Monday, October 9, 2017

Is Parallel Programming still hard?

CppCon 2017 is over and some of the talks are starting to trickle in.
One that is definitely worth watching is the one by Michael Wong, Paul McKenney and Maged Michael. If you don't know who these guys are then go and read this post.

Here are the two videos:


My favorite quote from the talk:
"It is possible to split atoms, but doing that spoils them for electronics use".

Nice intro to false sharing, store buffers and caches. All of those are hardware details that (unfortunately) are important in multi-threaded programming, and are rarely talked about.
 
About the Reader-Writer Lock slides (a subject I've dug deep into with Andreia), there is a much simpler way than using a TLS RW-Lock, namely the scalable reader-writer locks. More info can be seen on this post:
http://concurrencyfreaks.blogspot.nl/2015/01/scalable-reader-writer-lock-in-c1x.html

Let's see what the next talks will be about...

Wednesday, September 6, 2017

Dawn of the Universal Constructs


Historical Introduction

Some time ago I saw this nice presentation by Alexei Alexandrescu about Generic Locking.
https://www.youtube.com/watch?v=ozOgzlxIsdg
He starts the talk by giving an introduction to how the "Concurrency Problem" has been attacked.




And what is the "Concurrency Problem" you might ask?
Well, remember this little issue about Moore's Law (on CPU clock speed) having stopped back in 2004 and that nowadays to get more performance you need to use multiple cores, which means using some form of concurrency? Yeah, that's the "Concurrency Problem"!
Sure, if you can write your application or algorithms in a map-reduce form, then you can parallelize it, but not all algorithms are inherently parallel to profit from this approach, and those are the ones we need to worry about.

As Alexei says in his presentation, in the beginning there were mutual exclusion locks, and those were pretty much the only way to deal with concurrency.
Then came reader-writer locks, lock-free and wait-free data structures, CSP, Actor Models, Software Transactional Memory, Hardware Transactional Memory and many others.
For a while, lock-free and wait-free data structures seemed to be the thing, but the fact that there wasn't a memory model to program them in, and that it was very hard to verify their correctness, and very hard to do efficient lock-free memory reclamation (wait-free is even harder), it caused people to try the other approaches. Nowadays, there is no clear winner, and maybe there never will be. Actor models have been gaining a lot of followers, with the whole Rx/Reactive movement based on it... I'm not sure they all realize that at the base of the actor model is the message-passing queue between the actors which is itself based on a lock-free queue.

Anyways, fast-forwarding to 2017, one of the upcoming CppCon talks which will definitly be worth watching, is the one by Paul McKennney, Maged Michael and Michael Wong:
https://cppcon2017.sched.com/event/Bgtm/the-landscape-of-parallel-programming-models-is-it-still-hard-or-just-ok-part-1-of-2?iframe=no&w=100%&sidebar=no&bg=no
As they mention in the abstract, there was a time when writing a simple loop seemed as perilous then as writing a lock-free data structure seems now.
I don't know what their talk will be about, but I do agree that most of the dangers in concurrent programming can be avoided by using better programming tools, and on that note, I would dare go so far as to say that we're entering "The Dawn of the Universal Constructs"!


What is a Universal Construct

Universal Constructs aren't a new thing. The first Wait-Free Universal Construct (WFUC) was invented by Maurice Herlihy back in 1993:
http://www.cs.utexas.edu/~lorenzo/corsi/cs380d/papers/p745-herlihy.pdf
WFUCs are a construction that encapsulate an object (or data structure) which was written for single-threaded usage, and provide a way to call its methods with wait-free progress.
It's kind of like wrapping all the methods in an object with a call to lock()/unlock() of a mutual exclusion lock, except that a lock is blocking while WFUC are, well, wait-free!



In summary, a Wait-Free Universal Construct lets you write your favorite data structure as if it was "single-threaded code" and then wraps its methods to provide wait-free progress.
If you saw Alexei's presentation at the beginning of the post, you'll know that he has a cool way (in C++) of automatically exporting the methods of an object/data-structure wrapped with a read-lock or a write-lock depending on whether the methods are mutative or read-only. The same kind of approach can be done with WFUCs, which means that they are truly generic.


How come I never heard of this before?

Well, you kind of have heard about this (if you read this blog).
Left-Right for example, is a Universal Construct that provides wait-free progress for readers and blocking (starvation-free) progress for writers:
http://concurrencyfreaks.blogspot.nl/2013/12/left-right-concurrency-control.html
It's fast, it reclaims memory, it's "universal", but it's not completely wait-free, i.e. it's not wait-free for mutations.

There are truly WFUCs in the literature and we've even added (wait-free) memory reclamation to a few of them. The most famous being P-Sim by Panagiota Fatourou and Nikolaos Kallimanis:
http://thalis.cs.uoi.gr/tech_reports/publications/TR2011-01.pdf

But even the best of them still works on the COW principle, in the sense that it requires a full copy of the data structure for every mutation.
Yes, that's right, if we have a map with 1M keys/values, and we want to insert a new key (or remove an existing one), we have to make a full copy of the 1M nodes of the data structure.

Waaaaiiiit wwwhhhaaat? Did you just say that this thingy has to copy 1 million nodes every time I want to insert a new key/value in my hashmap?

Yes, I'm afraid that's how it works. It's not as bad as it looks if the data structure is small, but as soon as you start going to anything bigger than a few thousand nodes in the data structure, throughput is going to go down the drain. And cache locality is a big issue with COW.


What's on the horizon

I happen to have insider knowledge that it is possible to make a wait-free universal construct that is fast enough to be used in production, because it doesn't have to copy the entire data structure (except on very rare occasions). This new way of doing WFUCs is fast enough to be used in real-life applications, in fact, it's capable of beating hand-written lock-free and wait-free data structures.
Yes, it's that good. We call it CX and the paper is almost ready.
... did I say it does wait-free memory reclamation as well?  ;)

As soon as other researchers see what CX is capable of, I'm sure they will have their own ideas and improve upon it, thus bringing about a new age of production-capable WFUCs.


Going back to the Concurrency Problem, how does this change the way we program multi-threaded applications?

Based on my experience as a developer, software engineers rarely use an off-the-shelf data structure as is. We always add a little twist to it.
I don't know whether this happens due to us having big egos and wanting to give our little personal touch to everything we do (I'm generalizing here), or if it's simply because we're tinkerers and we like to see what's under the hood and tweak it to our own desire. the truth is, if we could use it as is then there wouldn't be a need to write all the software people write, there would be just one search engine, just one banking software, just one customer management software, etc.

Whatever the reason may be, the consequence is, that we always make changes to the data structures, and the problem with that is that lock-free and wait-free data structures will fall apart if they're touched. They have sequences of instructions that have to be done in a particular way, they have optimizations for relaxed memory orderings and subtle details, and memory reclamation, etc, etc. The temptation to touch and tinker a lock-free data structure is just too much, and we end up getting burned and then we say that "lock-free data structures suck", which is not true.
The algorithms behind these data structures are very subtle, and our human nature is to tinker with it, which is incompatible.

This is where Wait-Free Universal Constructs come in.
With an efficient WFUC, any software engineer without any knowledge of lock-free data structures (or atomics or the memory model) can just take a "single-threaded data structure", modify it to satisfy its needs (or ego) and then wrap it with WFUC and BOOOOM, out comes a fast and scalable implementation of its customized data structure, with wait-free progress.
Even more, another software engineer working on the code can later fix bugs or add new features to the data structure without breaking its correct functionality. Try doing that with a lock-free data structure, it's hard enough to get it right the first time, to add new functionality later is just too nuts.
As a software engineer, what more could I ask for?

No wonder that WFUCs are seen by researchers as the Holy Grail of concurrency, it's because they make it so easy to solve the Concurrency Problem in production.
Yes, I'm a fan boy of Universal Constructs, but the hype is called for and needed. This stuff doesn't solve all the concurrency problems, but it solves a a lot of them, enough to become a standard way of dealing with multi-threaded code, or at least for writing multi-threaded data structures.

Whatever happens, Universal Constructs are here to stay as one of the easiest tools to use when writing multi-threaded applications, and we all know that when it comes to multi-threaded programs, any help we can get is not enough  ;)