Tuesday, May 10, 2016

Resilient Distributed Datasets and origins of my obsession

Yes- it is true. I am obsessed with in memory data processing and it goes waaayyy back. Back in my college days I had a very simple but very powerful insight or vision. I feel super lucky because over the years every now and then I just happen to stumble upon more and more opportunities to make it real...I'll talk about LINQ, Project Orleans and then finally RDDs.   
In the beginning 
The obsession  started some 17-18 years back when I was writing a paper for my college seminar. Being a compiler freak, my idea was to write a language that manipulated in memory collections using sql like syntax.  What remains today of that dream is a couple dozen hand written versions of that paper and reams of BNF grammars for that dream language.  Here are couple of examples ... 
And here is another example ...(didn't I tell that this obsession goes way back) 

While( !done) iterate;

One idea that I kept on toying with (rather ad infinitum) was that of representing a sql query as a enumeration. (See example of one such idea.) I didn't actually get it working in the college. I almost forgot about it.  
Later on while at Microsoft, my passion was rekindled . Off the normal day job hours, I tried to meddle with some research projects like COmega and Common compiler Infrastructure.  Unfortunately I was not able to make any real contributions.  I moved on and joined Amazon.com...a number of years passed.  

Laziness Pays Off

Later in 2008 Microsoft released LINQ and then eventually java followed it by lambdas and streams. It felt like my dream was realized by someone else already. I needed a a new obsession and so I was free to forget about the whole idea. Except - the Lazy Evaluation was destined to turn my world upside down again.  
Back to the Future<> in 2011 : Project Orleans at Microsoft Research
I was lucky enough to get a contract position in Microsoft's eXtreme Computing Group in Microsoft Research. More specifically in Cloud Computing Futures team. The project was using Asynchronous Message Passing to implement an Actor Model based compute engine. Think of these actors as objects distributed in a cluster using some sort of Consistent Hashing Ring. Almost like a memcache for actively callable objects. Admittedly this project was the coolest project I have ever worked on and worked with some of the brightest engineers I have ever met. ( I helped with implementing state management and code generation.)    With the help of "mostly in memory" processing we were able to achieve amazing throughput for live Halo 4 services. 

Enter Apache Spark and Resilient Distributed Datatypes 

After a trip down a memory lane we arrive at Today's state of the art - Resilient Distributed Datasets (RDDs), a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner. From the RDD paper ...
.... RDDs provide an interface based on coarse-grained transformations (e.g., map, filter and join) that apply the same operation to many data items. If a parti-tion of an RDD is lost, the RDD has enough information about how it was derived from other RDDs to recompute just that partition. Thus, lost data can be recovered, often quite quickly, without requiring costly replication.

References :

No comments:

Post a Comment