Tuesday, May 10, 2016

Phonetics: 2500 year old Sanskrit grammar in BNF form

In last post I talked about the Syntax parsing and special X-Bar theory that applies to the natural languages. In this post we'll start our exploration with 2500 years old grammar in Backus Naur Form and move to internal structure of words and syllables.


My mother tongue is Marathi Language which is a daughter language of Sanskrit  and preserves most of the grammatical features of Sanskrit including 8 cases, 3 genders. 50% of the vocabulary is directly from Sanskrit (tat-sama ) and another 25% can be derived by single sound shift (tat-bhava).

3 years of Painful High School Grammar

I took 3 years of Sanskrit in high school but all I remember is sheer pain learning its grammar.
  • Imagine all those latin conjugation and declination tables ? Well ! Sanskrit has 10 times more tables.
  • You think French liaison and ellisions are hard ? Sanskrit has most such rules for sound changes.
  • You think German has long words ? Yes, Sanskrit has that as well. Including famous splitting of prefix and the main verb.


Panini's Grammar

The most authoritative grammar of sanskrit is actually formulated using form similar to modern context free grammars.
  • Panini developed his Grammar around 500 BC
  • He used 14 classes of phonemes and taking that as lexical terminal symbols of his grammar described Sanskrit grammar as set of 3,959 rules to describe all grammatically correct ways to construct Sanskrit sentence.


The word factory 

Human - a talking animal 

Despite our differences all humans are born with an ability for the language most human languages have following properties
  • 30-60 sounds that are put together to make syllables.
  • The syllables are put together to make words
  • words form phrases
  • phrases form sentences 
  • sentences form discourse 

Structure of Words and Syllable




How human voice is produced.

Human speech articulation begins in the larynx. The vibrations created there are modified through out the vocal tract which ultimately comes out of the mouth as series of sounds which we understand as sequence of phonemes. Larynx is where the "voice" starts but that sound is progressively modified in the mouth by forming various shapes with our tongue and lips. I have modified the standard diagram to show the schematic of the vocal tract. The most important key concept is this: Our tongue is a very flexible muscle that can form various shapes and that profoundly affects the way we perceive the basic sound that begins at the larynx.  [Edited from original here : http://en.wikipedia.org/wiki/File:Illu01_head_neck.jpg.]
That brings us to the next concept. There are two types of sounds viz. vowels and consonants.
When the airflow from larynx is not obstructed we perceive that sound. Anyone how has dealt with waveform of human sound knows that most of the waveform is basically sound of vowels. However the most interesting sounds are generated when the airflow is blocked (or substantially modified) for the brief amount of time. This time frame is seriously tiny compared to rest of the vowel sound and we perceive it as consonant.
In terms of digital signal processing the waveform of the voice that starts in larynx is modified by the resonance of the vocal tract and therefore we apply the convolution of the simplified model on the right side.


Vowels take most of the time in speech signal. All consonants are meaningful only in the context of background of robust vowel signal.  The larynx is actively producing "voice" during the vowels.
The quality of vowel depends on  four attributes.
  1. Height of tongue position- In other words whether the gap between tongue and roof of mouth is closed or open. generally four positions are considered. Open, Open mid,Close mid and Close.
  2. Position of tongue - Front, central or back
  3. Roundness of lips - The roundness of lips add another effect.  They are generally rounded for back vowels.
  4. Nasalization ? are the vowels nasalized.
Below is the list of known vowels and their IPA symbols.

Edited from original here : http://en.wikipedia.org/wiki/File:IPA_chart_2005_png.svg

The best way to analyze vowels is to look at their frequency spectrogram and look at the peaks. The first three formant are specifically interesting. The roundness of lips changes the positions of formant.
Again there are further diphthongs and glides that are combination of  two vowels.


 The consonants are interesting because they are numerous, show great variation in how they are produced and the signal is available for only the short time. They are "negative" because the voicing and vowel are interrupted or blocked by the position of tongue. Here I am going to use somewhat non-standard terminology and classification only because that is how I understand it. Also because i was introduced to the subject via Sanskrit Grammar.
There are three main things to look for - (A) Properties of original sound that is obstructed (B) manner in which the air flow is obstructed/modified
(C) place of obstruction/modification

Properties of original sound

  1. Voiced vs Unvoiced - whether voicing continues or not . [b -p]. [k -g][t-d] are some examples . (saghosh and aghosh.सघोष- अघोष )
  2. Aspirated vs Non aspirated - this difference is important in Indic langauges it effectively doubles the consonants available for use in languages.   (alpa prana- maha prana अल्पप्राण- महाप्राण)

Manner in which flow is obstructed /Modified

  1.  Stops - The airflow completely stops
  2. Friction - Sound produced by friction is introduced in the mix.
  3. Affricates - these are combination of stop, followed by the friction at the same place.
  4. Liquids - sibilant and approximants - where air flow is partially blocked
  5. Nasals - the airflow goes through the nasal channel


Place of articulation

  • Labial 
    • Bilabial
    • Labio-dental
  • Coronal 
    • Dental
    • Alveolar
  • Palatal 
    • Alveo-palatal
    • Retroflex
    • Palatal
  • Dorsal
    • Velar
    • Uvular 
  • Radial 
    • Pharyngeal /Epiglotal
    • Glotal

Edited from original here : http://en.wikipedia.org/wiki/File:IPA_chart_2005_png.svg

Syntax: Colorless green ideas sleep furiously

In 1957, the father of modern linguistics Naom Chomsky came up with a sentence "colorless green ideas sleep furiously" as an example of sentence that is syntactically correct but semantically non-sense. He founded a branch of grammar referred as "Generative Grammar" based on the idea of production rules. Here I talk about the basics, X-Bar theory and generalized phrase structure grammar. 

The basics of Context Free Grammar 

The context free grammar is specified by a system of production rules. Starting with a set of terminal symbols, non-terminal symbol and a distinct starting symbol each production rule specifies how to generate a valid sentence in a given grammar.
Given a sentence identifying the sequence of rules that generate that statement is called Parsing. The raw text is first "tokenized" using lexer and syntax tree is generated during parsing.
  • The grammar is generally specified in a form similar to Backus Naur form
  • There are two ways to parse top-down or bottom-up.
  • The LL parsers look ahead k symbols and deterministically chose next production rule to apply. (Often a simple recursive decent parser can be written "by hand" making generous use of switch statements)
  • LALR parsers on the other hand use a "shift/reduce" parsing where a state transition table is used to decide whether to "shift" current token on to the stack or to "reduce" the stack by applying a grammar rule. This often requires getting help of parser generators like yaac.

The "Language Instinct" and X-Bar Theory

 As linguists started applying formal grammar theory to the natural languages they realized there is an underlying common structure to all languages.
  • The sentence is made up of phrases
  • The sentence always has verb.
  • The order of Subject(S) Object(O) and Verb(V) is fixed for a given language.  Eg. English is SVO, Indic languages are SOV , while semantic languages tend to be VSO.
  • The order of adjectives (A) , Nouns(N) and Specifiers (Sp) is fixed
  • This ordering is similar to way individual phrases themselves are structured internally.

X-Bar theory

The theory is formally called "X-bar" which posits that each phrase is made up of a binary tree.  Each phrase must have a district element called "head" and the phrase structure is specified by following rules. The structure applies to Noun Phrase, Verb Phrase and Preposition Phrase. (Even the sentence itself!)
  • Phrase ::= Spec X'' 
  • X'' ::= X' Adjunct*  
  • X' ::= X Complement*
  • X ::= Head
By changing the order for symbols in each rule we get rules for your favorite language. Eg In English and Indic languages the adjective is before noun , where as in Spanish (and other Romance languages) it comes after noun.
  • the blue sky
  • el cielo azul

Language Instinct

Even the languages spontaneously generated by kids show the same "inherent structure" leading to the hypothesis that "parsing ability" is a part of cognitive circuit that every human is born with. Cognitive Psychologist Steven Pinker explains this very well in his book "Language Instinct".
Essentially for each language we learn  the parameters that fix the order of symbols in X-Bar theory

Generalized Phrase Structure Grammar 

In BNF form of grammar the production rules are given in the following manner
  • A ::= B C D
Where A is a parent and it can have three child nodes in that order.
However the generalized phrase structure grammar parsing strategy will separates immediate  dominance and linear precedence.
Instead of specifying a single production rule, it is broken down into three separate parts 
  • Immediate Dominance : In simple terms it says A is the parent of B, C and D . It does not specify in which order  B, C, D should occur.
Eg Rule A--> B, C, D specifies that A can have three children

  • Linear Precedence : Specifies order in which children are. Orders can be partially specified 
Eg Rule  B C specifies that under A,  B must occur before  C
  • Head Identifying Rule: Each such production has exactly one head element and must occur in the production.
Eg.  ConditionalBlock ~ IfClause

if {} - is valid
if {} else {} - is valid
else {} - is not because the head is missing.

Example is
QueryExpression --> SelectClause, WhereClause, FromClause.

With the separation of Dominance and Precedence all of following are valid production of the rules.

select * from tbl where col == 10;
from tbl where col ==10 select *;

My JVM dream, now 20 years later..

Back in 1996, I first encountered Java Virtual Machine while still in collage. And immediately fell in love. But worse - I got obsessed with it. Obsessed with this crazy idea of writing new languages that targeted JVM.  20 Years later that obsession secretly continues ...

Fortran Compiler For Java Virtual Machine

To complete my Computer Engineering degree, we had to submit a group project at the end of the final year. I was able convince my awesome and insanely talented friends to take that as our project.  
  • We chose fortran because lots of literature available on Fortran
  • We contacted Center for Development of Advance Computing (CDAC), and they indicated interest in porting their scientific simulations in Fortran to our JVM.
  • Sun JVM team shared their JVM specs with us. (I just realized I actually work for the same company now !!!)

What we delivered 

  • Assembler for JVM 
  • Fortran compiler that outputs in our assembly language
  • Java based IDE for editing and building fortran code.


The BIG DREAM - JVM as a target for writing new languages 

Obviously back in those days as a young collage student, my intention was to conquer the world! But by developing new and interesting higher order languages ...

Paradigm Shift I really wanted was ...LINQ !

The language I really wanted us to implement had native SQL syntax ...here is what my handwritten version of seminar says ...

Where are we now 20 years later?

Top Languages running on JVM
  • Scala
  • Groovy
  • Jython
  • JRuby

Lambda and Streams 

  • Java now supports lambda and stream processing

Where to next ?

It looks like world is still warming up to the notion of targeting languages to JVM and getting to know lambdas.
And I feel like my dream is still alive and a driving force for me... 

Scarce Data Problem: Mysterious Indus Valley Script

Once upon a time, nearly 5000 years ago there was an urban Civilization with 1000 planned Cities across the North-Western India. It is called Indus Valley Civilization

The Cities 

Most of the cities were along banks of the now dry river and at its peak this Civilization had a population of over five million. 
For that era it was very advanced civilization with houses built out of baked bricks and public facilities like baths and huge granaries. One of the cities Harappa had about 700 houses with private wells and most cities had Citadels. The civilization had developed metallurgy.  

The Oldest Writing on Indian Subcontinent  

One of the most interesting thing about this civilization is that they developed a script and left behind around 4000 seals with around 400 distinct symbols. 
The script is still undeciphered and the underlying language has not been identified. 
  • The script is found all over the North-Western India as well as Egypt and Mesopotemia.
  • Script is continued over hundreds of years with regularity and similarity between seals found across the region and across the centuries 
  • There is no bilingual text available like Rosetta Stone.
  • Most seals are extremely brief with longest being only 17 character 

Original Scarce Data Problem

  1. Does this script represent a spoken language ?
  2. If yes what language ? What can we learn about this language ? How does it related to present day languages of the South Asia
  3. What are these seals trying to tell ? What were they used for ?
Answers to these question can shed light on early history of Indian subcontinent and cultural sphere. But this is not just an old Indian civilization it is the World Heritage for all of humanity.  


Call to Action:

How can we use our big data tool set and machine learning to decipher Indus Valley Script ?

[All images from Wikipedia - thank you to all original creators]

Managing Projects: Software By Numbers

I read a wonderful book in 2004 named Software By Numbers by Mark Denne, Jane Cleland-Huang. It introduced the concept of  "Minimum Marketable  Feature" and "Incremental Funding Methodology".  It radically changed how I viewed project management and continues to inspire me even today.  
The main message is  - by carefully choosing the way in which software components are assembled, we can create identified units of market value before the application/product is anywhere near completion.
Minimum Marketable Feature (MMF) 
Typically an MMF creates the market value in following ways
  • Competitive differentiation
  • Revenue generation
  • Cost Saving
  • Brand Projection
  • Enhanced loyalty  
As the name suggest the features need to be marketable which means you can use them to decide price for the product, promote them and more importantly create revenue by help selling those products. 

Kano Matrix

An incomplete software still has value to the user. However a car that consist of only engine and wheelbase is unlikely to be thought any useful. You need more than basic functionality. Similarly a super cool luxury car that does not move is equally useless. What you need is a balance of  "cool" and "required" features. This idea is captured in a model developed by Nariaki Kano. 
The model  categorizes the features in a few interesting categories 
  1. Must-Have or Basic Needs: Lack of must have features causes dis-satisfaction, but once these features are present adding more Must-Have features does not make the product attractive to the customers
  2. Delighter: Lack of these features does not make customer unhappy but it does make your product boring. When added these features create excitement  and makes your product attractive.
  3. One dimensional or Performance Needs: The customer satisfaction is directly proportional to 'how much' of the feature you are providing. For cars milage is one such feature - more the better and as it goes down so does customer's dissatisfaction.

Incremental Funding Methodology

The goal of the incremental funding is to accelerate achievement of first the self funding point and then the breakeven time.
Below is the a high level of procedure to do exactly that.
  1. Define Dollar Value of the MMF: Once we have MMFs defined and categorized next task is to assign a dollar value to that feature.  This is the incremental revenue generated by adding an MMF to the product. Obviously it is calculated based on additional buyers and price increase.  For Must-Have feature the dollar value is defined by possible loss if the feature was absent.
  2. Estimate Cost : Next we estimate the cost of creating and maintaining/servicing this feature.
  3. Identify Risks : We also need to capture the uncertainties associated with the features. 
  4. Identify the dependencies between MMFs.
  5. Calculate Rate of Return for each MMF. The rate of return can be calculated using various standard techniques like Net Present Value (NPV) or Internal Rate of Return (IRR)
  6. Sequence: Finally Sequence the delivery of the MMFs to maximize return in by delivering them in  minimum amount of time


[Images:  for Kano wikipedia and for incremental funding the book]

Look Ma No Locks - Act III - "STRONG" Eventual Consistency with CRTDs and other Geeky Stuff

In Act-I we saw that Consistency and Serializability were the goals of our Hero Developer was striving for, the plot started with the basic idea of Two Phase Locking and in Act II we encountered the concept of "Tunable  Consistency" in the form of isolation levels and then stumbled upon the CAP theorem , both very important developments. In this Act III, the story continues with new and exciting developments in research community - namely CRDTs and Co-ordination Free Consistency...

What is this "Eventual Consistency" business?

If our Micro-Services were dealing with Read-Only-Data then we could scale them infinitely by adding one clone after another- all nodes will have same data and every result would always be consistent. But alas - the world changes and so must our data - on each individual node responsible for that piece of data.  The problem is that the change has to "Start" somewhere and then must spread to all other nodes in the cluster and eventually somehow we must detect that change propagation is "Finished". Network is unreliable  and THAT exactly is the curse of being a distributed system, my friends. There is that brief amount of time when the change has been "Started" but not yet "Finished" completely. How do you deal with the next change request already in the queue ?
Now, you can be pessimistic and wait for the first change to Finish Completely before next one Starts. OR - you can be optimistic and assume the Best Case and proceed to adapt the next change before first one Finishes and only deal with any issues when they arise. The intuition here is that the series of changes may be re-ordered or delayed but "Eventually" every one sees the all changes and "deals with it" in case when problems arise. (It is as if my otherwise consistently normal friends behave a little strange when in Las Vegas, but they are eventually back to normal when we are back in Seattle...)
In short here is the reality 
  • By the time you detect the consistency problem, the client requesting the change may be long gone already and you are left alone with inconsistencies  to resolve.
  • Eventual consistency is a meaningless term until you define what it means to "reconcile" the differences/inconsistency. Eg. What happens after you detect that  two customers have added the same last remaining item to the cart or have booked the same last seat on the plane? What is your MERGE strategy ?
  • In eventual consistency you decide to pay very little price upfront for the consistency but your are signing up to pay heavy price later to  reconcile and correct the in-consistency when it arises. The math works as long as the cost of "Fixing" is amortized over mostly problem free executions.
  • It is an absolute requirement  that ultimately all nodes resolve the inconsistency the same way. I repeat - It is an absolute requirement  that ultimately all nodes resolve the inconsistency the same way. It is turtles all the way down - How do all nodes come to know what was the resolution decision and how do you propagate that decision consistently? Fun times ... 
I'm already thinking of writing a next best seller novel based on the effects of eventual consistency and meaning of destiny - Is it cute Ms. Isabella who gets the last seat or it is really her destiny to miss the flight and meet Mr. Right on the next flight ??  ...

Commutativity : 2 + 3 == 5 == 3 + 2

  • What if there was no additional costs to pay when inconsistency was detected in eventually consistent system?
  • What if no coordination was required to resolve inconsistency in a distributed system? 
  • What if every node already knew how to resolve the inconsistency in a  prescribed deterministic ways? 
  • What if there were no bad effects of message re-ordering to begin with ?
Suppose our Hero Developer wins three lotteries in a row in three days . Now, it doesn't matter in what order he gets paid from each lottery. As long as he gets the money in reasonable time - it's all good.  The commutative nature of natural numbers "adds up" to a solid logic based reasoning that eventually our Hero Developer will be rich by the amount that equals to sum of all three lottery prizes !
In 2007 , Marc Shapiro and others at INRIA, formalized  this notion in what is called as "Conflict Free Replicated Data Types" 
This system works in following conditions ...
  • If the system is monotonically increasing , then it never has to "roll-back"
  • And if state transitions are partially ordered and MERGE operations are associativecommutative and idempotent
  • THEN the replicas that have received and applied the same set of updates must immediately have equivalent state. There is no conflict arbitration process, because conflicts do not exist in strongly-consistent systems.
In short, if only operation we allow is INCR(account, amount)  then is monotonically increasing and addition being commutative, associative as long as the operations have "at least once" and "at most once" guarantees (i.e. idempotent) , the system is STRONGLY consistent and co-ordination free !!!

Mr. Luca Pacioli and Double Entry Accounting

The Father of profession of Accounting, Luca Pacioli was a Italian mathematician who invented what is now known as double entry accounting system. This system revolutionized the world of commerce. This system is fundamental to modern civilization as is Newton's Mechanics or theory of Social Contracts. In his system there is a column for Debit and another column for Credit. The total balance is result of subtraction of one column from the other 

How is that relevant ?  

Both debit and credit are monotonically increasing and follows the rules of addition in short can be implemented as CRDTs. And natural number can be represented as a sum of a positive and negative number

What other data structures can be implemented as CRDTs

  • Sets
  • Maps
  • Graphs 

Shipping Pie vs Shipping Pie Recipe 

The theory of CRDT makes a great deal of distinction between shipping the end results and shipping the instructions to achieve the results. While shipping End Result is cumbersome. It is theoretically easy to deal with and easy to prove properties of the system. While shipping recipe is easy it is not as accessible to theoretical treatment. The genius of team who developed CRDTs at INRIA is to show the equivalence of both the approaches.

The story continues next week ....

Look Ma, No Locks - Act II - Tunable Consistency

 The last post described why our industry started using locks excessively in database system to achieve Serializable transactions. The conflict and tension between Availability and Consistency became more serious as days passed by...
This post first talks about how the concept of database Isolation Levels is actually an early attempt to define Tunable Consistency and then describes the essence of now famous CAP theorem.
Next post will lay out necessary foundations to understand current research in this area including commutatively and example STRONG eventual consistency model called Conflict-free Replicated Data Types.

Locks everywhere in all shapes and sizes.

To increase the performance we had to start using different types of locks for different types of things and operations. There were obviously separate reader and writer locks. There are page level locks and then there are row level locks. Sometimes we use transaction level locks, but most times the locks are statement level locks. In short there were locks everywhere in all sizes and shapes. But our insatiable thirst for throughput meant that we were ready to let go of some consistency guarantees. Stage for Availability vs Consistency battle was set.

Are these Database Isolation Levels or Tunable Consistency Levels in disguise?

To increase throughput (aka availability) we started dropping the locks starting with range lock, then read locks and even write locks. But that obviously started showing up undesirable side effects. Recall that any transaction is in final analysis a sequence of reads and writes and that all reads and writes to the same data items can potentially cause inconsistent state as a result of random scheduling.
Later these bad effects were categorized as follows
  • P0 - Dirty Read – Transaction could read uncommitted data written by another transaction. In short your reads are dirty.
  • P1- Non-Repeatable Reads – If transaction read same data item twice then value may be different between two reads. You reads are not reliably repeatable
  • P2- Phantom reads – This happens when a query with a where clause is executed twice and between those two calls new data gets added or deleted that satisfies the query- the result is phantom reads. Some other names for this are motion blur, fuzzy read
  • In addition there are many other phenomenon viz. Lost Update, Cursor Lost Update, Read Skew, Write Skew etc etc.

In short, there is no free availability without reduced consistency

Based these bad side effects, the ANSI standard defines following four isolation levels. In reality these are rather Consistency Levels and you can choose the reduced consistency for better performance
  • Read Uncommitted – This is lowest level of consistency guarantee. Dirty reads are possible, so are Non-Repeatable and Phantom reads
  • Read committed – In this next higher consistency level, only the committed data is read. Although dirty reads are prevented other anomalies are still possible
  • Repeatable Reads – At this consistency level single item reads are repeatable but phantom reads are still possible
  • Serializable – This is the highest consistency guarantee. It is not possible to have dirty reads, phantom reads or non-repeatable reads
  • Snapshot – this is yet another consistency level where your transactions can see only the snapshot of the data as it existed before your transaction began.

Distributed Transactions – the basics

  1. As computer systems were connected to each other on a distributed network,  our industry started exploring Master-Backup architecture. However there is always some delay between when transactions are committed at Master and when Backup is updated - leaving possibility for In-Consistency
  2. Alternatively you could have synchronous updates - The protocols like two phase commit and three-phase commit were introduced just for handling this sort of stuff. In essence a transaction is not committed until all machines participating in the distributed atomic transactions first commit their share of the transaction. The obvious side effect is that while Consistency is increased the Availability is decreased because of all this co-ordination effort …

Enter the CAP theorem

The CAP theorem says that it is impossible for a distributed computer system to simultaneously provide all three of the following guarantees:
  • Consistency (all nodes see the same data at the same time)
  • Availability (a guarantee that every request receives a response about whether it succeeded or failed)
  • Partition tolerance  (the system continues to operate despite arbitrary partitioning due to network failures)

 The theorem is easy to understand and equally easy to misapply.   Many have interpreted this as a binary choice between Consistency and Availability (because partitions are always ever present).
However what the theorem REALLY demands of you is following : 
  • better partition detection , 
  • better partition management 
  • And more importantly mitigation and recovery when partition recovers.
More on this next time ...


Look Ma, No Locks - Act I

I have seen too many people signing up for "SQL vs NoSQL war" without understanding what exactly they are fighting for and more importantly - for whom and for what reasons.
This is a first act of the drama that is "SQL vs NoSQL war"...enjoy.

Opening sequence: Writes, they always conflict 

Ever since John von Neumann started reading and writing to memory addresses he left a problem for future generations to solve.   If you read a variable twice (or two different variables) then value obtained during those reads depends upon whether someone else wrote and changed the value of variable(s) behind your back between those calls!  
The problem is pervasive - from processor registers and caches ..to memory and disks, ...from single processors to distributed systems.
The problem becomes even more severe when transactions are involved. 
Say  you have just two statements with primitive operations of add, multiply and assign , and  initial values of variables are as follows ; x=1 and y =2 
  1. z = add(x, y)
  2. y = multiply( 10 * x) 
What is the value of z if you could possibly simulate all combinations of read and writes for 3 variables x, y, and z happening one after another?  
  • If 1 is executed before 2 then z == 3.
  • In other case it is 12 !
Which one is it - REALLY?  
Now what if there are two transactions with 8 steps each ? what about 679 steps each ?  I get dizzy just thinking about this - I hope you see the problem here ... 
This is NOT a theoretical problem, with pipelining architectures the operations are indeed executed in any random order every time you run. With reordering of messages in a distributed system this is a real problem.
What happens if we "double click" and we break statements above down to smallest possible unit of execution? may be a single processor instructions or even a micro-instructions ?!  
Real geeks know it already : It is all about History
  • The real geeks represent this sort of thing as a directed graph where an operation that depends on other operations is connected by a directed edge in a graph.
  • Real geeks truely understand that there are many ways to traverse nodes in such a connected graph.
  • Real geeks also know that there are many permutations and combinations to visit the nodes in this graph and each path or History is just one way things could happen.  
  • There are many alternative Histories of your travel through the graph - you just happen to take one path and know one History. 

The Seer : Transactions - Do or Don't, Don't try 

One of the greatest invention was the idea of transactions. You read number of items and write number of items - but here is the rule everyone follows - ALL the variables change at once or NONE of them change at all ! 
Seers invented this great idea of ATOMIC operations - batch your operations in transactions and at the end - either COMMIT or ABORT but never leave your application in a ZOMBIE state.
Are all the problems solved ? 
NO!!! Mr. Problem shows up again. Effect of series of transactions still depends on which transaction ran first !!

First Breakthrough : Serializable transactions - what if we decide order should not matter at all - just like that ? 

What if you did not allow transactions to complete at all if they are NOT serializable , meaning that the side effect did not depend on the order in which transactions  were executed ? Here is a simple rule that solves the problem - If order between transactions did not matter then and only then allow them to proceed or else fail them right away
No mercy - you need to be Serializable! Anyway, what is the difference between an explicit ABORT and implicit ABORT (when transaction does not even meet our requirements of serializability) ? 
  • Here is our requirement - Given transactions T1 and T2,  the effects are same if T1 in its entirety was executed before T2 or vice versa. 
  • A stronger requirement is - ANY interleaving of operation in transaction T1 and T2 produce the SAME effect.

First Problem : But How do you even know when set of transactions is serializable ? 

The Seers know the answers - as always 
A History of transactions is Serializable if and only if its operation dependency graph does not include any cycles.

The Journey begins:

The serializability theorem looks like a very promising destination to go to. It feels like worthy purpose to live for a database system ...
The Question is - How do you get there ?

First Wrong Turn: Two phase locking 

In the old days when there was simple one core computer the answer was obvious ...use two phase locking.
The algorithm is pretty simple 
  1. In phase ONE try to acquire all the locks you will ever need (or else abort if you can't)
  2. Do you stuff (censored version)
  3. In phase TWO release all the locks you acquired in step 1

The Shortest theorem ever !!

  • If there are cycles in the "History Graph" then phase ONE will fail for one of the transactions and it is not member of Serializable Transaction Set..
  • That is it - that is the theorem -  in a Directed Acyclic Graph you can't visit the same node twice ... no two transactions can acquire the same lock !!

Deadlocks : Oh my god !!!

the story continues in next week's episode ....

Einstein and Time Travel in Distributed Systems

Exactly 100 years ago this month in Nov 1915 Einstein published his Field Equations and forever changed the physics and our understanding of space and time. His ideas are still relevant in the field of Distributed Systems. This is my tribute to Einstein by showing how his fundamental principles directly apply to distributed system design. 

Accept the reality and work backwards to fix your theories

In last century Einstein changed the physics and turned it upside down several times again and again by simply accepting the reality as it is and working backwards  to fix the physics to deal with it.

  • First Einstein used the reality of constant speed of light to derive special relativity and famous E=mc2.
  • Then he used the simple everyday fact about equivalence of inertial mass and  gravitational mass to derive General relativity. Theories about unimaginable stuff like blackholes and Big Bang followed.  
  • He accepted the fundamental reality of quantum nature of interactions between light and matter and led the foundation of quantum mechanics by explaining photo electric effect.  By accepting the simple fact that two photons are indistinguishable from each other he derived the Bose-Einstein statistics that is the basis of how Lasers work and at the root of everything IT - from VLSI to printers to optic fibers to Compact Disks. 

Photon the Messenger of Force

The greatest achievement of last century is Quantum Field Theory , the most successful of which is Quantum Electro-Dynamics  (QED).
The theory posits that all interactions between matter particles are mediated through force particles.  In short  one matter particle sends a message to other particle in the form of a photon. After receiving this event/message the second matter particle changes its internal state. The Quantum Field Theory reduced all physical laws to this basic interaction scheme.  A Fermion (a matter particle) keeps on doing what it is doing i.e remains in the same state until it either sends a message or receives a message - a Boson (force particle).   
In short the nature at it's most fundamental level works on this Message Passing paradigm   
No Global Clock and Only Local Effects
 When you look at the night sky there are billions of shiny stars.  You can see a star only because a photon from that star started perhaps billions of years ago and hitting your eye right now getting converted to chemical-electrical signal. Have you ever wondered that the moonlight you see now might have actually started ~8 minutes ago at the surface of the sun and reached you only after reflecting at the surface of the moon?  Suppose you see a shooting star and two minutes later see a super-nova.  What event do you think really happened before and which one after? It really depends on frame of reference - your local frame of reference.
In a distributed systems messages get reordered and messages get delayed . How do you know which event happened before another event. This is super important because without ordering the events any attempt of maintaining consistency is  impossible.  
But not all hope is NOT lost because we still know two things
  • that star became unstable before it went super nova
  • that light from shooting star was emitted before you received it.
Formally this relation is captured in a Strict Partial Order < such that
  1.  If event A happened on the same process before event B then A < B
  2. If event A is sending of message M and B is the receiving of that message then A < B
  3. Transitive : If A < B and B <C then A < C
  4. Irreflexive : The event can not be before itself ! 
  5. Necessity :  If A < B is true then B < A must be false. 

Black holes and Gravitational Lensing 
In Distributed Systems the messages often get lost or sometimes a node just fall into the blackhole. We know the presence of the node and its state only by receiving a messages from it.
But equally importantly  because of gravitational bending of light we might have multiple images of the star.  A parallel in distributed systems is when you get multiple duplicated messages from the same event and need to deal with it
In conclusion, 
Einstein based all his theory on this fundamental principle of causality and his genius is converting  theory of local effects into a theory that can describe the entire universe and the birth of Space and Time .
Below is the image of photograph that proved the light-bending effects of suns gravitation that turned Einstein's general theory of Relativity into Law of Nature.  
Thank you very much Albert Einstein.

Note : all images from wikipedia - many thanks to the original creators of those images.