I have seen too many people signing up for "SQL vs NoSQL war" without understanding what exactly they are fighting for and more importantly - for whom and for what reasons.
This is a first act of the drama that is "SQL vs NoSQL war"...enjoy.
Opening sequence: Writes, they always conflict
Ever since John von Neumann started reading and writing to memory addresses he left a problem for future generations to solve. If you read a variable twice (or two different variables) then value obtained during those reads depends upon whether someone else wrote and changed the value of variable(s) behind your back between those calls!
The problem is pervasive - from processor registers and caches ..to memory and disks, ...from single processors to distributed systems.
The problem becomes even more severe when transactions are involved.
Say you have just two statements with primitive operations of add, multiply and assign , and initial values of variables are as follows ; x=1 and y =2
- z = add(x, y)
- y = multiply( 10 * x)
What is the value of z if you could possibly simulate all combinations of read and writes for 3 variables x, y, and z happening one after another?
- If 1 is executed before 2 then z == 3.
- In other case it is 12 !
Which one is it - REALLY?
Now what if there are two transactions with 8 steps each ? what about 679 steps each ? I get dizzy just thinking about this - I hope you see the problem here ...
This is NOT a theoretical problem, with pipelining architectures the operations are indeed executed in any random order every time you run. With reordering of messages in a distributed system this is a real problem.
What happens if we "double click" and we break statements above down to smallest possible unit of execution? may be a single processor instructions or even a micro-instructions ?!
Real geeks know it already : It is all about History
- The real geeks represent this sort of thing as a directed graph where an operation that depends on other operations is connected by a directed edge in a graph.
- Real geeks truely understand that there are many ways to traverse nodes in such a connected graph.
- Real geeks also know that there are many permutations and combinations to visit the nodes in this graph and each path or History is just one way things could happen.
- There are many alternative Histories of your travel through the graph - you just happen to take one path and know one History.
The Seer : Transactions - Do or Don't, Don't try
One of the greatest invention was the idea of transactions. You read number of items and write number of items - but here is the rule everyone follows - ALL the variables change at once or NONE of them change at all !
Seers invented this great idea of ATOMIC operations - batch your operations in transactions and at the end - either COMMIT or ABORT but never leave your application in a ZOMBIE state.
Are all the problems solved ?
NO!!! Mr. Problem shows up again. Effect of series of transactions still depends on which transaction ran first !!
First Breakthrough : Serializable transactions - what if we decide order should not matter at all - just like that ?
What if you did not allow transactions to complete at all if they are NOT serializable , meaning that the side effect did not depend on the order in which transactions were executed ? Here is a simple rule that solves the problem - If order between transactions did not matter then and only then allow them to proceed or else fail them right away
No mercy - you need to be Serializable! Anyway, what is the difference between an explicit ABORT and implicit ABORT (when transaction does not even meet our requirements of serializability) ?
- Here is our requirement - Given transactions T1 and T2, the effects are same if T1 in its entirety was executed before T2 or vice versa.
- A stronger requirement is - ANY interleaving of operation in transaction T1 and T2 produce the SAME effect.
First Problem : But How do you even know when set of transactions is serializable ?
The Seers know the answers - as always
A History of transactions is Serializable if and only if its operation dependency graph does not include any cycles.
The Journey begins:
The serializability theorem looks like a very promising destination to go to. It feels like worthy purpose to live for a database system ...
The Question is - How do you get there ?
First Wrong Turn: Two phase locking
In the old days when there was simple one core computer the answer was obvious ...use two phase locking.
The algorithm is pretty simple
- In phase ONE try to acquire all the locks you will ever need (or else abort if you can't)
- Do you stuff (censored version)
- In phase TWO release all the locks you acquired in step 1
The Shortest theorem ever !!
- If there are cycles in the "History Graph" then phase ONE will fail for one of the transactions and it is not member of Serializable Transaction Set..
- That is it - that is the theorem - in a Directed Acyclic Graph you can't visit the same node twice ... no two transactions can acquire the same lock !!
Deadlocks : Oh my god !!!
....
....
the story continues in next week's episode ....
No comments:
Post a Comment