Tuesday, May 10, 2016

Microservices: Managing for Frequent Reads and Rare Writes

It is a rare service that does not need to persist its state. 
Choice of your design critically depends on how frequently data gets written and allowable delay between when data gets written and the time when readers (and other nodes in the cluster) see that update. Let us examine a class of services where there are very infrequent rare writes and massive numbers of reads. This type is the most common kind of service in my experience.  Here I outline a common way to design such Microservice. I have used this simple "design recipe" for many services and it has worked for me every time over the last decade.  

Step 1 : Treat it as a Read Only Service

The first step towards designing a service is to assume it to be Read Only and optimize it for fast access and scalability. When data is readonly you can make million copies of your service and you don't need any coordination. They all return the same results for same input
An example would be a service that returns Maximum or Minimum recorded temperature for a given the Zip Code. Input for this service is historical data. We assume it is somehow injected in the system already. Temperature gets recorded daily and so it is okay to update the result set on daily basis.  Until we get new updates the service just serves readonly data. Another example would be Catalog Search on football memorabilia/ collectible site which will have some structured data in product catalog. It is highly unlikely that new products are added every hour and product details are upadated every second.   
  1. Precompute results: The first optimization obviously is to pre-compute results.Eg. min/max results for given zip codes are calculated offline in a batch (or hadoop) job. The resulting data is then loaded into some persistent database. Your service is then just the look up of this pre-computed data for given type.  Performance is increased many folds.  You can increase the performance even further by trading space for speed by denormalizing the results. How often you calculate the result set data is domain specific. It can range from minutes to daily to even weekly. Most domains tolerate delays extremely well and do not require true "as of now" results. Now where do you store the output of a batch job? I prefer using distributed high availability key-value stores for storing such "denormalized" and  pre-computed results. The resulting service code is then mostly looking up data in key-value stores (and do some parameter validation and minor post processing)  
  2. Cache results: But we are still hitting the datastore for every request. Networks are fast these days but not fast enough. Next optimization is to use distributed caching to decrease latency. The result data is treated readonly anyway.
  3. Cache results locally in memory: We can increase the performance even further by using multiple levels of cache by first storing some of the data locally in server memory.
The "hot data" lives mostly in memory,  the "warm data" is a RPC call away in  distributed cache and "cold data" is always reliably stored in distributed key value store and summoned at any time.  It is very easy to scale your service up or down by adding more caching or service instances. 

Now is the time to start thinking about those Writes. 

Step 2: Handle Writes

The writes are rare but DO change the state of the system. We must handle concurrency and consistency properly. 

Separation of Concerns

There is a common pattern called Command Query Responsibility Segregation (CQRS). Essentially you have two separate interfaces: one for update and the other for read.  We designed our read interface earlier.  
The Write interface is generally your plain vanilla 3-tier web service backed by a very high performance Relational Database.  The data in your database is the ultimate source of truth about your service.  I prefer this database to be strongly consistent i.e. ACID. The read interface never ever read from this database. They read from the key-value persistent store.  A batch job periodically reads the data from the master database and transforms it into key-values and stores in No-SQL store. We have already discussed how to handle this readonly key-value data effectively. Note that the eventual consistency comes from the delay between when data is added to master database and when it shows up in key-value stores. There are age old multi-phased state of the art protocols to handle failures and backups and fail-overs in case of master read/write RDBMS. It is very much appropriate to use standard database technology given low write traffic.  Again recall that this database does not contain the data about the entire universe but only the slice of reality that is relevant to out micro service. Usually the data size is not a problem these days. 

Optimistic Concurrency 

Although write traffic is low it still needs locking and properly scoping your transactions and using correct isolation levels. Use of optimistic concurrency and locking can increase performance of your database without sacrificing the consistency. 

Use Event Sourcing 

Event sourcing is a technique where we record and preserve each every change made to the system. This is yet another way to increase throughput and resiliency of the system.  

Use Complex Event Processing

So far we are relying on batch model . But this is not hard and fast requirement. You can build complex pipelines on top of event stream using likes of  Reactive X , Spark Streaming , Apache Storm

Step 3 : Get Closer to Data 

Use Ephemeral Local Store: One disadvantage of caching is that memory is limited and you still need to fetch data from remote machine (be it memcache or  your no-sql store). Yet another optimization is to copy data to each mid-tier machine. There are probably a dozen different ways to achieve this each with its own pros and cons. But the important point is that - the data access is then becomes just a local access and much faster than accessing remote machines.  
Use Actors: You can also use actor based run-times to add fault tolerance and optimize write performance.  Akka is one such example.  

Conclusion:

Many domains tolerate delay in data being updated and that update being visible to everyone.  Many micro services have very high read traffic but rare write traffic.  There is a simple and effective "design recipe" to build them that works wonderfully in practice.
Happy Coding !

References :

  1. Data  Concurrency and Consistency in Oracle Databases http://docs.oracle.com/cd/B28359_01/server.111/b28318/consist.htm
  2. Oracle High Availability Database http://www.oracle.com/us/products/database/high-availability/overview/index.html
  3. Oracle No-sql database http://www.oracle.com/us/products/database/nosql/overview/index.html
  4. Memcached http://memcached.org
  5. Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics
    http://www.oracle.com/technetwork/database/bi-datawarehousing/twp-hadoop-oracle-194542.pdf
  6. Why use event sourcing ?http://codebetter.com/gregyoung/2010/02/20/why-use-event-sourcing/
  7. Actor Models https://en.wikipedia.org/wiki/Actor_model 

1 comment: