Why an API Developer Should Care About Big Data Technologies


As developers we have to be careful not to be so narrowly focused on our immediate responsibility that we lose sight of the bigger picture. In pursuit of serving up the almighty contract, care must be taken to not overlook design decisions that are almost certainly going to have negative performance implications. One such implication revolves around fetching and presenting data from a database.

This problem seems rather cut and dry - its been done for decades after all. The consumer sends a request, the API fetches the data from the database and sends it to the consumer. Simple? Agreed, until sub-300 millisecond service level agreements come into play and the database(s) has many millions of artifacts stored in it. Now we have design considerations that are no longer trivial.

The Aggregation Document Problem

SELECT N+1 is a well known problem in database design and queries. N+1 comes from Big O notation concerning algorithmic complexity - O(N+1). In this case, it means 1 query for the initial ‘parent object’ query plus N queries for each ‘child object’ query. O(N+1) is essentially a linear algorithm whose performance will degrade, in a linear fashion, based upon the size of the dataset. This type of query is common when building a document that is an aggregation of various related pieces. Multiple traversals of a graph or SELECT queries must happen to fetch the various pieces required to produce the document. Databases do not handle this kind of thing well. If a highly normalized relational model exists, Linear queries manifest in the form of multiple SELECT and JOIN statements. In a graph, Linear queries manifest in the form of multiple and discrete traversals of the graph.

See the following graph:
Example Graph

In this arbitrary example, I want get all the homework assignments for all the Janes’ classmates. I want to return all those in a document. But I also want to build an aggregate of data from the various nodes:

  • Each Jane’s data (name, age, enrollment status, etc)
  • Each Classes’ data (Name, courses number, instructor name, etc)
  • Each Classmate’s data (similar to Jane’s data above)
  • Finally, the grade and content for each assignment for each classmate

To put together such a document several traversals have to happen, and this is where the N + 1 or even O(N2) problem shows up. One traversal might be:

  • Get all the Classes in Institution
  • Get all the Jane’s in each Class
  • Get all the Students enrolled in said Class
  • Get all the Assignments for each Student in said Class

SELECT N+1 is typified by ORM tools such as Hibernate. Without care, auto-generation of classes and/or tables via an ORM tool will generate this exact scenario. Likewise, early and rapid prototyping can produce these queries against NoSQL databases and constituent API calls required to produce an aggregate. All too often prototypes make it into production before optimization and N+1 performance related issues are left in production-grade artifacts.

This N+1 issue highlights the problem that on-demand construction of large aggregations is not suitable for services that ultimately serve up aggregated documents over HTTP. The root problem lies our monolithic thinking which is a holdover from the old days: that is the One-Database-To-Rule-Them-All syndrome. The monolithic thinking is this: that a single representation of data in a datastore should be used for all purposes - including complex analysis and the simple storage and retrieval of presentation data. In many use cases a monolithic data approach is a bad approach. Instead, multiple views of the data should exist and be driven by each use case. In the case of presenting an aggregated document to HTTP, a precomputed document should be used.
There are two distinctly separate problems and a third implied problem with constructing this precomputed view:

  1. The first problem lies in the analysis task. The Analysis problem is the calculation of the constituent components of a larger aggregate object or document. This analysis could be any number of sufficiently complex computations on a data set that requires significant processing or overhead which is undesirable in an on-demand scenario.
  2. The second problem is the retrieval of the Analysis for presentation. In this discussion, presentation is the aggregate document and will always be some kind of document retrieved on-demand.
  3. The third problem is the implied problem. This is the problem that the analysis must somehow be transformed and then stored as a document in store that is well suited for the purpose of document retrieval.

Polyglot persistence is one solution to this problem. Polyglot persistence describes the practice of using different data storage approaches to solve a problem. For example, use a graph database, such as Titan, to perform analysis and to generate documents that are to be served up in the web, but do not perform that computation on-demand. Instead, store those results in a document-based database, such as MongoDB. A Document-based database naturally fits the type of data served up by aggregated documents. A properly designed and indexed document-based database should perform well when serving up json and xml documents to a http based service. Keep in mind that the purpose of the document database is simply to respond to trivial queries, not complex analysis. Any sophisticated searching and querying should be accomplished via another tool (e.g. elasticsearch or contemporary technology). The document database simply serves up the precomputed documents. This polyglot approach solves problems one and two.

For problem three, data must be Analyzed and Transformed from graph to document and available to the HTTP request. Also, this transformation should not happen on demand - otherwise we just have the same old nasty N+1 problem but now hidden behind more layers of complexity. Analysis and Transformation should be accomplished via some batch-style application - via Apache Spark, Hadoop, Dataflow or some other approach that can make heavy use of parallelism and horizontal scaling. This Analysis and Transformation layer should run at regular intervals and/or in response to an event, generating the documents for the document database. The result should be a simple and fast document-store database that serves up the required documents.

Architectural Overiew

Caveats

This proposed architecture is with the caveat that we don’t want the generation and reporting of near real-time analysis in near real-time. Instead we want near real-time retrieval of a document containing the same analysis completed at a known point in time.

Additionally, the documents presented will never be exact representations as they exist at the moment they were requested from the document store. A lag time will always be present between when the last precompute ran, and the retrieval of a precomputed document.
Individual Use Case analysis should then drive these concerns:

  1. Is this architecture a good fit for the use case? It is complex; make sure the complexity is required. Very high speed retrieval of a complex aggregate document is the use case presented here.
  2. How frequently should the Analysis and Transformation run? Will the HTTP service be presenting documents that change quarterly, daily or every few seconds?
  3. How quickly must the Analysis and Transformation step take to complete?
  4. What is the estimated size of the data? Are there unrealistic expectations between how fast analysis must be completed and the data’s footprint?

The answers to two, three and four will in-turn drive concerns revolving around costs, servers, scaling, and technologies used to implement parallelism.

The N+1 problem by itself may not be of major concern, depending upon the use case. This proposed architecture comes with a lot of complexity and will not be necessary unless dealing with large datasets and moderately complex construction of aggregations. Another good candidate for this architecture is when constructing aggregations from disparate data sources. The required datasets should be large with multiple and nested SELECT N+1 scenarios or sufficiently slow to aggregate.

Analysis, Transformation and Computation Time

One key concern with this architecture is the potential computation time of the Analysis, Transformation and Load process. This “ATL” process is similar to classical database driven Extract, Transform and Load process. However, Analysis is called out specifically instead of Extraction because this type of analysis is more complex than simply retrieving data from one datastore, transforming into the format of another and loading it.

Size of the data and complexity of the relationships/graph will come into play here (footprint). ATL is a problem that should be solved via MapReduce and similar solutions driven by parallelism. Apache Spark is one good candidate. It can do its work in memory instead of via disk I/O. This implies a significant performance advantage of typical Hadoop disk I/O. Spark is also capable of performing cluster computing tasks on streams of data. These are tools and techniques typically reserved for so called ‘big data’ problems.

Compute Time and Cost will be place of conflict. Faster compute times will cost more money - in terms of memory, hardware on any Virtual Machines, and the number of said virtual machines. One key requirement is to quickly provision these machines and tear them down, therefore some kind of automated deploy and provisioning system must be constructed. This is no different than any other big data job.

Typical Solution - Horizontal Scaling and Sharding

One typical solution to database performance issues is in scaling in sharding. Thanks to technologies like Cassandra we can scale horizontally very easily. Sharding lets us break the data apart logically so that each piece has a smaller footprint. Yes, there are ways to spread this kind of large database footprint and queries amongst servers and these approaches certainly help. However, in the case of aggregated documents, horizontal scaling is solving the wrong problem.

All the SELECTS still have to happen on-demand - by scaling and sharding we just throw more resources at it. This is the wrong (and potentially more expensive) approach. Generating these types of documents on demand illustrates a fundamental problem with retrieving relationships on the fly and horizontal scaling will not make that go away. Scaling and sharding are valid approaches to dealing with data-scaling issues - just not the issue discussed in this post.

Solving SELECT N + 1

How does this architecture solve the N+1 problem? It doesn’t. Instead, this architecture minimizes it. The N+1 problem is still there - in the case of an aggregated document it will always be there. The difference is that for any one document, the necessary traversals happen exactly one time. When using a monolithic database approach the necessary graph traversals happen exactly n times for any request for any given document. That is - every time a document is requested N+1 traversals must happen. N+1 doesn’t magically go away when producing documents containing aggregations but SELECT N+1 can and should be moved into a layer more appropriate than the layer responsible for serving up the HTTP requests.

Implementation Guidance

Future articles will present partial walkthroughs, implementations and lower level details of this architecture.

Please feel free to comment, provide criticism or otherwise rant.

Inspirations and Unknown Contributors

This article is inspired by Dr. Chris Retford and Nathan Marz’s Lambda Architecture. My thoughts here are a result of issues exposed while implementing graph databases working for a startup.