Wednesday, August 21, 2013

NoSQL Databases and Financial Applications

-----
Update: AWS DynamoDB offers consistent reads. One consideration, however, is the need for complex queries or queries with changing criteria - unless you know exactly what questions you have in advance NoSQL may make sense. For complex and changing queries you may opt for RDBMS because NoSQL requires designing systems for queries at time of creation - not good for adhoc queries.
-----
I recently attended an Amazon AWS architecture class which included an overview of when and where to apply certain technologies within your architecture. Sometimes new technologies come out and they are all the rage. Everything starts looking like a nail and the new technology is the hammer.

One of these absolutely beyond cool new technologies (actually they have been around a while but starting to get widely used) are NoSQL databases:

http://nosql-database.org/

NoSQL databases are interesting because they overcome certain performance limitations such as your vertically scaling database architecture falling over under load and the only way to fix that is more database servers and convoluted replication and ID management strategies which require re-writes of all your ID sequence dependent applications. According to trainer at AWS very few companies do this well because it is very hard to do. Chances are restructuring database may be more desirable.

NoSQL databases are great in the case of what I'm playing around with on Amazon AWS (DynamoDB: http://aws.amazon.com/dynamodb/) for creating new databases, fast access, distributed systems, parallel processing of data using MapReduce. There are many types of unstructured data or applications that create user generated types on the fly that can benefit immensely from this type of database, as well as huge amounts of data (ok I'll use the buzzword: Big Data) that can be processed in parallel. It's also cool to be able to store certain types of data very fast and retrieve it very fast such as when you're loading a web page. DynamoDB runs on SSD so it's zippy. I was able to set up an application that sends data to a queue and log it to DynamoDB in about three days (this having never worked with SQS or DynamoDB before...)

The fact that the data is schema-less yes, does make things faster. The structure of the database supports horizontal scaling. Cool. The problem with schema-less is similar to the problem with not having a strongly typed programming language. You can have various types of flaws a strongly typed language will prevent. I just heard a speaker talking about never using JavaScript for financial applications because there is no decimal type and he had seen many mistakes related to decimal places. Someone at the event mentioned that he works at a very large company, which shall remain nameless, that has an online calculator with exactly that problem at the time of this writing.

Analogous to that, a NoSQL database has a bunch of key value pairs which can really be anything or maybe loosely enforced depending on the database. So you cannot count on a money value having the appropriate number of decimals, for example, or depending on the system even that it is a number if it's a generic list of key-value pairs (Some systems may enforce this, I don't know them all - leave as an exercise for the reader if you are planning to use this type of database for your financial application and you care. You should.) If you use a relational database you can be pretty sure that your defined column which is a decimal with two digits will be just that or throw an error if someone tries to put anything else in there.

[Please don't start going on about obscure rounding issues with money data types or whatever...that's for another blog post but as an example - you can find many for different vendors - check out the rounding issue after the 15th digit in Sybase IQ  15.2 decimal type: http://infocenter.sybase.com/help/index.jsp?topic=/com.sybase.infocenter.dc38151.1510/html/iqrefbb/X315932.htm].

Another factor to consider when thinking about using a NoSQL database is lack of data integrity due to no schema, traditional joins or enforcement of complex relationships. The whole reason for 1st, 2nd and 3rd normal form in relational databases and the whole entity relationship diagram exercise is to maintain data integrity - preventing duplicate mismatched data, preventing orphans, enforcing unique values, etc. If you're unfamiliar with database normalization it's a big topic worthy of a book:

http://www.amazon.com/Database-Design-Databases-Books/b?ie=UTF8&node=3902

In some cases you don't want all that schema around your data. The reason people use warehouses and flatten everything out is to eliminate the joins and make reports run a lot faster. They still use the relational database in the first place where the data is entered for data integrity. In a similar fashion you need to consider the data you are storing, the purpose of it, how it will be used and how important data integrity is in this case, to decide where is most appropriate to put that data. NoSQL databases often repeat the same data across tables which can be an issue for data integrity but may be just what you need for the system you are implementing because you're only reading, not writing, or speed is more important that precision.

With a NoSQL database in many cases you generally lose triggers. Of course you can mimic triggers in your application but depending on how your system is structured you may end up with multiple points of data entry and the same rules in multiple places if not carefully managed that can end up getting mismatched down the line. (Duplicate code is the root of all evil...http://www.informit.com/articles/article.aspx?p=1313447 ). The same logic could be in a web site middle tier and a GUI application used internally and a batch processing application. Putting this logic in the database makes it easier to enforce certain rules at the point of insert, update and delete regardless of which application or piece of code is touching the data.

If using a NoSQL database you'll also want to make sure you have consistent reads and writes if you have a financial application:

http://aws.amazon.com/dynamodb/faqs/#What_does_read_consistency_mean_Why_should_I_care

There are, of course, alternative solutions to this problem and other environment dependent decision making criteria, but then you have to consider the cost of those alternate solutions vs. just using an RDBMS. What's the point anyway? To prove you can use NoSQL or follow some ivory tower principle you read in a book... or to create a superior business application that provides more ROI to the business? Some issues with NoSQL can be worked around to design a financial application if needed because you could funnel all requests through a queue and an application layer that enforces all your rules and the tool that parses the data on the way out could perform other functions typically provided by an RDMS. But is it worth it? That's a lot of expense and possibly error-prone code to replace a technology that is pretty solid and been around a long time. You may have other objectives and pain points not addressed here such as an RDBMS that is tipping over - those are just some considerations.

Perhaps the solution is applying the right technology in the right place to appropriately solve the problem and distributing data across databases to horizontally scale your application and move data that doesn't require a formal schema to a more cost-effective, fast, scalable data store. Your NoSQL database has certain data that does not have strict data integrity rules such as product descriptions, web request data, web content, or other unstructured more free form data and your the data that is highly critical to reconcile for security and auditing purposes remains in your RDBMS. Querying your RDBMS and stuffing the results into a NOSQL work table for use by a batch process that only reads, doesn't update, for the purposes of sending data to a vendor but it later queried and used by the application to flag results or rety errors. That could be interesting. 

Here's how some people are using NoSQL databases for highly scalable applications:

http://highscalability.com/blog/2010/12/6/what-the-heck-are-you-actually-using-nosql-for.html

I always say there are many ways to solve a problem...So I am not suggesting a one size fits all answer in this post. I am simply highlighting some considerations which are missed in an otherwise interesting and useful book like:

Disruptive Possibilities and How Big Data Changes Everything

Picking an optimal solution has many factors and typically a combination of ideas can be more productive than turning your system into a nail....and in the case of financial applications where reconciling at the end of the day is pretty darn important you want to carefully consider the cost of implementing rock solid alternatives and what each technology buys you vs. what will be required to implement something across a large organization with disparate systems feeding and reading financial data. Then again the use of parallel processing of data for end of day applications is pretty intriguing...something to ponder.