Tuesday, May 06, 2014

AWS NoSQL Event

Gartner says Amazon has 5 times amount of compute power than next 18 cloud providers combined.

Pace of innovation increases when increase deployment iterations and reduce the risk.

AWS builds custom servers. Optimized performance and 30% off the cost compared to private cloud purchasing servers from vendors.

DynamoDB gives you NoSQL with consistent (as opposed to eventually consistent) reads.

Because Amazon has built so many data centers they are obtaining expertise and getting better at it.

The success of Amazon is based on a big way on a distibuted model of teams who manage their own technology [and interact through services based through a blog post I read].

Scaling SQL databases is easy - partition the database. The problem is repartitioning the data while taking on new traffic. Initially Amazon avoided by buying bigger boxes

Then....

Amazon wrote a paper on Amazon Dynamo, highly available key-value store.

Distributed hash table.

Trade-off consistency for availability.

Allowed scalability while taking live traffic.

Scaling was easier but still required developers to benchmark new boxes, install software, wear pagers, etc.

Was a library, not a service.

Then...

DynamoDB: a service.

- durability and scalability
- scale is handled (specify requests per second)
- easy to use
- low latency
- fault tolerant design (things fail - plan for it)

At Amazon when talk about durability and scalability always go after three points of failure for redundancy.

Quorum in distibuted systems
http://en.m.wikipedia.org/wiki/Quorum_(distributed_computing)

DynamoDB handles different scenarios of replica failures so developers can focus on the application.

SimpleDB has max 10 GB and customer has to manage their own permissions.

Design for minimal payload, maximum throughput.

Can run map reduce jobs through DynamoDB. EMR gives Hive on top of DynamoDB.

Many AWS videos and re:invent sessions on AWS web site.

HasOffers uses DynamoDB for tracking sessions, deduplication.

Session tracking is perfect for NoSQL because look everything up by a single key: session id.

Deduplication: event deduplication.

Fixing DynamoDB problems ... Double capacity, maybe twice, fix the problem, drop the capacity

Being asynchronous and using queues is nice option.

Relational databases are more flexible for querying. Something to consider when determining whether you want to use RDBMS or NoSQL.

---
Hash key is single key. Can also have combo key.

Hash key = distribution key

Optimal design = large number of unique hash keys + uniform distribution across hash keys.

Important to pick hash key with large cardinality.

Range key: composite primary key - 1:N relationships. Optional range condition. Like == < > >= <=

e.g.customer id is hash key and range key is photo id

Local secondary indexes. e.g. Two customers share a key. Requires more throughput capacity.

Hash + Range must be unique

Data types supported: string, number, binary and sets of the three.

Cannot add or change secondary indexes after initial creation of table...may be coming.

Global secondary indexes are separate tables asynchronously updated on your behalf. GSI lookup is eventually consistent. May require one or more updates.

Local secondary index = max 10 GB per hash key. May be a reason to move to GSI.

GSI has it's own provisioned reads and writes whereas LSI's use provisioned table reads and writes. 

1-1 relationship: hash key and secondary index

1-Many index: hash key and range key

NoSQL - no transaction support in DynamoDB

Can only double throughput when changing. Amazon looking at changing this.

Choosing the right data store:

SQL: structured data, complex queries, transactions.

NoSQL: unstructured data, easier scaling

DataPipeline automates moving between data stores.

A client only app is available which emulates DynamoDB to develop without paying AWS fees.