Saturday, August 24, 2013

AWS Cloud Error Logging: EC2 instances, S3, DynamoDB, Alerts

In the past I have written generic error handlers that log errors by writing out the error stack trace and messages to a format that is more easy to review later than typical log file, and attach application specific data. The problem I have with standard files is the time it takes to parse through them to find a particular error, and the lack of application specific data that helps quickly replay and troubleshoot the error. There may be additional overhead in more specific logging but it it helps me eliminate the errors more quickly then the number of errors remains small.

Now I'm attempting to pass these errors asynchronously through an AWS SQS queue over to S3 and/or DynamoDB so the error logging doesn't tie up the system, the logs are separated from my EC2 instances in case one dies, logs are consolidated, and I can find a ways to optimize costs by reviewing usage stats and fixing errors that might use extraneous resources quickly. It is definitely scalable and the queue will store the messages for up to 4 days. I am not in a hurry to get the errors to the log store so they can sit in the queue for a while.  I can also hook up alerts using SQS, SES or SNS services from AWS depending on which is most appropriate for the case that my error logging is going berserk due to some system problem and catch it right away to minimize impact.

A Java program will throw different types of errors as explained in the article below, plus every other different type of standard and custom exception a programmer might throw, so code will need to deal with each of those types of errors appropriately. Basically I cast them all to throwables for my purposes, get the error message, stack, root cause error message and stack trace and store it all in a file with the information around that exception that helps me troubleshoot.

http://www.javaworld.com/jw-07-1998/jw-07-exceptions.html

Wondering the implications of storing individual errors vs dumping all the errors to one file and also using DynamoDB vs. S3. In the case of errors you may want to further extend your basic error to have additional information depending on what type of error it is. That means it doesn't fit exactly nicely into a single type of record because different parts of the system may log different information with each error. There are many alternatives but for my first cut I'm going to log the unstructured error information to a file and then explore logging a record in DynamoDB with error message, file name, timestamp, ID later for use with a user interface that can pull up the file from S3 if needed. At this time can probably live without the UI.