Tuesday, July 11, 2017

Timeout Connecting to S3 Endpoint From Lambda

In my last post I explained how to turn on detailed Boto logging to get more information about an AWS error. The specific problem I was having was sporadic timeouts connecting to an S3 Endpoint from a Lambda function.

Update: 7/12/2017 - For the past three nights (since I started tracking more closely) this happens at 8:30 p.m. PST. I post the errors on Twitter when they happen if you want to follow me and compare notes @teriradichel 

Update 7/14/2017 - It appears that running the Lambda function one time generates multiple CloudWatch entries. Trying to determine if logs are just duplicate for the function is actually running 2x from one call.

Initially the only information I could see was a generic timeout message when connecting to the S3 bucket.

After turning on the detailed Boto logging I got an error which included this message:

ConnectionError: ('Connection aborted.', gaierror(-3, 'Temporary failure in name resolution'))

Name resolution sounds like DNS name resolution, meaning something is having a hard time translating the S3 bucket URL into an IP address to connect to in order to retrieve or upload files.

In this case it is unlikely that there would be S3 logs since the traffic wouldn't be able to make its way to the S3 bucket.

After getting this error on an off for days and sending the detailed logs to AWS support, it looks like they may have uncovered an issue. Still waiting for the final answer but it seems like a resolution is forthcoming. I am also trying to confirm this has nothing to do with traffic traversing one of my EC2 hosts, but I don't think that is the case.

Update: 7/12/2017 AWS Support closed this issue and said they are working on it. A suggestion was made to architect for timeouts however the Boto library times out after a number of retries in about two minutes. If you ran that function twice it would simply fail twice as I have been running this repeatedly. That would cost double the money and not fix the problem. The other option is to change the Boto library timeout but the max timeout could be is 5 minutes which is max time allowed for a Lambda function.

It looks like the Boto library is configurable and can be set to max time a Lambda function can run: http_socket_timeout = 5

If you experience this issue please report it to AWS Support.

It's always a good idea to get as many logs as possible and submit questions to AWS support to try to get help with errors when possible.

Also it is a very good idea to always use S3 endpoints for data that doesn't need to be publicly accessible, in light of all the recent S3 bucket data breaches. There was another one today...I explain S3 endpoints in my previous post from May 16 (at which point this error was not occurring by the way, so I know this solution can work!)

http://websitenotebook.blogspot.com/2017/05/accessing-files-in-s3-via-lambda.html

Sunday, July 09, 2017

Detailed AWS Boto Library Logging

In my last post I explained how to turn on AWS X-RAY to log details about Lambda functions. That feature is probably most useful when you have a number of Lambda functions chained together. It doesn't change your logs it simply helps you trace which function called another to drill down to the original error (I think, still testing).

For detailed logging you still need to manage this within your code. For the AWS Boto library you can add this line to your logs (be careful where you send these logs because it can output a lot of details, some of which you may want to remain private):

boto3.set_stream_logger(name='botocore')

In my last blog post the error message simply stated the Lambda function could not reach the S3 endpoint. This seems to happen randomly. With the detailed boto library logging, it tells us there is a DNS issue reaching the endpoint.

Starting new HTTPS connection (5): firebox-private-cli-bucket-876833387914-us-west-2.s3.amazonaws.com
2017-07-10 03:30:55,224 botocore.endpoint [DEBUG] ConnectionError received when sending HTTP request.
...
ConnectionError: ('Connection aborted.', gaierror(-3, 'Temporary failure in name resolution'))

Since the error occurs in an AWS Lambda function using AWS DNS servers I submitted a support case to get some help with this issue. Having the extra logs and knowing more precisely what failed should help get the problem resolved more quickly.

Enable AWS X-Ray for Lambda Function using CloudFormation

I just realized there's a check box under Lambda configuration tab to enable x-ray. I was trying to enable it in other more complicated ways.


If get this error trying to check the X-RAY button on the configuration tab of a lambda function:

  • The Configuration tab failed to save. Reason: The provided execution role does not have permissions to call PutTraceSegments on XRAY

This means some permissions have to be added to the Lambda role as the error message states. The message above says the permissions will be automatically added but seems they are not.

An IAM role can perform the following x-ray actions, if allowed:


Out of the full list it the role only requires the "Put" commands so following best practices only add those two actions to my Lambda function role.

          - 
            Effect: "Allow"
            Action: 
              - "xray:PutTelemetryRecords"
            Resource: "*"
          - 
            Effect: "Allow"
            Action: 
              - "xray:PutTraceSegments"
            Resource: "*" 

Here's the lambda role I'm using with the actions added:


While we're add it let's automatically add tracing to the lambda in CloudFormation. The Lambda Function CloudFormation properties includes something called TracingConfig:


Clicking on the tracing config shows tow modes: active and pass through with an explanation. We'll go with the default (Pass-Through):


Although this is configurable doesn't appear to actually turn tracing on. Hopefully that's coming soon. For now log into the console and click on the Lambda function configuration tab, advanced settings, and then check the box to enable X-Ray tracing. Save the function and run it to see X-ray in action.

So bummer, the title is a bit misleading but at least we were able to set up our IAM role with the necessary X-Ray permissions.

Once X-ray is turned on, click on the monitoring tab of the Lambda function and you'll see some additional information. The first display has the list of executions:


Click on one of the executions to see more details about the trace:


Finally click on one of the lines in the trace to get even more details such as code stack trace that caused an error - in this case connection to an S3 endpoint is failing (and I have a support ticket out to AWS about this because it is a random occurrence in my case):


Not that this does not necessarily increase your logging but it will provide some additional information and what I hope is that it will track the error to the source in a micro-services environment with APIs calling other APIs. When you need more details in the logs, turn up the logging as explained this example for AWS Boto Python Library: http://websitenotebook.blogspot.com/2017/07/detailed-aws-boto-library-logging.html


I am not sure what visibility Amazon has into these logs but as always be careful what you log where and who has access to see it. Especially since I see no way to delete these traces...still looking. The data is auto-deleted in 30 days and may be able to delete it using the API. http://docs.aws.amazon.com/xray/latest/devguide/xray-guide.pdf

In the case that you want Amazon to see the logs to help troubleshoot X-Ray may help. I'll let you know!








Wednesday, July 05, 2017

Waiting For an EC2 Instance To Initialize Before Completing a Script

Sometimes when running a script to create AWS Resources, an EC2 instance needs to be created and up and running before the script can continue.

In my case I'm instantiating a WatchGuard Firebox Cloud. Then I need to wait until it is ready before I can instantiate a Lambda function to configure it.

I have a script that gets a value (called get_value.sh) as described in my last post on deleting a lambda function from CloudFormation which is used below. I query for an instance that has a particular tag, and the state is either pending or running. Then I use the instance-status-ok command to wait until the instance is ready to run my lambda function.

#this code assumes one firebox in account
#would need to get fancier to handle multiple instances with same tag

aws ec2 describe-instances --filters Name=tag-value,Values=firebox-network-firebox Name=instance-state-name,Values=pending,running > firebox.txt  2>&1

fireboxinstanceid=$(./execute/get_value.sh firebox.txt "InstanceId")

echo "* waiting for firebox instance...see status check column in EC2 console"

aws ec2 wait instance-status-ok --instance-ids $fireboxinstanceid

echo "* firebox instance running"

Once the instance is running the Lambda function that connects to it can execute. Note that just checking the instance is running with the instance-running command will cause problems because the instance isn't actually ready to receive connections.

The above code lives in the following file:

https://github.com/tradichel/PacketCaptureAWS/blob/master/code/execute/action.sh

CloudFormation Won't Delete Lambda

I am noticing that AWS CloudFormation has difficulties deleting my particular Lambda function.

If you are having this problem the trick is to use the CLI to first forcibly detach the ENI, then delete it.

In my case I created a generic get_value.sh script that parses a file for a value from a query using the AWS CLI. It's using some funky bash script so if you wanted you could re-write this in your favorite language that's more readable.

Also note I am still *testing* this code. Make sure it does not delete the wrong ENIs. There is no good way to get a handle on the ENI specific to a Lambda function either unfortunately.


filename=$1;key=$2; var=""
var="$(cat $filename | grep "\"$key\""  | cut -d ':' -f 2- | sed -e 's/^[ \t]*//' -e 's/"//' -e 's/"//' -e 's/,//')"
var="$(echo "${var}" | tr -d '[:space:]')"
echo "$var"

https://github.com/tradichel/FireboxCloudAutomation/blob/master/code/execute/get_value.sh

A series of commands use the above function to query for the Attachment ID and Network Interface ID. These two values are used to force detachment and delete the end. Use the name of the Lambda Function to find the correct ENI to delete which is at the end of the Requester ID field.

#!/bin/sh
#get our lambda ENI as we need to force a detachment
aws ec2 describe-network-interfaces --filter Name="requester-id",Values="*ConfigureFirebox" > lambda-eni.txt  2>&1
attachmentid=$(./execute/get_value.sh lambda-eni.txt "AttachmentId")
if [ "$attachmentid" != "" ]; then
    echo "aws ec2 detach-network-interface --attachment-id $attachmentid --force"
    aws ec2 detach-network-interface --attachment-id $attachmentid --force

    #I don't see a good way to wait for the network interface to detach.
    #Pausing a few here and hope that works.
    SLEEP 5

    networkinterfaceid=$(./execute/get_value.sh lambda-eni.txt "NetworkInterfaceId")
    echo "aws ec2 delete-network-interface --network-interface-id $networkinterfaceid"
    output=$(aws ec2 delete-network-interface --network-interface-id $networkinterfaceid)
    if [ "$output" != "" ]; then
        echo "If an error occurs deleting the network interface run the script again. Contacting AWS for a better solution..."
    fi
fi

This code lives in the following file:

https://github.com/tradichel/FireboxCloudAutomation/blob/master/code/execute/delete_lambda_eni.sh

Once the above code completes, the Lambda function can delete successfully.

The full code I am testing is found here:

https://github.com/tradichel/FireboxCloudAutomation