AWS Serverless Quirks

At 1way2cloud we are focusing on delivering software solutions to our customers that have low operational cost, low maintenance effort and high scalability, resilience and performance. Our weapon of choice for meeting such requirements are serverless AWS services. After delivering a number of solutions that are based on AWS serverless stack, we are convinced that this kind of technology is future-proof and the best choice for majority of businesses.

A typical solution architecture that we would build for our customers looks like this:

To break it down into more specific chunks:

  • Endpoint protection: Amazon CloudFront, AWS WAF and AWS Certificate manager
  • Authentication: Amazon Cognito
  • Websocket endpoint: AWS AppSync
  • REST API endpoint: Amazon API Gateway
  • Frontend hosting: AWS Amplify
  • Environment security: AWS Security Hub, AWS GuardDuty and AWS Config
  • Request tracing: AWS X-Ray
  • Backend services: AWS Lambda
  • Event pipes and schedulers: Amazon EventBridge
  • Workflow orchestration: AWS StepFunctions
  • Storage: Amazon RDS and Amazon DynamoDB
  • Data analytics: Amazon S3, AWS Glue, Amazon Redshift and Power BI
  • Notifications: Amazon SNS, AWS Chatbot and Slack
  • Logging: Amazon CloudWatch
  • Messaging: Amazon SQS
  • Email sending/receiving: Amazon SES

Depending on the specific use case there might be some more or less services on the diagram but in general this is our go-to stack for majority of customers. Building such architecture is quite straightforward and fast since we already have prepared infrastructure as code blocks in Terraform. With such IaC blocks, it takes us just a few hours to have the whole environment set up, source code repositories defined and automation pipelines configured.

However, while working with these serverless AWS services we did encounter some of the not-so-obvious quirks that made us scratch our head and forced us to either dig deeper in the documentation, ask for specialist help or search for alternative solutions.

I wanted to share some of these findings so that others don’t need to search for solutions themselves.

Limit on DynamoDB streams consumers

DynamoDB streams are a great way to do CDC (change data capture). We are relying heavily on DynamoDB streams to invoke different Lambda functions as soon as something gets changed in the DynamoDB database (a record is updated, added or deleted). In one of such use cases, we needed to invoke four Lambda consumers like this:

Here we noticed a significant throttling. Some of Lambda consumers were taking a long time to get executed.

After digging through the documentation we did find such a limit:

Simultaneous readers of a shard in DynamoDB Streams

For single-Region tables that are not global tables, you can design for up to two processes to read from the same DynamoDB Streams shard at the same time. Exceeding this limit can result in request throttling.

You can’t know how many shards exist in your DynamoDB table so it is not possible to plan processes based on shards.

We decided to use a Fan-Out Lambda as a solution. To have a single consumer of a DynamoDB stream that asynchronously invokes 4 of our consumers (passing the whole event object further onto them):

This works perfectly well without any throttling.

We addressed this issue with AWS and as we were told there is an improvement coming at the end of this year: the limit will be raised to 10 consumers and there will be a visibility into existing DynamoDB shards.

Error handling of DynamoDB streams

How do we handle errors in DynamoDB streams?

DynamoDB streams are a robust system based on internal SQS queues. Each DynamoDB shard has one SQS assigned to it. Streams are writing modified records into those internal queues.

If a Lambda function is defined as a listener for DynamoDB streams, this Lambda is polling internal queues for new messages. That means if the downstream consumers (such as Lambda in our case) are unavailable for whatever reasons, the internal queue is pilling up stream-records messages for the duration of 24 hours.

To capture this situation, we use Lambda’s IteratorAge metric: The Lambda IteratorAge metric measures the latency between when a record is added to a DynamoDB stream, and when the function processes that record. An increase in IteratorAge means that Lambda isn’t efficiently processing records that are written to the DynamoDB stream. This metric triggers an alarm and notifies us of such situations.

Another error handling situation comes when a Lambda receives one or more (in the case of batch delivery) DynamoDB stream records and fails to process them. Here we can define a DLQ (dead letter queue) for DynamoDB Stream trigger that would put all delivered but unprocessed records in the DLQ. But there is a quirk! Messages in that DLQ are not actual DynamoDB records, but only pointers to the internal SQS queues that are valid for the first 24 hours. You have to retrieve those records yourself from the internal DynamoDB queues, in the first 24 hours, otherwise they are lost.

Here is a Python code that extracts actual DynamoDB stream records based on the pointers in the DLQ:

Simple! 🙂

Cognito custom domains

Cognito is a publicly accessible authentication service and as such has a public endpoint which can be customized to use your own domain name such as e.g auth.your_application.com.

However, setting up a custom domain for Cognito is different to setting up custom domains for other publicly accessible endpoints such as API Gateway, AppSync or Amplify. These other services generate CNAME records that you should add to your domain controller and that is all that is needed for the custom domains are to be verified.

But not for Cognito. For Cognito you actually need to have an A record in your account’s Route53 service.

Usually you would delegate DNS domains from a centralized Network or Shared Services accounts to your specific workload account. In the workload account you would then just need to add CNAME records and the custom domain would be resolved. But Cognito for some reason expects to see an A record with that custom domain name.

So, the trick is to add an A record domain in your workload’s Route53 service and since A record requires setting up an IP address, you can simply use any IP address e.g. 55.44.33.22.

Once Cognito resolves that DNS name and verifies your custom domain, you can simply delete that A record.

Weird…

Lambda destinations/DLQs

If something goes wrong with the Lambda, such as when retries are exhausted or the event age has been exceeded or just the underlying Lambda host dies, that is a failure that needs to be handled somehow. Lambda comes with two mechanisms how to do that: DLQ (dead letter queue) and Destinations. Lambda Destinations is a newer mechanism and generally preferred by AWS to be used instead of DLQs. Destinations have OnSuccess and OnFailure. OnFailure would be used to pass failed processing messages to some other error handling mechanism (we usually use ChatBot -> Slack).

However, there is a quirk!

Lambda Destinations and DLQs work only if the Lambda is invoked asynchronously. In case of synchronous calls, these mechanisms do not work. I assume the logic here is that an synchronous call expects an immediate response back, which in case of a Lambda failure can be a messages containing an error.

But, why? Why would a functionality of consumers depend on how they were invoked? If Lambda fails, it should be able to report that failure in DLQ or Destinations no matter how that Lambda has been invoked, synchronously or asynchronously.

To make things even more murky, services that can integrate with Lambda are “calling” those Lambdas in unexpected ways. For example, our above mentioned DynamoDB Streams are considered to be triggering Lambda synchronously. I suppose that is because under the hood DynamoDB Streams are not actually calling Lambda, but Lambda is polling internal Stream queues for messages. The same would be with SQS, Managed Kafka, SNS, Kinesis and some other services that call Lambda synchronously (https://docs.aws.amazon.com/lambda/latest/dg/lambda-services.html).

This is breaking a few design patterns and is not feasible to be used. For example we have 10+ DynamoDB streams, 10+ SQS queues, 10+ S3 buckets, few API Gateways and they all are triggering Lambda functions. Instead of setting one error handling functionality in one place (on each Lambda function), we would have to do a mix of Destinations/DLQs for those services that call Lambda asynchronously and error handling on the source producers for those services that call Lambda synchronously. Won’t do that, it’s just a mess.

What we do instead, we let Lambda fail and report ERROR in their CloudWatch log streams. We have subscription filters on those log streams where we capture each such error and then report it further on (ChatBot->Slack).

React amplify library

If you try to do something like this:

It won’t work.

We have a web application developed in React and are using amplify Gen2 library for interacting with AppSync service (https://docs.amplify.aws/react/start/quickstart/). With AppSync we have queries (data read), mutations (data write) and subscriptions (websockets) that we call from that React Amplify library. Each call is authenticated with Cognito User Pools.

Queries and mutations work well in this setup but subscriptions are not working.

When we used API Key as an authentication method instead of Cognito User Pools, then all operations worked well. It is worth mentioning that almost all available AWS documentation is using only API Key for examples and tutorials. We weren’t able to find a single example of using Cognito User Pools as an authentication mechanism even though it is a preferred way of authenticating for production deployments.

We tried alternatives to see if that would work. We used plain vanilla JavaScript to call AppSync subscriptions with Cognito User Pool authentication and it worked well. We also tried using Apollo client library (https://www.apollographql.com/docs/react/) instead of Amplify and there also everything worked well.

This is when we decided to ditch Amplify library altogether and instead use Apollo for all GraphQL calls.

We have also made an AWS support case documenting this apparent bug in Amplify library. I hope they are working on resolving it.

1 Comment

Leave a reply to clarsachi Cancel reply