All Posts

Monitoring AWS DynamoDB performance and latency

monitor and troubleshoot dynamodb

Amazon DynamoDB is a fully managed NoSQL database service provided by AWS. As a managed service, we don’t have to worry about tasks such as hardware provisioning, configuration, scaling, replications, or patching. Although AWS does most of the heavy-lifting, it doesn’t mean we need not be vigilant about how we’re using DynamoDB or that we don’t need to understand it in the larger context of our applications. In this article we’ll cover the common issues that DynamoDB users should look out for and how they impact your application, as well as the key metrics to monitor to get a comprehensive view into the health and performance of DynamoDB.

Common DynamoDB Issues and Troubleshooting

Limited Visibility

Like with other AWS services, DynamoDB can be difficult to see exactly what’s happening with the service. There are many layers of abstraction on top of DynamoDB, making it difficult to troubleshoot.

For example, when requests are throttled, even when there’s enough capacity available, you have no easy way to find out why. There’s nothing logged to make this obvious, meaning you’ll have to go through the documentation to understand how requests were handled. We can try using APIs provided by AWS in their SDKs, but this approach is not so easy.

Also, when there are errors, DynamoDB isn’t so great when it comes to explaining what’s going wrong. There are ways of seeing the performance of each query we run on it and charting this. But that’s pretty much the limit when it comes to debugging performance issues—there’s nothing more you can do out of the box.

Permissions

AWS manages all permissions to services and data using IAM policies and roles, but DynamoDB IAM policies are very vague, and there are many of them. Getting them right is critical for privacy and security. Not having the right permissions can cause data leaks or disable you from accessing data either through the console or via an application, depending on the use case. Even though the error message makes it clear that it’s a permission issue, there’s no indication of what permission is required or missing, causing further delays in debugging the problem.

Code debugging and exception handling

Debugging any code that works with AWS services or runs on AWS Lambda is always challenging. Because the errors are not very verbose or clear, it’s difficult to understand what the issues are and how to fix them.

The lack of tools to debug AWS services adds to the difficulty of debugging and exception handling. There are some tools on the market to make this easier. For example, Lumigo’s execution tags can help in tagging DB save attempts. You simply tag such DB responses to easily search for them in Lumigo and debug issues whenever necessary.

Potential for high latency and its impact

Slow processing of data

When data is in the hundreds of GBs or in TBs, or too many transformations or computations happening on huge data, the processing of data will slow, among other reasons like unoptimized code, bad design of the system, etc. If the data takes a lot of time, latency will increase and slow down the entire system.

Timed-out requests

When DynamoDB is not able to respond to a request within the time specified, the request will time out, and this will also fail the entire system dependent on this request. Bad network, nodes being too busy or unresponsive can cause timeouts, so make sure you give enough time for the request to come through when creating an AWS SDK client in your applications.

Read/Write

DynamoDB read and write units and their configuration define the performance of our queries. Based on the use cases and load on a given DynamoDB table or instance, we need to ensure  enough units for both reads and writes. If you’re trying to write or read more than one unit, DynamoDB will need to consume multiple, increasing the duration of the operation. You should calculate how many units you need while creating tables configured accordingly to avoid high latencies.

Throughput

DynamoDB provides two types of throughput modes: provisioned and on-demand. The provisioned mode lets you set a limit on the number of reads and writes per second. This is best when the application is not expected to have any bursts or traffic spikes. But if and when it exceeds the limit set, the applications can experience high latency and even dropped requests. 

The on-demand mode makes sure no request coming into DynamoDB is dropped. It is more accommodating and performant but because it can scale dynamically, it can also cause a spike in billing.

Performance metrics to look for with DynamoDB

DynamoDB exposes a few metrics we can use to gauge its performance. Using these, we can decide if the DynamoDB instance needs any tuning or not. 

Throttled Requests

Each resource in DynamoDB (table or index) has a defined throughput limit. Whenever an operation exceeds this limit, the requests are throttled. Depending on the type of operation, the number of throttled requests is incremented. For batch requests, the throttled requests count is incremented only if all the requests in the batch are throttled.

Latency 

Latency is the time taken by a service to respond to a request and DynamoDB promises single-digit millisecond latencies no matter what size of data we’re working with or at what scale. Even if we scale our application gradually, we shouldn’t see any change in the performance of the queries on DynamoDB. Any change in latency could indicate something wrong in the data pipeline.

Errors 

When there is an error during the execution of a query, DynamoDB throws an HTTP exception (as all requests are submitted through HTTP) with three components: HTTP status code, error message, and exception name. Any error that contains a 5xx series status code should be looked into, as this is a system error.   4xx series exceptions represent a bad request or human error.

Capacity 

DynamoDB sets read and write capacity metrics for tables and global secondary indexes. Whenever this read or write capacity is breached for a given table, all following requests are throttled. It’s important to closely watch consumed read and write capacity units so that requests don’t throttle and decrease performance of the system.

Monitoring and Troubleshooting DynamoDB

DynamoDB, like any other AWS service, can be integrated with Amazon CloudWatch, which is the centralized place for monitoring all activities and events happening across AWS services. DynamoDB sends logs in the form of events to CloudWatch. All of this is configurable and can be tuned. 

CloudWatch collects the information as log messages and stores them for a configured period of time. DynamoDB also exposes some important metrics that can be reported to CloudWatch and other monitoring tools. With these metrics, you can monitor performance of DynamoDB and also catch issues as they happen. The following is a small list of such metrics:

  • ConsumedReadCapacityUnits
  • ConsumedWriteCapacityUnits
  • ReadThrottleEvents
  • WriteThrottleEvents
  • SystemErrors
  • UserErrors

To troubleshoot these issues, developers can use Cloudwatch with Lumigo, which provides one-click distributed tracing to monitor and troubleshoot managed services like DynamoDB.

Optimizing DynamoDB for Performance

Once we understand what’s causing issues with DynamoDB, we can easily tune the configuration to improve performance.

Read/Writes

DynamoDB has read and write capacities that help throttle requests during high-traffic situations. But these capacity limits might also cause high-latency issues. To fix this, you need to monitor the read and write capacity usage and then increase the limits so that DynamoDB can use more resources to make the read and write operations less throttled.

On-Demand vs. Provisioned 

Whether you use on-demand or provisioned DynamoDB instances can have a huge impact on the performance of applications. If we know what capacities we need for our DynamoDB instance and are sure that there will be no surprises or unexpected hikes in read or write traffic, we can provision the DynamoDB instance ourselves to reduce cost. 

On the other hand, if there is even a slight chance of unexpected traffic or load hike, we need to let DynamoDB scale on-demand so that no read or write requests are throttled and are served immediately, thereby keeping latency to a minimum.

Autoscaling 

When you have autoscaling enabled on a DynamoDB table or index, you can control how DynamoDB scales requests. You can have it scale only for reads, only for writes, or both. You can even set a target utilization so that DynamoDB makes sure the auto-scaled capacities are always near this target utilization number.

Throughput 

Unlike traditional databases, Amazon DynamoDB scales out instead of scaling up to improve query performance. This means it can add more storage whenever data grows. Along with this, DynamoDB partitions the data so that throughput is not affected; this is the easiest way to improve throughput. Also, you can specify the level of throughput needed, but be sure to design your applications to make complete use of DynamoDB’s design. 

Summary

DynamoDB is a popular data management and query engine used in many modern applications. Although Amazon already takes care of most of the optimization and performance tuning, you need to monitor it closely to make sure you’re getting the most out of it.

This may also interest you