Today’s modern applications contain a broad set of microservices, with containers and serverless becoming the architectures of choice for many cloud applications. Both architectures facilitate highly scalable systems, and while which approach to take is routinely debated, containers and serverless technologies are being used in tandem more and more. However, despite their many benefits to enriching application development, using microservices adds new complexities that make gaining visibility and control an increasingly harder task.
This article focuses on understanding the observability issues in serverless and containers and how to address them in practice.
The serverless movement is still gaining popularity, with 4 million developers and counting using cloud functions with services like AWS Lambda and Google Cloud Run, while containers continues to be rapidly adopted by companies looking to modernize their applications.
In the early days of serverless, AWS introduced running code on AWS Lambda, which made serverless mainstream. With the growing demand for serverless, AWS pushes its boundaries and expands its serverless offerings into many other areas such as Database as a Service (DaaS), Storage as a Service (STaas), and Container as a Service (CaaS), etc.
Serverless transactions and containerized applications are often a combination of owned code in Lambda or ECS, for example, cloud managed services like DynamoDB, SNS, SQS, and Kinesis, and third party APIs like Twilio and Stripe. While core components of applications are offloaded to these outside services making it easier and faster to build applications, these highly distributed environments are increasingly opaque, making them harder to control. Poor visibility prolongs any troubleshooting activities that have to take place when issues arise, so gaining a better insight into these services provides facilitates fast recovery time and minimal user end-user impact.
Monitoring is one aspect of application management that allows administrators and developers to track an applications’ health and performance, and helps to identify errors or other issues. Application monitoring comes down to collecting metrics and logs from your applications to monitor various aspects of an application. Collecting metrics can help users recognize patterns and trends, such as CPU usage spikes or a Lambda crash, but they don’t provide details on why a particular component isn’t working, how to fix it or even who should fix it.
Logs help to provide the much-needed contextual information regarding an event, such as why CPU usage spikes or why a Lambda crashed. This additional information allows administrators and developers to effectively narrow down to the root cause of the issues.
It is important to note that developers need to add these logs into the application code to be available later . The developer must identify the necessary components to generate logs during the design and development stages.
Since logs are textual, humans can easily read and understand what they mean. However, when at scale, logs overwhelm users and eventually lose their value.
Follow a standard logging format. Ensure that the applications follow structured logging such as JSON and always include additional metadata that’ll help identify the context of the log, such as the environment, user ID, and other fields. The easiest way of doing this is to use an automated logger or a logging library across all teams and the applications they build, thus ensuring uniformity amongst the logs.
Always try to minimize the number of logs when the application runs normally. However, make sure that there is an option to turn on additional logging when the application is malfunctioning.
End-to-end observability and easy, effective debugging is critical for fast development and high performing applications, but is next to impossible for many companies to achieve with monitoring and logging alone. These traditional methods, which only give visibility into the working state of specific components, are no longer enough to serve the complexities of cloud native architectures. But where they fail, distributed tracing can help.
With distributed tracing, we observe a request or transaction as it moves through its workflow via a unique identifier to give us a better understanding of how services and resources are interacting in the larger context of the application.
Traditional monitoring relies heavily on agents to extract telemetry data, and this method worked well when developers coded and controlled applications in their entirety. In today’s cloud native environments however, many core components are offloaded to managed and third party services where agents can’t be installed. Sticking to an agent-based approach creates gaps in the data collection process and incomplete visibility.
Distributed tracing removes the need for agents and correlates all the metrics and logs from managed services into a structured format, allowing developers to quickly drill down and identify the root cause of an issue.
To identify the severity of the failing element, users must consider the dependencies and the externally exposed services this component provides. For example, a single Lambda may serve two different services, such as in the example below.
There are three main components of implementing distributed tracing:
Three main types of solutions are available in the market that enables the implementation of distributed tracing within an application:
The basics of setting up a DIY distributed tracing solution boil down into the following components:
The first steps to setting up a tracer are to:
The ingestion mechanism requires a highly scalable and redundant architecture to keep up with the inherent scalability of a microservice application. As an application scales, so does the number of traces–and the need to quickly ingest, correlate and visualize observability data in real-time.
Many open-source solutions in the market provide comprehensive traceability. One well-known solution is OpenTelemetry, which contains tools and libraries that provide extensive trace collection capabilities. Another solution is Jaeger, which offers centralized visualization capabilities for distributed tracing.
Open source tools provide a way to build your own traceability solution, however, these solutions lack some of the more advanced features that third-party observability platforms offer out-of-the-box.
These platforms simplify visualization by providing intuitive system maps that show the relationships between the different components and services within an application rather than just providing timelines of transaction executions. It enables developers to make changes to the architecture without worrying about unknown or unexpected dependencies.
These advanced platforms outperform their open source counterparts by enabling automated distributed tracing of each component, and some can even trace the various requests and responses made to managed services such as DynamoDB.
This tracing level enables a user to minimize log requirements, as traces contain most of the information that typically comes from logs so even if the configured logs do not show the required information, traces can fill in the gaps to provide context to each transaction.
In addition to these advanced features, these platforms perform core functionalities such as visualizing timelines and collecting logs and metrics. The added advantage of using tracing over traditional log collection is that the users don’t need to look for specific logs related to a transaction since the unique identifiers help narrow down the logs for each particular transaction.
It is practically impossible to conduct tracing manually for applications with hundreds or thousands of transactions each minute. Therefore, the only way of conducting effective tracing is by automating these processes from day one to make them effective and efficient.
While automating these functions, it is crucial to collect traces from all the components within the application since missing any components in the transaction flow will hinder the view of each transaction.
Finally, correlation logs and metrics will provide more context and information about each transaction, which is key to having a complete tracing solution.
It is essential to understand that there is no restriction on using containers over serverless. However, designing architecture while keeping in mind the desired application functionality will reduce most of the issues within cloud-native applications.
Upon finalizing the architecture for a cloud-native application with distributed environments, the architecture must include appropriately distributed tracing. While this is achievable in different ways, choosing the suitable method for an application can help increase efficiency and reduce the time taken to troubleshoot.