Microservices Monitoring: Challenges, Metrics, and Tips for Success

  • Topics

What Is Microservices Monitoring?

Microservices is an application development approach in which large-scale applications are built as modular components, known as services. Each service supports a specific task or business goal and communicates with other services using simple, well-defined interfaces, typically application programming interfaces (APIs).

Monitoring and managing microservices can be particularly challenging, because there can be hundreds of services and thousands of service instances in a microservices application. Monitoring all service components and their interactions can be complex, and requires building observability into individual services, enabling failure detection, centralized log collection, and the ability to collect metrics from logs to identify a range of production issues.

This is part of an extensive series of guides about microservices.

Monolith vs. Microservice Architecture Monitoring: What Is the Difference?

Software architecture has changed over the past two decades from traditional, tightly coupled monolithic applications to loosely coupled microservices. Monitoring systems are undergoing changes no less dramatic. Monitoring the health and performance of microservices is not as easy as tracking CPU and memory usage on a single, known server, which creates new challenges for development teams.

The flexibility of microservices environments comes at the price of higher operational complexity. In a traditional monolithic application it was possible to simply open a remote shell to a server to diagnose a problem, but this is no longer possible when applications have tens or hundreds of unique microservices.

There are two more key differences between monolith and microservices monitoring:

  • Microservices typically run in containers (such as Docker), can scale across multiple hosting environments, commonly managed by container orchestrators (like Kubernetes). Monoliths typically run on bare metal machines or virtual machines (VMs), which have very different operating characteristics.
  • Container runtimes generate their own metrics and logs. Containers have a shorter life cycle than virtual machines, making it difficult to access logs and metrics while debugging. This makes it critical to implement a centralized logging system.

Related content: Read our guide to cloud native monitoring (coming soon).

Why Is Monitoring Microservices Health Important?

In a microservices architecture, every service is an independent unit, which interacts with external users as well as other microservices. These services are combined together to provide end-users with the capabilities they need.

If there is a problem with one of the services, teams need to know as soon as possible to take action. They’ll need to know what happened, why, when, and under what conditions. This information makes it possible to identify the root cause and recover the malfunctioning service quickly.

While it is more complex to monitor health in a microservices environment, given the right data, it is also easier to resolve issues. Each microservice is small and self-contained, making it easier to identify problems and resolve them (for example, by releasing a patch or a new version of the individual microservice). Problems can be addressed quickly, without impacting other microservices or the application as a whole.

Learn more in our detailed guide to microservices health checks (coming soon).

Microservices Monitoring vs. Observability

Observability for microservices focuses on giving development teams access to the data they need to identify problems and efficiently resolve them. By contrast, monitoring is the process of tracking performance and identifying problems and anomalies.

Observability is a characteristic of a microservices system, which means components of the system provide information necessary to discover issues in microservices. Monitoring builds on observability, collecting available metrics and using them to understand the health, performance, efficiency, and other important characteristics of a microservices application.

For monitoring to be effective, system architects must identify a set of key metrics that provide a baseline for the overall health of the system, such as acceptable latency and call failure rates. In turn, developers need to ensure their services are observable, meaning they generate these metrics and make them accessible to make monitoring possible.

Learn more in our detailed guides to:

  • Microservices observability (coming soon)
  • Cloud native observability (coming soon)

Microservices Monitoring Challenges

A major challenge to monitoring microservices environments is to ensure that all the services and components cooperate and deliver high performance and a smooth user experience. Measuring the functioning of these services and how they impact each other and the overall user experience is also challenging.

Traditional monitoring tools focus either on infrastructure, specific software components, or overall operational health. These tools are usually only moderately effective, but they may be sufficient for legacy systems with a monolithic architecture. However, microservices deployments expose the weaknesses of conventional monitoring tools.

Microservice architectures host components in containers or virtual machines distributed across a private, public, or hybrid cloud environment.

Measuring the ability of services to communicate with each other and deliver the expected results requires specific monitoring capabilities:

  • End-user experience monitoring—measures client operations and performance on browser and mobile devices.
  • System interaction monitoring—measures the system interactions required to service each transaction. These include interactions between the end-user device and the microservices and other components involved in the user’s request.
  • End-to-end monitoring—helps isolate issues across the microservices environment.

Another significant challenge is identifying the team responsible for each service—different microservices have different teams who understand them and can fix issues. In a microservices deployment, there is usually a small team responsible for the whole life cycle of each service. Each team must maintain service-specific and cross-service observability throughout the pipeline’s build, test, and release phases. Monitoring should be part of the continuous integration and continuous delivery (CI/CD) pipeline to guarantee the performance of new code releases.

A final challenge is managing the complexity of shared, dynamic services involving continuous, accurate documentation and awareness. It is important to train new employees to understand how each component interacts with the others.

Metrics to Monitor in Microservices

Platform Metrics

Monitoring platform metrics is critical to keeping microservices infrastructure running smoothly. This is low-level data that can indicate problems in the underlying compute, storage, or networking equipment. Careful monitoring of these metrics can highlight performance degradation and prevent system-wide failures.

Platform metrics include:

  • Number of requests per second/minute
  • Failed requests per second
  • Average response time per service endpoint
  • Distribution of time required for each request
  • Average execution time for the fastest 10% and slowest 10% queries
  • Success/failure rate by service

Resource Metrics

The infrastructure provider typically provides resource metrics that are useful for monitoring infrastructure health. In the cloud these metrics are provided by a system like AWS CloudWatch; in an on-premise environment they could be Kubernetes metrics generated by a system like Prometheus.

Examples of resource metrics include:

  • CPU and memory utilization of nodes and containers—monitoring the health and performance of microservices is not as easy as tracking CPU and memory usage on a single, known server. It requires tracking, multiple, ephemeral resources..
  • Host count—the number of hosts or pods running the system (enables the identification of availability issues resulting from crashed pods).
  • Live threads—the number of threads spawned by the service (enables the detection of multi-threading issues).
  • Heap usage—statistics related to heap memory usage (for debugging memory leaks).

Golden Signals

The concept of “golden signals” refers to metrics that are highly useful for monitoring the health of a microservice, or the entire microservices application, identifying and resolving problems. Examples of golden signals include:

  • Availability—the system’s state as measured from the client’s perspective, such as the ratio of errors to total requests.
  • Health—the system’s state as measured using regular pings.
  • Request rate—the rate of requests coming into the system.
  • Saturation—the extent to which the system is free or loaded idle time or system load (e.g., available memory or queue depth).
  • Usage—the system’s usage level (CPU load, memory usage, etc.), expressed as a percentage.
  • Error rate—the rate of errors the system is experiencing.
  • Latency—the system’s response time, usually measured in the 99th or 95th percentile.

5 Tips for Effective Microservice Monitoring

1. Monitor Containers and What’s Running Inside Them

Containers have become a popular component of microservices. Their portability, speed, and isolation make them an important building block throughout the development lifecycle.

However, containers can make monitoring and troubleshooting more difficult, because they act as black boxes. From a DevOps point of view, not only do you need to know that containers exist; you need to understand them in detail.

The traditional monitoring process—using a VM or an agent that runs in the host userspace—does not work well with containers because they are small, independent processes with minimal dependencies. Running multiple monitoring agents in a medium-sized deployment can also be expensive at large scale.

To overcome this, developers can either directly instrument their code or use common kernel-level instrumentation methods to monitor all container and application activity on the host.

2. Service Performance Monitoring

To automate software deployment for containerized applications, DevOps teams use orchestration systems such as Kubernetes, which take a logical blueprint of an application and deploy it as a set of containers. Developers use Kubernetes to define microservices and understand the state of deployed services.

DevOps teams should configure alerts to focus on attributes closely related to the service experience. These alerts can immediately let operational staff know if anything is affecting the application or end users. Container-native monitoring solutions can use orchestration metadata to dynamically aggregate container and application data at a service level.

It is important to look at threshold values of individual containers and their hosts, and also monitor the pod level which manages several containers used to run a microservice. Service performance monitoring applies to application-level information and infrastructure-level monitoring. It involves tracking metrics like slowest query response time, URLs showing errors, containers that consume too many resources, and resource utilization on hosts.

3. Monitoring Multi-location and Elastic Services

Elastic services are not a new concept, but they change much faster in containerized environments than virtualized ones. A changing environment can make monitoring more difficult.

In monolithic applications, monitoring typically required manual tuning of metrics based on individual deployments—for example, configuring collection for a specific metric on specific servers. This approach is not feasible in a large microservices environment. Microservice monitoring should have fully automated metrics collection and be able to scale up and down without human intervention. It also needs to run dynamically in a cloud-native environment across multiple data centers, clouds, and edge locations.

4. Monitor APIs

APIs are the common language in a microservices environment. In a properly-defined microservice, the API is the only element of the service exposed to other teams or external systems. In fact, API response and conformance can be the de-facto SLA of a microservice even when no formal SLAs are defined. This makes API monitoring extremely important.

The basic form of API monitoring is binary uptime checks. But this is not enough. Here are a few ways to extend API monitoring and make it more useful:

  • Monitoring most commonly used endpoints at a given point in time. This allows the team to see if there are any noticeable changes to the use of the service due to changes in design of other microservices, or changes in the way users consume the service.
  • Monitoring slow endpoints—identifying the slowest endpoints across the environment can reveal serious problems, or at least point to areas of your system that need optimization.
  • Distributed tracing of service calls through the system is another important feature. This type of analysis helps understand the end-to-end user experience, and understand how requests interact with infrastructure and application components.

5. Map Monitoring to Your Organizational Structure

As organizations adopt microservices, they typically reorganize teams in a microservices-compatible structure. These small, decoupled teams have a high degree of control over the languages they use, how errors are handled, and even operational responsibilities.

Monitoring needs to also reflect this structure. A microservices monitoring solution should allow individual teams to define their own alerts, metrics, and dashboards, while providing a broader view of the system that is shared by all teams.

Microservices Monitoring with Lumigo

Lumigo is a troubleshooting platform, purpose-built for microservice-based applications. Developers using container technologies like Kubernetes or serverless services like AWS Lambda, to orchestrate their applications can use Lumigo to monitor, trace and troubleshoot issues fast. Deployed with zero-code changes and automated in one-click, Lumigo stitches together every interaction between micro and managed service into end-to-end stack traces. These traces, served alongside request payload data, give developers complete visibility into their container environments. Using Lumigo, developers get:

  • End-to-end virtual stack traces across every micro and managed service that makes up a serverless application, in context
  • API visibility that makes all the data passed between services available and accessible, making it possible to perform root cause analysis without digging through logs
  • Distributed tracing that is deployed with no code and automated in one click
  • Unified platform to explore and query across microservices, see a real-time view of applications, and optimize performance

Get started with a free trial of Lumigo for your microservice applications

See Additional Guides on Key Microservices Topics

Together with our content partners, we have authored in-depth guides on several other topics that can also be useful as you explore the world of microservices.

Application Mapping

Authored by CodeSee

Aws Lambda

Authored by Lumigo

Istio

Authored by Tigera

Debug fast and move on.

  • Resolve issues 3x faster
  • Reduce error rate
  • Speed up development
No code, 5-minute set up
Start debugging free