Microservices is an application development approach in which large-scale applications are built as modular components, known as services. Each service supports a specific task or business goal and communicates with other services using simple, well-defined interfaces, typically application programming interfaces (APIs).
Monitoring and managing microservices can be particularly challenging, because there can be hundreds of services and thousands of service instances in a microservices application. Monitoring all service components and their interactions can be complex, and requires building observability into individual services, enabling failure detection, centralized log collection, and the ability to collect metrics from logs to identify a range of production issues.
This is part of an extensive series of guides about microservices.
In this article
Software architecture has changed over the past two decades from traditional, tightly coupled monolithic applications to loosely coupled microservices. Monitoring systems are undergoing changes no less dramatic. Monitoring the health and performance of microservices is not as easy as tracking CPU and memory usage on a single, known server, which creates new challenges for development teams.
The flexibility of microservices environments comes at the price of higher operational complexity. In a traditional monolithic application it was possible to simply open a remote shell to a server to diagnose a problem, but this is no longer possible when applications have tens or hundreds of unique microservices.
There are two more key differences between monolith and microservices monitoring:
Related content: Read our guide to cloud native monitoring (coming soon).
In a microservices architecture, every service is an independent unit, which interacts with external users as well as other microservices. These services are combined together to provide end-users with the capabilities they need.
If there is a problem with one of the services, teams need to know as soon as possible to take action. They’ll need to know what happened, why, when, and under what conditions. This information makes it possible to identify the root cause and recover the malfunctioning service quickly.
While it is more complex to monitor health in a microservices environment, given the right data, it is also easier to resolve issues. Each microservice is small and self-contained, making it easier to identify problems and resolve them (for example, by releasing a patch or a new version of the individual microservice). Problems can be addressed quickly, without impacting other microservices or the application as a whole.
Learn more in our detailed guide to microservices health checks (coming soon).
Observability for microservices focuses on giving development teams access to the data they need to identify problems and efficiently resolve them. By contrast, monitoring is the process of tracking performance and identifying problems and anomalies.
Observability is a characteristic of a microservices system, which means components of the system provide information necessary to discover issues in microservices. Monitoring builds on observability, collecting available metrics and using them to understand the health, performance, efficiency, and other important characteristics of a microservices application.
For monitoring to be effective, system architects must identify a set of key metrics that provide a baseline for the overall health of the system, such as acceptable latency and call failure rates. In turn, developers need to ensure their services are observable, meaning they generate these metrics and make them accessible to make monitoring possible.
Learn more in our detailed guides to:
A major challenge to monitoring microservices environments is to ensure that all the services and components cooperate and deliver high performance and a smooth user experience. Measuring the functioning of these services and how they impact each other and the overall user experience is also challenging.
Traditional monitoring tools focus either on infrastructure, specific software components, or overall operational health. These tools are usually only moderately effective, but they may be sufficient for legacy systems with a monolithic architecture. However, microservices deployments expose the weaknesses of conventional monitoring tools.
Microservice architectures host components in containers or virtual machines distributed across a private, public, or hybrid cloud environment.
Measuring the ability of services to communicate with each other and deliver the expected results requires specific monitoring capabilities:
Another significant challenge is identifying the team responsible for each service—different microservices have different teams who understand them and can fix issues. In a microservices deployment, there is usually a small team responsible for the whole life cycle of each service. Each team must maintain service-specific and cross-service observability throughout the pipeline’s build, test, and release phases. Monitoring should be part of the continuous integration and continuous delivery (CI/CD) pipeline to guarantee the performance of new code releases.
A final challenge is managing the complexity of shared, dynamic services involving continuous, accurate documentation and awareness. It is important to train new employees to understand how each component interacts with the others.
Monitoring platform metrics is critical to keeping microservices infrastructure running smoothly. This is low-level data that can indicate problems in the underlying compute, storage, or networking equipment. Careful monitoring of these metrics can highlight performance degradation and prevent system-wide failures.
Platform metrics include:
The infrastructure provider typically provides resource metrics that are useful for monitoring infrastructure health. In the cloud these metrics are provided by a system like AWS CloudWatch; in an on-premise environment they could be Kubernetes metrics generated by a system like Prometheus.
Examples of resource metrics include:
The concept of “golden signals” refers to metrics that are highly useful for monitoring the health of a microservice, or the entire microservices application, identifying and resolving problems. Examples of golden signals include:
Containers have become a popular component of microservices. Their portability, speed, and isolation make them an important building block throughout the development lifecycle.
However, containers can make monitoring and troubleshooting more difficult, because they act as black boxes. From a DevOps point of view, not only do you need to know that containers exist; you need to understand them in detail.
The traditional monitoring process—using a VM or an agent that runs in the host userspace—does not work well with containers because they are small, independent processes with minimal dependencies. Running multiple monitoring agents in a medium-sized deployment can also be expensive at large scale.
To overcome this, developers can either directly instrument their code or use common kernel-level instrumentation methods to monitor all container and application activity on the host.
To automate software deployment for containerized applications, DevOps teams use orchestration systems such as Kubernetes, which take a logical blueprint of an application and deploy it as a set of containers. Developers use Kubernetes to define microservices and understand the state of deployed services.
DevOps teams should configure alerts to focus on attributes closely related to the service experience. These alerts can immediately let operational staff know if anything is affecting the application or end users. Container-native monitoring solutions can use orchestration metadata to dynamically aggregate container and application data at a service level.
It is important to look at threshold values of individual containers and their hosts, and also monitor the pod level which manages several containers used to run a microservice. Service performance monitoring applies to application-level information and infrastructure-level monitoring. It involves tracking metrics like slowest query response time, URLs showing errors, containers that consume too many resources, and resource utilization on hosts.
Elastic services are not a new concept, but they change much faster in containerized environments than virtualized ones. A changing environment can make monitoring more difficult.
In monolithic applications, monitoring typically required manual tuning of metrics based on individual deployments—for example, configuring collection for a specific metric on specific servers. This approach is not feasible in a large microservices environment. Microservice monitoring should have fully automated metrics collection and be able to scale up and down without human intervention. It also needs to run dynamically in a cloud-native environment across multiple data centers, clouds, and edge locations.
APIs are the common language in a microservices environment. In a properly-defined microservice, the API is the only element of the service exposed to other teams or external systems. In fact, API response and conformance can be the de-facto SLA of a microservice even when no formal SLAs are defined. This makes API monitoring extremely important.
The basic form of API monitoring is binary uptime checks. But this is not enough. Here are a few ways to extend API monitoring and make it more useful:
As organizations adopt microservices, they typically reorganize teams in a microservices-compatible structure. These small, decoupled teams have a high degree of control over the languages they use, how errors are handled, and even operational responsibilities.
Monitoring needs to also reflect this structure. A microservices monitoring solution should allow individual teams to define their own alerts, metrics, and dashboards, while providing a broader view of the system that is shared by all teams.
Lumigo is a troubleshooting platform, purpose-built for microservice-based applications. Developers using container technologies like Kubernetes or serverless services like AWS Lambda, to orchestrate their applications can use Lumigo to monitor, trace and troubleshoot issues fast. Deployed with zero-code changes and automated in one-click, Lumigo stitches together every interaction between micro and managed service into end-to-end stack traces. These traces, served alongside request payload data, give developers complete visibility into their container environments. Using Lumigo, developers get:
Get started with a free trial of Lumigo for your microservice applications
Together with our content partners, we have authored in-depth guides on several other topics that can also be useful as you explore the world of microservices.
Authored by CodeSee
Authored by Lumigo
Authored by Tigera