May 22 2020
One of the great things about Lumigo is that it records a lot of context about each Lambda invocation. This includes the invocation event and its return value, as well as the environment variables that were in use at the time.
I find this super helpful because it gives me all the relevant information about an invocation in one place. I don’t have to jump between different screens to find the relevant information and then piece the clues together in my head.
It recently dawned on me that because environment variables can change between deployments, this made Lumigo a time machine for your environment variables!
As we often pass static configuration and resource identifiers such as DynamoDB table names into Lambda via environment variables, this can be a powerful tool for debugging problems related to environmental changes. And it proved very useful while debugging a tricky error on a client project recently.
Here’s what went down.
When passing information across service boundaries, I prefer using SSM parameters. In this case, we have a shared-infrastructure
repo with all the infrastructure pieces that are shared by our services. Amongst these is a Cognito User Pool that all user-facing APIs would use. Some APIs need to implement custom authorization logic to support groups — e.g. only “admin” users can access certain endpoints. These services would implement their own custom authorizer Lambda functions, and deployed as part of their service stacks.
To share the ARN of this Cognito User Pool with other services, the shared-infrastructure
CloudFormation stack would provision an SSM parameter. This way, the other services can reference and fetch the ARN in their serverless.yml
(we use the Serverless framework) using the ${ssm:/path/to/parameter}
syntax.
Long story short, our end-to-end tests in the dev environment suddenly broke after a deployment. Looking at the Issues page in Lumigo pointed us to the custom authorizer function as the culprit.
And comparing the failed invocations with the last successful one quickly shows that the COGNITO_USER_POOL
environment variable had changed.
“Interesting, how could that have happened?”
I mused.
Anyhow, we went to the SSM parameter, checked its history, and found the person who made the change. A quick Slack exchange later, we were able to solve the mystery.
Turns out, another team had manually changed the SSM parameter value briefly to test something, before quickly changing it back. However, we were unfortunate to have deployed at the wrong time and picked up the wrong value!
To prevent such misfortune from happening in the future, people shouldn’t be changing these shared parameters by hand. And if teams have separate environments, this could have also been prevented.
But nonetheless, this was almost a non-event because it took us less than 5 mins to work out what had happened as we had snapshots of the environment variables to compare with from before the deployment. In the past, this kind of errors would have been a lot harder to debug.
When something breaks immediately after a code change, you tend to (and rightly so) suspect the code change and not the execution environment (which is assumed to be fairly static).
Being able to compare snapshots of environment variables makes it trivial to detect these problems. 99% of the time, environment variables would not be the cause, and being able to rule them out quickly allows you to focus your effort on more likely culprits.
I hope this post helps you appreciate this not-so-flashy feature Lumigo offers. And for more blog posts like this, subscribe to our newsletter and sign up to the platform for free at lumigo.io.