One of the basic ingredients of any software is that before you deliver it to the end-user you want to make sure that it works properly. The question is how we need to adapt our approach to testing and debugging serverless architectures, given their cloud-based, highly distributed nature.
There have been different trends around testing in recent years, but the overriding goal is that we want to release good working software to our customers as quickly as we possibly can.
In order to do that, one of the major changes has been the move away from the endless cycle of developing the software and then, at the end of the development phase, handing it off to the QA or testing department, who then send it back to the developer to fix. Instead we’ve moved towards the concept called “shift left”.
Testing has moved to an earlier stage of the development process, and that means that developers are more responsible for testing the software they create.
At a startup like Lumigo we don’t have any QA Department at all, and even at more established, much larger, enterprises developers take on more responsibility for testing, while QA departments focus on more holistic things such as integration testing.
In his book, Succeeding with Agile, Mike Cohen described testing as a pyramid with three different layers: the unit test, the component test and the user interface test.
In the unit test we test a small portion of the software. One example might be a function where the user presses a button and a message appears on the screen. So, the developer builds an automated test to check that specific function at the unit level.
With automated component tests we test the behavior of different components in our application and the interaction between them. Sometimes this is called API testing, or integration testing. Even end-to-end tests are sometimes part of component level testing.
If you’re working on a part of the software that involves user interaction then we get to user interface testing. Typically, this is the kind of testing that is performed by the QA team. While unit tests and component tests are usually automated, most user interface tests are done manually, although some automation tools are available.
One of the reasons Cohen described this testing methodology as a pyramid is that as you move down the pyramid you have an increasing number of tests. That’s because it’s simpler and cheaper to write, maintain and execute the tests. Moving in the other direction, testing becomes more complex and more expensive.
By focusing on unit testing we’re able to get much better “coverage” of our codebase than in years past when manual testers would act as users, doing exploratory testing in search of bugs and other issues.
In many ways, very little changes about our trusty pyramid when it comes to testing serverless applications.
At the bottom, unit testing remains largely unchanged. The same goes for the apex of the pyramid, the UI tests. It’s the middle of the pyramid that becomes the main focus in serverless testing.
One of the defining features of serverless is the greater granularity of services: the combination of our functions and managed third-party services like DynamoDB, SQS, and so on. So, while unit tests are important, the application is a combination of many components and, because of that, component testing becomes much more important in a serverless environment. You have so many moving parts that if you don’t know how they interact you can quickly get lost.
The big testing question with serverless, which is cloud-native by nature, is whether you should do your testing locally on your machine, or in the real environment where it will run afterward.
There are many tools that enable you to do local testing. Serverless Framework is the most well known, and then there is AWS SAM, which has mocks for some of the services you’ll be using in a production environment, such as API Gateway.
The main benefit is that it’s much faster to do it on your machine, so you’re saving money. It’s also cheaper because you don’t need to pay Amazon for the infrastructure for testing. Additionally, it’s much easier to test locally because you can see and understand exactly what is going on in your environment. When it’s in the cloud it’s harder to understand the root cause of a problem.
There are, however, several disadvantages to local testing that outweigh the benefits. A big one is that other team members are working on other functions and the application is a combination of all these different components. Maybe I’m not testing the latest version of my colleagues’ functions.
Another issue is that in serverless we have the Lambda, which is the glue between the different managed services that we simply configure to do what we want.
When I’m running everything against local mocks I have different configurations. It may be that a problem I encounter is due to the configuration. For example, in DynamoDB you pay Amazon based on the number of concurrent writes and reads from the database. Now if I have a function that’s going to write to a database, if I test it locally on my own I may not come across any issues. But when I’m doing it in the real environment it may be that there are 50 other components also writing simultaneously to the database and it will be blocked. It’s a totally different scenario.
One of the most problematic issues is that we have mocks for some of the managed services like DynamoDB, and API Gateway, but we don’t have them for everything. We don’t have for Kinesis, SNS, SQS and others, so it’s difficult to get a complete picture.
With a hybrid methodology all the managed services will run in the cloud, where we’ll access them, and all my functions can run locally. It’s easy to debug because we have our code on our machine. The code is shorter because we’re only testing ours, we’re using all the correct configurations, and we still don’t need a lot of time for deployment so it is relatively simple.
The disadvantage is that we need to pay Amazon for the testing environment now too, not just the production environment. On top of that, we still might use older versions of the functions of the other developers, and we don’t know if there are any interconnectivity issues between my function and other functions. We need to remember that if we’re not practicing a well-orchestrated CI/CD platform, the tests are done before the merge with the latest. The above is not relevant if all tests are conducted as part of a pipeline and we are always using the latest (the con of testing like that is the time for build – this duration is reasonable before the merge, but not as part of the ongoing work of the developer).
The third option is to do all the testing on the cloud. The big advantage is that it’s exactly like in production. The downside is that it costs more money because each developer on your team needs to have their own account to serve as a testing environment, although it shouldn’t come to more than several hundred dollars for a big team.
Deploy time is an issue with cloud testing because you need to upload everything. But with some tweaks you can make this problem more manageable. We use a bash script to check what is new and only push to the cloud the things that have been changed. So, as developers what we do is we go into my environment and check it and if we find that somebody flagged something as changed then we can deploy everything to the cloud for testing. If not, then we only need to push the function we’re working on to the testing environment.
This process is done automatically of course with a push of a button, and again, the first basic tests will be done by each developer on his machine and the overall testing is part of the pipeline process.
Testing in the cloud is the only way to discover whether you’re exceeding the limits of what is allowed within the environment.
At the time of writing, AWS gives you a maximum limit of 1,000 concurrent lambdas (these limits are changed from time to time, so you better check what is the current limitations). Limitations also exist for storage and memory. You need to know what the limitations are, and you need to test your software against those limitations to find out if there will be a problem once it reaches production. Concurrency and timeouts are the big ones you need to watch out for.
Another thing you can only test properly in the cloud is efficiency and cost, because everything is about cost. The amount of memory that you allocate to your function dictates CPU.
The amount you pay depends on the memory you allocate and the duration of execution. So, a higher memory allocation may end up cheaper than a smaller allocation because the execution time could be cut significantly. There is no hard and fast rule, so it requires experimentation to arrive at the most efficient configuration.
Let’s say that we did a test and we find a problem. We need to understand how to solve the problem. We need to find some way to correlate between all the different components, in a highly distributed environment. In order to do that we need to get a more complete picture of our environment. This is where distributed tracing can help us. One way is to do manual tracing by adding trace-id for each component and collect spans using one of the following engines:
Another alternative is to use a commercial third-party tool – Lumigo.io is one of them – that can give you an immediate overview of your environment and help you identify root cause more quickly.