Many articles have been written about AWS Step Functions since it was first introduced in 2016. Most of them create the impression that the service is simply an extension of the Lambda function that allows us to stitch together multiple Lambda functions to call each other.
But actually, it is much more than that. Step Functions allow us to design and build the flow of execution of AWS serverless modules in our application in a simplified manner. This enables a developer to focus solely on ensuring that each module performs its intended task, without having to worry about connecting each module with others.
In this article, we will explore the what, the how and the why of Step Functions, before walking through some use cases, limitations and best practices around using the service.
AWS Step Functions is an orchestrator that helps to design and implement complex workflows. When we need to build a workflow or have multiple tasks that need orchestration, Step Functions coordinates between those tasks. This makes it simple to build multi-step systems.
Step Functions is built on two main concepts Tasks and State Machine.
All work in the state machine is done by tasks. A task performs work by using an activity or an AWS Lambda function, or passing parameters to the API actions of other services.
A state machine is defined using the JSON-based Amazon States Language. When an AWS Step Functions state machine is created, it stitches the components together and shows the developers their system and how it is being configured. Have a look at a simple example:
Can you imagine if you had to do it yourself using a Messaging Queue, Istio or App Mesh? It would be a big task, and that’s without considering the overhead of actually maintaining that component.
It’s really great to see what features it provides out of the box. However, it would have been even better if AWS had added the ability to design it visually rather than through JSON.
As discussed earlier, the state machine is a core component of the AWS Step Functions service. It defines communication between states and how data is passed from one state to another.
A state is referred to by its name, which can be any string but must be unique within the scope of the entire state machine. It does the following functions:
Here is an example of a state definition for Task type:
"States": { "FirstState": { "Type": "Task", "Resource": "arn:aws:lambda:ap-southeast-2:710187714096:function:DivideNumbers", "Next": "ChoiceState" }
Learn more about AWS Serverless in our article: What is AWS X Ray?
For Step Functions, input is always passed as a JSON file to the first state. However, it has to pass through InputPath, ResultPath and OutputPath before the final output is generated. JSON output is then passed to the next state.
InputPath – selects which parts of the JSON input to pass to the task of the Task state (for example, an AWS Lambda function).
ResultPath then selects what combination of the state input and the task result to pass to the output.
OutputPath can filter the JSON output to further limit the information that’s passed to the output.
Let’s take a look at an example to better understand this in detail:
For Lambda execution, Input is described as JSON like above. That input is bound to the symbol $ and passed on as the input to the first state in the state machine.
By default, the output of each state would be bound to $ and becomes the input of the next state. In Each state, we have InputPath, ResultPath and OutputPath attributes which filters the input and provide the final output. In the above scenario, “ExamResults” state is filtering “lambda” node, appending the result of a state execution to “results” node and final output is just “result” node rather than the whole JSON object:
Hence, the final output will be:
{ "math": 80, "eng": 93, "total": 173 },
Step Functions can be triggered in four ways :
As I mentioned earlier, Step Functions is not only about Lambda Functions. It has support for several other Integration Patterns like SQS, DynamoDB, SNS, ECS, and many others.
There are many use cases that can be resolved using Step Functions. However, we’ll restrict ourselves to a few major ones here:
If you have many Batch Jobs to be processed sequentially and need to coordinate the data between those, this is the best solution. For example, an e-commerce website can first read the product data and the next job will find out which products are running out of stock soon and then, the third job can send a notification to all the vendors to expedite the supply process.
If a workflow needs manual approval/intervention, AWS Step Function would be the best solution to coordinate it. For example, the Employee promotion process – It needs approval from the manager. So the Step function can send the email using AWS SES service with Approve or Reject link and once receives it, can trigger the next action using lambda or ECS jobs.
AWS Step functions can help to make a decision about how best to process data. Based on the file size, you can decide to use either lambda, ECS or on-premise activity to optimize the cost and runtime both.
"Retry": [ {
"ErrorEquals": [ "States.Timeout" ],
"IntervalSeconds": 3,
"MaxAttempts": 2,
"BackoffRate": 1.5
} ]
It can also catch Lambda service exceptions(Lambda.ServiceException) and even the unhandled errors (Lambda.Unknown). A typical example for an error handling:
"Catch": [ {
"ErrorEquals": [ "States.TaskFailed", “States.Permission” ],
"Next": “state x”
} ]
You can bet that it was never this easy to implement error handling like this with any other workflow solution.
Despite all the powerful features Step Functions offers, there are still a few things missing:
"ExamResults": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123456789012:function:HelloFunction",
"TimeoutSeconds": 200,
"HeartbeatSeconds": 30,
"End": true
}
Let’s take an example where HeartbeatSeconds value is 30 seconds and TimeoutSeconds is 400 seconds for a long activity worker process.
When the state machine and activity worker process starts, the execution pauses at the activity task state and waits for your activity worker to poll for a task. Once a taskToken is provided to your activity worker, your workflow will wait for SendTaskSuccess or SendTaskFailure to provide a status. If the execution doesn’t receive either of these or a SendTaskHeartbeat call before the time configured in TimeoutSeconds, the execution will fail and the execution history will contain an ExecutionTimedOut event. So, by configuring these, we can design a long running workflow effectively.
Similar to Lambda functions, Step Functions also sends logs to CloudWatch and it generates several metrics around it. For example, Execution metrics, Activity metrics, Lambda metrics, etc. Below is an example of Execution Metrics:
Visual Workflow panel shows the current status of the execution. Look at the right side of the panel (below picture). We can see the details of any state by clicking on the state in the workflow diagram. It shows the input, output, and an exception (if any) for the state.
It also logs the execution history and status for each state. AWS Console does provide a nice visual of the states from start to end. We can also click on CloudWatch Logs to go to LogGroups and see detail logs.
One recommendation is to create a Unique Trace ID which should be passed to all the integration services these states connect to. It will help to track the transactions easily.
It also has integration with CloudTrail to log the events.
In this article, we explored the basic concepts of Step Functions and how it works. We also talked about how with the Visualization panel, Error Handling and Retry features, it makes the workflow creation process much smoother. Step Functions should properly be described as state-as-a-service. Without it, we would not be able to maintain the state of each execution having multiple lambda functions/activities.
Just keep in mind that you need to keep a watch on your bills as it can burn a hole in your pocket very fast. And the best way to do that is to ensure that proper monitoring and metrics are in place.
Learn how easy AWS Lambda monitoring can be with Lumigo