Does Step Function's new TestState API make end-to-end tests obsolete?

Home Blog Does Step Function's new TestState API make end-to-end tests obsolete?

Debug fast and move on.

Resolve issues 3x faster
Reduce error rate
Speed up development

No code, 5-minute set up

Start debugging free

Yan Cui , Jan 15 2024

Step Function added support for testing individual states [1] with the new TestState API [2]. Which lets you execute individual states with the following:

the state definition
an input
an IAM role

And returns the following:

the output of the state
the status — whether it succeeded, errored, or caught an error
the next state in the execution
the error and cause (where applicable)

With the TestState API, you can thoroughly test every state and achieve close to 100% coverage of a state machine.

So, does this eliminate the need for Step Functions Local [3]?

Can we do away with end-to-end tests as well?

If not, where should this new API fit into your workflow, and how should you use it?

What problems does the TestState API solve?

As mentioned earlier (in reference [4]), I use a combination of methods to test Step Functions:

Component testing on individual Lambda functions.
Use end-to-end tests to test most execution paths.
Use Step Functions Local to test hard-to-reach execution paths (using mocks to direct the execution to the target branches).

The TestState API lets you test these hard-to-reach states directly. It should help you improve your state machine’s test coverage with less effort.

However, it’s worth remembering that it’s not a local simulation tool. In most cases, it wouldn’t help you improve the speed of your feedback loop.

For example, if you’re testing a Lambda-based Task state, then the referenced Lambda function and the relevant IAM role must be deployed first. Similarly, after you change the Lambda function, you have to deploy the change before you can test the state.

Another good use case for TestState API is for testing input or output processing logic [5]. This includes modifying the current input with the Pass state’s Result field.

Because the TestState API takes the state definition as an argument, you do not have to redeploy the state machine after every change. Instead, you can iterate and test your settings by passing the modified state definition to the TestState API.

How to use the TestState API

For example, take the Task 2 state from the imaginary state machine above:

Copy Code

Task 2: Type: Task Resource: !GetAtt task2.Arn Catch: - ErrorEquals: [ "States.ALL" ] Next: Task 3 End: true

We can write tests to make sure that:

In the happy path, the execution succeeds and there is no nextState.
In the error case, the execution errs, but the error is caught, and the execution should proceed to the Task 3 state.

We need a way to fetch the definition of our state machine and the IAM role we should use. I like to encapsulate this into a given module, like this:

Copy Code


const { SFNClient, DescribeStateMachineCommand } = require("@aws-sdk/client-sfn")
const client = new SFNClient()

const a_state_machine = async (stateMachineArn) => {
  const command = new DescribeStateMachineCommand({ 
    stateMachineArn
  })
  const resp = await client.send(command)
  
  return {
    definition: JSON.parse(resp.definition),
    roleArn: resp.roleArn
  }
}

module.exports = {
  a_state_machine
}

We also need a way to call the TestState API with our state definition and input. I like to encapsulate this into a when module:

Copy Code


const { SFNClient, TestStateCommand } = require("@aws-sdk/client-sfn")
const client = new SFNClient()

const we_invoke_a_state = async (state, input, roleArn) => {
  const command = new TestStateCommand({ 
    definition: JSON.stringify(state),
    input: JSON.stringify(input),
    roleArn
  })

  const response = await client.send(command)
  return response
}

module.exports = {
  we_invoke_a_state
}

So I can keep my test code simple and easy to read.

Copy Code


require('../steps/init')
const given = require('../steps/given')
const when = require('../steps/when')

describe('When the task errors', () => {
  it('"Task 3" should be the next state', async () => {
    const { definition, roleArn } = await given.a_state_machine(process.env.MyStateMachineArn)
    const choice = definition.States['Task 2']
    const resp = await when.we_invoke_a_state(choice, { ErrorProbability: 1 }, roleArn)
    expect(resp.status).toEqual('CAUGHT_ERROR')
    expect(resp.nextState).toEqual('Task 3')
  })
})

describe('When the task succeeds', () => {
  it('The execution should end', async () => {
    const { definition, roleArn } = await given.a_state_machine(process.env.MyStateMachineArn)
    const choice = definition.States['Task 2']
    const resp = await when.we_invoke_a_state(choice, { ErrorProbability: 0 }, roleArn)
    expect(resp.nextState).toBeUndefined()    
    expect(resp.status).toEqual('SUCCEEDED')
  })
})

(You can try out this demo project here [6])

I can write tests like this for every state in the state machine and cover every scenario.

However, as I mentioned before, both the Lambda function (used by the Task state) and the IAM role must be deployed first. So your typical workflow would be as follows:

Work on the state machine design.
Implement the Lambda functions.
Deploy the project, including the state machine, Lambda functions, IAM roles, etc.
Run tests against individual states.

As you iterate on your state definitions and Lambda functions, how do you maintain a fast feedback loop? Can you avoid redeploying the project every time you make a change?

Yes, you can. That’s why we need a full suite of different tests.

Do we still need component tests?

Yes, you should still perform component-level testing on the Lambda functions involved.

Use “remocal testing” (i.e. execute the Lambda function code locally against remote AWS resources) to maintain a fast feedback loop as you iterate on your Lambda function.

As you iterate on your Lambda function, you can run these tests and execute the latest code locally. Because the code is executed locally, you don’t need to deploy them to the Lambda service.

But a Task state is more than just the Lambda function. There are input and output processing and error handling settings as well.

The TestState API helps you test these settings as seen in the example above.

Do we still need Step Functions Local?

Step Functions Local was best used to test execution paths that are difficult to reach, thanks to its mocking capability.

The ability to test individual states means this is no longer necessary.

Another potential use for Step Functions Local is to iterate on your state machine locally without redeploying the project.

Unfortunately, this doesn’t work very well in practice.

Your state machine likely depends on Lambda functions, SNS topics, and other AWS resources. So you have to either provide a full simulation of all these resources (e.g. by running LocalStack [7]) or you still have to deploy your project first.

The same dynamic still exists with the TestState API.

But no, you don’t need to use Step Functions Local anymore.

Do we still need end-to-end tests?

End-to-end tests execute the state machine in the cloud and make sure everything works together. Before the TestState API, end-to-end tests played an important role in my test strategy.

They were the workhorse in my test suites.

From a test coverage perspective, you no longer need end-to-end tests. You can achieve better test coverage with less effort by testing individual states with the TestState API.

However, it’s easy to lose sight of the forest when looking only at the individual trees.

I think there is still value in having end-to-end tests for business-critical execution paths. This is to ensure that all the individual states do indeed function together as a unit.

In a state machine, data flows from one state to the next. You need to make sure that if you change the output from Task #1 (see below) then you also change the conditions in Choice #2.

It’s easy to break the contract between Task #1 and Choice #2 when testing them separately.

This is similar to the kind of integration problems that you often face in a microservices environment. In the context of a state machine, end-to-end tests can help you catch these “integration” problems early.

Summary

To summarise:

The new TestState API is awesome! You can use it to achieve nearly 100% test coverage of your state machines.
Because the business logic of a state machine is often split across Lambda functions and state definitions, you should still have tests for Lambda functions.
You should use “remocal tests” for Lambda functions to help you maintain a fast feedback loop.
Because the TestState API invokes the remote resources referenced by the Task state, you still have to deploy the project first.
You don’t need to use Step Functions Local anymore.
There is still value in end-to-end tests. You should use them to ensure critical business workflows work end-to-end.