AWS Step Functions are difficult to test. I found an approach to testing that helped in one particular scenario. It relies on using the same programming language for both testing and infrastructure-as-code (IaC). I hope the idea helps others tame their Step Functions into a testable submission.
I’m most familiar with unit testing. I’m used to lifting some small component from my system and isolating it in a different runtime environment. There, I poke it with my special inputs, and prod at the results until I’m satisfied it behaves how I want. By isolating it, I gain confidence it serves its purpose when it is executed as part of the system.
This familiar technique of decomposing a program into smaller units to instantiate and test individually is not an option with Step Functions. The definition of a state machine is one atomic unit. I found no way to take an individual state, and poke and prod it in isolation. If I want to test a Step Function, I deploy it to my developer environment on AWS and execute the whole thing. Note that this is not the same debate as whether you should test locally or in the cloud. Either way, the execution of a Step Function cannot be decomposed.
If that explanation does not work for you, try this: Step Functions are difficult to test in the same way stored procedures on a database server are hard to test.
I have taken this approach before. As is typical for integration tests, I found them easy to define, but difficult to make deterministic, reliable and fast. There are some nice aspects to the AWS API that help – in particular, retrieving the events in an execution’s history is useful for asserting that particular state transitions were made.
However, having to transition through the entire state machine for every test made set up harder. Made each test case slower and less reliable. Made every change of the Step Function definition require a deployment, which stretches out the feedback loop (even with the CDK’s --hotswap
option, which made a huge difference).
When I came to introduce a DynamoDB task to a Step Function I wanted something better.
I had a use case for adding a DynamoDB task to a Step Function. The specifics of why are not too important, what’s important is that the single DynamoDB task covers a lot of behaviour in a small task definition. I’ll explain just enough of the use case and Dynamo interaction to highlight why testing via Step Function executions causes problems.
My use case was to limit concurrent access to a shared resource. I decided on a semaphore implemented via DynamoDB. Before accessing the shared resource, the Step Function should try to “acquire a permit”, and could not continue if none are available. A condition expression and a counter would model the metaphor of acquiring permits until they’ve all been taken.
The DynamoDB updateItem task in the Step Function task definition contained these expressions:
"acquirePermit": {
"Type": "Task",
"Resource": "arn:aws:states:::dynamodb:updateItem",
"Parameters": {
"ConditionExpression": "(attribute_not_exists(#ownerCount) or #ownerCount < :limit) and attribute_not_exists(#ownerId)",
"UpdateExpression": "SET #ownerCount = if_not_exists(#ownerCount, :initializedOwnerCount) + :increase, #ownerId = :acquiredAt"
}
// several attributes omitted for brevity
}
A brief overview of the desired behaviour:
acquirePermit
should be successful and the owner count should be incremented and ownerId
added as a keyacquirePermit
should create an item and initialize the ownerCount
as 1acquirePermit
task should fail and leave the Item untouchedThere are even more semantics and edge cases not included here. Before even getting to the subsequent DynamoDB call to release a permit.
I couldn’t bring myself to write several tests which all execute the Step Function. I want fewer slow and unreliable tests in my life. Getting the conditions correct required upfront thinking, but also some trial and error while I explored and improved the code. The feedback loop had to be tight. I needed to make the task of the next developer to touch this easier, especially if that developer is me.
Instead of testing the Step Function by executing it, I took advantage of using TypeScript for both CDK and integration tests in the same repository. I defined constants in their own file:
// semaphoreExpressions.ts
export const acquireUpdateExpression =
"SET #concurrencyLimit = :limit, #ownerCount = if_not_exists(#ownerCount, :initializedOwnerCount) + :increase, #ownerId = :acquiredAt";
export const acquireConditionExpression =
"(attribute_not_exists(#ownerCount) or #ownerCount < :limit) and attribute_not_exists(#ownerId)";
In defining the Step Function construct with CDK, those expressions are only an import away:
// cdkStack.ts
import {
acquireConditionExpression,
acquireUpdateExpression
} from "./semaphoreExpressions";
const acquirePermit = new tasks.DynamoUpdateItem(this, "acquirePermit", {
table: myTable,
updateExpression: acquireUpdateExpression,
conditionExpression: acquireConditionExpression,
returnValues: DynamoReturnValues.ALL_NEW,
resultPath: "$.acquirePermitResult"
// several attributes omitted for brevity
})
const definition = sfn.Chain.start(
acquirePermit.next(nextStatesWhichAccessSharedResource));
new sfn.StateMachine(this, "MyStateMachine", { definition });
Big whoop right? I’ve taken a string constant and imported it into another file. Hardly the stuff of Knuth or Dijkstra. So what?
With this simple change, I can use the DynamoDB client to execute the query in a Jest test:
// dynamoBasedSempahore.test.ts
import {
acquireConditionExpression,
acquireUpdateExpression
} from "./semaphoreExpressions";
const acquire = async (/* params omitted for brevity */) => {
const documentClient = new DocumentClient();
const acquirePermitUpdateItem: DocumentClient.UpdateItemInput = {
TableName: "MyDynamoTableInDevelopmentEnvironment",
UpdateExpression: acquireUpdateExpression,
ConditionExpression: acquireConditionExpression,
ReturnValues: "ALL_NEW"
// several attributes omitted for brevity
};
return documentClient.update(acquirePermitUpdateItem).promise();
}
it("can acquire a permit", async () => {
const response = await acquire();
expect(response).toEqual(/* verify successful response*/);
// can fetch Dynamo Item and assert on attributes here if you wish
});
I run the test from my machine. I can iterate on the query without worrying about deploying or executing the Step Function. Sure, I need a real DynamoDB table, but that is rarely modified.
Hello again, Feedback Loop, my old friend.
Where does this leave me? I now have a Step Function containing an dynamodb:updateItem
task, where I have confidence in the behaviour of its query. If I need to make a change, I can get feedback by running an integration test that only depends on DynamoDB and the TypeScript code in my editor. Which is a damn sight easier to make robust, deterministic, and fast. I’ve avoided using a Lambda task and the operational burden it introduces.
I still need to have confidence that my Step Function transitions to and from this state correctly. I resorted to an integration test for the happy path in this case. Here, one test case is less bad than several. I could have added CDK unit test to verify the Step Function construct uses the same condition string, but I felt that was overkill.
This approach would not have been as easy without a shared programming language for both tests and infrastructure-as-code. If the Step Function was defined in the YAML of a CloudFormation or SAM template for example, sharing a constant would have been more effort than just importing across two different TypeScript files.
Step Functions are hard to test. Having a shared programming language across IaC and tests allowed a creative way to gain confidence in my system, with more maintainable tests.
As a parting thought, it would be interesting to approach this testing problem from “above” the Step Function instead of from “below” as I’ve done here. For example, maybe the same DynamoDB behaviour could be written once in a declarative model and used to generate code for local testing, as well as the ASL task definition. When Step Function ASL is generated or transpiled from a different model, there would be lots of clever things you could do.
I found this pattern to play nicely with DynamoDB tasks, what other Step Function task types could it benefit?
AWS Step Functions come in two flavours: Standard and Express. They share a definition language called ASL, but numerous features are different. Standard offers better observability features than Express, which is reflected in its higher cost.
This widget shows an up-to-date view of the most recent executions of a Step Function. You can see new executions starting, see their status update, and search by execution name prefix. Searching by prefix is useful if you encode a meaningful domain identifier into the start of each execution name.
Being able to filter by status is useful for investigating several failed executions without maintaining a separate list of execution names. Or if you need to cycle through a batch of failed executions to retry them.
A nicely rendered UI widget showing events that occurred during an execution. Human-readable timestamps, expanding content and JSON-aware syntax highlighting and formatting. Reading a nicely rendered table beats eyeballing lines of JSON in a log stream.
The coolest feature. A rendering of the Step Function definition, overlaid with details of this specific execution. If you want to find out which branch the execution took, the input or output of an individual task, or cycle through the outputs of a Map state, this widget will show you.
An execution of a Step Function is tightly coupled to the flow and shaping of data between states and tasks. Since there’s no step-through debugger to replay an execution, the Graph Inspector helps tremendously in understanding what happened after-the-fact.
And that’s about it. Yeah, you get what you pay for.
Express is (generally) cheaper than Standard. I expect that’s why those useful features are not available. The execution data would need to be stored and indexed to allow for querying with these access patterns. Removing those capabilities allows customers who don’t need them to run Step Functions at a lower cost. Even creating the logs from Express executions is opt-in, so you can further trade usability for reduced cost.
Standard vs. Express maybe isn’t the fairest comparison. Aside from sharing a definition language they are quite different. Perhaps a fairer comparison is Express vs. Lambda. I was being uncharitable when I said Express only produces CloudWatch Logs, there is a pre-configured dashboard on opening the Step Function:
Doesn’t look to dissimilar to what’s provided with a Lambda. There’s not much more available by default with Lambda either. Logs and metrics and traces are opt-in too, either via a configuration flag or creating logs and metrics explicitly yourself.
That comparison only goes so far though. The code executing in a Lambda is more easily simulated outside of AWS (e.g. running the same code in a local Node or Java process) so more confidence can be gained on how it behaves. Since Step Functions get their power from subtraction (constraining to ASL and their SDK) you’re more dependent on what information they expose to be able to make sense of your system. So the comparison with Lambda is perhaps too charitable.
There is an interesting aspect of the CloudWatch logs emitted from Express.
Each log event:
execution_arn
property, uniquely identifying this executiontype
attribute like TaskStateEntered
the values for type
are uniform across all Step Functions and executionsid
attribute, which is a consecutively incremented integer, allowing for robust in-order processingprevious_event_id
attribute, which is usually set to id - 1
, but is also used to track fan-out states like Map and ParallelI suspect these properties mean that all the lost features shown above could be reproduced from logs alone. If you were willing to put in the effort. Having the execution_arn
is a great unique identifier to index by for fast lookups. The status of a completed execution can be inferred from type=ExecutionSucceeded
and type=ExecutionFailed
events, which also could be indexed for querying by status. The id
and previous_event_id
of the sequence of events make it possible to build a graph data structure that could be rendered atop the state machine definition.
Clearly a lot of work to develop and operate. Especially for what is likely undifferentiated heavy lifting unrelated to your core business.
I wouldn’t rule out AWS offering these features with Express, a la carte, at some point down the road. A first iteration could be a screen where you enter a log stream ARN (identifying an execution) and the browser fetches the logs and reshapes them for rendering as the Graph Inspector and the Execution History.
AWS Step Functions come in two flavours: Standard and Express. They share a definition language called ASL, but a lot of the API operations are not supported across both types. I didn’t find a good side-by-side comparison showing the support for each individual API operation, so I made one. This could help you decide if your Standard workflows could be replaced by Express workflows.
There is a good side-by-side comparison for an overview of the two types of workflow. Understanding those differences helps to explain why certain operations are not supported.
Takeaways
Category | Operation | Standard | Express |
---|---|---|---|
State Machine | createStateMachine | ✅ | ✅ |
updateStateMachine | ✅ | ✅ | |
deleteStateMachine | ✅ | ✅ | |
describeStateMachine | ✅ | ✅ | |
listStateMachines | ✅ | ✅ | |
Execution | startExecution | ✅ | ✅ |
startSyncExecution | ❌ | ✅ | |
stopExecution | ✅ | ❌ | |
getExecutionHistory | ✅ | ❌ | |
describeExecution | ✅ | ❌ | |
listExecutions | ✅ | ❌ | |
describeStateMachineForExecution | ✅ | ❌ | |
Activities | createActivity | ✅ | ❌ |
deleteActivity | ✅ | ❌ | |
listActivities | ✅ | ❌ | |
describeActivity | ✅ | ❌ | |
getActivityTask | ✅ | ❌ | |
Callback | sendTaskFailure | ✅ | ❌ |
sendTaskSuccess | ✅ | ❌ | |
sendTaskHeartbeat | ✅ | ❌ | |
Tags | tagResource | ✅ | ✅ |
untagResource | ✅ | ✅ | |
listTagsForResource | ✅ | ✅ |
ABOUT GRUNDLEFLECK
Graham "Grundlefleck" Allan is a Software Developer living in Scotland. His only credentials as an authority on software are that he has a beard. Most of the time.