AWS Step Functions come in two flavours: Standard and Express. They share a definition language called ASL, but numerous features are different. Standard offers better observability features than Express, which is reflected in its higher cost.
This widget shows an up-to-date view of the most recent executions of a Step Function. You can see new executions starting, see their status update, and search by execution name prefix. Searching by prefix is useful if you encode a meaningful domain identifier into the start of each execution name.
Being able to filter by status is useful for investigating several failed executions without maintaining a separate list of execution names. Or if you need to cycle through a batch of failed executions to retry them.
A nicely rendered UI widget showing events that occurred during an execution. Human-readable timestamps, expanding content and JSON-aware syntax highlighting and formatting. Reading a nicely rendered table beats eyeballing lines of JSON in a log stream.
The coolest feature. A rendering of the Step Function definition, overlaid with details of this specific execution. If you want to find out which branch the execution took, the input or output of an individual task, or cycle through the outputs of a Map state, this widget will show you.
An execution of a Step Function is tightly coupled to the flow and shaping of data between states and tasks. Since there’s no step-through debugger to replay an execution, the Graph Inspector helps tremendously in understanding what happened after-the-fact.
And that’s about it. Yeah, you get what you pay for.
Express is (generally) cheaper than Standard. I expect that’s why those useful features are not available. The execution data would need to be stored and indexed to allow for querying with these access patterns. Removing those capabilities allows customers who don’t need them to run Step Functions at a lower cost. Even creating the logs from Express executions is opt-in, so you can further trade usability for reduced cost.
Standard vs. Express maybe isn’t the fairest comparison. Aside from sharing a definition language they are quite different. Perhaps a fairer comparison is Express vs. Lambda. I was being uncharitable when I said Express only produces CloudWatch Logs, there is a pre-configured dashboard on opening the Step Function:
Doesn’t look to dissimilar to what’s provided with a Lambda. There’s not much more available by default with Lambda either. Logs and metrics and traces are opt-in too, either via a configuration flag or creating logs and metrics explicitly yourself.
That comparison only goes so far though. The code executing in a Lambda is more easily simulated outside of AWS (e.g. running the same code in a local Node or Java process) so more confidence can be gained on how it behaves. Since Step Functions get their power from subtraction (constraining to ASL and their SDK) you’re more dependent on what information they expose to be able to make sense of your system. So the comparison with Lambda is perhaps too charitable.
There is an interesting aspect of the CloudWatch logs emitted from Express.
Each log event:
execution_arn
property, uniquely identifying this executiontype
attribute like TaskStateEntered
the values for type
are uniform across all Step Functions and executionsid
attribute, which is a consecutively incremented integer, allowing for robust in-order processingprevious_event_id
attribute, which is usually set to id - 1
, but is also used to track fan-out states like Map and ParallelI suspect these properties mean that all the lost features shown above could be reproduced from logs alone. If you were willing to put in the effort. Having the execution_arn
is a great unique identifier to index by for fast lookups. The status of a completed execution can be inferred from type=ExecutionSucceeded
and type=ExecutionFailed
events, which also could be indexed for querying by status. The id
and previous_event_id
of the sequence of events make it possible to build a graph data structure that could be rendered atop the state machine definition.
Clearly a lot of work to develop and operate. Especially for what is likely undifferentiated heavy lifting unrelated to your core business.
I wouldn’t rule out AWS offering these features with Express, a la carte, at some point down the road. A first iteration could be a screen where you enter a log stream ARN (identifying an execution) and the browser fetches the logs and reshapes them for rendering as the Graph Inspector and the Execution History.
ABOUT GRUNDLEFLECK
Graham "Grundlefleck" Allan is a Software Developer living in Scotland. His only credentials as an authority on software are that he has a beard. Most of the time.