Grundlefleck's Blog - AWS Step Functions— Observability features not available to Express

AWS Step Functions come in two flavours: Standard and Express. They share a definition language called ASL, but numerous features are different. Standard offers better observability features than Express, which is reflected in its higher cost.

Takeaways

Standard has useful features in the AWS web console which are not available with Express
CloudWatch Logs from Express expose the same raw data used by Standard-only features
you’re on your own in finding ways to use them in a similar way

Observability Features available only to Standard

Query executions, filtering by status

This widget shows an up-to-date view of the most recent executions of a Step Function. You can see new executions starting, see their status update, and search by execution name prefix. Searching by prefix is useful if you encode a meaningful domain identifier into the start of each execution name.

Being able to filter by status is useful for investigating several failed executions without maintaining a separate list of execution names. Or if you need to cycle through a batch of failed executions to retry them.

Execution history

A nicely rendered UI widget showing events that occurred during an execution. Human-readable timestamps, expanding content and JSON-aware syntax highlighting and formatting. Reading a nicely rendered table beats eyeballing lines of JSON in a log stream.

Graph inspector

The coolest feature. A rendering of the Step Function definition, overlaid with details of this specific execution. If you want to find out which branch the execution took, the input or output of an individual task, or cycle through the outputs of a Map state, this widget will show you.

An execution of a Step Function is tightly coupled to the flow and shaping of data between states and tasks. Since there’s no step-through debugger to replay an execution, the Graph Inspector helps tremendously in understanding what happened after-the-fact.

Observability Features available to Express

CloudWatch Logs

And that’s about it. Yeah, you get what you pay for.

Express is (generally) cheaper than Standard. I expect that’s why those useful features are not available. The execution data would need to be stored and indexed to allow for querying with these access patterns. Removing those capabilities allows customers who don’t need them to run Step Functions at a lower cost. Even creating the logs from Express executions is opt-in, so you can further trade usability for reduced cost.

Standard vs. Express maybe isn’t the fairest comparison. Aside from sharing a definition language they are quite different. Perhaps a fairer comparison is Express vs. Lambda. I was being uncharitable when I said Express only produces CloudWatch Logs, there is a pre-configured dashboard on opening the Step Function:

Doesn’t look to dissimilar to what’s provided with a Lambda. There’s not much more available by default with Lambda either. Logs and metrics and traces are opt-in too, either via a configuration flag or creating logs and metrics explicitly yourself.

That comparison only goes so far though. The code executing in a Lambda is more easily simulated outside of AWS (e.g. running the same code in a local Node or Java process) so more confidence can be gained on how it behaves. Since Step Functions get their power from subtraction (constraining to ASL and their SDK) you’re more dependent on what information they expose to be able to make sense of your system. So the comparison with Lambda is perhaps too charitable.

An Interesting Aside on the Structure of Express Logs

There is an interesting aspect of the CloudWatch logs emitted from Express.

Each log event:

contains an execution_arn property, uniquely identifying this execution
has a type attribute like TaskStateEntered the values for type are uniform across all Step Functions and executions
has an id attribute, which is a consecutively incremented integer, allowing for robust in-order processing
has a previous_event_id attribute, which is usually set to id - 1, but is also used to track fan-out states like Map and Parallel

I suspect these properties mean that all the lost features shown above could be reproduced from logs alone. If you were willing to put in the effort. Having the execution_arn is a great unique identifier to index by for fast lookups. The status of a completed execution can be inferred from type=ExecutionSucceeded and type=ExecutionFailed events, which also could be indexed for querying by status. The id and previous_event_id of the sequence of events make it possible to build a graph data structure that could be rendered atop the state machine definition.

Clearly a lot of work to develop and operate. Especially for what is likely undifferentiated heavy lifting unrelated to your core business.

I wouldn’t rule out AWS offering these features with Express, a la carte, at some point down the road. A first iteration could be a screen where you enter a log stream ARN (identifying an execution) and the browser fetches the logs and reshapes them for rendering as the Graph Inspector and the Execution History.

comments powered by Disqus

ABOUT GRUNDLEFLECK

Graham "Grundlefleck" Allan is a Software Developer living in Scotland. His only credentials as an authority on software are that he has a beard. Most of the time.

Follow Grundlefleck
GitHub
Twitter