← Tin's Posts · April 28, 2026 · 5 min read
The Same Machine, Twice
I built (nearly) the same thing twice last year. Once in twenty lines of Python. Once with five Lambda functions, an S3 handoff, a distributed map, and EventBridge wired to a DynamoDB audit trail.
The thing: a state machine. The concept was identical both times. The infra wasn't.
The small one
Here's the entire design of the Python version. First, the states — the actual enum from the source:
class SyncStatus(StrEnum):
NEW = "new"
BUILD_FAILED = "build_failed"
READY = "ready"
SUBMIT_FAILED = "submit_failed"
SUBMITTED = "submitted"
It's really that simple. Five states in a StrEnum (just a fancy string with types). I got some six transitions out of it, and it drove a real world implementation. Didn't need no library, nor a framework... and not a line of YAML either.
The system syncs legal matters from a Salesforce-based platform into an internal API. Records arrive, we try to map them, we try to submit them, we track where each one sits. If something breaks, we know exactly which step broke, and a retry task runs at 10, 30, and 55 minutes past the hour to pick up the pieces.
The state machine made one thing possible that if/elif chains never do: you can look at any record in the database and immediately know what happened to it and what can happen next. build_failed means the mapping logic isn't written yet, or the dependencies failed to resolve (lazy loading is fun...). submit_failed means the API rejected it. submitted is a terminal state — it's done, nothing touches it again.
When a new platform came in, we added states for it. When a mapping failed, we fixed the code and ran process_pending(). The machine didn't need to change. Only the interfaces doing the transitions were implemented.
The large one
This version processes monthly invoices. Each invoice run fans out across potentially hundreds of bookings - each one needing a ClickHouse query, a DynamoDB lookup, a processor evaluation, and a line item stored. Then those results aggregate back into a single report with a final status.
That's a different problem. A Python loop works fine at ten records. At three thousand, you want parallelism. If one reservation fails, you want to know without losing the other two thousand, nine hundred, and ninety-nine. If a Lambda crashes mid-run, you want the execution to resume, not restart.
You want to buy what Amazon's selling.
So: AWS Step Functions. Five Lambdas in sequence. S3 as the handoff between steps - the list of bookings written to a file, the distributed map reading from it, the results written back, the aggregator reading those. EventBridge listening for execution completions and writing an audit record to DynamoDB regardless of outcome.
The state machine here doesn't even live in "application code". It's a construct with loads of YAML and minutiae for each of those steps, machine definitions, etc. You could do it in the visual UI... but let's be real, we're doing real work here.
The states are the same shape:
pending -> processing
processing -> completed
processing -> fatal_error
The infrastructure is just what you attach them to.
Where they both break
Here's what the code review of the AWS machine turned up: missing validation at state boundaries.
A Lambda in the middle of the chain expected inputId to be present in its input. It wasn't validated. If it arrived as undefined, the update would fail silently - the line item would never reach completed or fatal_error, it would just stop. A field not threaded through. An assumption instead of an assertion. Real easy mistake to make.
The Python machine had the same failure mode, differently dressed: a matter could sit in ready with a null payload if the build step stored the status before confirming the payload was actually set.
Different scale, same mistake. Both caught by looking carefully at what each step assumes it will receive. The fix in both cases was the same: redundantly validate at the entry of every state transition. State machines protect from control flow chaos, they don't magically clean the data up.
The AWS machinery - the retries, the execution history, the distributed map - didn't catch this. Naming the states and reviewing the transitions did.
The point
Most business logic is "just" a state machine. You just don't see it yet.
A support ticket goes: open to in_progress to resolved. An order goes: placed to payment_confirmed to fulfilling to shipped. A sync record goes: new to ready to submitted. All states.
The code that moves between them is full of transitions. The bugs live in the unmodelled ones - the "what if it's already in this state when we try to do that" cases that an if/elif chain handles inconsistently and a state machine handles explicitly.
You don't need Step Functions to get this. You don't need a library. You need an enum, a clear definition of which transitions are valid, and the discipline to only write the record once you're sure the transition succeeded.
Step Functions earns its keep when you need to do a lot of records, precisely, fast - Lambda timeouts, S3 writes, third-party APIs, fan-out. For the rest: name your states. Write them in a comment at the top of the file if nowhere else. Just thinking about it makes lots of the bugs obvious.
The machine was the same both times. The question was just how many moving parts it needed to survive.
Enjoyed this? Subscribe to get future posts by email.