Skip to main content

· 11 min read
Adam Berger

One of the most important decisions in software architecture is deciding where to draw lines. If we can get the right groupings of concerns, separating the parts we can and keeping the right aspects of our system together, we'll be able to express our solutions more simply and straightforwardly. In fact, many of the big shifts in the way we think about software architecture can be thought of as drawing different boxes around the same components.

For example:

  • The microservices movement advocated drawing a physical box around individual domains, aligning physical architecture with domain structure.
  • React advocated removing the line between UI logic and templating and drawing a box around both.
  • Document databases advocated moving constraints and schema maintenance into the application code (outside of the data storage box), creating a boundary between them.

Line drawing doesn't change the set of things we can do but it does change the set of simple things we can build and the set of things we can express easily.

Simple things do not cross lines and embody a single concept. Easy things make use of few simple things.

If we see our job as platform builders as making simple easy, drawing lines determines both what simple things we can build and what's easy to build. So where we draw our lines matters. A lot.

We'll take a look at the boundaries that a traditional backend chooses and how a traditional backend as a service modifies those boundaries to create a different set of tradeoffs. Finally, we'll discuss a new set of boundaries that define what we're calling the "smart storage" architecture and examine what benefits we get from this new grouping of concerns.

A traditional backend architecture

Traditional backend

The traditional backend draws its top-level lines between the client and the server and between the server and the database.

The simple things we can easily create in this architecture are:

  • Named backend operations (e.g. getOrder or updateShoppingCart). These reside entirely within the server and are easy to individually invoke from the client.
  • Static constraints on the shape of our data. Our data store has everything it needs to evaluate the constraints before saving our data.
  • Authorizing clients to run specific backend operations based on static client claims.

The things we may want to do that this architecture makes complex (non-simple) and hard are:

  • Orchestrating multiple backend operations. Even if we build a new endpoint, the orchestration requires both logic and state, which requires crossing the line between our backend and our datastore.
  • Dynamic constraints on the evolution of our data. For example, we may want to enforce the constraint that our order status can never go from placed to shipped without first transitioning to fulfilled. We can attempt to enforce this in our server box but it requires support from our data store. Further, if we have drawn some lines between our endpoints, any dynamic constraints will have to cross the lines between every endpoint that deals with the entity we're considering.
  • Authorizing clients to run backend operations where the authorization decision requires additional data. This introduces logic and data store access into our authorization layer and requires our authorization layer to know about the internals of our endpoint logic, requiring us to cross quite a few lines.

A traditional backend as a service architecture

Traditional backend as a service

The traditional backend as a service in the model of Firebase, Supabase, etc. collapses the client and server boxes from our "traditional backend," removing the line that had been between them, and moves the authorization box to sit in front of the data store rather than in front of the endpoint logic.

We can see that we have generally the same set of components, just with different boundaries drawn between them.

So what are the simple things that this architecture enables?

  • Reads of and updates to individual data items from the client are simple and easy. This is the bread and butter of traditional backends as a service.
  • Static constraints on the shape of our data. Our data store still has everything it needs to evaluate the constraints on our data, though specific backends as a service vary in their support for this.
  • Authorizing clients to view and update individual tables based on static client claims (Supabase has cleverly extended this to support authorization based on arbitrary data by collapsing the authorization and data storage boxes).

And what are the capabilities we likely want that this architecture makes complex and hard?

  • Enforcing invariants across entities. The best of the traditional BaaS platforms offer client-side transactions, so it is possible for cooperative clients to maintain invariants across entities but, by nature, many of our cross-entity invariants combine aspects of authorization with static schema constraints. This creates a very complex interaction across the line separating "authorization" from "data storage". This means that, even in the best case, it is likely not possible to prevent invariant violations across entities with uncooperative clients (i.e. anyone who knows how to open dev tools).
  • Dynamic constraints on the evolution of our data. Remember: here we have in mind the example of requiring an order entity to transition to a fulfilled state before being marked as shipped. At least in the traditional backend architecture, it was possible to enforce dynamic constraints by creating a complex interaction between endpoints and our data store. Unfortunately, it is incredibly difficult to enforce these constraints at all with the typical BaaS without crossing every single line we've drawn. We could allow running arbitrary code in the authorization layer or data store to inspect the old and new version of every updated record but we would then need to write a validator in the authorization layer for every update that we wanted to write on the client side. At that point, we've essentially re-built our traditional backend but required every endpoint to be built twice.
  • Orchestrating backend operations that should outlive the client. Because we don't have any logic running on the server, this is not just hard but impossible without building additional systems.

A "smart storage" architecture

We've seen that there is quite a lot to like about the traditional backend architecture vs the traditional backend as a service. We had to sacrifice our ability to control the evolution of our data in order to get the convenience of centralizing our logic (i.e. placing all logic on the client instead of split between client and server).

What if we could re-draw these lines to preserve the things we liked about the traditional backend architecture, expand the set of simple things that could be built easily, and still got the convenience of the traditional backend as a service?

Enter: the "smart storage" architecture.

Smart storage architecture

The primary innovation of the traditional backend as a service over the traditional backend was to remove the line drawn between the client and the server.

In our "smart storage" architecture, we will instead remove the boundary between the server logic and the database. We'll also remove the boundaries between the logic for different endpoints.

When we say that we'll remove the boundary between the server logic and the database, we mean this at a logical layer. We'll make the storage of server logic state automatic and transparent to the server itself by constraining our server logic to look like a reducer, transforming a current state and an event into a new state and a set of effects. We likely still store data in a physically separate data store but, while physical boundaries do matter, the logical boundaries determine the high order bits of what's simple and easy to express.

Our client can send events to our smart storage backend through the authorization layer and can read the storage state.

Since we are still encapsulating all of the backend logic into a box that is easily managed by a platform, our "smart storage" architecture fits nicely as a target for a traditional backend (e.g. while migrating) or as a backend as a service, hit directly by clients.

So what are the simple things made easy by this architecture?

  • Arbitrary backend operations can be implemented in one place and invoked by the client (by sending the corresponding event).
  • Static constraints on the shape of our data are easy to enforce because all of our transition logic sits in one place and can easily share validations or validations can be performed as part of the transparent storage updates.
  • Authorizing clients to run specific operations (send events) is easy because our authorizer has the information it needs at the time of authorization.
  • Orchestrating multiple backend operations is easy because all of our transition logic lives in one place and the platform can provide persistent queues and reliable timers to the transition logic on top of our data store access (without crossing any boundaries).
  • Enforcing dynamic constraints on the evolution of our data is easy because our data only evolves as dictated by our transition logic, which has no internal boundaries.

What's still complex and hard?

  • Enforcing invariants across entities. We've found that this is easier and simpler but still not fully simple or easy. By nature, enforcing a constraint across entities involves crossing boundaries and the "smart storage" architecture doesn't change that. However, by modeling storage records as actors with internal logic that can send and receive events, cross-entity communication is a bit more structured and predictable than it otherwise would be. In this architecture, one entity can send an event to another, allowing for structured cross-entity coordination that's at least co-located with the rest of the entity's update logic. When implemented as state machines, there are even reasonable ways to prove properties of systems of interacting machines.

Move fast, don't break things

Much of the convenience of traditional backends as a service centers on the ability to ship quickly, though development speed may slow down over time as the workarounds for hard problems pile up.

We can achieve the same (or better) time to market with a "smart storage" backend as a service and, because we're able to provide simple solutions for more desired capabilities, we don't see the same slow down over time.

Developing with a traditional backend architecture, feature development proceeds by:

  1. Defining an interface between the client and backend.
  2. Defining a data storage schema.
  3. Implementing backend business logic.
  4. Implementing the backend storage interface.
  5. Building the client.
  6. Connecting the client to the backend.

Traditional backends as a service remove the need for (1) and (4), combine (3) and (5), and standardize (6).

"Smart storage" backends as a service remove the need for (1) and (4) and standardize (6).

So, we remove and standardize the same amount of work. Traditional BaaS also combine backend and client-side business logic but, because there is little duplicated between them, we haven't seen this result in a significant savings. In fact, we have seen that this mushing together of business logic on the client can prevent us from preserving crucial invariants about our data. Traditional BaaS also know this, which is why they all offer arbitrary backend functions as a service in addition to the standard backend offering. With a "smart storage" backend as a service, one architecture provides everything we need to move quickly and preserve our data integrity and business rules.

All together, we expect equivalent initial development times and significantly lower maintenance cost over time from the "smart storage" BaaS architecture.

Implementation

As we said at the top of this article, the principle components determining how valuable an architecture is tend to be driven by where the boundaries are drawn. So, regardless of how each component is implemented, we think that there's significant value in this re-thinking of the traditional backend or backend as a service architecture. However, we've found that the best representation for the transition logic in the "smart storage" architecture is a state machine.

Our goal is always to make our code match our mental model of the solution as closely as possible and, in this case, a state machine with associated data captures exactly how our transition logic is intended to work.

Our model for what our transition logic layer should do is as follows: Events come in and are processed sequentially. Processing entails applying some logic to the current state and the provided event to generate new state and effects. The new state should be transparently stored and the effects executed, potentially producing new events.

State machines are the right, simple abstraction to easily represent a process that works like that.

State Backed

We built State Backed backend as a service using the "smart storage" architecture and, having built many apps on top of it, we've seen the benefits that we expected first-hand. More importantly, our clients have too.

You can try out the "smart storage" architecture with State Backed now, for free. Let us know what you think!

· One min read
Adam Berger

We're excited to announce the release of our completely free GitHub bot to visualize XState state machines in pull requests!

One of the amazing things about state machines is that your high-level logic becomes pure data that's easy to visualize to quickly understand.

You can install our bot in about 11 seconds here. Once installed, whenever a PR creates or updates a state machine, you'll see a helpful comment from the State Backed Visualizer bot with a visualization of the new machine to easily keep the whole team aligned.

State Backed visualizer bot

Obviously, we bult the State Backed visualizer bot as a persistent workflow in State Backed! We found a really nice pattern for handling webhooks:

  1. Identify the entities in the webhook
  2. Map from entity types (e.g. pull request, repo, user) to a State Backed machine definition
  3. Create a machine instance for the specific entities in the webhook (e.g. PR 12345 or user 123) if one doesn't already exist
  4. Send the machine instance for each referenced entity the webhook as an event

This pattern makes it really easy to keep the state of the entity up to date in your machine instance.

· 22 min read
Adam Berger

Whether you intended to or not, you’re probably building a state machine right now.

That's because any time you have a set of steps with some ordering between them, your system can be represented and built simply and visually as a state machine.

On the frontend, it’s a bit easier to squint and see the states and events you’re modeling. After all, you actually talk about transitions and "paths" the user can take through your app. The mapping from the familiar world of screens and popups and nested components to the language of hierarchical states and transitions is fairly straightforward. So, thankfully, we’ve seen more and more (though not yet enough!) adoption of state machines for modeling frontend flows.

On the backend, however, while it’s just as true that many of the systems we build are implicitly state machines, I’ve yet to see many teams explicitly model them that way.

I get it. Backend concerns seem quite different. Whiteboards in conference rooms around backend-focused teams are covered in boxes and arrows depicting information flows and architectural dependencies rather than states and transitions.

So many of us backend engineers are so consumed with the mind-boggling concurrency of our systems that we may even scoff at the idea of a system being in a "state." If the frontend seems deterministically Newtonian, the backend seems stubbornly relativistic or, on its worst days, quantum.

But our users most certainly expect that each logical grouping of their data is self-consistent. While we’re thinking about data en masse, our users really care about data in the smallthis document or that ad campaign.

We’re talking about queues, eventual consistency, and reliable timers. All our users care about is having our business logic applied to their data consistently.

There is a better way. And, as is so often the case, it requires a change of perspective, a jump in the level of abstraction at which we’re working.

What we need on the backend is a focus on logic over infrastructure, an investment in dealing with the essential complexity of our business use cases rather than re-addressing the purely accidental complexity of our architecture with every new project.

The mechanism we need to accomplish that is none other than the lowly state machine.

The five sentence state machine intro

A state machine1 definition consists of states and transitions between them. Transitions happen in response to events and may have conditions that determine whether they’re active or not. Hierarchy allows for parallel (simultaneous) states. Each instance of a state machine is in one set of states at a time (a set rather than a single state to accommodate parallel states) and owns some arbitrary data that it operates on. States can define effects that run when the state is entered or exited and transitions can define effects that run when the transition is taken; those effects can atomically update the instance’s data or interact with external systems.

This structure is easily visualized like this:

State machine example

The backend state machine value proposition

We’ll talk about exactly how state machines help us solve the major classes of problems we face in backend development but, first, let’s look at the high-level value proposition.

State machines are a mechanism for carefully constraining the updates to our critical data and the execution of effects in a way that allows us to express solutions to many classes of problems we encounter and to effectively reason about those solutions.

Let's break that down.

  1. Constraints are actually good. Like, really good. We’re all trying to build systems that perform tasks that people care about and operate in ways that we can understand, not least because we’d really like to fix them when they misbehave. Unconstrained code leaves no bulwark between our too-burdened brains and the chaos of executing billions of arbitrary operations per core every second. We all consider GOTOs harmful because Djikstra convinced us that we should aim to "make the correspondence between the program (spread out in text space) and the process (spread out in time) as trivial as possible."

    There are few better ways to simplify the correspondence between what your program looks like and what it does than by constraining your program’s high-level structure to a state machine. With that reasonable constraint in place, it suddenly becomes trivial to understand, simulate, and predict what the systems we build will actually do.

  2. Protect your data and orchestrate your effects. Just as the infrastructure of our system only exists to support our business logic, our business logic only exists to act on our data and the external world. Data updates are forever and the changes we effect in the world or external systems can have serious repercussions.

    As we saw above, with state machines, data updates and effects are only executed at specific points, with clean error handling hooks, and easy simulation. When you know exactly where and under which conditions these critical actions will happen, your entire system becomes intelligible, invariants become comprehensible, and your data becomes trustworthy.

  3. Reasoning about your system is not optional. There’s the old adage: "Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." Kernighan said that in the era of standalone programs. How quaint those times seem now. Once you connect two programs together the emergent effects of your systemunexpected feedback loops, runaway retries, corrupted datacreate a mess many orders of magnitude more “clever” than any one component.

    If we’re going to have any hope of understanding the systems we buildand we better, if we want them to do useful things for peoplethen we have no option but to constrain ourselves to simple parts. Because they are so simple, state machines are just the right high-level structure for the components of a system you hope to be able to understand.

  4. We left off expressiveness. Expressiveness is the point at which I hear the groans from some of the folks in the back. We've all been burned by the promise of a configuration-driven panacea before. What happens when your problem demands you step beyond the paved road that the platform envisioned? And so began the rise of the "everything as code" movement that's now ascendant. It makes sense. You simply can't forego expressivity because expressivity determines your ability to solve the problems you're faced with. It's non-negotiable.

    But expressivity is the key, not arbitrary code executing in arbitrary ways. State machines are expressive enough to model processes in any domain, naturally. They simply provide the high-level structure within which your code executes. This constraint ensures you can naturally express your logic while preserving your ability to model the system in your head. Even non-engineers can typically understand a system's logic by looking at its state machines.

Now, let's look at the two primary types of backend systems and examine how state machines might form a helpful core abstraction for each. First, we'll examine reactive systems and then proactive systems (aka, workflows).

Reactive systems

Most of our APIs fall into this camp. Get a request, retrieve or update some data, return a response; they lay dormant until some external event spurs them to act.

Whether we write these as microservices, macroservices, miniliths, or monoliths, we have a bunch of seemingly-decoupled functions responding to not-obviously-connected requests by updating some very-much-shared state.

A reactive system example

Let's look at an example to understand how state machines can help us build better reactive systems. We’ll walk through the traditional way of building a traditional app: a food delivery service, focusing on the flow of an order.

We’ll simplify the flow to this: users submit an order, we offer it to the restaurant, the restaurant accepts or rejects it, and a bit later, we send the delivery request to a courier and wait until the courier marks the order complete.

To build that in a traditional way, we’ll probably want an order service with endpoints for users to create an order, restaurants to accept an order, couriers to accept a delivery and mark it complete, and timers to notify us that they’ve elapsed.

To wildly simplify what was, in my past job, a few hundred person-years of work, you likely put together some code structured like this:

Endpoint files

To represent a process that I’m pretty sure you’re picturing in your head right now like this:

Order state machine v1

Expand

The problem with the traditional approach

It looks like those endpoints sitting in their separate files are decoupled but, within each route, we have a bunch of assumptions about where we are in the flow. Orders can only be accepted once, couriers need the order information we stored during order acceptance when they pick up the order and shouldn’t be able to accept early since they’re paid based on time spent. We’ll also need to make sure that, if we offer a job to a courier who rejects, they can’t subsequently accept after another courier is assigned.

In short, to be correct, each endpoint must validate aspects of the overall flow so, to coherently understand this system, we need to think about the whole thingwe can't really understand any part in isolation. The overall process is what our customers are paying for, not a set of endpoints. Having spent many sleepless nights attending to outages within just such a system, I know firsthand that seemingly innocent changes to a supposedly isolated endpoint can have unintended consequences that ripple through the entire system.

Basically, all of the critical structure around and between the endpoints that jumps right out at us in the state machine is completely hidden and hard to extract from the "decoupled" endpoints.

Now, let’s imagine an all-too real request: after building this system, our business team decides that we could offer wider selection faster if we send couriers out to buy items from restaurants we have no relationship with (and, therefore, no way to send orders to directly).

With that feature, we’ve broken all of the assumptions buried in our supposedly decoupled endpoints. Now, couriers get dispatched first and orders are accepted or rejected after the courier is on their way.

With the traditional structure, we satisfy this new requirement by painstakingly spelunking through each of our endpoints and peppering in the appropriate conditionals, hoping that, in the process, we don’t disrupt the regular orders flowing through our system.

Then, to satisfy restaurants that want to perform their own deliveries, we add a new option: for some orders, instead of dispatching couriers, we give the restaurant the delivery information so they can bring the customer their food. We wade through the mess of conditionals in our "decoupled" endpoints, struggling to trace distinct, coherent flows, painstakingly adding to the confusion as we implement our new feature.

Doing better

The trouble here lies in the difference between coupling and cohesion. Most systems have some degree of coupling, some interdependent assumptions between different endpoints or components. The degree of coupling is directly related to the difficulty of understanding a part of the system separately from the whole. As it becomes harder to understand this endpoint without also understanding those endpoints, it becomes more and more important to treat the system as a cohesive whole rather than pretending each part is an isolated component.

As the coupling between endpoints grows, so too do the benefits of representing the system as an explicit state machine.

If you’re blessed with a generally stateless problem domain where you can build truly isolated endpoints, you should certainly do so! Our goal is always simplicity in the service of comprehensibility and, by that measure, nothing beats an isolated pure function.

If, however, your problem domain, like most, requires inter-component assumptions, I highly recommend that you architect your system as it isas a wholeinstead of pretending it is composed of isolated pieces. As the dependencies between the endpoints of your system intensify, you’ll find more and more value from representing your requests as events sent to an instance of a state machine and your responses as pure functions of the machine’s state and owned data. In these systems, your primary concern is to understand the inter-component flow and that’s exactly what a state machine provides. You then build truly decoupled data updates, actions and conditions that your state machine orchestrates into a coherent whole.

Returning to our example, it doesn’t take a state machine expert to be able to understand our complex, 3-part flow from this executable diagram but I can assure you that after 6 years in the trenches with the "decoupled" endpoint version of this system, I still struggled to piece together a view of the what the system was actually doing.

Order state machine v3

Expand

By making this shift, we can solve the general problem of running consistent instances of these machines once and then spend all of our time building the business logic our users actually need.

Which brings us to our second class of system…

Proactive systems (workflows)

Proactive systems are distinguished by being primarily self-driven. They may wait on some external event occasionally, but the primary impetus driving them forward is the completion of some process or timer they started.

The fundamental problem with workflows is that computers run code as processes and, while processes are permanent(ish) at the timescale of a request, they are decidedly ephemeral at the timescale of a long-lived workflow. We used to string together cron jobs, queues, and watchdogs to ensure forward progress in the face of machine and process failures. That made things work but created a messas with the "decoupled" endpoints we saw above, there was no cohesion to the separately-deployed dependencies. All of the above arguments for building more cohesive systems apply doubly so for workflows built around queues, event buses, and timersunderstanding a system from those parts demands top-rate detective work.

Workflow engines and the clever hack

In the past few years, we’ve seen the rise of the cohesive workflow as an abstraction in its own right2. Just write code and let the workflow engine deal with reliably running on top of unreliable processes. Wonderful! Except that nearly all such platforms suffer from two major flaws: a lack of upgradability and embracing an iffy, leaky abstraction.

There is only one constant across every software project I’ve seen: change. We’ve created this infinitely malleable construct andof course!we’re going to take advantage of its amazing ability to change. But there is no coherent upgrade story for any major workflow platform. After kicking off a job that’s going to run for a year, there’s no reasonable way to change how it works!

The best of these systems allow you to litter your code with version checks to manually recover missing context. Understanding is the hardest part of the job and trying to reason about a workflow littered with “if (version > 1.123) {...}” checks is like betting your business on your ability to win at 3d chesswe shouldn’t need to introduce a time dimension to our code.

This obvious problem of wildly complicated updates derives from the less obvious, more insidious issue with workflow platforms: at their core is a Shlemiel the painter algorithm. They cleverly provide the illusion of resuming your code where it left off but that’s simply not possible with the lack of constraints present in arbitrary code, where any line can depend on arbitrary state left in memory by any code that previously ran. They provide this illusion by running from the beginning on every execution and using stored responses for already-called effects, thereby re-building all of the in-process context that your next bit of code might depend on.

It is clever!

It is also the wrong abstraction because it starts from the assumption that we programmers aren’t open to adopting something better than arbitrary, unconstrained code.

A better abstraction

With state machines as the core abstraction for workflows, upgrading becomes a simple data mapping exercise because we know exactly what any future code depends on: our state and our owned data. We can write one function to map states from the old version to states from the new version and one function to map the owned data from the old version to owned data from the new version. Then we can upgrade instances of our state machine whenever we want and our logic itself can ignore the history of mistakes and rethought features that more rightly belong in our git history than our production deployment.

There’s more. State machines are inherently resumable because, again, we know exactly how to rebuild the state that any future execution depends on: just load the state and owned data. No clever tricks required.

A workflow example

Let’s look at an example of an onboarding workflow we might run with a standard workflow engine today:

export async function OnboardingWorkflow(email: string) {
await sendWelcomeEmail(email);
await sleep("1 day");
await sendFirstDripEmail(email);
}

Workflow engines treat each of our awaited functions as "activities" or "steps", recording the inputs and outputs of each and providing us the illusion of being able to resume execution just after them.

Now, we decide that we want our welcome email to vary based on the acquisition channel for our user. Simple, right?

export async function OnboardingWorkflow(email: string) {
const acquisitionChannel = await getAcquisitionChannel(email);
await sendWelcomeEmail(email, acquisitionChannel);
await sleep("1 day");
await sendFirstDripEmail(email);
}

Nope! If a user has already passed the sendWelcomeEmail step, workflow execution engines have no choice but to throw an error or send a second welcome email with this change. Let’s see why.

The first time the workflow runs the first version of our workflow, the engine will execute the sendWelcomeEmail activity and store its result, then execute the sleep activity, which will register a timer and then throw an exception to stop the execution. After the timer elapses, the engine has no way3 to jump to the line of code after our call to sleep. Instead, it starts at the very top again and uses stored results for any functions it already executed. It has to do this because there’s no other way to rebuild all of the program state that we might depend on (e.g. local variables, global variables, arbitrary pointers, etc.). Instead, we’ll need to write our updated version more like this:

export async function OnboardingWorkflow(email: string) {
if (getVersion() < 2) {
await sendWelcomeEmail(email, defaultAcquisitionChannel);
} else {
const acquisitionChannel = await getAcquisitionChannel(email);
await sendWelcomeEmail(email, acquisitionChannel);
}
await sleep("1 day");
await sendFirstDripEmail(email);
}

Now imagine that we had more updates (great software engineering teams push multiple changes a day, right?) and imagine that the steps of our workflow had more direct dependencies between them. Maybe you could still mentally model the overall flow after v3. What about after v7?

Again but with a state machine

With state machines, things are a bit different.

We start with this state machine:

Workflow state machine v1

Expand

For simple workflows, this diagram is helpful but it’s admittedly not a huge improvement in understandability over the code. As things get more complex though, a visual representation of the high-level structure of the workflow is really helpful. More importantly for our analysis here though, this is what an upgrade looks like:

Workflow state machine upgrade

Expand

As an engineer, we need to do 3 things to cleanly migrate running instances from one version of our machine to another:

  1. We build the new version of our state machine. We don't need to include any vestiges of the old version that are no longer needed. This is represented in the right-hand side of the above diagram.
  2. We write a function to map our old states to our new states (a trivial mapping in this case). This is represented as the left-to-right arrows in the diagram.
  3. We write another function to map the owned data for the old version to owned data for our new version of the machine. For example, if we had used the acquisition channel in future states, we would want to populate the acquisition channel in our owned data as part of this mapping.

Because of the constraints of state machines, those mapping functions are straightforward to write and easy to test.

This upgrade mechanism allows us to keep our workflow implementation clean and completely separate from our handling of changes over time. The inherent ability of state machines to actually resume execution from any state is what allows us to disentangle our change history from our point-in-time state machine definition.

Putting them together

Examined more broadly, few systems fall entirely into the reactive or proactive categories. An application likely has reactive aspects that kick off proactive processes that wait for reactive events and so forth. With today’s paradigms, these are incredibly awkward to model uniformly, so we tend to create subsystems built around different abstractions with different operational teams with different expertise. Because state machines are driven by events and are inherently resumable, they easily model both reactive and proactive systems within a single paradigm that’s able to naturally express both types of solutions.

Migrating

Great! So now that you're convinced of the value of state machines, you just need to rewrite your whole backend as a set of state machines in a big, all-at-once migration, right?

Not quite.

You don’t need to replace your existing code with state machines. In many cases, you’ll want to wrap calls to your (simplified) existing code in a state machine. That’s because, for most backends, the entire concept of a flow is simply missing. Once you introduce a state machine that’s responsible for executing the code that previously sat behind your endpoints, you can update your clients or API layer to send events to your new state machine instead of directly invoking the endpoints. Then, you can remove the flow-related checks and logic from the former endpoint code that now sits behind your state machine. Finally, you can lift your state management out of the former endpoint code to move ownership of the data to the state machine itself.

Obviously, all of this can be applied just to new projects and migrations can easily be approached piecemeal, wrapping one related set of endpoints at a time.

Let's examine the value gained at these key milestones:

  1. Creating a state machine to wrap a set of endpoints will yield valuable insight into the system you thought you knew. This executable documentation will allow future engineers to understand the overall flow and confidently make changes. You'll even remove the potential for a whole class of race conditions. Often, you'll discover never-before-considered user flows lurking in your existing implementation and this is a great time to finally define how they're supposed to work.
  2. Pulling the flow-related checks and validations out of the former endpoint code will simplify things as only deleting code can. You'll likely even find a few lurking bugs in those complex validations.
  3. Lifting state management out of the former endpoint code and into the state machine removes yet more code with yet more potential for bugs. Importantly, you'll find that your next project finishes faster and with fewer outages because you've pulled many application concerns up to the platform level.

Getting started

Ready to start implementing your backends as state machines?

The most important first step is to start thinking in terms of states and transitions. Immediately, you'll start to see improvements in your ability to understand your software.

There are even some great libraries you can use to build state machines on the backend, including XState.

And there's a new service that can definitely help you adopt this pattern...

I was lucky enough to be a part of Uber Eats' journey from an endpoint-oriented to a workflow-oriented architecture. Complex dependencies between endpoints had made working on them incredibly difficult and error-prone. With the migration to a workflow abstraction, we gained immense confidence in our system by finally having a cohesive view of the user-relevant flows that we were building.

This was super exciting but, as I'm sure you can tell by now, I saw huge potential for state machines to expand upon that value. So I started State Backed. We recently released our state machine cloud to make it incredibly easy to deploy any state machine as a reliable workflow or a real-time, reactive backend. We'd be proud to help you adopt state machines for your own backend or we're happy to share notes and help however we can if you choose to build a state machine solution yourself.

You can have a state machine deployed in the State Backed cloud in the next 5 minutes if you'd like to try it out.


  1. Technically, we’re talking about statecharts throughout this article because we want the expressivity benefits of hierarchical and parallel states. We’ll use the more common term just for familiarity.
  2. This was pioneered by platforms like Cadence. These platforms were a huge leap forward for proactive system design because they enabled cohesion in this type of software for the first time. The fact that we believe that state machines are a more suitable abstraction doesn't detract at all from the amazing advance that these platforms made.
  3. The only exception we’re aware of is Golem, a workflow engine built around Web Assembly. You can't snapshot the memory of a regular process and restore it but, because of Web Assembly’s sandbox model, they are able to capture the full program state and do actual resumption. This is a beautiful abstraction for resumption but doesn't address upgrading running instances.

· 6 min read
Adam Berger

Cohesion is essential to great software but it seems like such a squishy concept. As it turns out, I think you can define it fairly intuitively. And, unfortunately, I find it missing in most backend systems.

Cohesion and coupling are, to a large extent, duals. Peter Hunt introduced the concepts wonderfully in his great Rethinking best practices talk introducing React.

At the time (back in 2013), common wisdom held that templates should be "de-coupled" from app logic. Instead, Peter argued that any such de-coupling was a fiction and that the benefits to cohesion from colocating templating and logic outweighed the perceived coupling concern.

All these years later, this remains the principle advance that React ushered in but the conceptual framework to understand coupling vs cohesion still hasn't been fully developed or fully adopted by the software community.

Unlike cohesion, the concept of coupling seems much easier to nail down. Coupling measures the degree to which different components depend on one another. We've all rightly had it drilled into our heads that unnecessary coupling is bad; we want to build de-coupled modules.

But we too-often forget that it's unnecessary coupling that's bad. We can't just pretend that necessarily coupled modules are de-coupled and, while tricks sometimes suffice to turn coupled code into seemingly de-coupled code, the degree of real coupling ultimately depends on the actual thing we're trying to build. The nature of the solution determines how much each of its parts must interact.

That is, the "whole" (the solution) exerts constraints on the "parts", the components.

Cohesion, on the other hand is about ensuring that the things that we have grouped together actually do belong together. False cohesion is obviously bad because it's just unnecessary coupling.

So what is cohesion?

Cohesion, I believe, is measured by how closely a module reflects the explanation of the essential aspects of what it's supposed to do.

When you have 10 endpoints that all operate on the same data, they necessarily have deep assumptions and dependencies on each other, whether those dependencies are explicit or not. After all, they operate on the same data. Each endpoint embeds some assumptions about what that data must look like by the time it's called.

Any explanation of how that system works would have to discuss a level of abstraction above any of the endpoints themselves. Explaining why any of the endpoints behave as they do would require talking about how the whole set of endpoints is supposed to behave. Likely, when one engineer explains to another how a system like that works, they would talk about the flow that those endpoints jointly implement. They are parts of a whole and that whole doesn't exist in the codebase.

That indicates a lack of cohesion because the explanation of how the system works relies on an abstraction that has no analog anywhere in the code! In most endpoint-oriented codebases, there is no flow to point at at all even though everyone talks to their colleagues about flows all the time. In fact, it is necessary to talk about a flow if you want to have a good explanation of how the system behaves or why any particular endpoint is built the way it is.

Instead, if the flow were reified and represented directly, e.g. by creating a state machine that represented the flow itself or a workflow that directly implemented the flow, then we could say that the flow demonstrated cohesion because the explanation of how the system worked could refer to actual entities in the code, namely, the state machine or workflow.

The endpoints could even remain in separate modules. Each of them, individually, could be considered somewhat cohesive at a certain level of explanation but to achieve cohesion at the level most of us care about (i.e. the flow), you have to introduce a higher-level structure that matches the explanation of the system.

Good explanations refer to good, cohesive abstractions. And good abstractions must be things you can point to in your system.

The science analog of a codebase without explicit flows and just fictionally-decoupled endpoints would be to ignore chemistry, biology, and psychology because they could be derived from physics. While the effects at each of these emergent levels could theoretically be derived reductively, the explanations of any of the higher-level effects that we actually experience would be tortuous and so wildly complex that it would be hard to even consider them explanations in the normal sense of the word. We need levels of abstraction that correspond to good explanations of the things we care about.

David Deutsch, in The Beginning of Infinity refers to a thought experiment that's particularly apt:

Consider one particular copper atom at the tip of the nose of the statue of Sir Winston Churchill that stands in Parliament Square in London. Let me try to explain why that copper atom is there. It is because Churchill served as prime minister in the House of Commons nearby; and because his ideas and leadership contributed to the Allied victory in the Second World War; and because it is customary to honour such people by putting up statues of them; and because bronze, a traditional material for such statues, contains copper, and so on. Thus we explain a low-level physical observation – the presence of a copper atom at a particular location – through extremely high-level theories about emergent phenomena such as ideas, leadership, war and tradition.

He goes on to explain how silly an explanation of how that copper atom came to rest at the tip of that particular statue's nose would look if it were only to refer to phenomena at the level of atoms and physics.

In any complex system, there is an intricate dance of higher levels of emergence creating constraints and influencing lower levels of emergence (in this case, culture influencing atoms even though culture is, itself, an emergent property of atomic effects).

Similarly, in our systems, we have different levels of emergence at which different types of abstractions exist.

Just as is the case for physics and culture though, there is no single direction in which explanations flow between different levels of abstraction.

The crucial point, if we are going to build systems that we can understand (i.e. explain), is that we must have a language of talking about each relevant level of abstraction in our codebases.

Too often, the flow that connects various endpoints is absent from our backend code and it severly hampers our ability to understand our equivalent of why that particular copper atom came to be at the tip of that particular statue's nose - in our scenarios: why this endpoint has some particular validation or guards against some strange phenomena.

If we build software that contains the entities that we talk about when we explain how it works, we will build better, more easily understood systems.

That's one of the crucial insights that got us excited about centering backend systems around state machines. Every engineer explains the components of their systems as a flow but no flow can be found in their codebase. Why not build your system the way you think about it?

· One min read
Adam Berger

We're thrilled to release the State Backed web UI!

You now have two options to choose from to create machines, deploy versions, and administer your State Backed account: the smply CLI and our brand new web dashboard.

We still recommend using the smply CLI to connect your CI/CD environment and build pipelines to State Backed but the web dashboard is really nice to quickly check in on your instances and to tweak deployments.

We even included a full-blown in-browser IDE to easily author new machine versions. We're not talking about an editor. We're talking npm and a command line right there in your browser.

Check it out:

· 3 min read
Adam Berger

State Backed, the first platform to allow anyone to launch any state machine as a persistent, reliable cloud actor, subscribe to real-time state updates from anywhere, and send fully authorized events from any client or server, is now available for use.

During this open beta period, we will not be charging clients for their usage and are looking for feedback from the community as we strive to create the nicest developer experience of any backend as a service platform.

Already, deploying a new machine consists of running a single command in our smply CLI or using our in-browser IDE and flow visualizer. Launching a new instance and connecting from the browser is just 3 substantive lines of code:

import { StateBackedClient } from "@statebacked/client";
import { useStateBackedMachine } from "@statebacked/react";
import { useActor } from "@xstate/react";

// Create an anonymous session.
// You can also easily create authenticated sessions using your existing identity provider,
// just by altering the config.
const client = new StateBackedClient({
anonymous: {
orgId: "org_YOUR-ORG-ID",
}
});

function YourComponent() {
const { actor } = useStateBackedMachine(
client,
{
machineName: "your-machine",
instanceName: "your-instance",
getInitialContext() {
return { "any": "initial-context" }
}
}
);

const [state, send] = useActor(actor);
// render UI based on real-time updated state and send events
}

The traditional approaches to building backends are getting more and more cumbersome. Having built a ton of these traditional backends ourselves, we're tired of pretending that our individual endpoints or GraphQL resolvers are nicely decoupled bits of logic. In reality, bundles of endpoints, whether you call them microservices or domains within your monolith, are highly interdependent, with lots of assumptions baked into each about the overall flow of state updates. We are, after all, building cohesive user experiences, not merely APIs to be called in any order.

Pretending that these endpoints are decoupled only makes it more and more difficult to piece together an understanding of the high-level flow that we actually care about.

We're confident that the better path forward is to treat flows as first-class entities on the backend. That's what State Backed is about. You build a state machine that describes the logic for a flow or set of flows you care about and deploy it as a single package to the State Backed cloud. That means that you can always understand the most crucial aspects of your app - the evolution of your state and triggering of external effects. State Backed takes care of ensuring that you can launch as many instances of your flows as you want, that each instance is persistent and durable, and that every state update and effect creates a consistent, linearizable history for that machine instance.

And we forgot to mention: every instance creation, read request, and event you send is controled by the simple authorization functions you provide as part of your deployment package, based on the trusted user claims from your existing identity provider (or custom claims for anonymous access or specialized us cases).

We've started building invincible workflows and real-time, multiplayer backends on top of the platform (check out our examples) and we're super excited about the potential of applying this paradigm to these use cases.

You can deploy your first State Backed backend in less than five minutes at StateBacked.dev.

We'd love your feedback!

· 2 min read
Adam Berger

Everywhere you look, you'll find little bundles of state with little snippets of logic.

A user onboarding tutorial, general user activation state, a document that can be created, shared, and edited, a workflow to approve a piece of content, a rate-limited call to an external API.

There is nothing more annoying than discovering that you need yet another little bundle of state. State needs a place to live, its access patterns become hard to change at scale, changes and reads require careful authorization logic, and the structure of the state at rest rarely matches exactly what we want for the potentially-simple logic that we need to apply to it.

What if these little bundles of state and logic could live in a simple file, next to their authorization logic, came with their own simple-as-memory consistent data storage, and took one command to deploy?

Answering that question led us to build State Backed, the XState backend as a service.

You can now launch any XState state machine as an API-accessible, persistent state machine in the cloud with one command.

Send events and read instance state from your frontend or backend with simple, end-to-end authorization for every end-user and every action.

Check out our quick start to launch your first machine in the next 5 minutes.