Patterns of Progress
Modern software systems involve many asynchronous tasks, but reporting status on tasks is often handled poorly. An asynchronous task is one that processes in the background and allows the client or caller to work on something else in the meantime (“non-blocking”). Architecturally, asynchronous processing is often the right choice for operations and scaling reasons, but it does add complexity. Both the end-users and the operating teams will need to understand the state of the task and, if the wrong mechanism is chosen, teams will spend their time solving issues such as:
- Stuck tasks. A task is not progressing; this could be caused by a missing update/write (the task progressed but the update message was missed), the task was dropped from the queue, a task that is actually stuck in an infinite loop or deadlock, or a task’s terminal state was not accurately modeled.
- Failed tasks. A task is in a failure state and needs some operator intervention. This is usually for tasks that have exhausted their retry budget, but must be completed rather than abandoned. Universally, engineering teams will need to care about failing tasks, but a poor progress design will amplify the operational pain and impede automation improvements. End-users generate many complaints and escalations when they cannot differentiate between an “in-progress” task and a failed task, as every delay becomes a suspected failure.
- Orchestration limitations. If the client is unable to tell when an asynchronous task is complete nor separate success from failure, then their ability to orchestrate follow-on tasks that depend on the result of the previous task will be hampered by data quality and concurrency issues.
- Availability issues. The progress reporting mechanism may be subject to denial-of-service attacks where multiple clients can be impacted by a single bad client or data issue. If the processing systems cannot update status on tasks, the updates may be lost.
Many of these issues will only be seen once a system becomes popular, particularly orchestration and availability issues.
NB: In this article, a end-user is a human while a client is software, operating on the behest of some end-user, that has an interest in the task.
Put simply, something is happening out of sight from the client and the client needs some way to know when that something has completed. However, this definition can obscure the actual design space. If we ask these three questions:
- Why is this task being executed? What is the context of the call?
- What kind of updates does the client need?
- How complex is the client? How does the client receive updates?
we can design a solution based on the use case, required data model, and use that to find a lower cost technical implementation.
Example: Generating a report
The end-user (an authorized web user) has requested a report of the past month of activity which will take several minutes to compile. This is an adhoc report specific to this user (versus a report production that many clients may be tracking). The data comes from multiple data sources but a ‘partial’ report is undesirable, so there are many ways the report generation may fail and a single data source failing will prevent the report’s production. The end-user is using a web application to retrieve the successful report, so the client can handle some protocol complexity but may not be able to persist data reliably (user could delete stored data) nor be invoked from the server.
Example: Exporting data after transforming it
The end-user is collecting data within an online spreadsheet table and wants to transform the data collected since the last export and, once that is complete, export it daily (as input to a larger workflow). Internally, transformations are handled by one service and exports by a separate service. Transformations and exports operate on the same resource (the table). If the transformation fails, the export is unnecessary. The end-user is automating this process using shell scripts and a simple scheduler system.
Client Context and Knowledge
Know Nothing (Special Case)
The client has no knowledge of the asynchronous task. This is usually indicative that the task is a ‘side effect’ of some other workflow. For example, a log message may eventually be recorded in a store, but the client has no need to know the state of the log because it will not impact their workflow.
Task will eventually process (Special Case)
The client is told that the request will be eventually handled, but they have no visibility into its state. Email is an example; after being given confirmation that an email will eventually get to the recipients, the sender has no way to track the current state or ultimate delivery of their message1. If the task is modifying or creating a resource, the clients may be able to determine completion by checking on the resource.
The client can tell if a task has completed, but has no further details. For a state machine, this is equivalent to knowing that a task is in a terminal state.
If a non-terminal state is treated as “complete”, clients can see tasks switch back to incomplete, which is counter to normal expectations. If the clients are using the complete status as a part of a work orchestration, this can lead to processes becoming unexpectedly parallelized or data seeming inconsistent. If a task can be retried, the complete status should only be set once all attempts have been exhausted.
Success or Failure
The client can tell if the task completed successfully or if it failed. For a state machine, there is generally only a single successful terminal state but there may be many failure states. The client may be able to extract details about a failure (e.g. exceptions, status code).
Failures can reveal security-sensitive information about an infrastructure, so systems will often have a two-tier reporting mechanism — a limited error message or opaque code (e.g. transaction id) for outsiders and more detailed diagnostics for insiders/engineering team.
Similar to above, if a task can change from a failed state to a successful state, this can confuse clients and foil orchestration. Conceptually, this model is for tracking “permanent failure”.
The client has some view into the status of the task as it being executed. This view could be an indicator of:
- Work completed (e.g. percent complete)
- Work left (e.g. distance to destination)
- Estimated time to complete
- Individual steps completed
or many other variations or combinations. For instance, when I was generating the SSL certificate for this site, the first attempt failed. The user interface reported that a failure occurred, the time of the failure, and that there would be an automatic retry. While the data reported was minimal, this was sufficient for the purposes of the task.
Simulated In-progress Details (Special Case)
The client has some view into the status of the task, but the status is a simulation. For example, the server might be projecting an estimated time to complete based on a model of processing times, rather than monitoring actual progress. The server may report a mixture of real and simulated data; for example, a server may report the real completion status of the task but may add in a projected percent complete if the infrastructure to report a real percent complete is missing.
Data Model and Lifecycle
A progress status may be modeled as its own resource (e.g. with a progress id) or as an attribute of a resource. If the process does not produce a resource or if a process may be applied multiple times to a resource (perhaps even concurrently), then each task should have a unique (or unique enough) identifier. If the task is 1:1 to a resource, then Occam’s razor would suggest creating a new entity is unnecessary and instead use the resource identifier.
If a progress id and resource are connected, but there isn’t a way to look up that relationship, you are likely to run into operational visibility issues where end-users can only report one of the ids. Business reporting on the state of your system will likely require some join between the two ids, so plan ahead.
If you need to report in-progress details and the task is indeterminate upon creation (e.g. the task will be scanning a number of files but the number of files is unknown), then you will want to structure your reporting for “monotonic reporting”. Clients expect progress to move forwards, stall, or stop, but not move backwards. For example, in the scanning example, if you report status as the proportion of the number of files scanned over the number of total files, it is possible for the percentage to decline over time if finding files and scanning files are not sequential. Instead, represent the data to make the non-monotonicity clear, perhaps by differentiating a “known, fixed” denominator versus an “in-progress, changing” denominator or using an event history approach.
The utility of progress data tends to decline rapidly with time. Based on your operational and client needs, plan to delete old progress data periodically and automatically2. However, if the client can request data for a valid process that simply has not been registered yet (due to eventual consistency or delays in progress history processing), there may be an ambiguity between “old and likely deleted data” and “very new data” which can cause orchestration issues. A potential mitigation would be to encode the creation date within the progress id such that the server could return a 410 Gone versus 404 Not Found status.
Security-wise, authorizing access to a progress status can be tricky if it is not 1:1 with a resource. Additionally, if the client polls for status, the server could generate a significant number of authorization checks. Thus, from a risk trade-off perspective, avoiding storing any client or sensitive data as part of the progress and skipping authorization checks may be acceptable.
Types of Client Mechanisms
Wait / Synchronous Call Wrapper (Special Case)
The client makes a synchronous call which handles waiting on an asynchronous task. The clients call may need to be kept active through keep-alive or HTTP 100 Continue messages. If the asynchronous portion of the call is both a small portion of the overall workflow and the asynchronous task has a well-bounded time to complete, this pattern can simplify the client interface. This approach works best if the task and resource are 1:1, such as resource creation or deletion.
Periodically, the client makes a call to get the current status of the task.
Since there is a cost to every call, this mechanism can create denial-of-service attacks due to a large number of calls made in a very short period as well as tying up network bandwidth. Server mitigations include early request rejections (potentially with a HTTP 429 Too Many Requests status), long polling (a request may stay open until some timeout is reached, allowing late updates to be fulfilled within a single request), and caching. On the client side, the client can switch from a periodic rate to one with randomness or increase delays between calls to reduce server load.
For the overall workflow, the delays between calls add latency to the overall processing time. For clients, this is an incentive to decrease the time between requests. For a server, the worst case would be a client ‘busy wait’-ing where calls are made once the previous call has completed.
On either completion or an update to status, the server makes a call to the client. Within the Observer pattern, this
is equivalent to invoking the
update function. Callbacks may be modeled as both 1:1 and 1:n to clients.
If the clients do not have an addressable location (e.g. web browser), callbacks are not an option.
Compared to polling, the advantage of this method is that, since the server controls the callback, the server can minimize wasted “no new data” calls to the client. The server can maintain an audit trail of clients that were successfully notified, an important feature for some domains. Callbacks fit very naturally into an orchestration or workflow management system, allowing efficient “wait until this done” steps, since the callback can be used to trigger the next step in the flow.
Security-wise, the callback can be unsafe, particularly if it is controlled fully by the client. Servers can be tricked into participating in denial of service attacks, invoking vulnerabilities, or divulging internal data.
The callback from the server can fail, for reasons including a bad registration, network failures, or client availability issues. The onus is placed on the server for retries, and it is possible that the client will never see the callback. If clients need a way to see missing callbacks or replay previous callbacks, callbacks may be combined with event streams.
On a change to status, the server writes a new event log. Clients read from the event log. The event log is not necessarily persisted but usually is with technologies like Kafka. If the event log is persisted, clients may have the ability to replay or see historical data. The protocol could be server side events or may be custom.
An event stream may be modeled per progress, per resource, or per client.
Internally, a system may use an event stream (potentially to decouple async processing from the API layer) but present the data via a different mechanism. For example, a background may read the event stream and update a key-value store to represent the latest “snapshot” of the data.
As the most complex mechanism, you can expect more implementation-specific and operational issues than in polling or callbacks. For instance, storage space is a function of the number of things being tracked multiplied by the number of updates, while straight-forward implementations of polling and callbacks only require space proportional to the number of things being tracked.
Operationally, however, maintaining a history allows replay of events which can help mitigate server and client failure modes, so the complexity may be worth the cost.