Why We Chose QueueFile for Reliable Request Batching on Android

Segment’s mobile SDKs are designed to track behavioral data from your app and translate and route that data to hundreds of downstream integrations. One of the SDK’s core tasks is to upload behavioral data to our servers. Since every network request requires your app to power up the device’s radio, uploading this data in real-time can quickly drain a battery.

To minimize the impact of the SDK on battery life, we queue these behavioral events and upload them in batches periodically. This results in 3x less battery drain over an implementation without batching.

Our Android queuing system is built on QueueFile, which was developed by Square. In this article we’re going to run though our queueing requirements, some of traditional solutions, and explain in detail why we chose to build on Queue File.

Our Queueing Requirements

Queues are a deceptively simple concept, but there are a two main considerations that make it complicated in practice: durability and atomicity.

Durability - An element that has been added to the queue will survive permanently. An in-memory queue is easy to implement, but it lacks durability — events are queued until the process dies, and lost thereafter.

Atomicity - An element is either added to the queue or isn’t; the queue won’t ever be in an partial/invalid state. For our purposes, we needed to save events to disk as quickly and reliably as possible, and then worry about uploading them later.

We were looking for a solution that would guarantee these contracts would be be honored even in the face of process deaths or system crashes. Such scenarios are inevitable on mobile, e.g. the user could lose battery power in the middle of an operation, or the operating system could kill your application to reclaim memory.

Traditional Solutions

File: The most obvious way to persist data to disk is to use a plain old Java file. However, writing to a file is not atomic (in most cases), and it’s easy to run into cases of a corrupted file. A trivial implementation would also keep the entire queue in memory, which is not ideal for devices that might go offline for a long time.

AtomicFile: AtomicFile (also available from the support library) is a simple helper that can guarantee atomic writes on a regular file. It guarantees atomicity by creating a backup of the original file before writing any changes, and waiting until the write is completely written to disk before deleting the backup. As long as the backup file exists, the original file is considered to be invalid. However, AtomicFile requires the caller to maintain track of the corrupted state and restore itself from one.

SharedPreferences: The simplest way to persist data onto disk is by using SharedPreferences from the Android framework. Although designed to store small key-value pairs, it can certainly be used to store any data in a pinch. The SharedPreferences class is a high level wrapper around its own implementation of an atomic file and an in-memory cache. Writes are committed to a map in memory first, and then saved to disk. However, writing to SharedPreferences is not durable. SharedPreferences silently swallows any disk write error, leaving the memory cache and disk out of sync, with callers left guessing about the result of an operation.

SQLite: The typical solution to designing a reliable disk queue on Android is to layer it on top a SQLite database. SQLite works great for large datasets that need multithreaded access, but being a fully featured database, it is designed with optimizations for advanced querying and not for the purposes of a queue. SQLite is complex with many moving parts, and this made us wary of relying on it to back such a critical section of our code.

Why We Chose QueueFile

Although all of the above solutions could have been coerced to work for us, none of them were designed specifically to be used as queues. Luckily, Square had also run into this problem for accepting payments, and they built QueueFile to accomplish this. QueueFile guarantees that all operations are atomic, writes are durable, and is designed to survive process and even system-level crashes.

It was a solution tailor-made for our use case. Adding and removing elements from QueueFile is ordered first in first out (FIFO) and takes constant time, both of which are a nice bonus. QueueFile’s tiny size also makes it perfect for us to embed in our SDK.

Core Concepts of QueueFile

There are a few guarantees that the filesystem provides:

renaming a file is an atomic operation

fsync is durable

segment writes are atomic

QueueFile is particularly clever about the way it stores and updates data — Bob Lee has an excellent talk on the subject. QueueFile consists of a 16 byte file header, and a series of items called elements, as a circular buffer. The grey area (not to scale) represents empty space in the file.

The file header consists of four 4-byte integers that represent the length of the file, the number of elements in the file, and a pointer to the location of the first and last elements. Since the length of the file header is only 16 bytes (smaller than the size of a segment), changes to the file header are atomic. QueueFile relies on this by making modifications to the file visible only when the header is committed as well.

The components of a file header.

Each element itself is comprised of a 4-byte element header that stores the length of the element, and the element data (variable length) itself.

The components of an element.

Adding Data

Consider adding an element to the QueueFile below.

QueueFile first writes (and fsyncs) the element and its length. Notice that the file header remains unchanged. If writing the second element fails, the QueueFile is still left in a valid state since the header hasn’t been updated (even though the new data may be on disk). Upon restart, the QueueFile would still report that queue contains only a single element.

When the write operation completes successfully, the header is committed (and fsynced) as well. If updating the header fails, then the QueueFile remains the same as above, and the change is aborted. Otherwise, we’ve successfully added our data!

Calling fsync after every write prevents the filesystem from reordering our writes, and makes the transaction durable. This ensures that the QueueFile is never left in an invalid or corrupted state.

Removing Data

Removing and clearing do the reverse. Let’s start from the result of our previous operation.

QueueFile writes (and fsyncs) the header first. If committing the header fails, then the change is simply aborted and the QueueFile remains the same as above. When the header is written successfully, the removal is committed, and now has a size of 1 (even though the data is on disk).

QueueFile goes one step further and zeroes out the removed data, which leaves us just with the element we added previously.

In Conclusion

QueueFile has been a key part of the queueing system we’ve built which powers data collection for over 1,400 Android apps running on more than 200 million devices. While the incredibly small size of the library helps prevent SDK bloat, its simplicity does also create a few limitations. For instance, QueueFile is limited to a size of 1GB, and it cannot be used by multiple processes concurrently. Neither were deal breakers for us — we don’t want your queued analytics data using too much disk space and we can create separate queues for different processes — but are good to be aware of.

We’re working on porting this approach over to iOS and have some ideas to expand QueueFile to work on a broader range of filesystems like iOS and desktop apps. If you’re interested in helping us build infrastructural components like this for mobile apps and SDKs, we’re hiring.