API pagination techniques
The code corresponding to this post can be found here.
Software engineers interact a lot with HTTP APIs. In high-traffic systems, using APIs correctly can be tricky: how do you retrieve millions of records efficiently? How do you make sure that you don’t miss any item? How can you, as an API developer, make life easier for your consumers?
When accessing APIs, pagination is frequently used in order to decrease response size, prevent network timeouts, and generally to get results faster. With pagination comes the task of navigating through the pages of the entire result set.
Note: pagination is only meaningful in an ordered list of results. If the results are not ordered, pagination doesn’t make sense, as records can move between pages. The ordering can be visible or transparent to the client, but it must be consistent between requests.
# OFFSET + LIMIT
At its simplest, pagination is done with the
limit keywords. For instance, let’s suppose we have a REST API at
http://my-dummy-api.com, and suppose we have resources called
todos. If the server supports pagination, then when calling the
todos endpoint, we could specify the
offsetquery parameters in the URL:
http://my-dummy-api.com/todos?limit=10&offset=0. This would fetch the first “page” of all the records, with 10 items on that page. Then the call for the next page would be
limit=10&offset=10 to skip the first ten rows, and so on, always incrementing the
offset parameter as we consume the API.
The implementation of this approach relies on the
LIMIT SQL keywords (see the Postgres docs for a summary), in its shortest form it looks something like:
# Pros and cons
- This is a straightforward, easy to implement, and widely used techique.
- Its performance can be fine for reasonably sized datasets, but it will degrade as the dataset grows, and as the offset gets larger. This is because the database still reads all records before cutting off the ones not needed.
- In addition, there can be missed and duplicate records in the results, depending on the ordering: imagine that a new record is created while you are paginating. Whether you miss any records will depend on the ordering that the server internally uses: if records are returned by an ordering which is not something like
ORDER BY created_at ASC, then it can not be guaranteed that a new record created during pagination will not be lost, since it can happen that the new record belongs to the beginning of the result set, which you might already have fetched. This is not exclusive to this technique, but the slow performance makes this issue more prominent.
Overall, this approach is fine for some use cases, it is definiely not the most performant and not the most reliable.
# Cursor pagination
Cursor pagination (also called keyset or seek pagination) is also a widespred technique which aims to avoid the pitfalls of the
The idea is for the server to provide a unique cursor to the client with each response. The client has to send back the cursor at the next request, and the server then will fetch the next page based on the cursor. The cursor is a value which identifies an exact position in the result set that the server can use to identify the current paging position.
At its simplest, a cursor can be an internal ID of the last record of a page. The client would then ask for the following page using
cursor=10 indicates that we are at position 10 in the results (however, it is strongly recommended to obfuscate the cursor and not to use the internal ID of the record as it is - ideally it would be an encrypted value, meaningless to the client.) The server would then execut a SQL similar to:
This ensures consistent ordering of the results (in fact, it relies on it), and also has a better performance, since it uses a primary key for filtering, which has an index.
# UUID primary keys
What happens if we cannot use the primary key? For instance, for UUID primary keys, ordering can be expensive - that’s not what UUIDs are for. In that case, I would suggest using a timestamp column, which broadly corresponds to the creation date of the record. For instance, the cursor could be the
created_at of the item. This creates some difficulties, though:
- We have to make sure that the timestamp column has an index on it.
- What happens when two records have the same
created_atvalue? The solution is to use a composite key: use the timestamp for ordering, and use the UUIDs for secondary ordering, to decide between records with the same timestamp. A composite key like this can be encoded and provided to the client.
An example implementation can be seen here.
# Pros and cons
- Cursor pagination is good because it’s efficient and reliable, even with very large datasets. Your database has to handle much less data, as it doesn’t have to read loads of unneeded rows into memory, as it does with the offset pagination. This advantage can outweigh all the downsides if your dataset is large enough.
- Its downsides include the fact that it needs more effort: the server has to provide this functionality, and the client also has to have this extra logic.
- It also makes concurrent requests impossible, as in, you cannot fire off 10 async requests each requesting a page from the results (as you could do with the
offsetversion), since you don’t know the cursor value needed for any subsequent request. Therefore the client also can’t jump to a specific page.
# Try it yourself
I have put together a playground to experiment with pagination here. It includes an API that exposes data and offers pagination, both the limit + offset and cursor-based one. There are two clients, one hammering the API and updating and creating records. The other client is a hypothetical consumer who wants to export all the data.
The results are the following:
It can be seen that the offset pagination can offer flexible sorting, but it does not produce consistent results: there are missing items, unless the ordering is by creation date, in ascending order. And if you don’t set any explicit ordering and rely on the database default, pagination is a mess: there are thousands of missed rows.
The cursor (keyset) based pagination, however, is faster and does not miss any records. Sorting is part of the implementation, so it’s less flexible: it’s only possible by the cursor column.
Thanks to all the authors of the articles below for helping me understand this topic!
# Resources & further reading
- A Hackernews discussion about keyset pagination: here
- A blog post from Slack engineering about how they do pagination: here
- Related Slack API docs: here
- JSON API specification of the cursor paginatio: here
- All you need is backend article
- A Wordpress post
- A great article by Moesif
- Cursor-based pagination and UUIDs on Stackoverflow
- Cursor based pagination with non-unique key on Stackoverflow
- Use the index Luke post
- A Hackernoon post
- How Facebook offers pagination in their graph api
- Why limit + offset is slow
- Paginating real-time data