API pagination techniques

The code corresponding to this post can be found here.

Software engineers interact a lot with HTTP APIs. In high-traffic systems, using APIs correctly can be tricky: how do you retrieve millions of records efficiently? How do you make sure that you don’t miss any item? How can you, as an API developer, make life easier for your consumers?

When accessing APIs, pagination is frequently used in order to decrease response size, prevent network timeouts, and generally to get results faster. With pagination comes the task of navigating through the pages of the entire result set.

Note: pagination is only meaningful in an ordered list of results. If the results are not ordered, pagination doesn’t make sense, as records can move between pages. The ordering can be visible or transparent to the client, but it must be consistent between requests.

# OFFSET + LIMIT

At its simplest, pagination is done with the offset and limit keywords. For instance, let’s suppose we have a REST API at http://my-dummy-api.com, and suppose we have resources called todos. If the server supports pagination, then when calling the todos endpoint, we could specify the limit and offsetquery parameters in the URL: http://my-dummy-api.com/todos?limit=10&offset=0. This would fetch the first “page” of all the records, with 10 items on that page. Then the call for the next page would be limit=10&offset=10 to skip the first ten rows, and so on, always incrementing the offset parameter as we consume the API.

The implementation of this approach relies on the OFFSET and LIMIT SQL keywords (see the Postgres docs for a summary), in its shortest form it looks something like:

SELECT * FROM records ORDER BY id ASC LIMIT 10 OFFSET 10;

# Pros and cons

This is a straightforward, easy to implement, and widely used techique.
Its performance can be fine for reasonably sized datasets, but it will degrade as the dataset grows, and as the offset gets larger. This is because the database still reads all records before cutting off the ones not needed.
In addition, there can be missed and duplicate records in the results, depending on the ordering: imagine that a new record is created while you are paginating. Whether you miss any records will depend on the ordering that the server internally uses: if records are returned by an ordering which is not something like ORDER BY created_at ASC, then it can not be guaranteed that a new record created during pagination will not be lost, since it can happen that the new record belongs to the beginning of the result set, which you might already have fetched. This is not exclusive to this technique, but the slow performance makes this issue more prominent.

Overall, this approach is fine for some use cases, it is definiely not the most performant and not the most reliable.

# Cursor pagination

Cursor pagination (also called keyset or seek pagination) is also a widespred technique which aims to avoid the pitfalls of the offset + limit approach.

The idea is for the server to provide a unique cursor to the client with each response. The client has to send back the cursor at the next request, and the server then will fetch the next page based on the cursor. The cursor is a value which identifies an exact position in the result set that the server can use to identify the current paging position.

At its simplest, a cursor can be an internal ID of the last record of a page. The client would then ask for the following page using http://my-dummy-api.com/todos?cursor=10&limit=10, where cursor=10 indicates that we are at position 10 in the results (however, it is strongly recommended to obfuscate the cursor and not to use the internal ID of the record as it is - ideally it would be an encrypted value, meaningless to the client.) The server would then execut a SQL similar to:

SELECT * FROM records WHERE id > 10 ORDER BY id ASC LIMIT 10;

This ensures consistent ordering of the results (in fact, it relies on it), and also has a better performance, since it uses a primary key for filtering, which has an index.

# UUID primary keys

What happens if we cannot use the primary key? For instance, for UUID primary keys, ordering can be expensive - that’s not what UUIDs are for. In that case, I would suggest using a timestamp column, which broadly corresponds to the creation date of the record. For instance, the cursor could be the created_at of the item. This creates some difficulties, though:

We have to make sure that the timestamp column has an index on it.
What happens when two records have the same created_at value? The solution is to use a composite key: use the timestamp for ordering, and use the UUIDs for secondary ordering, to decide between records with the same timestamp. A composite key like this can be encoded and provided to the client.

An example implementation can be seen here.

# Pros and cons

Cursor pagination is good because it’s efficient and reliable, even with very large datasets. Your database has to handle much less data, as it doesn’t have to read loads of unneeded rows into memory, as it does with the offset pagination. This advantage can outweigh all the downsides if your dataset is large enough.
Its downsides include the fact that it needs more effort: the server has to provide this functionality, and the client also has to have this extra logic.
It also makes concurrent requests impossible, as in, you cannot fire off 10 async requests each requesting a page from the results (as you could do with the offset version), since you don’t know the cursor value needed for any subsequent request. Therefore the client also can’t jump to a specific page.

# Try it yourself

I have put together a playground to experiment with pagination here. It includes an API that exposes data and offers pagination, both the limit + offset and cursor-based one. There are two clients, one hammering the API and updating and creating records. The other client is a hypothetical consumer who wants to export all the data.

The results are the following:

---
Ordering used: None (Postgres default) 
Pagination method used: offset
total count: 27229
exported 27229 events
exported 16922 unique events
10306 events missed
('offset_paginate', 8.73) seconds
---
Ordering used: updated_at desc
Pagination method used: offset
total count: 27343
exported 27343 events
exported 27229 unique events
114 events missed
('offset_paginate', 13.25) seconds
---
Ordering used: updated_at asc
Pagination method used: offset
total count: 27493
exported 27493 events
exported 27492 unique events
1 events missed
('offset_paginate', 12.64) seconds
---
Ordering used: created_at asc
Pagination method used: offset
total count: 27607
exported 27607 events
exported 27607 unique events
0 events missed
('offset_paginate', 11.07) seconds
---
Ordering used: default as per implementation
Pagination method used: keyset
total count: 27852
exported 27852 events
exported 27852 unique events
0 events missed
('keyset_paginate', 8.05) seconds

It can be seen that the offset pagination can offer flexible sorting, but it does not produce consistent results: there are missing items, unless the ordering is by creation date, in ascending order. And if you don’t set any explicit ordering and rely on the database default, pagination is a mess: there are thousands of missed rows.

The cursor (keyset) based pagination, however, is faster and does not miss any records. Sorting is part of the implementation, so it’s less flexible: it’s only possible by the cursor column.

Thanks to all the authors of the articles below for helping me understand this topic!

# Resources & further reading

A Hackernews discussion about keyset pagination: here
A blog post from Slack engineering about how they do pagination: here
Related Slack API docs: here
JSON API specification of the cursor paginatio: here
All you need is backend article
A Wordpress post
A great article by Moesif
Cursor-based pagination and UUIDs on Stackoverflow
Cursor based pagination with non-unique key on Stackoverflow
Use the index Luke post
A Hackernoon post
How Facebook offers pagination in their graph api
Why limit + offset is slow
Paginating real-time data

Written on October 26, 2019

If you notice anything wrong with this post (factual error, rude tone, bad grammar, typo, etc.), and you feel like giving feedback, please do so by contacting me at hello@samu.space. Thank you!

Back