A note to my future self about performance, dealing with large datasets, and incremental improvements.
# A problem
Hello. You might be writing a software component, probably a piece of backend code that uses a pretty standard stack, like Python & SQLAlchemy over some relational database. Your task might be to implement something relatively straightforward. It might even be trivial.
But when you are done with coding the logic of your component (almost in a pseudocode-style, in Python - it’s readable, it’s clear, it’s fun), you might realise that your job hasn’t even begun. Because you find yourself looking at logs and metrics, and it seems like it just doesn’t cut it: it’s too slow, cannot keep up with traffic, latency is too high: a request takes too long to complete. Maybe it’s a datamigration that takes ages and cannot process a large backlog in a queue, or in a database table. Or you might get errors in Sentry saying: some other service cannot execute a statement due to a lock timeout - your code has held an exclusive lock for too long. You start wondering: how could I make this thing faster?
# Reflection and asking for help
I’ve found myself in this situation recently. My task was a logically trivial script that had to backfill a column in a table. Except that the amount of data to backfill was large: millions of rows. The fact that those tables were in constant use by other services didn’t help. I scratched my head, then asked for advice.
I was fortunate enough to work with more experienced developers who, on multiple occasions, pointed me to the right direction. I would like to put into writing the things I learned during this process, so my future self can better remember it (or read it if they have forgotten).
# Take one - the database
Let’s suppose that the task at hand is querying a large table (tens of millions of rows), and populating a column by looking for a corresponding row in another table (but it could be any backend task really - I think the points below would still apply). After writing the initial implementation, see where it can be optimised.
If you are using the ORM, can you use it in a way that makes fewer queries?
If you are iterating over a collection of ORM models, and updating them one by one using a property setter, is there a way to avoid this? Perhaps using the SqlAlchemy expression language? If your queries are relatively simple (and are not likely to change), could you use raw SQL instead?
If you have an intuition that DB interactions are slow: prove it. Use
EXPLAIN ANALYZE to see the query plan. Try to make sense of it and see what has the largest cost in the plan. Is there a sequential scan that could be avoided?
Could you add an index to a column which would not be very expensive, but would increase performance? Does it make sense to add an index to that column? If values in that column are very common (the result set returned by the
WHERE clause is more than around 10% of the whole table), then it might not be useful at all, as a sequential scan will be faster. Is it critical that this table have optimal write performance? Again, an index can affect the time it takes for inserts/updates to be committed.
Experiment by adding the index in your local database, and inspecting the results of
EXPLAIN ANALYZE before and after. Take the results of your experiment with a pinch of salt: you cannot assume that these results on a small dataset will match the plan on the production database, since the query costs are not linear with regards to the size of the dataset.
If you are sorting by anything in your queries, think carefully: do you really need to sort, do you really need that
ORDER BY? Does the column you are sorting on have an index?
# Take two - the code
On the application level, do you know where most of the time is spent? Are you sure is it mostly interacting with the DB? If yes, perhaps there are other things your app could do while waiting for the results from the database. If that is the case, your code may be I/O bound, in which case it could benefit from multithreading. Even you don’t want to bother with asyncio (but maybe you should learn how to use it, for fun and for profit), you could potentially leverage the
futures module. In an I/O bound process, the
ThreadPoolExecutor can come in handy. While a thread is waiting on a network request or a database call, another one can do some useful work. This can yield significant performance benefits.
Try profiling your code to see where most of the time is spent. It might be as simple as running
python -m cProfile myscript.py and carefully inspecting the output. Use the
-s option to specify the key on which to sort - see the docs.
Tracing can help too, and is not too much work to set up. Some libraries are very well integrated with the Opentracing/Opencensus stack, SQLAlchemy and Requests being two examples. You can introduce tracing into your code to see even more information about exectly what actions are taken - see the docs
# Wrapping up
There is always something you can do to improve the performance of your code. To sum up, these steps could be worth looking into:
- Profiling your code
python -m cProfile <your script>.py
- Optimising your database calls
- run an
EXPLAIN ANALYZEand inspect the output
- avoid unneeded ordering
- think about indexes
- think about doing the query in batches
- think about using the SqlAlchemy expression language (if applicable)
- run an
- Giving concurrency some thoughts
- is your process I/O or CPU bound?
- could it benefit from threading? Or multiprocessiong?
- OpenCensus/OpenTracing integration could give you more insight into the bottleneck
But first, take a nap, go for a walk, pet the dog. Then, read that documentation, try some things and see what works - and keep learning.