Chosen links

Links - 7th March 2021

Postgres scaling advice for 2021

Try to avoid the serious mistake of having your data “architecture” centered around a single huge table

You’d be surprised how often we see that… so slice and dice early, or set up some partitioning. Partitioning will also do a lot of good to the long-term health of the database, since it allows multiple autovacuum workers on the same logical table, and it can speed up IO considerably on enterprise storage. If IO indeed becomes a bottleneck at some point, you can employ Postgres native remote partitions, so that some older data lives on another node.

Make sure to “bake in” a proper sharding key for your tables/databases

Initially, the data can just reside on a single physical node. If your data model revolves around the “millions of independent clients” concept for example, then it might even be best to start with many “sharded” databases with identical schemas, so that transferring out the shards to separate hardware nodes will be a piece of cake in the future.

Best Practice: Application Frameworks

Another challenge is that adding more hops between microservices may add latency to the overall request. In some cases, this latency can be mitigated by architectural improvements that happen as part of a microservices rewrite. For its microservices platform, Google has also ensured that requests between services that happen to be co-located in the same server use an optimized in-process transport.

The Unexpected Find That Freed 20GB of Unused Index Space

While we were looking at the sizes of the indexes after we finished rebuilding them, an interesting thing caught our eye.

One of our largest tables stores transaction data. In our system, after a payment is made, the user can choose to cancel and get a refund. This is not happening very often, and only a fraction of the transactions end up being cancelled.

In our transactions table, there are foreign keys to both the purchasing user and the cancelling user, and each field has a B-Tree index defined on it. The purchasing user has a NOT NULL constraint on it so all the rows hold a value. The cancelling user on the other hand, is nullable, and only a fraction of the rows hold any data. Most of the values in the cancelling user field are NULL.

We expected the index on the cancelling user to be significantly smaller than the index on the purchasing user, but they were exactly the same. Coming from Oracle, I was always taught that NULLs are not indexed, but in PostgreSQL they are! This “Aha” moment led us to the realization that we were indexing a lot of unnecessary values for no reason.

This was the original index we had for the cancelling user:

CREATE INDEX transaction_cancelled_by_ix
ON transactions(cancelled_by_user_id);

To check our thesis, we replaced the index with a partial index that excludes null values:

DROP INDEX transaction_cancelled_by_ix;

CREATE INDEX transaction_cancelled_by_part_ix
ON transactions(cancelled_by_user_id)
WHERE cancelled_by_user_id IS NOT NULL;

The full index after we reindexed it was 769MB in size, with more than 99% null values. The partial index that excluded null values was less than 5MB. That’s more than 99% percent of dead weight shaved off the index!

A Data Pipeline is a Materialized View

To update a materialized view, there are two high-level properties you typically care about: the update trigger, and the update granularity. The former affects the freshness of your output, which impacts end-users of the data, and the latter affects the performance of your update process, which impacts the engineers or operators responsible for that process.

Update Trigger

The update trigger is the event that prompts a refresh of the materialized view—​e.g. by running your pipeline against the latest source data.

That event may be a file landing in a shared drive, or some data arriving on an event stream, or another pipeline completing. For some pipelines, the update trigger may just be a certain time of day, in which case it might be more useful to talk about the update frequency rather than trigger.

A typical batch pipeline, for example, might run on a daily or hourly cadence, whereas a streaming pipeline may run every few seconds or minutes, or whenever a new event is delivered via some sort of event stream. Whenever the pipeline runs, it updates its output, and the whole process can be viewed as a refresh of the materialized view.

Update Granularity

The update granularity refers to how much of the materialized view needs to be modified to account for the latest changes to the source data.

A common update granularity is the full refresh. No matter how small or large the change to the source data, when the pipeline runs it throws away the entire output table and rebuilds it from scratch.

A more sophisticated pipeline might rebuild only a subset of the table, like a date partition. And an extremely precise pipeline may know how to update exactly the output rows that are impacted by the latest changes to the source data.

The update trigger and granularity are independent. You can have a pipeline that runs every second and does a full refresh of its output, and you can have a pipeline that runs once a day but carefully updates only the rows that it needs to.

Typical Examples

Let’s explore these two properties a bit using our example pipeline that computes the top-selling authors of the month.

The Daily Batch Update

Every night at 1 a.m., an automated process looks for a dump of the latest purchases from the previous day. The dump is a compressed CSV file.

The update process uses this dump to recompute the month’s sales numbers for all authors. It replaces the entire output table with all-new calculations for all authors. Many of the authors' numbers may not have changed since the last update (because they had no new sales in that time period), but they all get recomputed nonetheless.

This is a very typical example of a batch pipeline. It has a scheduled update trigger at 1 a.m. every night, and an update granularity of the entire output.

The Live-Updating Table

In this version of our top-selling authors pipeline, individual purchases are streamed in as they happen, via a stream processor like Apache Kafka. Every purchase on this stream triggers an update to the calculation of top-selling authors.

The update process uses each individual purchase to incrementally recompute the sales total for the relevant author. If an author has no new sales over a given span of updates, their sales total is not recomputed (though their rank in the top-selling authors may need to be updated).

This is an example of a precise streaming pipeline. The update trigger is the purchase event that is streamed in, and the update granularity is the sales total for a single author.

Weird architectures weren’t supported to begin with

Program packagers and distributors (frequently separate from project maintainers themselves) are very used to C’s universal presence. They’re so used to it that they’ve built generic mechanisms for putting entire distributions onto new architectures with only a single assumption: the presence of a serviceable C compiler.

A long term solution to the problem of support for platforms not originally considered by project authors is going to be two-pronged:

  • Builds need to be observable and reviewable: project maintainers should be able to get the exact invocations and dependencies that a build was conducted with and perform automatic triaging of build information. This will require environment and ecosystem-wide changes: object and packaging formats will need to be updated; standards for metadata and sharing information from an arbitrary distributor to a project will need to be devised. Reasonable privacy concerns about the scope of information and its availability will need to be addressed.

  • Reporting needs to be better directed: individual (minimally technical!) end users should be able to figure out what exactly is failing and who to phone when it falls over. That means rigorously tracking the patches that distributors apply (see build observability above) and creating mechanisms that deliver information to the people who need it. Those same mechanisms need to have some mechanism for interaction: there’s nothing worse than a flood of automated, bug reports with insufficient context.

I put this one last because it’s flippant, but it’s maybe the most important one: outside of hobbyists playing with weird architectures for fun (and accepting the overwhelming likelihood that most projects won’t immediately work for them), open source groups should not be unconditionally supporting the ecosystem for a large corporation’s hardware and/or platforms.

RIP Flash

Unlike its namesake, Flash lived a long life and died a slow death. It was given to the world with the intent of making advertisements move and play sound. However, the world learned to instead be moved and play games.

Common management failures in developing individual contributors

When you don’t give your team the context for the work and just pass on tasks and work items to them, you make it clear that they are simply “doers” and your job is the job of the “decider”. There is a fine line between giving the team focus time and excluding them from meetings where they would get the necessary information and context to feel ownership of the projects. Your growth challenge is to learn the balance of providing information to the team and inviting them along to get that information, while not overwhelming them with meetings.