PostgreSQL Slow Cascading Deletes
This is less a thing I recently learned than one that I re-learned multiple times through real-world experience. Imagine that you are migrating a reasonable sized dataset from a document store to PostgreSQL. You have enabled dual writes and you are progressing slowly through a backfill process when, sadly, latency begins to spike and you find that all the slow queries are DELETE
statements and PostgrSQL is consuming quite a bit of CPU trying to complete these deletes. When you look at the schema, you see that the deletes are associated with
a foreign key constaint that specifies ON DELETE CASCADE
. You might be tempted, as many on stack overflow have done, to conclude generally that "DELETE CASCADE in PostgreSQL extremely slow". That's not strickly true, though. Let's take a look at an example and explain why this happens.
First, let's create a couple of tables that have the kind of foreign key constaint that we're talking about.
CREATE TABLE IF NOT EXISTS people (
id INTEGER PRIMARY KEY,
name VARCHAR(255) NOT NULL);
CREATE TABLE IF NOT EXISTS phone_numbers (
id INTEGER PRIMARY KEY,
person_id INTEGER NOT NULL REFERENCES people(id)
ON DELETE CASCADE ON UPDATE CASCADE,
number VARCHAR(255) NOT NULL);
Now, let's run a little experiment where we insert N
rows into the people
table and N
rows into the phone_numbers
table. Then, we'll simply delete all the rows in the people
table (which will also delete all the rows in the phone_numbers
table) using the following statement,
DELETE FROM people;
We'll do this for increasing values of N
and we'll also do the same thing in MySQL to see if PostgreSQL deserves the reputation for having slow cascading deletes.
That clearly shows that there is some merit to the claim that PostgteSQL has slow cascading deletes. MySQL is able to delete all 25,000 people rows along with their associated phone numbers in 174ms. PostgreSQL, on the other hand, takes over 14 seconds. Such a massive difference in performance is a good indicator that this is not merely a matter of PostgreSQL being slower than MySQL. Let's look and see what indexes were created in PostgreSQL when we built our tables earlier.
# SELECT tablename, indexname
FROM pg_indexes
WHERE schemaname = 'public';
tablename | indexname
---------------+--------------------
people | people_pkey
phone_numbers | phone_numbers_pkey
(2 rows)
The only indexes that we have are the primary key indexes for each of the tables. When we delete a row from people
, we have to be able to find the corresponding rows in phone_numbers
that have the same people_id
in order to delete them as well. As you can see, though, there is no index on the person_id
column in phone_numbers
. Our cascading delete on the phone_numbers
table requires a full table scan. Does this suggest that MySQL builds indexes automatically that PostgreSQL does not? Let's see.
mysql> SELECT table_name, index_name
FROM INFORMATION_SCHEMA.STATISTICS
WHERE table_name IN ('people', 'phone_numbers');
+---------------+------------+
| TABLE_NAME | INDEX_NAME |
+---------------+------------+
| people | PRIMARY |
| phone_numbers | PRIMARY |
| phone_numbers | person_id |
+---------------+------------+
3 rows in set (0.004 sec)
It appears so. MySQL not only has an index on the primary key of phone_numbers
but it has also added an index on person_id
. This explains why PostgreSQL is so slow relative to MySQL. Let's add that index in PostgrSQL to confirm.
CREATE INDEX phone_numbers_number_idx ON phone_numbers (person_id);
As expected, PostgreSQL performance, when you add the missing index, is now on par (a bit better, actually) than MySQL.
Is this on purpose?
It seems so. Let's look at the PostgreSQL documentation first.
Since a DELETE of a row from the referenced table or an UPDATE of a referenced column will require a scan of the referencing table for rows matching the old value, it is often a good idea to index the referencing columns too. … declaration of a foreign key constraint does not automatically create an index on the referencing columns. PostgreSQL 17.5: 5.5.5. Foreign Keys
While the documentation doesn't explicitly say why PostgreSQL decided not to create an index automatically in this case, the section of their documentation on indexes suggests using them "sensibly" and leaving their design to the application developer seems consistent with this principle.
Indexes are a common way to enhance database performance. An index allows the database server to find and retrieve specific rows much faster than it could do without an index. But indexes also add overhead to the database system as a whole, so they should be used sensibly. PostgreSQL 17.5: 11. Indexes
And if we look at the documentation for MySQL, we see that while they do create these indexes automatically, there seems to be more complexity than is apparent in our trivial example.
MySQL requires indexes on foreign keys and referenced keys so that foreign key checks can be fast and not require a table scan. In the referencing table, there must be an index where the foreign key columns are listed as the first columns in the same order. Such an index is created on the referencing table automatically if it does not exist. This index might be silently dropped later if you create another index that can be used to enforce the foreign key constraint. index_name, if given, is used as described previously. MySQL 8.4: 15.1.20.5 FOREIGN KEY Constraints
As you can see, MySQL will create an index on the referencing table if one doesn't already exist but it can use another existing index if it contains the neccessary keys. It also suggests that if an index is later created that can be used to enforce the foresign key containt, the automatically created index will be dropped. I can understand why PostgreSQL would want to avoid this complexity and ultimately defer to the developer to decide on a "sensible" approach.
Conclusion
Modern relational databases are complex and I think there is a tendency to treat them as if they are somewhat magical. But despite the many cool things they can do, the basic data structures used to store and retrieve data haven't really changed in decades and they have tradeoffs. Every index you add to your database will create additional storage and will add write overhead. However, you have to make sure that you have indexes to support the read patterns needed by your application. These are design tradeoffs that just cannot be made in the database itself. It's admirable that MySQL has attempted to address this foot gun. At the same time, you can understand why PostgreSQL would want to avoid silently creating indexes that were never requested.
After dealing with relational databases for many years as a product-level engineer, I've become pretty suspicious of any kind of implicit magic at the storage level. I've also had all of my initial hopes that code reviews would be effective in reviewing schema changes crushed repeatedly. The one thing that has saved me time and time again, though, has been runtime logging of slow or unindexed queries, via MySQL's Slow Query Log or through the pg_stat_statements
extension in PostgreSQL.