Hi. I am travis and I live and work in a place long ago abandoned by the gods: New York City. By day, I am a web developer at Blue Apron. By night, i am a normal person.

Postgres: Adding Foreign Keys With Zero Downtime

When I think about foreign keys in Postgres, I shudder with fear recalling the pain and suffering they've caused me. Adding foreign key constraints, like many schema changes in Postgres (and pretty much all database engines for that matter), should never be taken lightly. If you underestimate the impact that adding them can have on your running application, you will be severely burned.

Fortunately, there are a few sweet morsels of knowledge that will spare you this pain. Let's imagine that we have a very innovative app with a users table. Let's get real crazy and pretend that we also have an addresses table with a user_id column referring to--stay with me--the users table. This is a perfect candidate for a foreign key constraint!

The first very important thing to understand is that foreign key constraints operate using triggers. For our addresses table, we need to guarantee that when we insert a row, it refers to an actual user. Similarly, when we delete a user row, we want to make sure there's not some orphaned address somewhere, cold and alone. This is accomplished by adding a trigger to both tables. So even though we're altering the addresses table, we also acquire an AccessExclusiveLock on users. To illustrate:

myapp=# BEGIN;
BEGIN
myapp=# ALTER TABLE addresses ADD CONSTRAINT "ive_made_a_huge_mistake" FOREIGN KEY ("user_id") REFERENCES "users" ("id");
ALTER TABLE
myapp=# SELECT locktype, relation::regclass, mode, transactionid AS tid, virtualtransaction AS vtid, pid, granted FROM pg_locks;
   locktype    |          relation          |        mode         |   tid   |  vtid   |  pid  | granted
---------------+----------------------------+---------------------+---------+---------+-------+---------
 relation      | pg_locks                   | AccessShareLock     |         | 2/80438 | 56984 | t
 relation      | users_pkey                 | AccessShareLock     |         | 2/80438 | 56984 | t
 relation      | index_addresses_on_user_id | AccessShareLock     |         | 2/80438 | 56984 | t
 relation      | addresses_pkey             | AccessShareLock     |         | 2/80438 | 56984 | t
 virtualxid    |                            | ExclusiveLock       |         | 2/80438 | 56984 | t
 relation      | addresses                  | AccessShareLock     |         | 2/80438 | 56984 | t
 relation      | addresses                  | AccessExclusiveLock |         | 2/80438 | 56984 | t
 transactionid |                            | ExclusiveLock       | 3919702 | 2/80438 | 56984 | t
 relation      | users                      | AccessShareLock     |         | 2/80438 | 56984 | t
 relation      | users                      | RowShareLock        |         | 2/80438 | 56984 | t
 relation      | users                      | AccessExclusiveLock |         | 2/80438 | 56984 | t
(11 rows)

Generally, a table like users will get a lot of attention in an app, so this lock will be a huge problem. Everything conflicts with this lock, including reads (simple SELECT statements). In other words, literally every statement referencing users occurring after the ALTER TABLE statement will start to queue up. If anything slow is in front of the foreign key in the queue (like perhaps a long-running job that selected a single user row) you've essentially brought the entire database to a halt.

The second thing you should know about foreign key constraints is that they need to be validated, i.e. Postgres needs to make sure that the existing data honors the rule (again, no orphaned addresses). It does this by running an anti-join query returning addresses without a corresponding user:

myapp=# EXPLAIN SELECT fk."user_id" FROM ONLY "public"."addresses" fk LEFT OUTER JOIN ONLY "public"."users" pk ON ( pk."id" OPERATOR(pg_catalog.=) fk."user_id") WHERE pk."id" IS NULL AND (fk."user_id" IS NOT NULL);
                                  QUERY PLAN
-------------------------------------------------------------------------------
 Hash Anti Join  (cost=399473.67..644282.78 rows=1 width=4)
   Hash Cond: (fk.user_id = pk.id)
   ->  Seq Scan on addresses fk  (cost=0.00..135949.93 rows=7098976 width=4)
         Filter: (user_id IS NOT NULL)
   ->  Hash  (cost=335247.39..335247.39 rows=8671795 width=4)
         ->  Seq Scan on users pk  (cost=0.00..335247.39 rows=8671795 width=4)
(6 rows)

...yeahhhh. Basically, this will be totally fine if your addresses table is empty, and terrible if it's not. What have we learned so far? Yes...never use foreign keys. They will murder your family and reduce your happiness to ashes.

There's no good way to deal with the first problem. Try to introduce foreign keys when traffic is lowest, disable any non-web processes, and keep your eyes on the current locks. For the second problem, there is a simple strategy which substantially reduces this risk. Postgres allows you to create an invalid FK constraint and skip the giant query by specifying NOT VALID in the ALTER TABLE statement. This allows you to quickly introduce an invalid foreign key in one transaction and then validate it in a separate transaction. Validation only acquires a RowShareLock on the users table which does not block reads or writes!

myapp=# BEGIN;
BEGIN
myapp=# ALTER TABLE addresses ADD CONSTRAINT "i_dont_regret_this" FOREIGN KEY ("user_id") REFERENCES "users" ("id") NOT VALID;
ALTER TABLE
myapp=# COMMIT;
COMMIT
myapp=# BEGIN;
BEGIN
myapp=# ALTER TABLE addresses VALIDATE CONSTRAINT "i_dont_regret_this";
ALTER TABLE
myapp=# SELECT locktype, relation::regclass, mode, transactionid AS tid, virtualtransaction AS vtid, pid, granted FROM pg_locks;
   locktype    |          relation          |           mode           |   tid   |  vtid   |  pid  | granted
---------------+----------------------------+--------------------------+---------+---------+-------+---------
 relation      | users_pkey                 | AccessShareLock          |         | 2/80443 | 56984 | t
 relation      | index_addresses_on_user_id | AccessShareLock          |         | 2/80443 | 56984 | t
 relation      | addresses_pkey             | AccessShareLock          |         | 2/80443 | 56984 | t
 relation      | addresses                  | AccessShareLock          |         | 2/80443 | 56984 | t
 relation      | users                      | AccessShareLock          |         | 2/80443 | 56984 | t
 relation      | users                      | RowShareLock             |         | 2/80443 | 56984 | t
 relation      | pg_locks                   | AccessShareLock          |         | 2/80443 | 56984 | t
 virtualxid    |                            | ExclusiveLock            |         | 2/80443 | 56984 | t
 relation      | addresses                  | ShareUpdateExclusiveLock |         | 2/80443 | 56984 | t
 transactionid |                            | ExclusiveLock            | 3919706 | 2/80443 | 56984 | t
(10 rows)

This strategy, while not the most elegant thing in the world, works well assuming your data is actually valid. If you have any references to non-existent rows, Postgres will freak out, so make sure you DELETE the orphans before running it. If your data is fine, then you'll successfully create a foreign key and you can tell all your friends and dog about it.

P.S. Currently the only way to accomplish this in Rails is with execute statements in migrations, but I have a PR to change that. Hopefully someday this will be a little bit easier to accomplish using built-in Rails code.

Postgres and the Rails Sandbox