Hi. I am travis and I work in a place long ago abandoned by the gods: New York City. By day, I am a Software Engineer at Patreon. By night, i am a normal person.

Beware the ORM: Locking and Joins

There are a lot of people who have a lot of opinions about ORMs. This person and this other person both suggest that ORM is an anti-pattern. I can't really empathize with these people at all and I assume they cry themselves to sleep at night. Unlike them, I have things to do, which is why I'm a fan of tools like Rails to accelerate and simplify my development efforts.

99% of the time, I feel like Rails (and other ORMs) positively contribute to my code. They keep things simple, readable, and maintainable, which is a very important thing when you're onboarding new engineers. Furthermore, they provide pretty much all of the DB functionality I need at a fraction of the development effort. In the case of Rails, it's easy to build composable pieces of logic using scopes that lead to extremely expressive and elegant code.

Having said that, the one problem with ORMs is that, as developers, we almost completely stop paying attention to the SQL that's being generated. Again, 99% of the time this is totally fine and is, of course, the intention of the design pattern's abstraction. Once in a while, though, the wrong query will burn your ORM paradise to the ground. In my experience, this very frequently involves locking.

Let's say that we have a company where we sell some stuff to people, and then we have to ship that stuff to them from one of our warehouses. Here's what our models look like:

class User < ApplicationRecord
  has_many :orders
end

class Order < ApplicationRecord
  belongs_to :user
  belongs_to :warehouse
  has_many :order_items
end

class OrderItem < ApplicationRecord
  belongs_to :order
  belongs_to :product
end

class Product < ApplicationRecord
  # Has a uniquely identifying sku attribute
end

class Warehouse < ApplicationRecord
  # Has a uniquely identifying code attribute
end

Let's say for some reason we want to fulfill orders for different warehouses on different days and you do this per user since we need to charge their credit card or something. We do all of this in a transaction because we don't want to charge the user unless all of the orders successfully commit. Now you add some scopes and write some functions and your code looks like this:

class User < ApplicationRecord
  has_many :orders

  def commit_orders_for_warehouse(warehouse_code)
    transaction do
      orders.with_status('pending')
        .shipping_from(warehouse_code)
        .lock
        .each(&:send_to_fulfillment!)

      amount_to_charge = orders.sum(:total_price)
      charge!(amount_to_charge)
    end
  end

  def charge!(amount)
    # Make a network call to a payment provider
    sleep 1
  end
end
class Order < ApplicationRecord
  belongs_to :user
  belongs_to :warehouse
  has_many :order_items

  scope :shipping_from, ->(code) { joins(:warehouse).where(warehouses: { code: code }) }

  scope :with_status, ->(status) { where(status: status) }

  def send_to_fulfillment!
    # Make a network call to a 3PL or something

    update_attributes!(status: 'fulfilled')
  end
end

We're making a few fatal mistakes here, the first of which I see constantly: making network calls in a transaction. This makes sense in theory: "I don't want to do the thing unless the other thing works." In practice, long-running transactions are bad, and the time spent on a network call (especially one we forgot to set a timeout on) is a lifetime in the methamphetamine nightmare that is your database engine. This is particularly damaging when we notice the other serious flaw in this code...

Let's take a look at the SQL generated for one user:

BEGIN
  SELECT "orders".* FROM "orders" INNER JOIN "warehouses" ON "warehouses"."id" = "orders"."warehouse_id" WHERE "orders"."user_id" = $1 AND "orders"."status" = $2 AND "warehouses"."code" = $3 FOR UPDATE  [["user_id", 1], ["status", "pending"], ["code", "NYC"]]
  SELECT  "warehouses".* FROM "warehouses" WHERE "warehouses"."id" = $1 LIMIT $2  [["id", 1], ["LIMIT", 1]]
  UPDATE "orders" SET "status" = $1 WHERE "orders"."id" = $2  [["status", "fulfilled"], ["id", 1]]
  SELECT  "warehouses".* FROM "warehouses" WHERE "warehouses"."id" = $1 LIMIT $2  [["id", 1], ["LIMIT", 1]]
  UPDATE "orders" SET "status" = $1 WHERE "orders"."id" = $2  [["status", "fulfilled"], ["id", 2]]
  SELECT  "warehouses".* FROM "warehouses" WHERE "warehouses"."id" = $1 LIMIT $2  [["id", 1], ["LIMIT", 1]]
  UPDATE "orders" SET "status" = $1 WHERE "orders"."id" = $2  [["status", "fulfilled"], ["id", 3]]
  SELECT SUM("orders"."total_price") FROM "orders" WHERE "orders"."user_id" = $1  [["user_id", 1]]
COMMIT

If we run these statements one-at-a-time, we can analyze the locking behavior of our code (for reference, here's a good post on examining locks in Postgres):

test=# BEGIN;
BEGIN
test=# SELECT "orders".* FROM "orders" INNER JOIN "warehouses" ON "warehouses"."id" = "orders"."warehouse_id" WHERE "orders"."user_id" = 1 AND "orders"."status" = 'pending' AND "warehouses"."code" = 'NYC' FOR UPDATE;
 id | user_id | warehouse_id | status  | total_price
----+---------+-------------+---------+-------------
  1 |       1 |           1 | pending |          10
  2 |       1 |           1 | pending |          10
  3 |       1 |           1 | pending |          10
(3 rows)

test=# SELECT locktype, relation::regclass, mode, transactionid AS tid, virtualtransaction AS vtid, pid, granted FROM pg_locks;
   locktype    |    relation     |      mode       |  tid   |   vtid   |  pid  | granted
---------------+-----------------+-----------------+--------+----------+-------+---------
 relation      | pg_locks        | AccessShareLock |        | 2/298107 | 93051 | t
 relation      | warehouses_pkey | AccessShareLock |        | 2/298107 | 93051 | t
 relation      | orders_pkey     | AccessShareLock |        | 2/298107 | 93051 | t
 relation      | warehouses      | RowShareLock    |        | 2/298107 | 93051 | t
 relation      | orders          | RowShareLock    |        | 2/298107 | 93051 | t
 virtualxid    |                 | ExclusiveLock   |        | 2/298107 | 93051 | t
 transactionid |                 | ExclusiveLock   | 947996 | 2/298107 | 93051 | t
(7 rows)

Notice that line that mentions a RowShareLock on warehouses? Yeah...that's really bad. Because our Order scope joins to Warehouse, both the orders and the warehouse are locked; our code has effectively turned the warehouse into a mutex, which makes absolutely no sense. The side effect of this is that we can only process one user per warehouse at a time. To illustrate, if we open another transaction in another session and try to run the same statement for user 2, we halt completely:

test=# BEGIN;
BEGIN
test=# SELECT "orders".* FROM "orders" INNER JOIN "warehouses" ON "warehouses"."id" = "orders"."warehouse_id" WHERE "orders"."user_id" = 2 AND "orders"."status" = 'pending' AND "warehouses"."code" = 'NYC' FOR UPDATE;

An easy solution to this is to introduce a second query: the first one selects all the orders for the specific warehouse, the second query locks all of them. This is what the resulting code would look like:

def commit_orders_for_warehouse(warehouse_code)
  transaction do
    order_ids = orders.with_status('pending')
      .shipping_from(warehouse_code)
      .pluck('orders.id')

    orders.where(id: order_ids)
      .lock
      .each(&:send_to_fulfillment!)

    amount_to_charge = orders.sum(:total_price)
    charge!(amount_to_charge)
  end
end

With the sleep statement in our charge! method, we can easily see how the before and after code compare. I created five users with orders shipping from NYC. Here's some code to time the processing for all of those users:

def self.run_all_users(warehouse_code)
  Benchmark.ms do
    threads = []

    User.all.each do |user|
      threads << Thread.new { user.commit_orders_for_warehouse(warehouse_code) }
    end

    threads.each(&:join)
  end
end
# Before the code change
irb(main):050:0> User.run_all_users('NYC')
=> 5274.9072830192745

# After
irb(main):005:0> User.run_all_users('NYC')
=> 1099.306590971537

The results are pretty stark. It takes 5s in the first scenario (1s for each user since our threads are blocking each other), while the latter takes just over a second (each thread actually runs concurrently). This also presupposes that you even have a big enough connection pool for all of these threads to sit around and wait (I set mine to 10). More likely, many of these threads will start raising an ActiveRecord::ConnectionTimeoutError.

Caught up in your intense love affair with Rails, you thought you were safe from these horrors. Locking all of the orders was the right thing to do, but because of your beautiful yet naive code, you created a huge bottleneck and brought your site (or more likely your job queue) to a screeching halt.

Perhaps this is an argument against ORMs, but more likely this a testament to the complexity of systems and the need for developers to continuously improve their understanding of the technologies therein. Yes, the goal of an ORM is to abstract away the interactions with the database, but that doesn't mean you can pretend it isn't there. As DHH eloquently put it, Ruby, Rails, and other ORMs are "sharp knives", and developers must be empowered to make better decisions "by convention, by nudges, and through education. Not by banning sharp knives from the kitchen and insisting everyone use spoons to slice tomatoes."

Postgres and the Rails Sandbox

Intro to Avro