Albert Wang

A Years Old Revenue-hurting, Client-facing Database Locking Bug

How can you find a race-condition bug that isn't picked up by your tests nor detectable by alerting like Sentry?

That happened to me, where we were building out an appointment system where we were linking clients with therapists.

The Scenario:

At Soothe, our client (the user) would make a booking for a massage which we represent in our database as an AppointmentRequest. We'd create an AppointmentOffer for each therapist, so when a therapist opens up their feed, they'd see their AppointmentOffers.

After a therapist accepts an appointment offer, the status on the AppointmentOffer goes from pending to accepted, and any other therapist who had an AppointmentOffer for that AppointmentRequest became expired. An Appointment gets created, and both clients and therapists have an Appointment feed which shows their upcoming Appointment.

The Issue

This doesn't seem like there's anything wrong with the above scenario, until we found out that multiple therapists would try to accept their AppointmentOffer, creating more Appointment than we had expected. I only knew because I was tracking on Slack our call center about issues that they thought. If more therapists showed up than expected, we'd still end up paying them. This was showing up infrequently at first, but grew as the company grew.

We confirmed this issue by querying our database where there were more Appointment than expected.

Discussing The Solution

It was immediately clear that we had a locking issue.

We couldn't lock in the backend using simple business logic, as we were running multiple processes on Heroku.

We couldn't use a unique index, as users could book a couples AppointmentRequest as there are 2 Appointment associated with a filled AppointmentRequest.

We couldn't do a simple implementation using optimistic and pessimistic locking against the AppointmentRequest and AppointmentOffer, as there's enough after_save changes that this would require a bigger rewrite.

My Proposed Solution

Since we didn't want extra insertions, I proposed that the creation of our Appointment to when the AppointmentRequest was made. Then, we used pessimistic locking to prevent a double update.

Could we have used NoSQL for locking?

At the time, we weren't aware of Redis's ability to do locking using setnx and even if we did, there's the added risk of depending on one database for locking against another database.

Key Takeaways

  • Listen to customers and call center to catch bugs that won't show up in tests or Sentry
  • Pre-create rows rather than locking the table for easy row-locking