Sidekiq Race Conditions [Chris Toomey]

Chris discusses a nasty bug he wrote with Sidekiq, and Team Temporal discusses his bug.

Listen to The Bikeshed (24mins in) https://www.bikeshed.fm/313

Transcript


So we had a bug that occurred in the application where something was supposed to have happened. And then there was an email that needed to go out to tell the user that this thing had happened. And the bug popped up within AppSignal and said something was nil that shouldn't have been nil.


Particularly, we're using a gem called Time For a Boolean, which is by Caleb Hearth. And he's a former thoughtboter and maintains this wonderful gem that instead of having a Boolean for like, is this thing approved, or is it paid? Or is it processed? You use a timestamp. And then this gem gives you nice Boolean-like methods on top of that timestamp. Because it turns out, very often just having the Boolean of like, this was paid, it turns out you really want to know when it was paid. That would be a really useful piece of information. And so, while you're still in Postgres land, it's nice to be able to reach for this and have the affordances of the Boolean-like interface but also have the timestamp where available.


So anyway, the email was trying to process but that timestamp...let's pretend that it was paid as the one that matters here so paid at was nil, which was very concerning. Because this was the email that's like, hey, that thing was processed. Or let's say it was processed, actually, because that's closer to what it was. Hey, this thing was processed, and here's an email notification to tell you that. But the process timestamp was nil. I was like, oh no. Oh no. And so when I saw this pop up, I was like, this is very bad. Everything is very bad. Oh goodness.


Turns out what had happened was...because I very quickly chased after this, looked in the background job queue, looked in Sidekiq's UI, and the job was gone. So it had been processed. I was like, wait a minute, how? How did this fix itself? Like, that's not the kind of bug that resolves itself, except, in this case, it was. This was an interaction that I'd run into many times before. Sidekiq was immediately processing the job. But the job was being enqueued from within the context of a database transaction. And the database transaction had not been committed yet. But Sidekiq was already off to the races trying to process.


So the record that was being worked on, the database record, had local changes within the context of that transaction, but that hadn't been committed. Sidekiq then reads that record from the database, but it's now out of sync because that tiny bit of Sidekiq is apparently very fast off to the races immediately. And so there's just this tiny little bit of time that can occur. And this is also a fun one where this isn't going to happen every time. It's only going to happen sometimes. Like, if the queue had a couple of other things in it, Sidekiq probably would have not gotten to this until the database transaction had fully closed.


So the failure mode here is super annoying. But the solution is pretty easy. You just have to make sure that you enqueue outside of the database transaction. But I'm going to be honest, that's difficult to always do right.


STEPH: That's a gnarly bug or something to investigate that I don't think I have run into before. Could you talk a little bit more about enqueueing the job outside the database transaction?


CHRIS: Sure. And I think I've talked about this on a previous episode a while back because I have run into this one a few times. But I think it is sufficiently rare; like, you need almost a perfect storm because the database transaction is going to close very quickly. Sidekiq needs to be all that much more speedy in picking up the job in order for this to happen.


But basically, the idea is within some processing logic that we have in our system; we find a record, we do some work. And then we need to update that record to assign this timestamp or whatever it is. And then we also want to inform the user, so we're going to enqueue a job to send the email notification. But for all of the database work, we are wrapping it in a transaction because we want it to either succeed or fail atomically. So there are three different records that we need to update. We want all of them to be updated or none of them to be updated. So, therefore, we wrap it in a transaction.


And the way we had written, this was to also enqueue the job from within the transaction. That wasn't something we were actively intentionally doing because those are different systems. It doesn't really mean anything. But we were still within the block of ApplicationRecord.transaction do. We're now inside of that block. We're doing all of the record updates. And then the last piece of work that we want to think about is enqueueing the job to send the email.


The problem is if we're still within that database transaction if it's yet to be committed, then when Sidekiq picks up that job to run it, it will see the prior state of the world. And it's only if the Sidekiq job waits a little bit that then the database transaction will have been committed. The record is now updated and available to be read by Sidekiq in the correct updated state.


And so there's this tiny little bit of inconsistency that can happen. It's basically because Sidekiq is going out to Redis, which is a distinct system. It doesn't have any knowledge of the database transaction at play. That's why I sometimes consider using a Postgres-backed background job system because then actually the job can be as part of the database transaction.


STEPH: Cool. That's helpful. That makes a lot of sense the way you explained the whole you're actually enqueueing the job from inside that transaction. I'm curious, that prompts another question. In the case where you mentioned you're using a transaction because you want to make sure that if something fails to update so, everything gets updated together, in the event that something does fail to update because you were previously enqueueing that job from the transaction, does that mean that the update could have failed but that email would still have gone out?


CHRIS: That does not. And the reason for that is because we're within dry-monad world. And so dry-monad will implicitly capture the ActiveRecord rollback, which I think is an exception that gets raised or somehow...But basically, if that database transaction fails for any reason and ends up getting rolled back, then dry-monads will not continue processing through the rest of the sequential operation. And so, therefore, even if we move the enqueuing of the email outside of the database transaction, the sequential nature of that processing and the dry-monad stuff that we have in play will handle that. And I think that would more generally be true because I think Rails raises an exception on rollback. Not certain there. But I know in our case, we're fine on that. And we have actually explicitly checked7 for that sort of thing.


STEPH: So I meant a slightly different question because that makes sense to me everything that you just said where if it's outside of the transaction, then that sequential order won't fire because of that ActiveRecord migration error. But when you have the enqueuing inside of the transaction because then that's going to be inside of the sequential order, maybe before the rollback error gets raised. Does that make sense?


CHRIS: Yes. I think what you're asking is basically like, do we make sure to not send the job if the rest of the stuff didn't succeed?


STEPH: I'm just wondering from a transaction perspective, actually. If you have a transaction wrapped block and then you have in there, like, update this record, send email, end block, let's say update...well, I guess it's going raise because you've got probably like an update bank. Okay, so then yeah, you won't get to the next line. Got it. Got it. Got it. I just had to walk myself through that because I forgot that you probably...I have to visualize [laughs] as to what that code probably looks like. All right, that answered my question.


CHRIS: Okay. So back up to the top level then, this is the problem that we have. And looking through the codebase, we actually have it in a bunch of different places. So the solution in any one of those cases is to just take the line of code where we're saying enqueue UserMailer.deliver_later take that line of code, move it outside of the database transaction, and make sure it only happens if the database transaction succeeds. That's very easy to do in one case.


But my concern was this is a very easy failure mode to end up in. And this is a very easy incorrect version of the code to write. As far as I can tell, we never want to write the code where this is happening inside of the transaction because it has this failure mode. But how do we enforce that? That was the thing that came to mind. So I immediately did a quick look of like, is there a RuboCop thing I can do here or something?


And I actually found something even more specific, which was so exciting to find. It's a gem called Isolator. And its job is to detect non-atomic interactions within database transactions. And so it's fantastic. I was like, wait, really? Is this going to do the thing? And so I just installed the gem, configured it where I wanted, and then ran the test suite. And it showed me every place throughout the app right now where we were doing this pattern of behavior like enqueueing work from within a database transaction, which was great.


STEPH: Ooh, that's really nifty. I kind of want to install that and just run it on my current client's codebase and see what I find.


CHRIS: This feels like something like strong migrations where it's like, yeah, this is great. I kind of want to have this as part of my core toolset now. This one feels even perhaps slightly more so because sometimes I look at strong migrations, and I'm like, no, no, no, strong migrations, I get why you would say that, but for reasons, this is actually fine. And they have configurations within it to say, like, no, this is okay. Isolator feels like it's always telling me something I want to know. So this, very quickly, I'm like, I think this might be part of my toolset moving forward on every single app forever.

2021 Swyx