Lineage Driven Fault Injection [Kolton Andrus]

The theoretical foundation of Chaos Engineering.

The theoretical foundation of Chaos Engineering.

Audio source: Gremlin Podcast https://www.gremlin.com/blog/podcast-break-things-on-purpose-ep-9-kolton-andrus-ceo-and-co-founder-at-gremlin/ (33 mins in)

Reading

Transcript

Rich Burroughs:
Hey, so to shift gears a little bit Kolton, so you're one of the authors of a paper about Lineage Driven Fault Injection or LDFI. And I tried to read that paper and it was a bit over my head. So, I'm hoping you can explain to me and the listeners like we're five years old what LDFI is.


Kolton Andrus:
Yes, it's both a mouthful and as an academic paper, it can be a little hard to digest. There is the Netflix tech blog where we try to show some pictures and simplify it for folks that may be about to follow along at home. So the idea behind Lineage Driven Fault Injection is systems really stay up because there's some amount of redundancy. Whether it's hardware redundancy, a host failed, we had another host to take its place, or it's a logical redundancy. We had a bit of code and it failed, but we have some other way to fill that data or to have a fallback for that data.


Kolton Andrus:
And so the key idea was, if we have some way to walk the system, we have some way to graph it, think like tracing, and we can see how the pieces fit together, so we can see the dependencies, then we could start to reason about, if one of these dependencies failed, could something else take its place? And so at its heart, it's an experiment, it's really we're walking this graph, and we're failing a node, and then we're checking to see what the user response was. So this is a key part. You have to build a measure did the failure manifest to the user or was the user able to continue doing what they wanted to do?


Kolton Andrus:
And that sounds easy. It's like, oh, just check if the service returned a 200, or a 500. But in reality, you have to go all the way back to the user experience and measure that ala real user monitoring to see if the user had a good experience or not because the server could return a 200, and then the device that received that response could find that inside that 200 is a JSON payload that said error, everything failed. It happened. That's not a hypothetical. That was a learning from the process.


Kolton Andrus:
So, we build this service graph, we walk it, we fail at something, and then we rerun that request, or we look for another one of the same type of request. And we see if something else popped up and took its place or if that request failed. And then the other computer sciencey piece is, in the end, these service graphs are something that we can put into a satisfiability, a SAT solver. And so we can basically reduce it down to a bunch of ORs and ANDs. Hey, we've got this tree, obviously, if we cut off one of the root nodes of that tree, we're going to lose all of the children and all of those branches. And so we don't have to search all of those if we find a failure higher up because we can be intelligent that we'll never get to those.


Kolton Andrus:
So at its root, it's build a graph in steady state, build a formula that tells us what things are most valuable for us to fail first, on subsequent or retried requests, fail those things and see if the system either has redundancy that we find, that the request succeeds, or if the request fails. And then as we go, we're getting into more and more complicated scenarios where we start failing two, or three, or four things at the same time.


Rich Burroughs:
Oh, wow. Yeah, we actually just had Haley Tucker from Netflix on our last episode and I think that we talked about some of this and I didn't realize that we were talking about LDFI, so thank you for that explanation.


Kolton Andrus:
Yes, I mean, there's a lot of cool things. Building FIT at Netflix really enabled LDFI because we needed that framework to cause the failure very precisely to run the experiments. It enabled CHAP, so the chaos automation platform is entirely built on FIT, where it's essentially routing traffic to canary and control clusters, and then causing failures with FIT to see how they respond and how they behave. And then I believe, Haley and her team are continuing that forward and even looking at other ways to do more of this A/B Canary style testing around failure.


Rich Burroughs:
Yes, she mentioned that they're adding in load testing along with the Chaos Engineering in that scenario, which I think is super cool. I love that idea of doing that A/B testing and doing the actual statistical analysis on what's going on.


Jacob Plicque:
Yes, I think it's interesting too because I feel like we're seeing a lot of the different pieces come together. Obviously, things like continuous chaos within a CI/CD pipeline is typically where we're first start with that more automated chaos. So of course you have your build or the canary cluster like you mentioned, but adding the load testing in front of that to help drive a steady state metric before you even kick it off makes a lot of sense. 




2021 Swyx