Guest(s): Ishwari Aghav - Production Engineer; Joe Romano - Software Engineer
In this episode of the Meta Tech Podcast, Pascal Hartig interviews Ishwar Aghav and Joe Romano, engineers focused on configuration change safety and release reliability at Meta. They discuss how configurations enable rapid, safe updates across Meta services, the risks of misconfigurations, and the safeguards — like canaries, progressive rollouts, and health checks — that prevent outages. The episode also covers how AI and data-driven approaches are improving detection, reducing noise, and speeding up troubleshooting, as well as ongoing challenges in balancing developer velocity with system reliability.
Pascal: Hello and welcome to episode 84 of the Meta Tech Podcast, an interview podcast by Meta where we talk to engineers who work on our different technologies. My name is Pascal and I usually have a little opening joke here about current events. Not today, not with a 10-foot pole. Swiftly moving on, today we are talking about configurations. The settings and parameters that control how services behave and let you set meaningful changes without deploying your whole service. That separation is incredibly powerful. It’s also why config changes can be uniquely dangerous. They can propagate across huge parts of the fleet extremely quickly, sometimes in seconds, and they don’t always come with a nice predictable deploy moment you get with a normal code roll out.
To unpack how we made config changes safer at Meta and what’s still hot even with years of investments. I’m joined by Ishwari and Joe who work on configuration change safety and release reliability. We talk about what config S code looks like in practice, how we use canaries and progressive rollouts to control blast radius. Why health signals can be noisy enough that people convince themselves it’s probably just flakiness and how the systems can catch the patterns before they turn into a very bad day.
We also get into where AI is starting to help not as magic but in a very specific places like time series analysis and narrowing down suspects when a rollup batch regresses. On the Meta Engineering blog, there’s one article that caught my eye, FFMPE at Meta and our continued commitment to upstreaming and impactful changes. We’ve had a whole episode about video transcoding before. So, if you’re curious about how we use multi-lane transcoding for live streams, check out the link in the show notes. Just one quick note before we start with the interview. We talk a lot about SES. For those new to the podcast, at Meta we call incidents sees, which originally stood for site events but at this point encompasses everything from planned events that could affect other services like a large migration to raise awareness all the way to oh no, everything is down kind of events.
Now that we know what this is, we can get started. So here’s my conversation with Ishbari and Joe.
Whenever you read a postmortem of a major site going down, there is a good chance that configuration change shows up as the root cause. Now, by the time we're recording this, and I cannot stress this caveat enough, a major sitewide outage hasn't happened at Meta for a while, and that wasn't just dumb luck, but the result of a lot of hard work. To discuss how we made config changes safer, and what makes Meta system quite different from what you might find in the rest of the industry. I have two fantastic guests, Joey and Ishwari. Welcome to Meta Tech Podcast.
Joe: Thank you.
Pascal: Ishwari, can I kick things off with you before we dive into the topic and just ask a bit about your journey at Meta and before. So how long have you been here and what did you do before?
Ishwari: Yeah. I've been at Meta for nine years now. Before joining, I completed my master's degree at Carnegie Mellon University. Since graduating, I focused on managing our infrastructure for monetization. Initially, my work centered around host management, disaster recovery, and new region turn-ups. Over past five years, my focus has shifted primarily towards release and reliability, with an emphasis on configuration change safety.
Pascal: Fantastic. Joe, can I pass it onto you?
Joe: I've been here for seven years. Even a returning intern, so I graduated from college. I had an internship here before that came here full-time, and that whole time I've been in Core Systems, which is our internal cloud org. So we are like the equivalent of AWS inside. We have very similar set of products and I've been working on service management there for about half the time and the other half of that on change, safety, and deployment.
Pascal: Fantastic. I'm just gonna jump then straight to the next question. Usually I would ask you about your teams, but I think you've answered this quite well already, so let's talk about it. What do we actually mean when we say configurations? I feel like most of us maintain our bash RC or Fish config or whatever shell you want.
So what does it actually mean in the sense of distributed systems?
Ishwari: Yeah, configurations are most, commonly called configs are like settings and parameters that control how a service behaves. You can think of them like, as you mentioned, a Bash RC file on your computer, which customizes your shell environment. At Meta, configs are used in a wide variety of ways. They let us adjust features, enable experiments, update models that solve predictions and change system behavior, all without modifying the underlying code.
This flexibility is quite crucial, like it allows us to iterate quickly, roll out changes safely, and respond to new requirements at a much faster pace without needing a full code deployment. In short, like configs empower us to make updates and improvements quite efficiently.
Pascal: That makes sense, but one might. Say that this is really important if you have something like a weekly deployment, and I think Facebook was famously deployed on a Tuesday decades ago, because I happen to interact with the API, and this was also when most of the issues would materialize. But these days we have continuous deployments that happen every few hours.
So why do we need a sec second separate layer on top of that?
Joe: We see in a lot of subs, it can still take time when something breaks to push that code and change that code and roll it back. And not only that, a lot of these configs end up shared across multiple services or across the entire fleet. So you could have one config that actually control some library that's shared by all the services that run Facebook, Instagram, WhatsApp and that can't roll out and roll back within all those services separately.
Especially if you have an incident you're trying to fix. You really wanna be able to change that config right away. Or roll it out not by restarting the services you wanna be able to control. For example, a percent of users have access to something instead of a percent of machines.
Pascal: That makes sense. So let's talk a bit about the specific systems that we use at Meta. How do they work?
Joe: Yeah, I can chat about that a little bit. We built a system to complement. You may have seen the model repo at Meta how we have all of our code in one place. We also have the same idea with all of our configs in one place. We have one big repo essentially set up where you can author configs using config as code, mostly Python. That generates eventually some JSON files that you can write, anyone can write, and then any service in the fleet can actually read and consume. And we built up a system that actually enables those JSON files to go everywhere, all at once, instantly instantly about, one or two minutes SLA, but in practice, you can reach the entire fleet we've seen regularly in five seconds or less hitting the whole fleet. And that's super important to be able to manage the code and libraries that are shared across everywhere with configs that end up being shared across everywhere too.
Pascal: That makes a lot of sense when I sat here we've rolled out every few hours a complete kind of sideways change that is still a very different number than five seconds to reach all the services at once without restarts required. So one thing I found interesting was that you said mostly Python.
So what else is going on in there and what's the story behind the kind of pluralistic use of languages?
Joe: Yeah, ultimately we end up having just some set of data that can be distributed. So you could even see, some of these configs are completely automation driven, like we have shard maps living in there. We'll have, some security configs, routing configs, we might use Python to generate some of these.
When users write configs we'll have a bunch of other stuff that just directly generates the raw files and distributes them coming from basically anywhere. So it's really like in anyone can write with almost anything. And then anyone can read from almost anything set up. And that lets us integrate this into pretty much every single product, every single server even the Edge will read from these files.
Automation humans everywhere.
Pascal: And I think I need to give you one last softball question where people might roll their eyes while listening to this if they work in this space. But why do these configuration changes attract so many, as we call them here, SEFs or incidents or kind of outages? What's the reason behind that?
Ishwari: I think as Joe mentioned, configs can be deployed to the entire host fleet within few seconds, if not minutes. And they introduce the same risk, at the same point, like of misconfigurations, you could update different model settings, feature settings, which could eventually like quickly reach thousands of servers and take down our entire services.
Pascal: So luckily, again, as the time of recording, this hasn't happened for a while. So what went on there? Were we just lucky?
Ishwari: It is not just luck. We have invested heavily in safeguards testing and monitoring these systems. For example, there have been changes where a risky config change was caught by our automated checks and canary deployments preventing a potential outage. These systems help detect problems before they even reach all of our production fleet.
And, we have years worth of investment going into this. We have several deployment mechanisms. For example, there's canary, there's progressive rollouts, and yeah, we have invested in those checks that help us detect if any of our service SLIs are being impacted and catch the change early on.
Joe: And it's interesting you said for awhile there have been outages that have made it into the press. I think, if you're listening, you have seen these before. And we've responded to those and put a lot of effort in to learn from those. We have a strong culture of looking at incidents, analyzing them and taking steps.
So we've had massive company-wide efforts to make sure that those don't happen again. And we know it's not luck because those efforts have actually led to places where we've caught issues before. They made it to the entire fleet. So we do have incidents almost every day, right? But a lot of those are cases where we're catching things early on.
We're able to fix those quickly. And then we can look back at some of the successes in some of the, what you consider a near miss, right? Where we catch it before the site goes down, but we catch it by testing in a single region or a single host to see the issue first before it goes everywhere.
Pascal: Can you maybe talk us through a regular SEV review process because people might just think. You don't work for large companies, where I think this process is relatively standardized. This is where the person gets blamed to push the wrong button, but that's practically the opposite of what happened.
Especially as you just said, we are often catching things before they actually hit production. So what is the process like?
Joe: Yeah, in SEV review the motto we really have is that we're trying to improve our systems and we'll blame the systems, not the people. So it's all about improving the technology and process. So it might be presented by somebody who did land a bad change, but we're always trying to figure out.
How do we learn from that? How do we improve the detection, we have a setup we call DERP, detection, escalation remediation and prevention. And we try and address each of those and say, how do we detect it better? How did we escalate fast enough? Did we get to the right people?
Did we fix it fast enough and what are we doing to prevent it? And the prevention is, we avoid stuff like, oh, this person should know better. It's more of. How does the system test this better? How is it more foolproof? And that's becoming more and more important with AI even, right?
We want more things to be automated, so truly right. We're not able to blame the AI at this point. We need to have the systems protect themselves.
Pascal: Yeah, it probably won't be very free for all trying to blame Claude or some other agent for doing it and hoping next time it will do better. Actually, can we talk about specific examples? Because you said there were some near misses. So what did our systems catch and prevent from happening?
Ishwari: Yeah, I can give an example. For example when, our progressive rollouts were rolling out a config change that was not expected to load a particular model, but we did catch it. In our testing environment because all our rep checks, for example, how is model loading failures, a model fallback.
All these checks monitor for these kind of situations. And we caught the change in the progressive rollout itself and we reverted it after doing bisecting across the change list.
Pascal: Can we talk a bit more about this progressive rollout or canary approach? So how does that actually work in practice?
Ishwari: Yea, so there are two approaches. Canary is more of a short form method of very, how we can test changes. You take a test here and you. Canary those changes for about 10 to 15 minutes on those test tiers. And check if there have been any, SLI or health checks impacted.
Once that's done it moves on to an entire region where you test that for 10 minutes and then eventually it lands in production if everything looks good. The other mechanism that we use for deployment is progressive rollouts. Instead of landing within a time span of 30 minutes, the stakes like on an average couple of hours to roll out.
The reason why we choose the deployment mechanism is some services require a longer period of time in order to consume these configs and eventually publish the changes out. So we need maybe an hour and a half to make these new config changes onto the service and then eventually push to production.
And there are some services that only consume certain configs on startup. So we trigger an artificial restart of those services and eventually that forces the service to consume that config. And that makes us, makes our testing more effective and eventually be pushed to product again. So there are two main deployment mechanisms, as I mentioned here, a short form canary as well as like longer progressive rollouts.
Pascal: That already highlights I think one of the things that makes this so challenging to build. We've said there's a mono repository for this. Everybody pulls from the same repository, but there are services that can consume this. Some might get the configured runtime. Others, as you say, during startup time, we have our adopt up code base, which isn't hack, which generally speaking, pull stuff at runtime on a per request basis. So there you have a completely different model. Again, how do you even keep track of how a config is consumed and what mechanism is needed to actually test that config is correctly applied.
Joe: This is actually one of the big problems we're facing now. I think this is actually an issue where some of these configs that are Reddit startup. Are the ones that are the riskiest right now and the hardest to test. Part of this issue with decoupling the code and config release actually means that some of the progressive rollout techniques don't necessarily always restart the task running that code to get the new config.
And we have seen places where when that config is read at the process startup we're actually missing the signal during the rollout, during the release until much later. This is actually one of the problems we're looking at and on our team trying to solve today is, how do we, when we have this monolithic collection of configs. Classify some as a startup config, for example, and then use a different release process or classify some separately into, maybe they need to be split up and pushed with each service. This is something that we work a lot together on with Ishwari and monetization to figure out, how do we push differently based on the service itself and how it needs the config.
This is definitely like a frontier for us. We're hitting now that things have gotten so big we need to actually go back and split some of it up to have specialization.
Pascal: Yeah, I can see this, especially because the interface you're providing is deceptively simple. You put your configs there, there's some API to grab them back again, and then you basically figure out all the rest, which makes it so nice to use. So now that we are a bit talking about kinda the life of a mutation, so we've talked about how it's being tested, how you're already trying to apply this slightly bifurcated logic depending on where a config is actually being consumed.
So what happens now? At some point, you will make a decision whether something is good or not. How, firstly, are you deciding whether a config is good or bad and needs to go back?
Ishwari: We rely heavily on the health checks we have in place. As I indicated earlier, there are service level health checks and then there are top line health checks that look at the entire let's say ads ecosystem. And we check if we are like functioning correctly and also the testing. As in multiple stages, you start with a micro, a small testing environment, which is reflection of the production environment.
And eventually you move on to further larger blast radius. Like for example, you go to an entire region and then again, if it looks good in an entire region, it's likely good to roll out to all of production. So then we gradually increase our rollout to all of production.
Joe: I'd love to give this, give an example that kind of highlights this in another incident we prevented there's one example where we had a bad config that actually caused crashes. And like we talked about earlier, these configs can be read in a library, internal libraries across the entire fleet.
So this is one of those configs actually just caused everything to crash that it went to. And you'll actually be surprised, these health signals are not perfect. They're pretty noisy. So even if somebody tries to test one of these changes, they might think that, their change is good.
Everybody's landing something they think is good. They may actually retry a second time thinking, oh, I'm sure this was just noise, right? My change couldn't break the site. That's how every SEV happens, right? People assume something's good. And so we've actually not only built these health signals, but we've even started to build a layer on top that can say if somebody retries twice.
Now we can actually use the signal from both independent trials and say, oh, it, it failed once. And then when they tried again, it failed a second time. And if look at the whole time series, you can even see two separate spikes or three separate spikes. And then we even pinpoint that and we say, “Hey, this isn't actually noise, right?” We reach out directly, automatically. We say, we actually think you're about to break the site. Please stop trying again. And this actually prevented a major site outage when one of these checks ran somebody tried a few times, caused crashes, and we caught that with the kind of meta-analysis with multiple trials.
Pascal: Okay. So you say in some cases people just retry, and luckily we catch this, but I guess there could also be cases where we miss something. So what happens in that case? How do we mitigate these issues?
Joe: Yeah, so I was in that example, right? Let's say somebody tried a few times and maybe they were just barely breaking something, right? So maybe we actually let it through because these health checks are today manually tuned. We're trying to auto tune them. We've talked a bit about how some of these run in the past, in that scale.
But let's say we miss one and it goes out, the change was still tested, for example, in a region or on a service. So we do still have all of the metrics that were genuinely impacted by that change. So when a first responder gets paged, they'll still see some kind of time series. Chart that will show, for example, a spike when the change was first tested and maybe a spike when it was tested again with canary, even if the health check passed.
And we actually have built tools that you can highlight and pinpoint and say what was being tested during each of these spikes. And use that as what we consider like a fingerprint of the change that enables us to isolate down to the set of changes deployed with the exact same fingerprint, tested in the same place at the same time.
And that can usually give us like a handful of changes to comb through instead of the thousands, tens of thousands that we have every hour that are going out from humans and automation. And that has also helped us to look at major issues and catch them especially when it's progressive rollout, right?
And that, that it's a slower moving change and it hasn't gotten everywhere yet. There are a lot of cases where by pushing a little bit slower, we get a chance for a human to use that signal before we still go everywhere which is awesome.
Pascal: Yeah, absolutely. And it also shows that the time aspect is always so crucial. I used to work very briefly in something kind of configuration safety related, where we just try to figure out which intent tests to run because they're really handy in these cases. They're just little health markers, click around on the side, see if this works.
But there's a lot of flakiness in that particular signal as well. Plus you have a certain time budget that you need to work in. You can't just run. All the tests at every change, because then suddenly we would find ourselves again in a situation where we might ship once a week
So that means we always have to trade off time, which means we're slowing developers down and everybody wants everything to be faster.
How are you tackling this? What is the mindset you apply to this safety versus velocity,
Ishwari: We do invest a lot in making sure our health check signals are not flaky. There are systems designed that test these signals regularly. Either we run like AA tests and understand if they are still failing or like we do look at our health checks and we autotune them periodically to make sure that they are not blocking the releases frequently without actually introducing, like without actually detecting any true positive change that has been caught in the change list. So there's a lot of investment that goes on both sides. On the canary spec, we make sure that the canary specs are healthy as well as the health checks that are always run during these rollouts.
Joe: Yeah, I guess the only other thing I could add is one, one step we've made recently has been to stand up some metrics on like precision and recall of some of these systems. That's a newer thing we figured out we need to do. People said, oh, this is so noisy. I can't fight someone who shows up and says it's noisy, and I say it's not noisy.
And then who's right? So we've actually started to stand up in the last year or two, better measurements of like false positive rates. Even actually getting like ground truth labels to improve the system leaning into this idea of it's a classification problem. It's, we actually need to have data.
And then when we have the data, I know it's it's like common sense for product organizations, but in infra organizations being more data driven like this has actually enabled us to really improve things. We reduce the noise by almost 10 x. Across the board and we've seen some teams like I think Ishwari's team has made a lot of progress even further on really improving the recall and catching incidents early by being data-driven on the incidents with the most impact and then figuring out how to catch those.
Ishwari: Also to add to that, past outages have also been like a map for us to understand what we have lacked in the past and add those checks eventually to our progressive rollout systems to prevent similar kind of outages in the future.
Pascal: Yeah, I guess that's always one of the follow up questions from the SEV review, like how do we prevent this in the future? Are there any tests, health checks that could be added? So I think we're fairly good in building up a better defense strategy based on those. So Joe, you mentioned that part of the effort is basically reducing noise and one tool that not everybody swings at everything that is fairly good in sifting through noisy signals is AI.
Have you found that in general to be helpful to either prevent or even resolve configuration issues?
Joe: Yeah, I think we're still in the early stage of things. We're still in the early stage of things, but one place we've used AI is actually in the time series analysis. We started with doing machine learning based approaches to figure out. Instead of just, some of our health checks where it's static thresholds, if you go above 80%, memory, then fail the health check.
Now we're able to use AI and even LLMs to do time series analysis. We have yet to get this to be competitive with some of the traditional ML approaches, but we've seen glimmers of hope where an LLM can look at a longer time series window. LLM can get the context of the change itself to understand if it's risky or not and what types of tests might be needed.
So I would say this is all more in the experimental stage. Even LLMs for root causing but there is some promise that we're starting to see out of applying l lms, but especially traditional ML has actually been super useful and I think underutilized in this space as well.
Ishwari: I had to add, one of the ways that we are using AI to detect changes that have been the root cause of outages is when you have progressive rollouts, you're essentially batching. At least a few hundred changes together and you're testing them together and they roll out. But once you figure out that there's a regression in the testing environment, you need to go from the a hundred changes that you have, and you need to figure out which one of those is exactly the bad one and revert that. And that in the past has been a very manually heavy process for the on-call and the release operator to do, they had to do bisect, which eventually taxes the test capacity that we have and eventually find out the bad change.
This in the past has taken about 8 to 10 hours at least, and has slowed down all the developers waiting for the release to roll out. So the place that we have actually used AI is, figuring out of the a hundred diffs and based on the health check signal that we are seeing is regressing which one of the a hundred changes maybe, let's say, is likely to cause the regression and then eventually start bisecting from those top five, let's say changes that are the top five suspects.
If we are accurate we save a lot of like time in bisecting and we pinpoint the bad change very quickly and revert it. If not, we fall back to the traditional mechanisms of doing the entire change list.
Joe: Another place that we use AI is actually a lot in the developer workflows. Developer experience, the cost of maintaining change, safety. So like in that example when push is blocked using AI that remind me of some of our most successful experiments so far are actually on the maintenance of the what we call the canary spec itself that defines how these canaries run in production.
These are actually living in the config repository itself, because those are the configs for our systems that define how change safety works. So it's a little bit recursive, we do truly keep all the configs in one place. And writing and editing those directly with LLMs instead of forcing people to learn some complicated configuration schema itself has made it a lot faster for people to set up and manage their own safety setups for every system at the company that needs to have safety.
And the same applies to authoring health checks. And just understanding and managing these systems. We would love to get to a state where things are totally agent autonomous, right? Instead of having a human spend a day a week managing these systems or editing health checks, they're totally managed byAI.
So we can focus on shipping the products and features in the services. Not everybody doing the same setup and maintenance of change safety.
Pascal: You already touched on some of the big bets that you're having, and this might be part of it as well, but when you kinda zoom out and just think about the big challenges that you want to address with or without AI, what are you currently looking at in that space?
Ishwari: One of the big challenges that we face today, it's still related to AI, but we are seeing a multifold increase in the amount of changes that we are. Looking, going through our dev deployment systems. And this is a headwind for us, not just in terms of like test capacity and the ability to test changes, but you also need to make sure if the AI productivity is going up, you need to deploy at a much faster pace without compromising on reliability.
So that's one of the places that we are taking another look at how we test our changes. Is there any optimizations that we can do in testing those? And yeah, that's basically a re-look at how we deploy our systems. It's basically taking a re-look at how we deploy our changes.
Pascal: Do you have any specific frameworks you want people to adopt right now, or something like canary specs that you men mentioned before that you want people to adopt? Or is it mostly about you trying to analyze how configs are used for instance, as you mentioned before?
Ishwari: This will happen. Yes. At the later stage where we will analyze how the configs are consumed in services, what parts need to be tested, and which tests are the most appropriate ones to run during the progressive rollouts. I think this will happen mostly after the user has clicked. Ship it, and that's when our systems kick in.
Joe: Yeah we've been pretty good, I think so far on the incidents the major ones, because of such a big push on adoption so far. When we were analyzing incidents over the last few years, we were seeing like a lot of them could have been prevented if we had adopted something that we already built. It built.
But now, we've mostly adopted the stuff that's been built, so now it puts it back on us to do that type of optimization and build the next wave of tools so that we can catch the next set of incidents with two or three x or more rate of change from AI.
Pascal: So are there still certain problem categories that you feel are completely unsolved and could really have some potential listeners here join you in solving?
Joe: Think the biggest thing is needing to step up to the challenge of the rate of change. Like in the past we looked at what is the risk that each change might cause a major incident. And we could say, okay, if we keep that, riskiness of every change about the same, we're probably fine.
We improve that a little bit. But if you start to say that you're gonna have three x or you look forward like basically infinite code volume. Unless that code can be, a multiplier also more reliable, which I don't think that's the case then you're gonna end up needing to basically have the inverse multiplier of safety.
Or you'll see like a huge explosion in the number of incidents and it'll even take longer to diagnose those because they're all gonna be going out at the same time. So the real challenge is actually, essentially reinventing ourselves to step up to this challenge. Like we're exploring ideas like, do we need to completely remove code triggered incidents and put everything behind configuration?
Because like we talked about, you can more quickly mitigate incidents with configs. You can more quickly diagnose them. Code is a lot harder to manage. So we're exploring this idea, let's just wrap everything in some kind of, if statement with a config everywhere globally and manage things that way, that's a huge change to how we develop that, we can't just do that with what we have today or everyone would be slowed down too much.
So figuring out how to integrate that as a core part of the workflow that's an unsolved problem for sure.
Ishwari: Yeah, adding to what Joe said I think we are also looking at how can we fast track automated changes which have more predictable risk compared to human author changes at times, which can be quite risky. So yeah, I think we are looking into ways that speed up that deployment, have more predictable testing and so on.
Joe: If for all we talk about safety, there's that compliment of now that we can prototype so quickly. We actually need to keep the fast path available, right? So now we have to work in this double world of, how do the safe things get to be prototyped in five seconds, one minute, right? How do they go out without breaking everything else?
And then everything else, how do we let those have five x more changes also, without falling over? So it's like really interesting to spend, the morning trying to make the new prototype apps go as fast as possible in the afternoon, responding to peak rates of change and how do we prevent SEVs there?
Pascal: That basically goes back to one of the initial points that you brought up about safety versus developer velocity, because right now developers are also just finding this need for speed. Again, everything feels so much faster, and if there's suddenly this one mechanism in the middle that feels to slow them down.
There will be a lot of eyes on you basically, and at the same time, you have probably a very busy oncall right now unless you keep up with the rate of change and put some more automated checks in there so you don't need to, as you said before, maybe spot analyze three different graphs and figure out if the spikes match manually.
Okay, we are running a bit close to time now, but maybe you have one interesting war story about one change that really involved a bunch of critical thinking. So do you have one for us?
Joe: Yeah, I can talk about another example where change safety systems saved the day. Every time we talk about it, it feels like jinxing it. But for months now we've been saying stuff like, oh yeah, we haven't, the site hasn't gone down in x number of days or months. And every time we say that, there's people knocking on wood.
So I'll keep doing that, right? But it's nice to have the success stories. One of them is. We had a change rolling out an Instagram that would cause error rates that were essentially directly user visible causing, top line impact. And that was actually leveraging our progressive rollout tooling which goes region by region.
We split up. Our services into each region, and for the most part, they can fail independently. And so this is why we lean so much into the regional deployment. And so this change was going out and had gone to four regions at this point. So it actually was creating some visible incident but we were able to us that capability where we saw the change breaking one by one region at a time, we could actually go back and say, oh, it was this change. And it was a config change, which then lets us very quickly revert that. If we had pushed, for example, Instagram server and it had gone to four regions.
Remediation would not have been seconds, the remediation would've taken the time to push all those regions again and so that's a nice story. We still have an incident on our hands. We still had the SEV review and the postmortem. That gets in the state of how do we not have that type of incident?
How do we catch that in one region instead of four? So it's, there, there's still the challenge of. We can prevent the massive site outage, but now we wanna prevent the operational costs, the business impact of even the things that break even one region that could be, someone's month where they're now working on a postmortem.
That could be a huge top line impact. We wanna get that down to catch that in testing and hopefully we get in a future podcast even. How do we catch that before the deployment with validation or fuzz testing or other pre-production techniques.
Pascal: There are definitely some interesting follow ups. Ishwari, do you have a story to share as well?
Ishwari: Yes. I would like to share a story from the progressive rollout where we did catch the bad change in an entire region, but we caught it even earlier when we were testing on a much smaller prod environment that serves like less than 1% of traffic. What we did see is, we did see top line impact even in the smaller production environment that we were using.
And the problem here though was this environment is used by multiple progressive rollouts that are pushing like isolated config changes to production. So we did have quite a bit of back and forth where we had to find out which release was it that, was causing this top line revenue impact.
And then eventually we had to debug and figure out. But yes, it was like a SEV 0 that would've taken down pretty much most of ad services if it would've gone out.
Pascal: Thanks for sharing and unfortunately we're out of time, Ishwari and Joe, thank you both so much for keeping our site up. I would knock on wood, but I think that just makes for terrible podcasting. And of course for joining me here on Meta Tech Podcast.
Joe: Thank you.
Ishwari: Thank you.
RELATED JOBS
Show me related jobs.
See similar job postings that fit your skills and career goals.
See all jobs