Guest(s): Katherine Zak, Software Engineer; Dustin Shahidehpour, Software Engineer
Ever wonder what it’s really like to be an engineer at Meta? Software engineers Dustin Shahidehpour and Katherine Zak, share their day-to-day experiences.
Straight from the Meta Tech Podcast archives, host Pascal Hartig is joined by software engineers Dustin Shahidehpour and Katherine Zak in Episode 55 (Aug 2023) for a candid conversation about life at Meta. They discuss everything from the interview process and onboarding to the tools and workflows they use every day. They share what it’s like to work on products that impact billions of people and the unique challenges and opportunities that come with that scale.
In this episode, you’ll learn about:
Pascal: Hello and welcome to a bonus episode of the MedTech podcast, a podcast by Meta where we talk to engineers working at different technologies. My name is Pascal, and only for the second time in nearly seven and a half years of making this podcast. I'm reaching into the archives. Why? Well, and I'm sure some of my fellow software engineers will sympathize.
I'm hopeless with dates. Meta Connect is just around the corner September 17th and 18th. That's 2025 for those of you from the future. It's thrilling news for developers, creators, AI buffs, and our fans alike. But if you're me, and you need someone to review your freshly recorded interview, you'll find yourself politely shuffled to the back of the queue. So, rather than an episode on training models to generate 3D assets and weaving them into our products, I'm dusting off a classic from August 2023 that's truly stood the test of time.
I sat down with Katherine and Dustin, two names almost everyone at Meta knows by their internal handles Dust and Zak. We dive into what it's really like to write and ship code here.
While many companies slow down as they grow, not just from process or red tape, but the sheer weight of ML code tests, static analyzers, and source control groaning under tens of millions of files, Meta's move fast mantra is still very much alive.
In this chat, we explore some of the infrastructure that keeps us nimble. But enough of me rambling into my travel mic, which is currently hogging precious space on my 3300 kilometer Interrail trip. Just so you don't miss an episode in August. Let's hop into the time machine for my conversation with Dustin and Katherine.
Pascal: Today we are going to have a discussion that I've been wanting to have for years. We will be chatting about what it's actually like to write code at a company of this size, and the unique culture at Meta that allows us to move fast, despite tens of thousands of engineers and many million lines of code. And I'm excited to have two long term Facebookers and now Metamates with me to discuss all that.
Katherine and Dustin, welcome and actually welcome back to the Meta Tech Podcast.
Dustin: Thanks
Katherine: Thanks. Yeah, thanks for having me.
Pascal: Amazing. Katherine, can we start with you because you're new here on the podcast. How long have you been at Meta and what did you do before?
Katherine: Uh, I've been a Meta for a little over 11 years now. I guess it's coming up on 11.5. Prior to that, I was actually working at a thin-film solar cell manufacturing company. Like, literally the solar panels that would go onto your roof. I majored in chemical engineering in college and did the pivot actually on the job. I learned how to code once I got here.
Pascal: That is so exciting for multiple reasons. So not only because I feel like solar is actually one of the appropriately hyped technologies in the sense that they might actually have a shot at giving us a chance of blunting the blow that the climate crisis is having on us. And then obviously, the second part that you did not know how to code before you got here?
So how was the interview process like for you and how was it learning on the job?
Katherine: I did a lot of data analysis for the thin-film solar cell manufacturing company. So the interview was actually for a data analyst position. It was, ‘here's two tables. How do you do a left join? What does a right turn mean?’ So I got the job as a data analyst.
But when I got here I realized like the, there's just so much more power in coding. I wasn't really that good at the analysis. Truthfully, what I really enjoyed more, was figuring out how to program, how to automate the analysis so that the query would run on a daily using our internal framework for ETL - extract, transform and load.
And so kind of from there, I just leveraged all the internal resources that we had, leaned on other engineers to help me, and that went through our bootcamp process about a year after I joined and transitioned to be a software engineer.
Pascal: That is such a cool journey. We've had an entire episode about internal mobility before, and I had no idea that you were also one of the kind of prime examples of how far you can go, because I’ve said this before we set up this interview here, that you are one of those people that everybody working at Meta in certain parts, at least, will at some point cross paths with, because you are all over diff that touch anything code review related or tasks and you can submit it and within two seconds you will have like a Katharine comment on them.
So it's really amazing. But okay, before we actually get into that whole topic, Dustin, let me pass this on to you. So how long have you been now at Meta, and which space do you work in?
Dustin: I have been a Meta for coming up on nine years, and I work in the mobile space. Professionally. I've always been a software engineer. My story is not nearly as, as cool. I did go to college thinking I'd be a music performance major. That's where I started searching computer science. I will say I do have like an interesting, like internal mobility story too.
I joined the company knowing Java. I previously worked at Orbitz and it's a travel company, and I came in looking for kind of like a server job, and everything here is well known to be written in HACK. So I was doing bootcamp test in HACK with a very well known engineer here named Zeff. And Zeff gave me a task and during bootcamp I caused a SEV, and the SEV was uh, around Life Events. You know, when you post, you're like ‘Graduated college. Had a child.’ Everyone's Life Event on Facebook.com was, ‘Graduated from Stanford University, uh January 1st, 2004.’
So they filed a SEV and I was still in bootcamp, and I was, I was terrified. And the thing that gave me a lot of comfort was that there was no blame. They were pretty funny, you know, they got it reverted pretty quickly. But I immediately started looking for other teams, and I met with a manager who worked in Mobile, which I had very, very little experience in, and I was like, ‘I got to get out of server. I gotta get out of there.’ And he was, and he was like, ‘that's totally fine.’ He's like, ‘we're looking for someone to kind of like ramp up on, you know, this mobile app called Paper.’ And, you know, the rest is kind of history for me.
And, you know, had it not been for that SEV or their willingness to accept a person who had never written Objective-C, I would, you know, I've been working on mobile ever since, but I owe it to the bootcamp process and just them letting me move around for something I did not interview for, you know, at all.
Pascal: Amazing. Just quick side note here, the SEV model is basically our internal incident management part, and we might get into more details later. But, I'll just want to start here and explain some of our internals, because that is actually one of the core topics that we're here for today. And just to set the station at all.
So, Dustin, you wrote a blog post a while ago about the architecture of our iOS app, and if I may summarize the responses on Twitter and Mastodon, it was a lot of ‘this is stupid. Why don't you just do X?’ And to be fair to those people, it turns out that there are a lot of kind of facts of life from working at large companies over decades that are not at all intuitive if you're on the outside.
So that's why we want to actually cover some of those basics of working here. And I thought maybe we could just start with one of them.
And that's, why do we have a monorepo?
Dustin: That is a deep question.
Uh, I don't, you're going to have to subscribe to our Patreon for that one. You know, the extended cut. I don't know. My take on the monorepo or why I see the benefits of it because I work on infra, is that I have the ability to upgrade what APIs people are using, you know, or we can kind of like, upscale everybody at once.
Companies I previously worked at, we had repositories for everything. You know, and you would more or less like release a version. You're like, okay, this API like ‘I'm updating here's V3.’ And then you would have some kind of program where you’ll be like ‘okay, everyone, I'm going to stop supporting V2 in like a year.’ And the thing I like about here is that I can skip all of that. I can skip the post, I can skip the, the SLA to a certain degree. I can just go fix things up. And we have infra to support doing that across like, however many call sites. But you know, I think it's for me it's the code sharing, you know, and it's it's that. But I'm curious what Katherine thinks.
Katherine: Well the counterexample to not having a monorepo is then the need to be able to sync code between repositories. And that just gets hairy. You're, you're thinking potentially like large JSON files or large GraphQL files from one repository to another, needing to remember which ones like the actual source of truth. Right now, our main client side code is not in the monorepo, so it does require a lot of this, like syncing infrastructure to happen in order to enable product developers to have access to server side.
Pascal: Yeah, I guess there's this meme internally. We love monorepo so much, we have two of them! But actually even that is not entirely true. There are plenty of little repositories, but I feel like the ones that most people will touch are basically our big WWW repository. We talked about this before, and that is to kind of, everything else repository.
And I guess I'm not sure, you kind of touched on it, Dustin, but the ability to share code between different mobile platforms is also just absolutely critical for how we work. And I've seen other companies where it's much more common to just reimplement absolutely everything across Android, iOS and so on. Whereas over here we also have VR, for instance, where a lot of code sharing is happening.
And the more platforms you need to support and need to rewrite, the more the kind of need arises to find some sort of code sharing infrastructure between all those places.
Dustin: Yeah, definitely. One of the things that I've benefited from with the repo is more and more. So a lot of times, we would have like continuous integration jobs. Everybody's got them at their companies and the continuous integration codebase would be in WWW the one repo, and then a lot of the stuff you would be building or compiling or running the tests on was in the mobile repo.
You know, like the long tail one, we call it FB source. And the thing that would happen is you would go into the WWW code base and you would say like, okay, you know, if you're running, if you're building this job and you get this flag and you know, you'd be like, you'd be writing a script in one repo and then referencing the flags in another repo, or, you know, parsing args or doing whatever else. And there was, to your point, like a synching issue.
And so over time, as they've moved more and more things into the same repo, it becomes nice because you can put up one, whatever you want to call it diff, pull request, or merge requests and everything gets reflected as a source of truth in that. And, especially as over time, the rollouts to the servers have gotten faster and faster.
So it eliminates a lot of like time spent for me, just like boilerplate of being like, you know, everybody's been to thing of being like, I'm building a rest API, so I'm gonna make a new thing and I'm going to wait for it to hit prod and then I'm going to update the client so it can hit the new thing, and then I'm going to delete the old one. It's like the model repo to me has eliminated a lot of that. We're in one diff. I can just go ‘and you're hitting the new one.’ And that's pretty pretty awesome for me.
Selfishly.
Pascal: Yeah. Another great example that I've benefited from is, if you work on a library or a framework internally and you want to make API changes, that also eliminates that whole step of what you've previously described. So like, I'm going to release a new version, but now I also need to provide support for the old version until everybody has moved over to the new one.
And then you need to go around and chase people like, ‘hey, can you please update to V2? Because I really want to delete V1 now.’ And with a monorepo, you land your changes to the API, and it’s part of the same diff – that's like a pull request model – you change all the call sites. And then it is just live and the old API doesn't even exist anymore.
Katherine: It's much riskier operating in a non monorepo world because you, it's so hard to have any guarantees that something isn't being used. It takes a lot more effort to determine if something is finally deprecated.
Pascal: Yeah for sure. Okay, I think we've made a fairly good case now for why it makes sense for companies of this size to use a monorepo model. So now let's talk about why is it hard to have a monorepo?
Katherine: I have a limited understanding of source control, but that is definitely one of the major challenges. Any source control operations parsing through that many files, it just has the potential to be slow. But our source control team is phenomenal, and doing Mercurial commands in the monorepo is shockingly fast.
Dustin: Yeah. Agreed. Basically, yeah, anything like indexing, searching. Like imagine, you know, I came from a place where I was like, ‘oh, I just need to grep.’ And there's like hundreds of files. It's like here it's like, I don't know, it's like the order of magnitude is like gigs, right? So it's like grepping through that takes a second. And um, despite the thing that we just outlined, we're like, ‘it's great! Everybody can share code.’ That like, totally does not happen. You know?
In the perfect world, there, there's definitely like corners of the codebase, right? Like um, because they were previously things that weren't in the repo, you know. So despite the fact that it's all there and you're like, ‘in a, in a perfectly architected codebase, there's one implementation for the circular profile pic. And WhatsApp and Facebook and Instagram, we all hold hands and we, we use that code and it's perfect.’ You know, and it it logs and it's, you know, resilient.
And the reality is like, there's like 15 of them. And then sometimes our project, we're like you know maybe there shouldn't be 15 because you know, we have to make sure they're all privacy safe and we have to do all these things. We're like, ‘we should, we should.’ It makes these projects a lot more like palatable or interesting because we're like, you know, we have the capability to do it. And there's a lot of value add to not having 15 circular profile picture implementations, you know.
But I think the reality is just like, like any code base or any code bases, there's bifurcation, you know.
Pascal: Yeah. I think that's a great point about the tools because it doesn't just affect source control. Obviously, we have an amazing team. I mean by obviously, because we've had them on the podcast and they were obviously amazing on there. But it doesn't just affect source control, but basically any tool that you want to run on the code base. If you just try to open a standard IDE on the entire code base, it will probably not finish indexing before the heat death of the universe, unless you put an additional layer on top that restricts the kind of focus that it puts on certain files.
First, there's a lot of infrastructure you actually need to invest into. And I hear from a lot of small companies about like, ‘let's set up a monorepo!’ And I always try to tell them, you can do that, but make sure you actually staff a team that looks into the infrastructure, because this is not easy.
Katherine: We also run into challenges with continuous integration. You touch a file, a potential fan out for the number of tests that then need to run as a result of that change. We run into a lot of issues where hundreds of thousands of tests and jobs are potentially running for us for a single change. And how do we display the results to the developer?
How do we determine what are the important jobs to run?
Dustin: Yeah, I, I was going to say, I think you've had a lot of people on this podcast basically touting the tech built to solve. And it's like you had a Buck2 talk, right?
Pascal: Yup
Dustin: Okay. It's a lot faster parsing all these like hundreds of thousands of BUCK files that we have probably. Uh, you've had someone talking about Sapling or like, you know, Mercurial. You've had someone talking about Eden I think at some point or it's probably come up. It's open source. It's like our file systems.
It’s like all these things, we need to basically support the, the repo. Especially as it gets bigger and we start just like merging. That's the thing that astounds me is that over time, the Source Control team is like, ‘No, bring it. More’ and they’re like, ‘okay, we're gonna put, we're gonna put this other repo’ and then they’re like ‘Cool. All right, Right on.’ And I was like, ‘Wow, very impressive.’ But it works.
Tested it out and it works.
Pascal: Yeah. It's a little scary that just ‘Bring on the pain. We've just kind of we're in the process of solving this one issue. Just bring us 20 more while we deal with this.’ And somehow they still managed to sort this out.
It happened multiple times in the past that we saw this kind of laptop cliff emerging, where just felt like we cannot fit all this stuff. We need to build this one application onto a laptop. What are we going to do?
We can just kind of see this graph creep up and up and up until a certain line. And that can literally be something like, Apple doesn't give us laptops with larger than a one terabyte drive. I think they have them now. But that was something we could not rely on in the past, that they would actually increase the capacity of the machines.
And then you need to figure out, ‘okay, how do we solve this? Can we put some of it into the cloud?’ Can we, as you said, ‘put it onto some remote file system?’ All these different questions that emerge over time.
Dustin: Yeah, I'm trying to think back to that. Going back to what you were talking about, that, that post that I wrote. You know, a lot of the questions were kind of coming up around making changes to the code base, too. Or, you know, we're talking about gigabytes and gigabytes of files and people might be wondering, like, what are the files? Like, I understand Meta has like a lot of products and some. But like, how is it so big? Like what's going in it? You know what I mean? I'm like, imagine all the code for Facebook, Instagram, WhatsApp. You know, we have tons of internal apps that we make. It's like filters, you know, and blah. And it just goes like on and on and on. And it definitely adds up. But I'm trying to think like, Katherine knows the WWW code base way better than me. Like, what's in there? Why is it, why is it the way it is? You know? I mean, like, it seems huge. It seems huge. I don't know.
Katherine: We have a lot of code to represent objects in the graph. We have a framework called the end framework, where an end is basically an object. And for all those apps that you just said, you can imagine that there's a lot of various objects and various associations in that graph, just like a user and connected to another user. A user is, where is their home location, and just the fan out of all the different connections that we have in our products.
Pascal: I think one part of the answer is also that at Meta, we love generating code that is something that exists not just in WWW, but in all sorts of mobile applications too. And just look at Litho, at least during the Java days, where you would create one spec that basically told you, here is the layout tree I want to set up, and then it would generate an actual Java implementation of this with builder and all of this.
That actually takes up quite a lot of space in terms of the actual code that is generated. And then WWW, we usually check those files in. So that is what Katherine describes. So kind of the outline of an object, but also like privacy rules. Like who can access this. Under which circumstances are you allowed to actually look at this?
All of this is usually one file that is checked in for one file that you've written. And I don't know what the ratio is like, but for certain parts of the code base, it really feels like you touch one file and 20 of the same size are generated for you.
Dustin: Yeah, definitely. And I think what you're getting at here is like a theme, which is, the reason we generate all the code. Because you do the same thing in mobile. And this is like a common, like it comes up on Reddit like once every couple of years or someone's like, ‘oh my gosh, I took apart the Facebook app. What did I find?’ And there's like it has, you know, I don't know what the exact number is, but, you know, they're like, ‘I found like a million classes’ and everyone's like, ‘What?’
But the thing is, as someone who's reading it, I'm like, well, it's exactly what Katherine's talking about. It's like we have we use GraphQL. It's well known across the world. Facebook uses GraphQL.
Pascal: Great example.
Dustin: And to represent all the different types that we have, like users, profile groups, whatever. We generate, objects, you know, or types. And all those types show up in the compiled binary that you can take apart in the App Store. And I think if people find that they think that we're like handwriting these things, you know, are they a lot of the stuff that gets code generated like they, it can be mistaken, you know? So it would be really hard to handwrite all those things.
But we do that because we have so many people working on a lot of these things, whether it be like the website code base or the mobile app code bases. And we're trying to make sure that people don't just like trip and fall and write something by accident that could be, you know, privacy, unsafe or crashy or whatever.
And so more and more, I find that like the reason we do this code gen is that it, it softens the edges of the damage that one could do. Like me in bootcamp, making every Life Event ‘Graduating from Stanford.’ You know, like, I don't think I could do that as easily now because of those things as I did in 2014.
But maybe I'm wrong. I doubt it, though.
Katherine: Prior to that and framework, every team basically had to write their own MySQL, DB writers and getters. And if every team has their own unique bespoke implementation for effectively the same thing, it made mobility much more difficult because a person would have to ramp up on like that team specific way of doing something. And so having a common framework for, this is the one way in the company that you create an object, and that you use that object to write data to a database. And this is how you read data from that database.
It makes it much easier for people to move from a, from one team to another, because the common infrastructure just makes that possible.
Pascal: Yeah. And also optimizations. You can make them in one place and apply them across the entire code base. That's been so many examples, in the Lithos space where that happened. Somebody figured out, oh yeah, if we just lazily mount this thing here, we can actually make scroll performance better by N percent.
And I'm sure the end team that looks after our data queries will have had plenty of stories like this. But I found out, oh, this particular query pattern here. If we put an index over this – I've no doubt our databases, basis work – but something like this, right?
Dustin: Yeah. We, I mean the same thing happens on like the iOS side for mobile. Like we did something similar to Litho. And we have component code, which we've talked about before. And the thing that we ran, a real problem we ran into around 2017 was that, you know, you'd write a component or compose a bunch of components to make a screen. And what was happening is that components were represented as Objective-C classes, which all carry a lot of metadata that you can go free. There's a lot of great posts on the internet about, like how many bytes does it cost to just make a class in Objective-C? And the fact is, we had like tens of thousands of components. So much so that like, when we looked at the Facebook app size, we're like, why is it the size that it is?
There was a non-trivial amount that was just like metadata for the classes that were getting included in the app for component hierarchies. So much so, that the component code team actually started messing around with the, ‘Do we need classes? Can it just be a struct?’ You know, or like a grab bag of functions.
And, you know, similar to the React hooks, you know, or like the pure function React components where they're not like actual classes. Very similar idea. And what they found was that Codegen was really powerful internally because over time, it's exactly what we're saying here. It's like, you can optimize just like the internal generation of the code, and then all of a sudden it fixes the 60,000 classes and we find big app size savings.
But it all comes at a cost. You know, like, most of this codegen is not stuff that people use outside of the company. So you got to ramp up on it one time and then hopefully you don't bump into it again.
Pascal: Yeah, I think you're actually describing a very common pattern and matter again, where we often just let a thousand flowers bloom. We put our framework out there. It's probably not perfect and it has some problems in the long term because of, we'll add a lot of code size, for instance. And then when you look at a graph, it will go up and up and up. And if it actually finds a lot of adoption like component kit, and Litho did, you will at some point see this exponential growth and be like, uh oh, this is not looking like this is sustainable.
But then somebody will just look into a way of optimizing this. And in the Litho case, we had so many classes in there that were actually not strictly necessary at runtime. So the brilliant REDX team came along and wrote an optimization to actually remove all the classes, or just merge them together through some amazing bytecode hacks. And suddenly you see this kind of just drop to the bottom again and you have a much more, I don't know, like logarithmic growth curve all of a sudden. And yeah, for what you've described in component kits that was effectively the same story.
And I feel like we see this all over the place that we start with a simple solution, see if it actually takes off, and then we will look into a way of making the grace curve a little more sustainable.
Dustin: I also feel like in previous years or since I've been working here, the ownership or like the expectations around owning frameworks like this is much more um, rigid is the wrong word, but you know, in the old days it was much more likely that your team could build some kind of like, bespoke framework, you know, or something, and just let it grow naturally with, like, grow with adoption. And then you could, you kind of be caught on your back foot. Like you said, you're like, oh, when I built this, it was originally just for newsfeed, but then, you know, you started using it in profile, started using it, and then all of a sudden, like, the whole app is using it and it's not scaling that well.
And so you go back and you work on the optimizations. And I feel like nowadays that stuff, if you're going to build something from scratch, is supposed to be something that's going to proliferate like that. You do have to go through a lot more like due diligence to prove out a lot of these things that we've learned the hard way over the years.
But I think it's part of the nature of the codebase, which is, we don't have owner files like a lot of companies do. Like it's much more possible for something to kind of like grow in a corner of this massive codebase and then proliferate. And I think a lot of people hear that and just, you know, that is a little mind boggling. Like it was for me when I first started out.
You know, I joined from a company that had QA testing and people whose full time jobs it was to be testers. And there were like architects, you know, and so I would put code up for review and like a singular person had to okay it. And when I joined here, you could kind of add anyone you want as a reviewer from anywhere. And they could be like, ‘Right on.’ That was like, this is wild. And I was like, so who tested it? To like, who's the QA person? And they're like, you are. And I was like, whoa, it's really empowering. But it's also like, you know, it can be an overwhelming sense of responsibility.
And so I think that kind of, the culture, right, is that everything's in the codebase. It can grow. You can kind of like review each other's code. There's no sort of, there is ownership. I don't, I don't want this to be aggregated and turned into like a headline, like, ‘Do whatever you want in the Meta codebase. There are no rules.’ There are, there are rules. But
Pascal: Just need a buddy to accept your diff. That’s all.
Dustin: Yeah, it might, but theres, it’s like
Pascal: I’m joking.
Dustin: there is no instrumentation as far as I know. And Katherine's the expert on this. To say like, this single file can only like, you can only push this change of the singular person agrees on it. Which net in the ten years I've been here has never been such a problem that, you know, I've been like, we need to change this, and I'd love to hear what Katherine thinks about this.
Katherine: There's like, this is definitely one of the challenges of the monorepo. Everything is open and you do sometimes get surprised, like, hey, like this function that I wrote, like was intended for like my team only, and suddenly it's being used all over the place. And when I want to change my function, I now have to go into hundreds of calls sites.
Ah, with regards to ownership, um, it is true that anyone can accept any diff, any change, in basically any codebase as long as you are an engineer and you have the permissions to do so.
We, we kind of operate under kind of some like cultural understanding, I want to believe, that if you have no context about the code, if you cannot really say like, is this correct or not. You generally avoid just accepting the diff for your teammate and you wait for someone with context, someone who, quote unquote ‘owns the code’ to review it.
There's been a lot of teams that have come to us asking for an implementation of effectively required reviewers, because their part of the codebase is particularly sensitive, such as like, the networking team. And any change to a traffic configuration could suddenly, like, drop the site. And so there's been similar requirements from external auditors.
And so, we actually do have a way that the owners of a part of the codebase can do a retroactive review. So after the Diff lands, if neither the author nor the reviewer were the owners, and a retroactive review is triggered. And that way the owners can at least like see the change that was made and approve it that way.
So that's the solution that we always recommend and that we always push for, because it doesn't block the developers, during the actual like diff authoring process.
Dustin: Um hmm.
Pascal: Actually, can we take a massive step back because I think we've not really talked through the entire flow of making a change and then putting it out. We've already kind of talked about CLs a little and this auditing step that can happen, but just kind of talking about how we even put up a change and why we have pre-commit code review, which for us seems like second nature, but in many places doesn't exist.
Dustin: Hmm
Katherine: Yeah. So a developer writes their code in their editor of choice. They submit that change to our internal code review tool, and the tool has the ability to comment on the code and the ability to add inline comments. You see the right and left like before and after. It highlights the changes with, red for what was removed, green for what was added. It's actually like a really slick user interface for the actual like, code review process.
I really believe in the code review process. I think it's really important to have another set of eyes looking at the changes. It's actually the way that I think developers grow and learn. Like I would not be able to be a software engineer today without the people that reviewed my diffs way back when and taught me the best practices. They taught me the code smells to look at for. O a diff PR is accepted, then the developer can land their change to the codebase.
Pascal: I think one part that is also kind of unique about Meta culture is that we rely a lot on lints, that are also directly shown in the diff review tool. So a lot of ways, how people safeguard their files against accidental changes that could trigger some behavior, obviously that tests. But sometimes you can also have lints that show just inline warnings like, hey, do you actually want to call this particular method without checking that the return value was actually used, for instance? And that is yet another way, how you can ensure that people who might not actually know your particular codebase fall into the pit of success.
Dustin: Yeah, I always recommend that folks stray from lints and actually codify a lot of these, these things. Um, you know because the classic, the worst case scenario, right, is that somebody comes in your codebase, like Katherine’s talked about. You write a function that was never intended to be, you know, the ‘get my top five friends,’ you know, and then someone's like, ‘This is it. This is the function.’ And all of a sudden you go to fix it, optimize it, whatever else, and then you find that you're not the only caller.
You know, people have changed how it works. People have, you know, changed how many times it's getting called. And then, I think that, that's like the default example that I would go to if I didn't work here, I'd be like, ‘oh, well, this is just, this is just chaos, right?’ Like people could make the performance worse.
And I say, if you have any qualms about any of that stuff, we have at Meta all different ways of doing performance testing, unit testing, integration. You know, you name the kind of tests you want to write. Um, our CI is pretty good, I would say, very good, all things considered, at targeting, you know, like based on the files that got changed, what tests need to be run. And if you want to make sure that Joe engineer or Jane engineer doesn't come into your code base and regress it or do it, write tests.
That is the most surefire way. Because you can write lints. You know, you can write whatever. You can put a big spooky comment on the diff. But you point, someone can land it and then we can go, it might take us a little bit of time after the fact and go, oh well, it did show them a lint but maybe that person missed it. You know, maybe that whatever. So I think it does. It's one of those things with the culture that I, you know, comes with the pros and cons, is that you do get the freedom, but you do have a lot of responsibility to, you know, make sure that you're being a good citizen or actually kind of like protect your stuff.
There's nothing spookier for me than when I'm going around, like changing code, you know, or I'm refactoring something and I just run into a place with no tests. Because then, as a person changing it, I'm like, well, how do I test this?
It does, it doesn't inspire a lot of confidence. And the person's like, ‘oh, you just you boot up a simulator and you, you install Oculus on your headset and then you get a Windows, just,’ you know, I'm just like, ‘please try a test.’ I don't have, I don't have like a, you know, I don't have something running FreeBSD, you know, handy next to my, my desk, you know, with the, the check out of all your stuff.
I mean, that can take hours. So it's in everybody's best interest to just test, test test test, test in my opinion.
Pascal: That, that is such a good point because I, it actually changes how you write tests here. I feel like in many ways, I'm not following the kind of industry wide accepted best practices in the kind of tests I write, because quite often they are falling under this kind of change detector model. It's like, oh, something in the implementation has changed. That is the kind of test that people will tell you, you should actually delete before you check in the code,
But it can actually be still quite helpful to ensure, like, ‘hey, I know exactly what this function should be doing,’ and if somebody else accidentally comes by and changes it, they also need to make sure that the kind of inputs and outputs that are defined here still make sense.
I feel like one aspect that we haven't really talked too much about is, that it's not just humans that change your code, like quite often. It's also robots that come along and might change your code. And guiding against a robot overlords is another very, very important part that tests can play. Dustin, do you have kind of some insight into our robot code mod infrastructure that we have?
Dustin: Yes. Yeah, yeah. We have a service where you can, you know, it ultimately works like this. You write two steps. One step is what files do you need me to change? You identify them, you know, you return the list of file names. And then a second thing is what you want me to do to them.
And then you tell it, ‘I need you to, if you see a function call for Dustin’s cool funk one,’ you know, ‘I need you to change to Dustin's cool funk two,’ you know, an add a parameter or whatever. You write, you, you manipulate the file with whatever technology you need. You know, if it's AST or RegEx or, you know, find, replace whatever.
You do, you do that all yourself. The robot, you know, and then and then you provide some kind of configuration to say, ‘Okay, so I found a million files. Am I supposed to put this all in one PR request?’ and you go, ‘No, no, no, no. You like one file per PR request.’ And then put up a million PR requests. And for every PR, I want you to add, like the last ten people who touch the file, like make them the reviewer kind of thing.
And so, this gets rid of a lot of time spent for like a developer, so that you don't have to do this locally. Because it's a lot of like rinse and repeat. You know, it's, you know, a lot of people end up writing a script because, well, ‘how do I figure out all the files? How do I?’ You know it's, we call it command line golf. It's like, how do I, what's the, what's the command line thing to be like, run this thing on a million files and then split it so that one PR is, you know, one file and then, and then submit them to the diff tool.
It takes a lot of that logic out, but it is, you know, supported and maintained a lot by people over a code review which like Katherine does.
So I don't know do you, do you have a different perspective on that stuff or do you see it as like a good thing or a bad thing?
Katherine: I think it's like, something that's been really powerful for making these large scale changes across the code base. As someone like pre codemod service, any time that I tried to do like a rename, it was so much more manual effort than just writing that config once and letting the system take care of all of the diff state management.
We also use codemod service for the thinking of files that I mentioned earlier, from one repository to another. Like that would be such a, such a pain if a developer had to do that manually each time, that files would need to be synched.
Dustin: Um hmm. Yeah. It ends up being like a mix of like automation for like a piece of work that you might need to repeat like literally a million times. And also like CRON job, to a certain degree, where you can just be like, hey, I need you to do this like once a day. I just need you to like run this thing, put up a diff, and then, uh have somebody review it, just, like, make sure it makes sense.
But, to the original point, sometimes, you know, I write a lot of these, you know, the nature of being on the mobile infra team. There's lots of, like things we're updating, you know, or trying to improve. And so this tool’s really helpful for me.
But what I've noticed over time is that a lot of times change is made by one codemod, you know, or one, one config will basically trigger changes by another one. And so there will be like days where it looks like I saw yesterday, I reviewed a diff made by a robot. Which you know, put, you know, ran code that I wrote it, ran my script, put up a diff, that I reviewed for it, which triggered the work of another robot. And then they’re kind of going into this like fortuitous circle where the codebase is getting better or like linted. And it, in over time and it's from like six robots that I have, like putting up the diffs. And again, it's important that those things be reviewed by humans.
And, but I think understanding that outside of the codebase is or imagining that having so many files that you can't like, open it in IntelliJ or Xcode and just use like the, the refactor tool. And having that fall over can be a bit confusing or hard to grasp for sure.
Like I wouldn't have, I wouldn't have got it.
Pascal: I want to get to the review part in a bit. But just for a quick, I was kind of low key hoping that your question, Dustin, about like, ‘hey Katherine, what is your view of codemods?’ Would been answered by like, ‘I hate them! Please Dustin, stop!’ That would have been some great drama here for the podcast.
About the review step, because I think that is also really interesting. If you put up a thousand diffs because you change everything across the repository, how do you find the right people to review them?
Katherine: We have definitely tried to be smarter about that. Using ML to determine like who are appropriate reviewers for a particular file, leveraging things from like, who made the change last? Who reviewed it last? What are the teams most representative that have touched this file in the past? X amount of time? Finding the right reviewers is definitely one of the probably most challenging parts of the monorepo.
So the more that we can automate that. The more that we can have tooling that helps developers find the right reviewer. Just makes the overall cycle time faster.
Pascal: Yeah, it's super helpful to have this sentinel bot available. But I've also noticed that now reviews are even just added on top of the list that I've created, because there are often just kind of people, who are kind of known by the system, to be the absolute experts in a certain area. And it has made a huge difference for me to just kind of getting my diffs reviewed as quickly as possible.
Katherine: People can also create rules, if diff touches files x, y, and z or x, then automatically add me. Automatically add my team as a reviewer to these diffs. I care about these files, so I'm going to be automatically added as a reviewer.
Dustin: Yeah, I use that all the time. It's also a great way to see how people are using a lot of the frameworks and things that our teams build. You know, like we'll roll something out and then anytime somebody writes a diff and the word like, ‘cool framework feature I wrote’ appears in the text. You know, put me on it because I'd like to look. And or tag, you know, you can put tags on the, on the diffs and then you can, we have like a search tool, like find any diff that was tagged in the past week or whatever. That's really helpful for me because then I like to go through and be like ‘interesting, cool.’ I like, you know, built this thing and I expect to be able to use it this way. And they're actually using it like a completely different way. And that ultimately has a lot of influence on how stuff gets made. I think? Especially, especially for internal tech, like if it's not open source now, if I'm like, oh, wow, people like, 99% of people just really want it for this one thing, than I might tweak and optimize it for that one thing, as opposed to building it more generally.
Pascal: And what's particularly cool about this rule framework is that it's not specific to the diff tool, but it works across all our internal things. It's like, a big if this then that framework. Where you could say like, ‘if we're having a full moon and somebody mentions this feature in a diff, then wake me up at night. Or send me an SMS’ or something like this. So you can express pretty much anything in there.
Dustin: Yeah, it's really slick.
Pascal: Just to quickly summarize, we have all our code basically in one big folder. Anyone can make changes to anything and in order to get it landed, they just need to have one body to review it, give them a stamp and then they can land it. And hopefully there will be tests that cover it, or some rules that wake you up at night if somebody touches your file.
So with that, how are things not constantly on fire?
Katherine: Well, we have a release process, um where changes go out in a staged rollout. The first stage is to employees only. So employees are pretty much constantly dogfooding the apps, the products. So employees are a pretty good first line of defense to anything bad from going out to the general public. Employees are encouraged to submit bugs as they see them. And so that's kind of one way that we ensure the quality of our products.
If it passes through employees, then it goes to a very, very small percentage of the population. And again, we're checking, are there bug reports? Are we seeing an increase in exceptions? Our logging frameworks are very rich and enable a lot of alerting functionality.
Dustin: Yeah. I also think that we work with, I work with very smart people, you know, who have souls. Aside from the robots putting up diffs. But I guess that's like a different topic.
I mean, realistically, if I'm changing code in a file that I've never seen. Part of me is afraid that I will, you know, like I have a conscience and I'm just like, I'm not just like willy nilly changing files and be like, ‘if I blow, if I blow up the website. So let it be.’ You know. I mean, I think to a certain point there might be consequences if you just have, you know, if you've just set the whole code base ablaze for like weeks on end. I think at some point, you know, there will be some feedback given like, hey, I love your bias toward action here. Yeah, I love the way I love the way you're just taking the code base by storm and trying to really enact, you know, change. But it's a bit, you know, I mean, I mean, this is still like a job and you have expectations, you know? So I think part of it is tasks. Part of it is the dog fooding.
And then part of it is, just like the code review process. I can't understate how important it is. And the diff tool is awesome. It is so cool, you know, and it's a great like, it does a great job with that ML of cc’ing the right people. And I have very rarely, over the years run into a thing where something I own exploded because a complete stranger changed it and I didn't see it. Or nobody on my team saw it. And it made it through the dogfooding process and all these. And it passed all the tests and it made it through all those barriers.
And it, you know. After a while I'm just like, as time goes on, I just get more confirmation that I'm just like, you know. And we, we keep adding enhancements over time. And we added the final review process.
So the idea was like, okay, let's say someone accepts your diff and then you just change the whole thing. Can you submit it? Historically, the answer might have been yes, but now the answer is like definitely not. It basically puts it up for review again. It's like, hey, after you said this change look good, they basically rewrote it, you know. Or they changed a bunch of stuff. Is the stuff still okay?
And so it's like we, there's so much signal created for the ability to create signal, with tests, the if then/that tool and all this stuff to make sure that these things don't fall through, that it works. And it, and it, and it gives us the flexibility. But I think that's probably the trade off of this system. Right? Like they, we are given the power to make these changes, but it comes with the responsibility and, and stuff like that.
Katherine: It shows that there's a lot of trust in the employees, which I think we all really appreciate.
Dustin: Yeah.
Pascal: For sure. And what you said, Dustin, about expectations, I think goes in both directions. It's both that you, as somebody making a change to part of the codebase, you might not know much about. The expectation is that you will get somebody who is familiar with it to review it, and you actually will go through the test plans.
We have even really talked about this, but like we have almost always a summary section about your change and then the test plan. What have you done to actually test that this change works as expected? And you can look at previous test plans for instance, to see like. how have people deployed this or can I read it so that the changes actually work correctly.
So that's kind of one part of the expectations. But then there's also the expectation in terms of better engineering, that you actually ensure that your codebase is in a healthy state, that it is well tested, well documented, and has all these fail safes in place.
Dustin: And the code review tool has all the stuff now. Like there is AI or ML. I don't know what the right term is here, but it, it like wrote my diff summary the other day like way more succinctly than I could have. It was like, do you want me to write it? And I was like, yeah, and it was great. I was actually really surprised.
But from, you know, it will say like, hey, you know, based on, you know, your test plan, like kind of looks a little light and it gives me like a hyperlink being like, check out this one. You know, or like maybe you should include screenshots from this other plan. And I'm like, whoa. So it seems like there's like something happening behind the scenes to help augment them so that they're even better.
Because I know that a lot, to your point, a lot of times when I make changes, I go look at the blame on a file, you know, whether it's git, blame or HG. We are all familiar. But yeah, I go back and look at a previous change, a lot of times the test plans have instructions for how to properly test it. Which is like the most helpful
Katherine: How to command. Yeah
Pascal: Yeah. Honestly, I'm gutted that we are running out of time because I feel like we could talk for at least another hour or two. And if people like this kind of rambling discussion about how you actually write and land code at Meta, I'm absolutely happy to have you back on in the near future. But until then, Katherine and Dustin, thank you so much for discussing codebase health and shipping stuff at Meta with me here.
Katherine: Thank you so much. This was great.
Dustin: Yeah. Thanks.
Pascal: Welcome back to the present, or technically past. By the time you're listening to this, don't forget to tune in to Meta Connect on September 17th by going to Meta.com/connect and leave us a 5 monoreppy rating if you're so kind. But that's it for another episode of the Meta Tech Podcast. Until next time, have a lovely rest of the summer. Toodle-loo.
RELATED JOBS
Show me related jobs.
See similar job postings that fit your skills and career goals.
See all jobs