Feb 10 2026
Multimodal AI for Ray-Ban Meta glasses
Listening time: 39 mins

Guest(s): Shane Moon, AI Research Scientist

Multimodal AI—models that can process multiple inputs like text, audio, images, and even motion sensors—is transforming what wearables can do. With Ray‑Ban Meta glasses, multimodal AI helps the device understand what you’re seeing, so you can ask questions about the world in front of you—from translating text to recognizing landmarks.

But what does it take to bring AI into a wearable form factor?

In this episode, host Pascal Hartig sits down with Meta research scientist Shane Moon in the Wearables AI team, who has spent the past seven years working at the intersection of computer vision, perception, and multimodal learning. Shane shares what it’s like to do research with direct product impact, how the feedback loop between research and engineering has accelerated in the generative AI era, and why wearables introduce unique constraints that change how you build foundational models.


Mentioned on the podcast:


Transcript:

Pascal: Hello and welcome to episode 72 of the Meta Podcast, an interview podcast by Meta where we talk to engineers who work on our different technologies. My name is Pascal and although the podcast is entering its 8th year of operation soon, my outie is still blissfully unaware of its existence. Let's keep it that way.

We are breaking the Android streak with this episode in a radical gear shift. I'll be talking to Shane who is a research scientist working on multimodal AI for products like Ray-Ban meta glasses.

This puts Shane in a historically somewhat rare position where his research directly influences the direction of entire product categories. But that's not where the fun ends. Much of Shane's work is public and open source. And in our discussion, he shares how vibrant the back and forth with the open-source AI community is.

If you want to hear more about Meta's open source efforts, don't forget to follow @MetaOpenSource on Threads. You already follow @MetaTechPodcast right? The team has resumed their open-source 101 series that highlights projects that you may not have heard about like CacheLib, an engine optimized to transparently leverage DRAM and SSDs to build high throughput, low overhead caching services. To learn more, check them out on threads or follow the links in the show notes.

And now, here's my interview with Shane.

Pascal: I have been saying in the intro line to this podcast for years now that the show is about the engineers who work on our different technologies. My guest, however, according to his internal profile is a research scientist, but he works at the exact intersection of research and engineering, which at least from my vantage point has only increased in his relevance with the emergence of generative AI over the past few years.

I couldn't be more excited to dive into the interplay between these areas with my guest. So without further ado, Shane, welcome to the metatech podcast.

Shane: Thank you very much for the introduction. Excited to be here.

Pascal: Can you tell us a bit about yourself before we go into the topic? So how long have you been at Meta and what did you do before?

Shane: Yeah, sure thing. So I've been at Meta for about seven years by now, and which in Meta standard anyway, it's fairly long, I think. Um, so I did my. Ph. D. focusing on multimodal learning and transfer learning and Meta, I was fortunate enough to continue my research, which was one of the reasons that kept my passion for the team.

Pascal: Fantastic. So can you tell us a bit about your team that you work on? What's the name? What's the mission? What do you do there?

Shane: Yeah, absolutely. So I'm part of the Wearables AI org, uh, specifically focusing on computer vision, vision perception and multimodal AI. So we build foundational models for wearables. So think smart glasses, Ray-Ban Meta, and so on. And oftentimes, because of the unique form factor of wearables, it has many interesting challenges for the AI models that we develop. So we're basically in charge of addressing them.

Pascal: Fabulous. So now to the title I mentioned in the intro, you're not an engineer, at least not based on slightly reductive titles that we're passing around. But most of the audience to this podcast works as a software engineer in this space.

So can you tell us a bit about the role? What does it look like? What does the day to day look like as a research scientist.

Shane: Yeah, of course

Pascal: And also maybe as an additional little wrinkle on this, can you tell us about how this has evolved over time?

Shane: Oh, yeah, for sure. Yeah so my main role in short as an AI lead research scientist is advancing the AI technologies for pretty much all wearable use cases. So I work a lot with PMs on building the AI components for the vision, product vision that they have. But oftentimes, I think of my role as exploring beyond what's currently feasible and possible and sort of in return inspiring and influencing the product vision based on my research as well.

And because of that, I was often encouraged to do a very exploratory research, like at least a few years out, uh, further out before they have any product feasibility. But yeah, I do sense that these days, uh, that the pattern has definitely changed and that's mainly because the underlying AI models, as you have mentioned are so powerful, the generative models are so powerful that a lot of the research that we do now has direct applicability, uh, in products.

So the timeline between research and product has now become very short, much shorter. So we're putting them into products as we find new and exciting research findings, uh, research results. So, um, yeah, I work very closely with all the engineers in the org to put the research findings into products that can help users, but ensuring all the safety net is in place and so on.

Pascal: Yeah, to me, what you said there about influencing the product roadmaps, that sounds really exciting. And correct me with, if I'm wrong here, but that sounds like a fairly new aspect to this. At least my perception of somebody working in science, is that this is somebody sitting in the little lab sciencing away. And at some point, somebody from a product organization may look at the work you've done and find something they find interesting and trying to turn into a product.

But you are now in a position where you're actually directly affecting what is being built. So has this been something that you've noticed as well, how this has changed over time?

Shane: Yeah, absolutely. I mean, that's what's part of the reason that I stayed in this team for such a long time. Because I have the avenue to influence the product vision and so on. And as a sole researcher, it's often harder to find that direct product outlet. And for us, it's the wearable devices, right?

That, that has direct application of the research. So yeah, it's been really, really fun.

Pascal: Cool. Then let's talk a bit about multimodal AI. Can you tell us in your words what it is?

Shane: Yeah, absolutely. I probably should have defined it early on.

Pascal: I think is a good point for it?

Shane: Sounds good. Yeah. Uh, multimodal AI, as the name suggests, we basically build AI models that can understand and process multiple modalities. And modalities here, you can think of them basically as different forms of input, different sensory inputs. So think images, text, audio, motion sensor signals all combined.

So an analogy that I sometimes make is, when we as humans communicate, it's much harder if you communicate only in a single modality. So, for instance, if I mute my audio right now, it's probably pretty hard to understand what I'm saying. And if I mute my video, it's probably also hard to see all the visual cues that I'm making, for instance, or the objects that I'm referring to. For instance, I could say, oh, look at this book. But then if you're not looking, if you don't have any access to images or videos, you would have no idea which book I'm referring to in language.

Pascal: Which may be a bit harsh given that this is mostly an audio podcast, but you're still making a good point there.

Shane: Yeah. So the point is, um, our communication in general is very often multimodal in nature. And if an AI model can also process multiple modalities, it can understand a much richer context to allow for more natural interactions. And yeah, that's what we study with multimodal AI. Basically the ways to jointly understand various heterogeneous contexts and modalities.

Pascal: That's a great overview. So now maybe we can do what we've discussed before and actually apply this to the product.

Can you tell us how the multimodal AI is applied in our products?

Shane: Yeah absolutely. So our hero use cases with smart glasses, you know, we have the Ray-Ban Meta glasses that are equipped with powerful cameras and some on-device processors. So we launched a multimodal AI assistant over, about a year ago that essentially can see the same thing as the user does. So because the camera is facing the same egocentric view as the user.

So users can assume that the multimodal context is shared between the AI assistant and the user and ask questions naturally. So, for instance, yeah, you could ask, uh, hey Meta, what am I looking at? And it'll describe the same scene based on the camera image that it just takes. Or, um, like more complicated questions, like, you know, Hey, Meta, tell me more about the, the artistic style of that painting or story behind that artist. Or if you're traveling, you could ask, Hhey, Mera, look at this menu in Italian and tell me which one is vegetarian, right?

So the multimodal AI model that we built now has sort of to understand the textual query self that the user has asked, but also understand the scene that the user is looking at and do a joint reasoning between modalities and then answer the question.

Yeah, and the smart glasses form factor, I think is really the best venue in my opinion to host a multimodal AI because it's just sits in your glasses seamlessly and you don't have to hold a phone or anything just simply, you can just simply ask a question about what you see. And so yeah, when we launched this product I was really proud of all the amazing engineering and research work that we put together.

Pascal: I think that seems to be one of the realization from the people working on the product. How much easier it is when the thing you interact with is just always there,

Shane: Exactly

Pascal: As you say, you don't need to pull a phone out of your pocket, unlock it, and then ask a question. But it's right there. It's on your face. It looks at what you're looking at, in that very moment.

A question about the aspect of the different modalities you're describing there. So, so far, you've mostly talked about speech plus video, or kind of visual component to it. I'm curious if you're also taking sounds into consideration? Because when I activate one of the various voice assistants around me, what happens behind the scenes, I think, it's that basically transcribes what I'm saying, sends it to a server, evaluates it and comes back with an answer.

But does that change? Do you actually look at a broader spectrum of audio inputs than just the speech pattern of it?

Shane: Yes, certainly. That's the scope of the research they're doing right now. We are looking into ways to process the audio signals that the acoustic signals, not just the human speech signals into conditioning context for the assistance. So for instance, if there's a siren going on, that's an interesting context for the system to be aware of and help the user if necessary and so on.

So, yeah, we, the scope of our research touches upon understanding the acoustic signals as well as even the IMU motion sensor signal so that it can better understand what activities the users are currently engaged in and so on, which are all helpful context to assist the user with.

Pascal: That to me is really exciting. I'm almost more excited about this than the visual aspect of it. And maybe because I haven't thought about it too much, but selfishly, I would love to have an assistant that I could just have a recording of the weird sounds that my washing machine sometimes makes and understand that because that is entirely impossible right now. I do not have the capability of describing the weird melody that it sometimes sings for me, in a way that I can put it into Google. So this, this could be a complete game changer for me.

Shane: 100 percent agreed. Yes.

Pascal: Okay, so I want to talk a bit about the kind of day to day work. You gave us a bit of an overview already of how you work as a research scientist, but who are the kind of people and the different roles of them that you interact with on a day to day basis?

Shane: Yeah, absolutely. Um, well because I work in multimodal learning as a field, it also means that it's a, it's a very multidisciplinary field where I have to work very closely with the computer vision experts as well as the NLP experts. So there are research scientists in other orgs as well, like FAIR and GenAI and some other orgs.

And yeah, we collaborate with them very closely to continue to develop our models. And of course, I also work very closely with our machine learning engineers as well to productionize the models and keep improving the models that are facing the users currently.

Pascal: One of the cool aspects of your work is that a lot of it is actually out there in the open in terms of products that everybody can use, but also in terms of actual open research that you have published in terms of papers. I wanted to ask you a bit about one of the papers that you've recently published.

And first I need to actually know from you how it's pronounced. Is it Any M A L? Is it AnyMAL?

Shane: It's, it's, I'm pronouncing them as animal and it was a little joke on the, the little quirky thing that the machine learning community had that all model names have to be based on animal names. And we just chose the animal, Animodality Augmented Large English Model because it's a Meta model. And so, and therefore just the AnyMAL.

Pascal: Oh my God. There are layers to it. Okay. That, that is very clever. I like it.

So can you tell us a bit about what it is about and what an augmented language model is and how it differs from other models in the space?

Shane: Yeah, absolutely. So, uh, if, if a typical large English model is a text only reasoning model, we propose this animal architecture as an efficient method to extend it to multiple different modalities. Uh, so we started fusing images, videos, audio, and as I said, even IMU motion sensor signals into the LLM space so that we can easily reason over various different modalities. And yeah, a lot of the research focus was on improving the efficiency of coming up with this reasoning model that can process multiple modalities and so on.

Pascal: One of the aspects in the paper that you describe is how you use an encoder zoo to look into the different modalities. So they are like modality specific encoders that you have and pre-trained on other models or inputs. Can you talk a bit about how these different encoders work in Unity and what the result of that work is?

Shane: Yeah, absolutely. Um, so the design philosophy was that we, build upon the existing open source modules as much as possible for our research. So, uh, we, we, we call them an encoder zoo and we can simply pick and choose the best performing encoder for each modality at any given time. And these modality specific encoders are what I call the perception modules, where the job is to process and project the raw input signals into a well represented feature space so that it's, it's not just ones and zeros, which it would have been if they're in, uh, just a raw input.

And then there's the reasoning module, which does the secondary reasoning over the text to be able to do more complex tasks like question answering and so on. So the role of all the encoders, uh, from the encoder zoo are basically doing the perception tasks to understand and perceive the different modality signals. So, yeah, we were able to use a lot of the open source encoders for each, each different modalities.

Pascal: I mean, that's fantastic that it's actually not just contributes to open source, but actually builds on top of other building blocks that exist out in the space already. And you can really see how the combined effort of the open source community or open research community, whatever term you prefer here, is so synergetic.

Shane: Yeah, exactly.

Pascal: So about the encoders that you've chosen there, some of them are even overlapping or are kind of operating in the same domain. How do you choose then which one effectively wins for a given input or is it more like a collaboration between them?

Shane: Yeah, um, so oftentimes you know, we use the benchmark tasks to determine which encoders out of the available encoders are the best. And for different tasks, we sometimes interleave them. If interleaved different modalities and we have some interesting findings that if you interleave different modalities that actually helps with a specific task. So, which proves my point earlier that you really need to understand multiple different contexts at the same time to be able to assist the user in a more, well, in a richer context as well. So, yeah, oftentimes it comes down to doing ablations over different experimental settings and so on.

Pascal: You also mentioned that you had a really strong zero shot performance with this particular model. Was that something that surprised you? Was that expected or were there any other potentially unexpected results?

Shane: Yeah, certainly surprising to me for sure. So to train this animal model, we first designed a pre-training task of giving a caption description of the corresponding modality signal. So, for instance, we'd have data sets of image and caption pairs and audio and audio description caption pairs and so on. So, you know, we simply pre trained the model to describe the modality signals into words, like textual format. And what was surprising was that once the model learned to perceive these signals, it had zero shot ability to reason over them almost immediately, uh, was able to do question answering out of the box, which shows to me just how powerful these elements are and how efficient this animal method is.

So to me, that was really interesting bit.

Pascal: This really just puts my ignorance here on display, but how does it actually work to train this model? Because you say like, you showed it basically an image and it immediately responded to it. Does that mean you have some incremental training where you can actually check, I provide this input now, and I can look at, into the output? Or do you always have to run basically the entire training stage again, wait for, I don't know, making it up, a week, and then you can actually look at your results. But what's the actual kind of day to day experience working on a model? What does that like?

Shane: Yeah, absolutely. So to keep the reasoning capabilities of the model, we keep the LLM portion mostly frozen so that it maintains that reasoning ability. So what we do train is these perception module, basically projecting this different encoder space, encoder signals into the space that the language model can understand.

So we do the training over time and monitor the progress of the training based on some evaluation and validation tests that we set for every epoch, and we measure the progress over time. And so it's really interesting to see the progress of how the models learn to perceive the specific modality over time better and better.

At the beginning, it's, you know, writing out gibberish and then eventually it learns to describe the modality really in high detail. So it goes through that pre-training cycle first.

Pascal: That sounds incredibly fun. When, once you hit these milestones, you can actually see your work yielding results in almost real time there.

But does that mean you're basically working in a very modular fashion? That's what it sounds to me, that you can actually focus on one of the models that contribute to the overall, fine tune this, and then kind of check in more, in a more orchestrated fashion how they're all working together.

Shane: That's right. That's exactly right. So we focus on each modality first and then later think about ways to combine all of them, as a more of a curriculum learning, what we call, which is a way to train each portion of the model and then combining it later jointly.

Pascal: This is fantastic because I think I'm getting a much better understanding of what it's like on the ground working on these models now than I've had before. As I think our listeners can tell, we always try to keep our chats prior to the interview very brief so that I'm not that I don't need to kind of have a full ignorance here.

It's as real as it gets. I really do not know what you're going to tell me about it, but that just means that my excitement about this is very real. And I guess the other part about the paper that I wanted to ask you about is that you came up with different sizes as it is very common for LLMs. You have like 7 billion, 13 and 70, and evaluated the performance about them.

It turned out that the largest one was also the best performing model. Was that something you expected to be so pronounced? Or was that something that surprised you?

Shane: Yes, that was a very interesting result. We definitely found that the model based on the 70 billion large English model performs the best on our benchmark tests that focus on understanding visual reasoning capabilities of the models or any other modality reasoning capabilities. As I mentioned earlier, the way that I think of it is that there are, there is the perception module and the, the reasoning module and the role of the underlying large English model, 13 billion or 70 billion, is really on the reasoning side, right?

Whereas the perception module is really saying the same across multiple different variations that we tried. And so even though they were able to perceive the world in a mostly similar way, as we trained them with a sim, similar recipe essentially, but the, it was a 70 billion model, which had the best zero shot reasoning capabilities, that it really extended that learning into an unknown era or unseen tasks, basically.

So, to some degree, it was very interesting to see that it was really indeed the 70 billion model that performed the best on the reasoning tasks. But what was also interesting to me was that if they just give the description tasks, as in, if you give the model to just say, hey, describe this in a sentence, what you're seeing or what you're hearing, the performance was very similar across all model variants, which goes back to our point about, you know, it's really the role of the reasoning model that does the reasoning over multiple different tasks, but perception module itself is pretty much consistent.

Pascal: Okay, I would like to move us on now from the more theoretical realm to the practical application. So when we're talking about glasses and image recognition of what is in front of you, can you talk us through the different steps of how this actually works in practice? So like, where is the image taken? Where does the evaluation take place?

Talk about the kind of round trip of a request that you might have to it.

Shane: Yeah, absolutely. So, uh, once we, once we get a user query, we again, just focusing on the visual modalities and we take images or videos from the user and we do some first step reasoning and first step processing steps on the glasses themselves. For instance, OCR images, if the task was primarily textual, then the model has to do a really good job of the reading the text that is seen on the images, but we only send the thumbnail sizes to the server to save time on the bandwidth and, and so on. So, whereas the OCR tasks definitely require larger resolution images to be able to process them better.

And so we do that some, we have some computer vision modules that are undergoing, that are going on device. But then the majority of the information is get sent to the server side and that which is where we host larger language models that what I was referring to, and then process all the secondary reasoning steps there, and then gives back the results to the user.

Pascal: Yeah, that's very interesting because I would imagine you probably don't have the processing capabilities in something that you strap onto your face and want to remain relatively cool to run like, you know, the full suite of the different models that you described in an earlier phase.

But you mentioned there that this is now like a thumbnail that is sent to the server. Is there also work going on to understand more than just like a static snapshot of the world, but a continuous stream, a video of what's going on around you?

Shane: Yes, absolutely. And that's certainly the directions that we're taking now so that the model doesn't just rely on a single snapshot of image, but we can really understand a long period of video as the conditioning input for the model.

And there are a lot of challenges that come with processing and understanding the video signals. Of course, there are challenges with regards to having to stream the video and constantly run the model to process them and so on. But, you know, just the modeling capability in general to understand really long context is something that is not entirely solved, resolved in this community yet. So there is a lot of, uh, there are a lot of research efforts that are going on there to improve the video understanding capabilities just in general.

Pascal: What's super interesting about this is that I can't think of many other use cases where the bandwidth capacity actually matters as much as in this particular case. There have been many memes over the years now about the kind of marketing spiel that we all got about 5G when it was announced and the real world effects.

I think the only example somebody could bring up that somebody managed to, perform surgery on a banana in a different country. And I still doubt that 5G was actually required for this. But as you're talking about this, if I want to actually upload a video stream in a reasonably high resolution, so you can still perform something like OCR on it, to a server. And give me a reply to it with some degree of latency. I guess that actually makes a difference how fast your connection is.

Shane: Yeah, absolutely. The network bandwidth is really important to us. And of course, we do a lot of optimizations to reduce the load itself. But, because the connection is so widely available, I mean, the 5G connections are so available that, you know, we can even think about applications like this so.

You know, I'm only working on the multimodal learning space, but I, you're absolutely right that I'm, you know, the product that we're building is really, it's really a cumulative work by engineers of every field, I think. Which is really fascinating.

Pascal: Yeah, it's a combination of so many different aspects. Without mobile phones, we wouldn't have the chips that are small enough that we can put them now into glasses and all the other steps that we've already discussed. So you brought up there that this is obviously a challenge. There are tons of optimizations that can be done to make things faster, more efficient.

But in general, it still feels like the earliest prototypes, at least that I could imagine of something like this, you stream the entire uncompressed or semi compressed stream to some server somewhere. And I would assume it will take a ton of capacity to just perform this one request. Can you talk a bit about how you think about going from a prototype stage to something that can actually be used by billions of people, which is usually what Meta aims to do with the products that we provide to our users.

Shane: Yeah, absolutely. Bringing this model to potentially millions and billions of people and being able to answer the complex multimodal queries in subseconds, which we do now, has obviously been very challenging. There are a lot of optimizations that are done on the model hosting side just so that we can cache a lot of the information that has been streamed to the server. But not just in the raw format, but we encode it as much as possible so that the model doesn't have to re-encode all the information every time that it wants to run a query, for instance. So there are various smart optimizations around KB caching that are required there.

And also on the model training stage as well, we're dealing with a really big model with a lot of, lots of data. So there are a lot of optimizations that were done to paralyze the training across different GPUs and GPU nodes and across different data and so on, which, all of which were novel challenges, especially a couple of years ago to make the training pipeline productionizable and so on. So, yeah, it's, it's, it's been a really interesting challenge and there are a lot of still some complex challenges that we still have to overcome.

And yeah, it's exciting to be able to work on something that has huge impacts and potential audience.

Pascal: From your perspective, where's most of that optimization happening right now?

Is it primarily on the kind of production engineering side, figuring out how to build bigger data centers, shoving more GPU nodes in there, or is it more on the research side and just really thinking about, is there a way how we can encode this more efficiently that we don't need to send the entire image up to the server. Obviously, just as an example, but how do you see the split at the moment?

Or is it just purely a complete collaboration between the two forces?

Shane: It is absolutely collaboration between the two forces, you know, we first have to expand the capacity as much as possible and reduce all the intranode, uh, communication bottlenecks and so on. But at the same time, from the modeling perspective, our job is to reduce the need for capacity. And it's really a two pronged approach, in my opinion, to be able to serve really billions of people.

We really need both for sure.

Pascal: Got it. So now that we're talking about products, I guess one of the core aspects of how we build anything at Meta is iteration. Get something out as early as possible, especially with kind of trusted testers, our employee base and so on, and then iterate on it.

How do you incorporate feedback for this particular workstream?

Because I would imagine. If you have an app, a brand new app that you want to get out, and you send it out to your dog fooders, they will kind of, like a million monkeys on a typewriter, eventually kind of cover every single aspect of the app and figure out, oh, this button doesn't work, and here is some wrong padding.

But when you're literally talking about, I can look at any object in the world, any movement in the world, any sound in the world, and I should get a reasonable answer, that sounds absolutely insurmountable to me.

So how do you break this down and actually gather good feedback and apply this to your product,

Shane: Yeah, that's a really good question. We obviously try to gather all the, the feedback signals that we can get. For instance on the app, we have thumbs up and thumbs down button. And so if the users weren't satisfied with the results, the responses that we gave, they can always do thumbs up and thumbs down.

It's mostly thumbs down, then thumbs up. Um, and, and you're right that the feedback is more ad hoc, for the lack of better words. That, you know, we want to build a model that is universal and they can do any task, whereas the feedback that we're getting is on a specific task.

Like, oh, I asked the model to do this crossword puzzle for me, but it didn't do so, right? It's a very specific use case, and it's a fascinating use case, but, you know, it's so hard for us to go after solving that specific crossword puzzle solving model, right?

So, yeah, it is, it is a hard balance that we make, but, uh, we, the general philosophy that we have in, in the way that we approach this modeling is that the model really has to be as general as possible. And so, uh, we do gather very specific signals like that, but we try to combine them and try to come up with the best, what we call the training recipe, to make sure that we are good at those specific tasks that users are asking, but also have maintained that general reasoning capabilities across the board.

And user feedback is always super helpful. And we've been, you know, be able to utilize those signals as part of our training as well.

Pascal: So we touched on this before, but I want to talk briefly about open source. So we already talked about how this actually enables some of the work you do because you are effectively standing on the shoulder of giants who have already contributed to the space through other contributions. But in general, what's the approach like?

How does open source contribute to the work that you're doing and how does that influence how you approach problems?

Shane: Yeah, absolutely. It has always been this, uh, self improving positive feedback loop for us. You know, we build the models and we open source them and get it out to the public and the public improve on different aspects. And all those research findings have been super fascinating. We're able to incorporate them back to our research and our product as well.

So it's been, it's been really fascinating to see that. One example that I want to give is the workshop that we organized actually as part of the multimodal conversation AI, you know, workshop. We have hosted a few years of this workshop basically targeting the specific research area. And the problem with multimodal conversation AI just in general is that there are no data sets to work with where images are intertwined with the dialogues and so it's such an intricate data set to build.

And so, we were able to work, because this was the field that we're interested in, we're able to build several different data sets that we're interested in. And we hosted our workshop around this so that participants can use the data set. Build their models and, and improve the performance of these benchmark tests that we share with, uh, to the public.

And they were able to come up with very interesting novel approaches over the course of three, four years. And all those findings that they were able to find were really interesting so we were able to incorporate them back into the product. Which is how we got to this place where we now can productionize this multimodal conversation AI model to the public.

So. Yeah, that's been always a positive reinforcement learning for us as well.

Pascal: That is fascinating to hear. I think we hear this message quite often that there are these massive benefits to operating in the open when it comes to AI, but this is a really clear example that you're giving there of how this actually works instead of this very high level overview of like, yeah, in theory, this should make, this should lift all boats, but you can see how the interplay actually looks like on the ground. So I find this absolutely fascinating.

Another thing I wanted to briefly ask about is the Be My Eyes program. Can you tell us about that?

Shane: Be My Eyes program is basically hosted on our glasses. And Be My Eyes is a program that has existed before and basically the idea is that it can help the visually impaired people where by the users holding the phone and asking the questions and you will be connected to someone who can see the same thing as what the user is currently looking at. And you'll be getting the audio guidance on the phone basically.

And the idea with glasses is obviously that you don't have to hold the phone now because it can just stream the same thing as what they're seeing. And so, yeah, it's, it's been really fascinating to see that going.

And my personal goal would be to, you know, assist the people with those tasks more with just the AI capabilities. So that, you know, if the user cannot be paired with someone who's seeing the same thing and assisting, it can also still help the users quite often as well. So yeah, that's something that I'm looking into.

Pascal: Can you tell me how this actually feels to contribute to something like this? It feels like so many of these little improvements, not to underplay them, make a meaningful difference in our lives. But this could literally be life changing if people effectively gain a part of their vision back through a program like this.

So what, what is it like to actually have this kind of impact?

Shane: Yeah, it's really exciting for sure that my research can immediately reach the audience that needs the help and so on. I also love the idea that this technology can potentially help, let's say, visually impaired people because the AI can perceive the same scene as the user is looking at.

There was actually a Wall Street Journal article on it exactly, I think, yesterday. On this on this on this application implications. And yes, and because so many people use the model, I have this immense responsibility as well to conduct research well and make sure that the model is safe to interact with and so on. So yeah.

Pascal: I will definitely make sure to leave a link to that in the show notes.

Okay, maybe one question to slowly wrap us up now, but we have a ton of AI experts across the company. And this is obviously a huge collaboration. Can you talk a bit about what it's like to work with other AI experts at Meta?

Shane: Yeah, absolutely. We truly have a lot of talents who are experts in different fields. And again, because multimodal AI for me specifically is a multidisciplinary field, I was fortunate enough to work with them closely. And, you know, it's like basically whenever you have a question, you don't need to Google your questions on the web, but you can simply have a, set up a meeting with them and, you know, discuss the idea, brainstorm different approaches and so on.

It's been really, to be honest, like intellectually stimulating experience to me to, to collaborate with a lot of experts within the company. So, yeah, that's, that's some benefit that I get from, you know, working with very clever people.

Pascal: That's fantastic. Did you have any unique learning experiences that come to mind?

Shane: Yeah, absolutely. So, for instance, the model optimization and so on was a new field for me, you know, a few years ago. So I personally learned a lot from working with production engineers and researchers in that field to really understand different avenues where we can optimize the model. So that has really been interesting and it was really helpful for me to learn that as well.

Pascal: And that also sounds like a fairly unique opportunity because there are a I don't know how many five, six companies that operate at the scale that we do, and for a random researcher somewhere in a university, you will probably not have that exact feedback loop. You can't talk about like, hey, if I want to put this into production and give this to a billion users, what would that actually take?

What would be required of me to make that possible?

Shane: Yeah, you're touching up on really interesting points that the challenges that we deal with are very unique to the companies or people who are working at this scale of models, right? And so oftentimes a lot of the open source community hasn't really touched upon this, addressing this specific challenge.

And so, you know, the best resource that we have access to are the people who have been working on this field for a while. And, you know, we have a lot of them here. So that's been fantastic.

Pascal: All right. Okay. And I think with that, Shane, thank you so much for advancing the state of multimodal AI and joining me here on the Meta Tech Podcast.

Shane: Awesome! Thank you so much

Pascal: And that was my interview with Shane. If you have any questions or feedback then hit us up on Threads or Instagram where we are still @MetaTechPod or slide into our DMs. If you have any requests for future topics, that's also the best place to leave them just as Eric from Sweden recently did asking us for an episode about the infrastructure that allows us to scale GraphQL for our mobile apps. I'm on the case. And that's it for another episode of the Meta Tech Podcast. Until next time, toodle-loo.

RELATED JOBS

Show me related jobs.

See similar job postings that fit your skills and career goals.

See all jobs