SiteOps Global Product Hardware Lead Engineer - GPU

Meta is seeking a forward thinking, experienced AI/ML (Artificial Intelligence/Machine Learning) Product Hardware Platform Lead Engineer to join the Data Center Site Operations team. The Product Hardware Platform Engineering (PHE) team is responsible for the overall performance of Meta’s production compute, storage, and AI/ML platforms through their life-cycles in our data centers. This role will lead the subset of the PHE team that focuses on AI/ML platform hardware. AI/ML is an important priority for Meta that involves complex GPU based systems operating in shared computing clusters. The role scope is focused on maintaining and improving the health of the AI/ML platforms from verification testing into mass production through end-of-life. Key responsibilities include identifying systemic hardware, firmware, and tooling issues; engaging in hands-on problem solving; and collaborating effectively with cross-functional engineering and tooling teams to improve performance of the fleet. Our data centers, and the tens of thousands of servers installed in them, are the foundation upon which our rapidly scaling infrastructure efficiently operates and upon which our innovative services are delivered. Meta is at the leading edge of the global data center industry both in terms of how data centers are designed and operated. This person should enjoy working in a fast-paced environment where adaptability and flexibility will be key to their success. We seek an individual who can quickly absorb and understand the technical challenges of subject matter experts and local site operations teams, create alignment between these globally distributed teams as well as partner organizations, and can set informed priorities and direction while getting buy-in and commitment from relevant stakeholders.
SiteOps Global Product Hardware Lead Engineer - GPU Responsibilities
  • Lead other AI/ML PHE team members through efforts that provide end-to-end lifecycle ownership (verification test through end of life decommissioning) of AI/ML hardware platforms and associated new technologies in the data centers
  • Serve as the central point of contact representing the AI/ML hardware platforms and associated new technologies across SiteOps, and be the subject matter experts on hardware platform issues, for datacenter operations teams
  • Drive complex AI/ML technical investigations globally and spanning multiple disciplines such as Hardware, Software/Firmware, Networking and Power & Cooling
  • Work closely with other PHE team members to share best practices and ensure appropriate feedback is given to cross-functional teams
  • Issue timely alerts and support fixes to operations teams, and assure a robust feedback pipeline to engineering teams
  • Provide serviceability feedback on AI/ML production hardware to engineering design teams
  • Provide technical mentorship on large scale data center projects and initiatives to global, cross-functional teams
  • Build strong relationships and collaboration with engineering and cross functional teams across the company. Actively solicit feedback from teams, and use that feedback to improve operational effectiveness as infrastructure scales
  • Own the cross-functional communication with other technical operations groups to help resolve incidents
  • Collaborate with stakeholders, functional owners and subject matter experts to interpret and articulate business and operations needs
  • Ability to travel up to 30% required
Minimum Qualifications
  • Experience managing multiple concurrent projects and managing competitive timelines
  • 10+ years experience in hardware development and/or validation, working with cross functional teams to deliver products to production
  • BS or BA in technical field or commensurate experience
  • Effecting technical drafting skills, experience creating documentation for users of all levels
  • Experience in processing and analyzing large sets of data
  • Experience triaging and debugging hardware platforms
  • Knowledge of server and storage platforms, principles, technologies, protocols, and standards
  • Experience working with Linux or Unix Operating systems
  • Experience working independently within a multi-disciplinary team of hardware and operations engineers
  • Experience working across a diverse global organization and building partnerships with cross functional teams inside and outside of the organization
Preferred Qualifications
  • Experience with GPU based platform hardware that operates in AI/ML computing clusters
  • Large-scale data center environment experience, including hardware deployments, deep system knowledge of Linux, Server Hardware, networking, network protocols, supply chain and Data Center automation
  • Leadership presence and presentation skills
  • Experience in data center system and process automation
  • Bash, PHP, Python, or Perl scripting experience
Locations
About Meta
Meta builds technologies that help people connect, find communities, and grow businesses. When Facebook launched in 2004, it changed the way people connect. Apps like Messenger, Instagram and WhatsApp further empowered billions around the world. Now, Meta is moving beyond 2D screens toward immersive experiences like augmented and virtual reality to help build the next evolution in social technology. People who choose to build their careers by building with us at Meta help shape a future that will take us beyond what digital connection makes possible today—beyond the constraints of screens, the limits of distance, and even the rules of physics.
Meta is committed to providing reasonable support (called accommodations) in our recruiting processes for candidates with disabilities, long term conditions, mental health conditions or sincerely held religious beliefs, or who are neurodivergent or require pregnancy-related support. If you need support, please reach out to accommodations-ext@fb.com.
(Colorado only*) Estimated salary of $193,000/year + bonus + equity + benefits
*Note: Disclosure as required by sb19-085(8-5-20)
Related Job Openings
Meta is proud to be an Equal Employment Opportunity and Affirmative Action employer. We do not discriminate based upon race, religion, color, national origin, sex (including pregnancy, childbirth, reproductive health decisions, or related medical conditions), sexual orientation, gender identity, gender expression, age, status as a protected veteran, status as an individual with a disability, genetic information, political views or activity, or other applicable legally protected characteristics. You may view our Equal Employment Opportunity notice here. We also consider qualified applicants with criminal histories, consistent with applicable federal, state and local law. We may use your information to maintain the safety and security of Meta, its employees, and others as required or permitted by law. You may view Meta's Pay Transparency Policy, Equal Employment Opportunity is the Law notice, and Notice to Applicants for Employment and Employees by clicking on their corresponding links. Additionally, Meta participates in the E-Verify program in certain locations, as required by law.

Meta is committed to providing reasonable accommodations for qualified individuals with disabilities and disabled veterans in our job application procedures. If you need assistance or an accommodation due to a disability, you may contact us at accommodations-ext@fb.com.