> job detail
C
🤖ML Engineer
AI/ML Ops Lead, Central Technology
Chan Zuckerberg Initiative · Redwood City, CA (Hybrid)
// classified as
ML Engineer (Productionizing models, serving, MLOps.)
posted
196d ago
location
Redwood City, CA (Hybrid)
languages
c, go, java
tools
aws, azure, kafka
> stack
cgojavaphppythonawsazurekafkakubernetesairflowdagsterkeras
> education
msphd
> description
<div class="content-intro"><p>The Chan Zuckerberg Initiative was founded in 2015 by Priscilla Chan and Mark Zuckerberg to help solve some of society’s toughest challenges — from curing or preventing disease to improving education and addressing the needs of our local communities. We provide the operational support across our areas of work.</p></div><h2><span style="color: rgb(40, 40, 39); font-family: helvetica, arial, sans-serif;">The Team</span></h2>
<p>Across our work in Science, Education, and within our communities, we pair technology with grantmaking, impact investing, and collaboration to help accelerate the pace of progress toward our mission. Our Operations organization provides the support needed to push this work forward. </p>
<p>Operations consists of our Brand & Communications, Central Tech, Finance, People, Real Estate/Workplace/Events/Facilities/Security (REWFS), Strategy & Operations, and Ventures teams. These teams provide the essential operations, services, and strategies needed to support CZI’s progress toward achieving its mission to build a better future for everyone.</p>
<h2><span style="color: rgb(40, 40, 39); font-family: helvetica, arial, sans-serif;">The Opportunity</span></h2>
<p><span style="font-family: helvetica, arial, sans-serif;">Our Central Tech team provides technology and security support for CZI, the Biohub Network, and our grantees. We believe that Engineering and Security are most effective when in sync and learning from each other on a daily basis. Our AI Infrastructure Engineering team enables our AI Research teams to achieve their goals faster and more securely. We leverage technology to automate manual processes, constantly innovate to optimize operations, provide first-class support, and build solutions to enable the scale and execution of our business partners' strategies and initiatives.</span></p>
<p><span style="font-family: helvetica, arial, sans-serif;">The AI/ML and Data Engineering Infrastructure organization works on building shared tools and platforms to be used across all of the Chan Zuckerberg Initiative, partnering and supporting the work of a wide range of Research Scientists, Data Scientists, AI Research Scientists, as well as a broad range of Engineers focusing on Education and Science domain problems. Members of the shared infrastructure engineering team have an impact on all of CZI's initiatives by enabling the technology solutions used by other engineering teams at CZI to scale. A person in this role will build these technology solutions and help to cultivate a culture of shared best practices and knowledge around core engineering.</span></p>
<h2><span style="color: rgb(40, 40, 39); font-family: helvetica, arial, sans-serif;">What You'll Do</span></h2>
<ul>
<li style="font-family: helvetica, arial, sans-serif;"><span style="font-family: helvetica, arial, sans-serif;">Provide technical MLOps leadership: for a team of MLOps Engineers, where you will manage and lead the team in operating our heterogeneous AI training and inference systems as well as collaborating in the design and build of our AI platform components.</span></li>
<li style="font-family: helvetica, arial, sans-serif;"><span style="font-family: helvetica, arial, sans-serif;">Drive the application of MLOps and DevOps principles: across our multiple platforms, ensuring peak operational efficiency across our AI operations and process automation necessary for a world class large scale AI model training environment. </span></li>
<li style="font-family: helvetica, arial, sans-serif;"><span style="font-family: helvetica, arial, sans-serif;">Instrumentation and Observation technical leadership: for the MLOps team, defining our end to end metrics program including full proactive monitoring and alerting systems</span></li>
<li style="font-family: helvetica, arial, sans-serif;"><span style="font-family: helvetica, arial, sans-serif;">Facilitate model training through collaboration with our AI Researchers: alongside the rest of the AI Infrastructure Eng team work together to make sure that our models we are training and releasing to inference make use of best machine learning and deep learning practices, and are through code automation libraries fully resilient to restarts and checkpoint recoveries. </span></li>
<li style="font-family: helvetica, arial, sans-serif;"><span style="font-family: helvetica, arial, sans-serif;">Continuous Optimization of our Kubernetes based AI Lifecycle platform: through our IAC based practices and integrating our MLOps AI Lifecycle platform tooling, alongside integrating this with our On-Prem HPC systems into a cohesive heterogeneous platform. </span></li>
<li style="font-family: helvetica, arial, sans-serif;"><span style="font-family: helvetica, arial, sans-serif;">Collaboration on Data systems for our AI model training: with our Data Infrastructure Eng team as well as the Science data teams on the end to end data usage that drive our AI model training.</span></li>
<li style="font-family: helvetica, arial, sans-serif;"><span style="font-family: helvetica, arial, sans-serif;">Lead our MLOps team supporting our on-call rotation: combining a focus on automation and proactive alerting focused on reducing on-call loads and improving self healing AI system operations. This will be low volume, but we do have 24/7 coverage, and will include members of the rest of the AI team for escalation and on-call coverage.</span></li>
</ul>
<h2><span style="color: rgb(40, 40, 39); font-family: helvetica, arial, sans-serif;">What You'll Bring</span></h2>
<ul>
<li style="font-family: helvetica, arial, sans-serif;"><span style="font-family: helvetica, arial, sans-serif;">BS, MS, or PhD degree in Computer Science or a related technical discipline or equivalent experience</span></li>
<li style="font-family: helvetica, arial, sans-serif;"><span style="font-family: helvetica, arial, sans-serif;">7+ years of relevant coding and systems experience</span></li>
<li style="font-family: helvetica, arial, sans-serif;"><span style="font-family: helvetica, arial, sans-serif;">5+ years of systems Architecture and Design experience, with a broad range of MLOps experience across Data Infrastructure and AI/ML platforms</span></li>
<li style="font-family: helvetica, arial, sans-serif;"><span style="font-family: helvetica, arial, sans-serif;">Proven technical leadership in SRE and MLOps related experience, as well as either direct or indirect people management experience</span></li>
<li style="font-family: helvetica, arial, sans-serif;"><span style="font-family: helvetica, arial, sans-serif;">Proven SRE and MLOps knowledge and related experience</span></li>
<li style="font-family: helvetica, arial, sans-serif;"><span style="font-family: helvetica, arial, sans-serif;">Strong experience scaling containerized applications on Kubernetes or Mesos, including expertise with creating custom containers using secure AMIs and continuous deployment systems that integrate with Kubernetes or Mesos. (Kubernetes preferred)</span></li>
<li style="font-family: helvetica, arial, sans-serif;"><span style="font-family: helvetica, arial, sans-serif;">Cloud Platform proficiency with Amazon Web Services (AWS), Google Cloud Platform (GCP), or Microsoft Azure, and experience with On-Prem and Colocation Service hosting environments</span></li>
<li style="font-family: helvetica, arial, sans-serif;"><span style="font-family: helvetica, arial, sans-serif;">MLOps experience working with medium to large scale GPU clusters in Kubernetes (Kubeflow), HPC environments, or large scale Cloud based ML deployments</span></li>
<li style="font-family: helvetica, arial, sans-serif;"><span style="font-family: helvetica, arial, sans-serif;">Working knowledge of Nvidia CUDA and AI/ML custom libraries.</span></li>
<li style="font-family: helvetica, arial, sans-serif;"><span style="font-family: helvetica, arial, sans-serif;">Knowledge of Linux systems optimization and administration </span></li>
<li style="font-family: helvetica, arial, sans-serif;"><span style="font-family: helvetica, arial, sans-serif;">Solid Coding experience</span></li>
<li style="font-family: helvetica, arial, sans-serif;"><span style="font-family: helvetica, arial, sans-serif;">Proven coding ability with a systems language such as Rust,C/ C++, C#, Go, Java, or Scala</span></li>
<li style="font-family: helvetica, arial, sans-serif;"><span style="font-family: helvetica, arial, sans-serif;">Expertise with a scripting language such as Python (preferred), PHP, or Ruby</span></li>
<li style="font-family: helvetica, arial, sans-serif;"><span style="font-family: helvetica, arial, sans-serif;">Experience in integrating Data with the AI Lifecycle</span></li>
<li style="font-family: helvetica, arial, sans-serif;"><span style="font-family: helvetica, arial, sans-serif;">AI/ML Platform Operations experience in an environment integrated with challenging data and systems platform challenges</span></li>
<li style="font-family: helvetica, arial, sans-serif;"><span style="font-family: helvetica, arial, sans-serif;">Large scale Streaming data systems integration experience</span></li>
<li style="font-family: helvetica, arial, sans-serif;"><span style="font-family: helvetica, arial, sans-serif;">Hadoop, Spark, and/or Kafka deployments, or their corollaries such as Pulsar, Flink, and/or Ray) </span></li>
<li style="font-family: helvetica, arial, sans-serif;"><span style="font-family: helvetica, arial, sans-serif;">Workflow scheduling tools such as Apache Airflow, Dagster, or Apache Beam </span></li>
<li style="font-family: helvetica, arial, sans-serif;"><span style="font-family: helvetica, arial, sans-serif;">Understanding of Data Engineering, Data Governance, Data Infrastructure, and AI/ML execution platforms.</span></li>
<li style="font-family: helvetica, arial, sans-serif;"><span style="font-family: helvetica, arial, sans-serif;">PyTorch, Keras, or Tensorflow experience a strong nice to have</span></li>
<li style="font-family: helvetica, arial, sans-serif;"><span style="font-family: helvetica, arial, sans-serif;">HPC with and Slurm experience a strong nice to have</span></li>
</ul>
<h2><span style="color: rgb(40, 40, 39); font-family: helvetica, arial, sans-serif;">Compensation</span></h2>
<p><span style="font-family: helvetica, arial, sans-serif;">The Redwood City, CA base pay range for this role is $241,000 - $331,000. New hires are typically hired into the lower portion of the range, enabling employee growth in the range over time. Actual placement in range is based on job-related skills and experience, as evaluated throughout the interview process. </span></p>
<h2><span style="color: rgb(40, 40, 39); font-family: helvetica, arial, sans-serif;">Work Mode</span></h2>
<p><span style="font-family: helvetica, arial, sans-serif;">As we grow, we’re excited to strengthen in-person connections and cultivate a collaborative, team-oriented environment. This role is a hybrid position requiring you to be onsite for at least 60% of the working month, approximately 3 days a week, with specific in-office days determined by the team’s manager. The exact schedule will be at the hiring manager's discretion and communicated during the interview process.</span></p>
<h2><span style="color: rgb(40, 40, 39); font-family: helvetica, arial, sans-serif;"><strong>Benefits for the Whole You</strong> </span></h2>
<p><span style="font-family: helvetica, arial, sans-serif;">We’re thankful to have an incredible team behind our work. To honor their commitment, we offer a wide range of benefits to support the people who make all we do possible. </span></p>
<ul>
<li style="font-family: helvetica, arial, sans-serif;"><span style="font-family: helvetica, arial, sans-serif;">CZI provides a generous employer match on employee 401(k) contributions to support planning for the future.</span></li>
<li style="font-family: helvetica, arial, sans-serif;"><span style="font-family: helvetica, arial, sans-serif;">Annual benefit for employees that can be used most meaningfully for them and their families, such as housing, student loan repayment, childcare, commuter costs, or other life needs.</span></li>
<li style="font-family: helvetica, arial, sans-serif;"><span style="font-family: helvetica, arial, sans-serif;">CZI Life of Service Gifts are awarded to employees to “live the mission” and support the causes closest to them.</span></li>
<li style="font-family: helvetica, arial, sans-serif;"><span style="font-family: helvetica, arial, sans-serif;">Paid time off to volunteer at an organization of your choice. </span></li>
<li style="font-family: helvetica, arial, sans-serif;"><span style="font-family: helvetica, arial, sans-serif;">Funding for select family-forming benefits. </span></li>
<li style="font-family: helvetica, arial, sans-serif;"><span style="font-family: helvetica, arial, sans-serif;">Relocation support for employees who need assistance moving to the Bay Area</span></li>
<li style="font-family: helvetica, arial, sans-serif;"><span style="font-family: helvetica, arial, sans-serif;"><a href="https://chanzuckerberg.com/careers/#benefits"><span style="text-decoration: underline; color: rgb(222, 31, 38);">And more</span></a>!</span></li>
</ul>
<p><span style="font-family: helvetica, arial, sans-serif;">If you’re interested in a role but your previous experience doesn’t perfectly align with each qualification in the job description, we still encourage you to apply as you may be the perfect fit for this or another role.</span></p>
<p><span style="font-family: helvetica, arial, sans-serif;">Explore our <a href="https://chanzuckerberg.com/careers/working-reimagined/"><span style="text-decoration: underline; color: rgb(222, 31, 38);">work modes</span></a>, <a href="https://chanzuckerberg.com/careers/#benefits"><span style="text-decoration: underline; color: rgb(222, 31, 38);">benefits</span></a>, and <span style="text-decoration: underline; color: rgb(222, 31, 38);"><a style="color: rgb(222, 31, 38); text-decoration: underline;" href="https://chanzuckerberg.com/careers/candidate-journey/">interview process</a></span> at <a href="https://www.chanzuckerberg.com/careers"><span style="color: rgb(222, 31, 38);">www.chanzuckerberg.com/careers</span></a>.</span></p>
<p><span style="font-weight: 400; color: #7e8c8d;">#LI-Hybrid #CZI</span></p><div class="content-conclusion"><table style="width: 100%; height: 60px; margin-left: auto; margin-right: auto;">
<tbody>
<tr>
<td style="width: 10%;"><span style="color: #000000;"> </span></td>
<td style="width: 10%;"><span style="color: #000000;"> </span></td>
<td class="" style="width: 10%;" width="30"><span style="color: #000000;"><a style="color: #000000;" href="https://www.facebook.com/chanzuckerberginitiative/" target="_blank"> <img style="display: block; margin-left: auto; margin-right: auto; max-width: 100%;" src="https://chanzuckerberg.com/wp-content/themes/czi/img/facebook-red.svg" alt="Facebook" border="0"></a></span></td>
<td class="" style="width: 10%;" width="30"><span style="color: #000000;"><a style="color: #000000;" href="https://www.instagram.com/chanzuckerberginitiative/" target="_blank"><img style="display: block; margin-left: auto; margin-right: auto; max-width: 100%;" src="https://chanzuckerberg.com/wp-content/themes/czi/img/instagram-red.svg" alt="Instagram" border="0"></a></span></td>
<td class="" style="width: 10%;" width="30"><span style="color: #000000;"><a style="color: #000000;" href="https://www.threads.net/@chanzuckerberginitiative/" target="_blank"><img style="display: block; margin-left: auto; margin-right: auto; max-width: 100%;" src="https://chanzuckerberg.com/wp-content/themes/czi/img/threads-red.svg" alt="Medium" border="0"></a></span></td>
<td class="" style="width: 10%;" width="30"><span style="color: #000000;"><a style="color: #000000;" href="https://www.linkedin.com/company/chan-zuckerberg-initiative/" target="_blank"><img style="display: block; margin-left: auto; margin-right: auto; max-width: 100%;" src="https://chanzuckerberg.com/wp-content/themes/czi/img/linkedin-red.svg" alt="Linkedin" border="0"></a></span></td>
<td class="" style="width: 10%;" width="30"><span style="color: #000000;"><a style="color: #000000;" href="https://twitter.com/chanzuckerberg" target="_blank"><img style="display: block; margin-left: auto; margin-right: auto; max-width: 100%;" src="https://chanzuckerberg.com/wp-content/themes/czi/img/x-red.svg" alt="X" border="0"></a></span></td>
<td class="" style="width: 10%;" width="30"><span style="color: #000000;"><a style="color: #000000;" href="https://www.youtube.com/channel/UCZioJ6fb9SuRdLIO7DlE09w" target="_blank"> <img style="display: block; margin-left: auto; margin-right: auto; max-width: 100%;" src="https://chanzuckerberg.com/wp-content/themes/czi/img/youtube-red.svg" alt="YouTube" width="30" border="0"></a></span></td>
<td style="width: 10%;"><span style="color: #000000;"> </span></td>
<td style="width: 10%;"><span style="color: #000000;"> </span></td>
</tr>
</tbody>
</table>
<h5 style="text-align: center;"> </h5></div>