Job details

Machine Learning Operations Manager

Get a free resume review

Our mission is to solve the most important and fundamental challenges in AI and Robotics to enable future generations of intelligent machines that will help us all live better lives.

Who we are looking for:

We are seeking a Machine Learning Operations (ML-OPs) Manager who is both technically adept and an effective leader. In this role, you will lead a small team of engineers while also being hands-on in designing, building, and maintaining infrastructure that supports the entire lifecycle of Machine Learning (ML) projects. If you have a passion for building scalable ML infrastructure, mentoring engineers, and collaborating with world-class researchers, this is the role for you!

What You Will Do

Technical Leadership & Strategy: Drive the design, development, and maintenance of company-wide MLOps platforms and tools, leveraging Kubernetes infrastructure for ML and data processing applications.
Team Management & Mentorship: Manage and mentor a small team of engineers, providing technical guidance, setting priorities, and fostering a collaborative team culture
Scalability & Performance: Enable self-service access to ML-compute resources across on-prem and cloud environments, ensuring workload scalability, fault tolerance, and efficient job scheduling
Monitoring & Observability: Enhance system observability through integrations with tools and services such as FluentD, Prometheus, Grafana, and DataDog to improve reliability and debugging
Experiment & Model Lifecycle Management: Integrate ML applications with experiment tracking and model management services such as Weights and Biases
Best Practices & Collaboration: Champion engineering best practices, drive improvements in CI/CD, infrastructure automation, and reproducibility. Work closely with ML Engineers, Data Engineers, DevOps teams, and researchers to accelerate research and deployment.

What You Will Bring

BS or MS in Computer Science, Engineering, or equivalent
5+ years of experience in an ML-Ops, DevOps, ML Engineering, or software engineering role
2+ years of experience managing or mentoring engineers (can be formal management or technical leadership)
Strong, hands-on experience with Kubernetes for ML applications
Experience developing ML-Ops platforms (covering data/artifact management, reproducibility, fault tolerance, experiment tracking, and model serving)
Proficiency in Python, Docker, and environment management tools (pip, poetry, uv, or similar)Familiarity with CI/CD tools (GitHub Actions, ArgoCD) and Infrastructure as Code (Terraform)

Skills We Value

Experience with job scheduling mechanisms like Kueue
Hands-on experience with workflow orchestration tools (Airflow, Metaflow, Argo Workflows)
Experience managing cloud infrastructure (GCP, AWS) and hybrid-cloud environments
Knowledge of scalable AI/ML platforms like Ray or PyTorch Lightning
Experience with logging & monitoring tools (FluentD, Prometheus, Grafana, DataDog or similar
Exposure to ML model serving frameworks (TorchServe, ONNX Runtime, or similar)
Previous experience collaborating with research teams in academic or industrial settings

We provide equal employment opportunities to all employees and applicants for employment and prohibit discrimination and harassment of any type without regard to race, color, religion, age, sex, national origin, disability status, genetics, protected veteran status, sexual orientation, gender identity or expression, or any other characteristic protected by federal, state or local laws.

Average salary estimate

$140000 / YEARLY (est.)

min

max

$120000K

$160000K

If an employer mentions a salary or salary range on their job, we display it as an "Employer Estimate". If a job has no salary data, Rise displays an estimate if available.

What You Should Know About Machine Learning Operations Manager, Robotics and AI Institute

Join us as a Machine Learning Operations Manager in the vibrant tech hub of Cambridge, MA! We're on a mission to tackle some of the most pressing challenges in AI and robotics, paving the way for future intelligent machines that can greatly enrich our lives. In this pivotal role, you'll not only lead a small, talented team of engineers but also play a hands-on part in designing and maintaining robust infrastructure that underpins our Machine Learning projects. If you're passionate about creating scalable ML infrastructure, mentoring your fellow engineers, and collaborating with some of the brightest minds in research, this opportunity is tailor-made for you! Your responsibilities will include driving the design and maintenance of MLOps platforms leveraging Kubernetes, enhancing system observability with tools like Prometheus and Grafana, and advocating for engineering best practices across teams. You’ll bring your technical expertise in Python, Docker, and CI/CD tools while ensuring that our ML applications are efficient, reliable, and scalable. With your experience and leadership, you'll empower the next generation of ML projects and make a real difference in our organization.

Frequently Asked Questions (FAQs) for Machine Learning Operations Manager Role at Robotics and AI Institute

What are the key responsibilities of a Machine Learning Operations Manager at this company?

As a Machine Learning Operations Manager at our company in Cambridge, MA, you will lead a team of engineers while taking charge of the technical leadership and strategy for our MLOps initiatives. Your responsibilities will include designing and developing platforms for ML applications, overseeing the scalability and performance of our infrastructure, and enhancing system observability through various monitoring tools. Additionally, you will be integral in managing projects that integrate ML applications with experiment tracking and model management services.