Let’s get started
By clicking ‘Next’, I agree to the Terms of Service
and Privacy Policy
Jobs / Job page
Machine Learning Operations Manager image - Rise Careers
Job details

Machine Learning Operations Manager

Our mission is to solve the most important and fundamental challenges in AI and Robotics to enable future generations of intelligent machines that will help us all live better lives.


Who we are looking for:

We are seeking a Machine Learning Operations (ML-OPs) Manager who is both technically adept and an effective leader. In this role, you will lead a small team of engineers while also being hands-on in designing, building, and maintaining infrastructure that supports the entire lifecycle of Machine Learning (ML) projects. If you have a passion for building scalable ML infrastructure, mentoring engineers, and collaborating with world-class researchers, this is the role for you!


What You Will Do
  • Technical Leadership & Strategy: Drive the design, development, and maintenance of company-wide MLOps platforms and tools, leveraging Kubernetes infrastructure for ML and data processing applications.
  • Team Management & Mentorship: Manage and mentor a small team of engineers, providing technical guidance, setting priorities, and fostering a collaborative team culture
  • Scalability & Performance: Enable self-service access to ML-compute resources across on-prem and cloud environments, ensuring workload scalability, fault tolerance, and efficient job scheduling
  • Monitoring & Observability: Enhance system observability through integrations with tools and services such as FluentD, Prometheus, Grafana, and DataDog to improve reliability and debugging
  • Experiment & Model Lifecycle Management: Integrate ML applications with experiment tracking and model management services such as Weights and Biases
  • Best Practices & Collaboration: Champion engineering best practices, drive improvements in CI/CD, infrastructure automation, and reproducibility. Work closely with ML Engineers, Data Engineers, DevOps teams, and researchers to accelerate research and deployment.


What You Will Bring
  • BS or MS in Computer Science, Engineering, or equivalent
  • 5+ years of experience in an ML-Ops, DevOps, ML Engineering, or software engineering role
  • 2+ years of experience managing or mentoring engineers (can be formal management or technical leadership)
  • Strong, hands-on experience with Kubernetes for ML applications
  • Experience developing ML-Ops platforms (covering data/artifact management, reproducibility, fault tolerance, experiment tracking, and model serving)
  • Proficiency in Python, Docker, and environment management tools (pip, poetry, uv, or similar)Familiarity with CI/CD tools (GitHub Actions, ArgoCD) and Infrastructure as Code (Terraform)


Skills We Value
  • Experience with job scheduling mechanisms like Kueue
  • Hands-on experience with workflow orchestration tools (Airflow, Metaflow, Argo Workflows)
  • Experience managing cloud infrastructure (GCP, AWS) and hybrid-cloud environments
  • Knowledge of scalable AI/ML platforms like Ray or PyTorch Lightning
  • Experience with logging & monitoring tools (FluentD, Prometheus, Grafana, DataDog or similar 
  • Exposure to ML model serving frameworks (TorchServe, ONNX Runtime, or similar)
  • Previous experience collaborating with research teams in academic or industrial settings


We provide equal employment opportunities to all employees and applicants for employment and prohibit discrimination and harassment of any type without regard to race, color, religion, age, sex, national origin, disability status, genetics, protected veteran status, sexual orientation, gender identity or expression, or any other characteristic protected by federal, state or local laws.

Average salary estimate

$140000 / YEARLY (est.)
min
max
$120000K
$160000K

If an employer mentions a salary or salary range on their job, we display it as an "Employer Estimate". If a job has no salary data, Rise displays an estimate if available.

What You Should Know About Machine Learning Operations Manager, Robotics and AI Institute

Join us as a Machine Learning Operations Manager in the vibrant tech hub of Cambridge, MA! We're on a mission to tackle some of the most pressing challenges in AI and robotics, paving the way for future intelligent machines that can greatly enrich our lives. In this pivotal role, you'll not only lead a small, talented team of engineers but also play a hands-on part in designing and maintaining robust infrastructure that underpins our Machine Learning projects. If you're passionate about creating scalable ML infrastructure, mentoring your fellow engineers, and collaborating with some of the brightest minds in research, this opportunity is tailor-made for you! Your responsibilities will include driving the design and maintenance of MLOps platforms leveraging Kubernetes, enhancing system observability with tools like Prometheus and Grafana, and advocating for engineering best practices across teams. You’ll bring your technical expertise in Python, Docker, and CI/CD tools while ensuring that our ML applications are efficient, reliable, and scalable. With your experience and leadership, you'll empower the next generation of ML projects and make a real difference in our organization.

Frequently Asked Questions (FAQs) for Machine Learning Operations Manager Role at Robotics and AI Institute
What are the key responsibilities of a Machine Learning Operations Manager at this company?

As a Machine Learning Operations Manager at our company in Cambridge, MA, you will lead a team of engineers while taking charge of the technical leadership and strategy for our MLOps initiatives. Your responsibilities will include designing and developing platforms for ML applications, overseeing the scalability and performance of our infrastructure, and enhancing system observability through various monitoring tools. Additionally, you will be integral in managing projects that integrate ML applications with experiment tracking and model management services.

Join Rise to see the full answer
What qualifications are required for the Machine Learning Operations Manager position?

To qualify for the Machine Learning Operations Manager position at our company, candidates should possess a BS or MS in Computer Science, Engineering, or a related field. We are looking for individuals with at least 5 years of experience in ML-Ops, DevOps, or software engineering, along with a minimum of 2 years in a mentoring or management role. Strong expertise in Kubernetes and proficiency with Python, Docker, and CI/CD tools are also essential.

Join Rise to see the full answer
What skills set apart a successful Machine Learning Operations Manager?

A successful Machine Learning Operations Manager at our company should not only have technical expertise in ML-Ops platforms but also strong leadership abilities. Skills that set candidates apart include experience with cloud infrastructure management (GCP, AWS), knowledge of AI/ML platforms like Ray or PyTorch Lightning, and familiarity with tools for logging and monitoring such as FluentD and DataDog. Additionally, exposure to collaborative research environments is highly valued.

Join Rise to see the full answer
What programming languages and tools should I be familiar with for this role?

In the Machine Learning Operations Manager position with us, familiarity with programming languages like Python is crucial, as well as experience with Docker for containerization. It's beneficial to know CI/CD tools such as GitHub Actions and infrastructure as code tools like Terraform. Additionally, experience with job scheduling mechanisms and workflow orchestration tools will greatly enhance your effectiveness in this role.

Join Rise to see the full answer
Is prior management experience necessary for the Machine Learning Operations Manager role?

Yes, prior management or mentoring experience is necessary for the Machine Learning Operations Manager role at our company. We require candidates to have at least 2 years of experience in managing or mentoring engineers, which can be through formal management roles or through providing technical leadership within teams. Leadership skills and the ability to foster a collaborative culture are essential for success.

Join Rise to see the full answer
Common Interview Questions for Machine Learning Operations Manager
What are the primary challenges you expect to face as a Machine Learning Operations Manager?

As a Machine Learning Operations Manager, you may face challenges such as ensuring the scalability and performance of ML applications while managing a diverse team. When answering this question, focus on your strategies for overcoming technical and team dynamics issues, including your experience with collaboration, problem-solving, and implementing best practices.

Join Rise to see the full answer
How do you prioritize tasks in a fast-paced Machine Learning environment?

In a fast-paced ML environment, prioritizing tasks is essential. Share your approach to using agile methodologies, setting team objectives, and aligning tasks with company goals. Discuss how communication and collaboration with your team aids in managing priorities effectively.

Join Rise to see the full answer
Can you describe your experience with Kubernetes in machine learning applications?

When discussing your experience with Kubernetes, highlight specific projects where you've utilized Kubernetes for deploying ML applications. Explain how you managed scaling, load balancing, and fault tolerance within those projects to demonstrate your technical competence and relevant experience.

Join Rise to see the full answer
What is your approach to mentoring engineers on your team?

Your mentoring approach should focus on fostering an environment where engineers feel supported and encouraged to grow. Discuss techniques like setting up regular one-on-one sessions for feedback, creating opportunities for collaboration, and encouraging experimentation to help your team develop their skills.

Join Rise to see the full answer
How do you ensure observability in machine learning systems?

To ensure observability in ML systems, you should discuss your experience with tools like Prometheus and Grafana to monitor system performance. Explain how you integrate logging and monitoring frameworks into your workflow for troubleshooting and performance optimization, providing specific examples if possible.

Join Rise to see the full answer
What role does collaboration play in successful MLOps implementation?

Collaboration is key in MLOps implementation. Highlight your experiences working with cross-functional teams such as ML Engineers, Data Engineers, and DevOps teams. Emphasize the importance of communication, shared goals, and regular updates for driving effective collaboration.

Join Rise to see the full answer
How do you manage resource allocation across multi-cloud environments?

Discuss your strategies for managing resource allocation across multi-cloud environments, including your experience with tools like Terraform. Explain how you assess resource needs for various ML tasks and balance these with cost considerations across different platforms.

Join Rise to see the full answer
What CI/CD practices do you find essential for MLOps?

In your answer, mention CI/CD practices such as automated testing, continuous integration and delivery pipelines, and version control management that are crucial for MLOps. Talk about how these practices help maintain robustness, reliability, and efficiency in deploying machine learning models.

Join Rise to see the full answer
Can you provide an example of a successful ML project you managed?

Be prepared to share details about a successful ML project, including your role, the outcomes, the challenges faced, and how you collaborated with your team. Highlight the measurable impact of the project and your contributions to its success.

Join Rise to see the full answer
How do you stay current with advancements in Machine Learning and Operations?

Staying current with advancements in ML and operations is vital. Discuss how you regularly engage in learning through attending conferences, participating in webinars, following industry literature, and connecting with peers in the field. This shows your commitment to continuous improvement and keeping your skills fresh.

Join Rise to see the full answer
Similar Jobs
Lilly Hybrid Indianapolis, Indiana, United States
Posted 7 hours ago
Photo of the Rise User
Posted 3 days ago
Photo of the Rise User
Smiths Group Hybrid 5300 S Howell Ave, Milwaukee, WI 53207, USA
Posted 9 days ago
Posted 3 days ago
Photo of the Rise User
Humanforce Remote No location specified
Posted 10 days ago
Photo of the Rise User
LaBella Associates Remote No location specified
Posted 7 days ago
Photo of the Rise User
Posted 12 days ago
MATCH
VIEW MATCH
FUNDING
DEPARTMENTS
SENIORITY LEVEL REQUIREMENT
TEAM SIZE
No info
HQ LOCATION
No info
EMPLOYMENT TYPE
Full-time, on-site
DATE POSTED
March 26, 2025

Subscribe to Rise newsletter

Risa star 🔮 Hi, I'm Risa! Your AI
Career Copilot
Want to see a list of jobs tailored to
you, just ask me below!