Job details

MLOps Professional Services Engineer (Cloud & AI Infra)

About the Company

Our client is at the forefront of the AI revolution, providing cutting-edge infrastructure that's reshaping the landscape of artificial intelligence. They offer an AI-centric cloud platform that empowers Fortune 500 companies, top-tier innovative startups, and AI researchers to drive breakthroughs in AI. This publicly traded company is committed to building full-stack infrastructure to service the explosive growth of the global AI industry, including large-scale GPU clusters, cloud platforms, tools, and services for developers.

Company Type: Publicly traded
Product: AI-centric GPU cloud platform & infrastructure for training AI models
Candidate Location: Remote anywhere in the US

Their mission is to democratize access to world-class AI infrastructure, enabling organizations of all sizes to turn bold AI ambitions into reality. At the core of their success is a culture that celebrates creativity, embraces challenges, and thrives on collaboration.

The Opportunity

As an MLOps Professional Services Engineer (Remote), you’ll play a key role in designing, implementing, and maintaining large-scale machine learning (ML) training and inference workflows for clients. Working closely with a Solutions Architect and support teams, you’ll provide expert, hands-on guidance to help clients achieve optimal ML pipeline performance and efficiency.

What You'll Do

Design and implement scalable ML training and inference workflows using Kubernetes and Slurm, focusing on containerization (e.g., Docker) and orchestration.
Optimize ML model training and inference performance with data scientists and engineers
Develop and expand a library of training and inference solutions by designing, deploying, and managing Kubernetes and Slurm clusters for large-scale ML training with ready-to-deploy, standardized solutions
Integrate with ML frameworks: integrate K8s and Slurm with popular ML frameworks like TensorFlow, PyTorch, or MXNet, ensuring seamless execution of distributed ML training workloads
Develop monitoring and logging tools to track distributed training performance, identify bottlenecks, and troubleshoot issues
Create automation scripts and tools to streamline ML training workflows, leveraging technologies like Ansible, Terraform, or Python
Participate in industry conferences, meetups, and online forums to stay up-to-date with the latest developments in MLOps, K8S, Slurm, and ML

What You Bring

At least 3 years of experience in MLOps, DevOps, or a related field
Strong experience with Kubernetes and containerization (e.g., Docker)
Experience with cloud providers like AWS, GCP, or Azure
Familiarity with Slurm or other distributed computing frameworks
Proficiency in Python, with experience in ML frameworks such as TensorFlow, PyTorch, or MXNet
Knowledge of ML model serving and deployment
Familiarity with CI/CD pipelines and tools like Jenkins, GitLab CI/CD or CircleCI
Experience with monitoring and logging tools like Prometheus, Grafana or ELK Stack
Solid understanding of distributed computing principles, parallel processing, and job scheduling
Experience with automation tools like Ansible, Terraform

Key Attributes for Success

Passion for AI and transformative technologies
A genuine interest in optimizing and scaling ML solutions for high-impact results
Results-driven mindset and problem-solver mentality
Adaptability and ability to thrive in a fast-paced startup environment
Comfortable working with an international team and diverse client base
Communication and collaboration skills, with experience working in cross-functional teams

Why Join?

Competitive compensation: $130,000-$175,000 (negotiable based on experience and skills)
Full medical benefits and life - insurance: 100% coverage for health, vision, and dental insurance for employees and their families
401(k) match program with up to a 4% company match
PTO and paid holidays
Flexible remote work environment
Reimbursement of up to $85/month for mobile and internet
Work with state-of-the-art AI and cloud technologies, including the latest NVIDIA GPUs (H100, L40S, with H200 and Blackwell chips coming soon)
Be part of a team that operates one of the most powerful commercially available supercomputers
Contribute to sustainable AI infrastructure with energy-efficient data centers that recover waste heat to warm nearby residential buildings

Interviewing Process

Level 1: Virtual interview with the Talent Acquisition Lead (General fit, Q&A)
Level 2: Virtual interview with the Hiring Manager (Skills assessment)
Level 3: Interview with the C-level (Final round)
Reference and Background Checks: Conducted post-interviews
Offer: Extended to the selected candidate

We are proud to be an equal opportunity workplace and are committed to equal employment opportunity regardless of race, color, religion, national origin, age, sex, marital status, ancestry, physical or mental disability, genetic information, veteran status, gender identity, or expression, sexual orientation, or any other characteristic protected by applicable federal, state or local law.

Average salary estimate

$152500 / YEARLY (est.)

min

max

$130000K

$175000K

If an employer mentions a salary or salary range on their job, we display it as an "Employer Estimate". If a job has no salary data, Rise displays an estimate if available.

What You Should Know About MLOps Professional Services Engineer (Cloud & AI Infra), Lavendo

Join an innovative team as an MLOps Professional Services Engineer with our client, a leading company revolutionizing the AI landscape! This exciting opportunity, based in San Francisco but open for remote work across the US, allows you to design, implement, and maintain large-scale machine learning (ML) workflows that are crucial for clients’ success. You'll work alongside talented Solutions Architects and support teams, offering hands-on guidance to optimize ML pipeline performance. With your expertise in Kubernetes and containerization—specifically Docker—you'll create scalable ML training and inference workflows that make a real impact. What’s more, you'll get to integrate with widely-used ML frameworks like TensorFlow and PyTorch, ensuring that high-volume workloads run smoothly. The company fosters a vibrant culture of creativity and collaboration, enabling you to thrive in a fast-paced environment. If you're passionate about AI and eager to tackle new challenges, this role is perfect for you. Plus, enjoy competitive compensation, comprehensive medical benefits, and a flexible work environment. Get ready to contribute to groundbreaking AI technologies and advance your career with a company that prioritizes both innovation and employee well-being!

Frequently Asked Questions (FAQs) for MLOps Professional Services Engineer (Cloud & AI Infra) Role at Lavendo

What skills are required for the MLOps Professional Services Engineer role at this AI-focused company?

To excel as an MLOps Professional Services Engineer at this innovative AI company, you should have strong experience with Kubernetes and containerization technologies like Docker. Additionally, familiarity with cloud providers such as AWS, GCP, or Azure is essential. A solid understanding of ML frameworks including TensorFlow, PyTorch, or MXNet will be vital to ensure integration and deployment of ML models. Your technical proficiency should also extend to automation tools like Ansible or Terraform, as well as monitoring solutions like Prometheus or Grafana.