Let’s get started
By clicking ‘Next’, I agree to the Terms of Service
and Privacy Policy
Jobs / Job page
MLOps Professional Services Engineer (Cloud & AI Infra) image - Rise Careers
Job details

MLOps Professional Services Engineer (Cloud & AI Infra)

About the Company

Our client is at the forefront of the AI revolution, providing cutting-edge infrastructure that's reshaping the landscape of artificial intelligence. They offer an AI-centric cloud platform that empowers Fortune 500 companies, top-tier innovative startups, and AI researchers to drive breakthroughs in AI. This publicly traded company is committed to building full-stack infrastructure to service the explosive growth of the global AI industry, including large-scale GPU clusters, cloud platforms, tools, and services for developers.

  • Company Type: Publicly traded

  • Product: AI-centric GPU cloud platform & infrastructure for training AI models

  • Candidate Location: Remote anywhere in the US

Their mission is to democratize access to world-class AI infrastructure, enabling organizations of all sizes to turn bold AI ambitions into reality. At the core of their success is a culture that celebrates creativity, embraces challenges, and thrives on collaboration.

The Opportunity

As an MLOps Professional Services Engineer (Remote), you’ll play a key role in designing, implementing, and maintaining large-scale machine learning (ML) training and inference workflows for clients. Working closely with a Solutions Architect and support teams, you’ll provide expert, hands-on guidance to help clients achieve optimal ML pipeline performance and efficiency. 

What You'll Do

  • Design and implement scalable ML training and inference workflows using Kubernetes and Slurm, focusing on containerization (e.g., Docker) and orchestration.

  • Optimize ML model training and inference performance with data scientists and engineers

  • Develop and expand a library of training and inference solutions by designing, deploying, and managing Kubernetes and Slurm clusters for large-scale ML training with ready-to-deploy, standardized solutions

  • Integrate with ML frameworks: integrate K8s and Slurm with popular ML frameworks like TensorFlow, PyTorch, or MXNet, ensuring seamless execution of distributed ML training workloads

  • Develop monitoring and logging tools to track distributed training performance, identify bottlenecks, and troubleshoot issues

  • Create automation scripts and tools to streamline ML training workflows, leveraging technologies like Ansible, Terraform, or Python

  • Participate in industry conferences, meetups, and online forums to stay up-to-date with the latest developments in MLOps, K8S, Slurm, and ML


What You Bring

  • At least 3 years of experience in MLOps, DevOps, or a related field

  • Strong experience with Kubernetes and containerization (e.g., Docker)

  • Experience with cloud providers like AWS, GCP, or Azure

  • Familiarity with Slurm or other distributed computing frameworks

  • Proficiency in Python, with experience in ML frameworks such as TensorFlow, PyTorch, or MXNet

  • Knowledge of ML model serving and deployment

  • Familiarity with CI/CD pipelines and tools like Jenkins, GitLab CI/CD or CircleCI

  • Experience with monitoring and logging tools like Prometheus, Grafana or ELK Stack 

  • Solid understanding of distributed computing principles, parallel processing, and job scheduling

  • Experience with automation tools like Ansible, Terraform

Key Attributes for Success

  • Passion for AI and transformative technologies

  • A genuine interest in optimizing and scaling ML solutions for high-impact results

  • Results-driven mindset and problem-solver mentality

  • Adaptability and ability to thrive in a fast-paced startup environment

  • Comfortable working with an international team and diverse client base

  • Communication and collaboration skills, with experience working in cross-functional teams

Why Join?

  • Competitive compensation: $130,000-$175,000 (negotiable based on experience and skills)

  • Full medical benefits and life - insurance: 100% coverage for health, vision, and dental insurance for employees and their families

  • 401(k) match program with up to a 4% company match

  • PTO and paid holidays 

  • Flexible remote work environment

  • Reimbursement of up to $85/month for mobile and internet

  • Work with state-of-the-art AI and cloud technologies, including the latest NVIDIA GPUs (H100, L40S, with H200 and Blackwell chips coming soon)

  • Be part of a team that operates one of the most powerful commercially available supercomputers

  • Contribute to sustainable AI infrastructure with energy-efficient data centers that recover waste heat to warm nearby residential buildings

Interviewing Process

  • Level 1: Virtual interview with the Talent Acquisition Lead (General fit, Q&A)

  • Level 2: Virtual interview with the Hiring Manager (Skills assessment)

  • Level 3: Interview with the C-level (Final round)

  • Reference and Background Checks: Conducted post-interviews

  • Offer: Extended to the selected candidate

We are proud to be an equal opportunity workplace and are committed to equal employment opportunity regardless of race, color, religion, national origin, age, sex, marital status, ancestry, physical or mental disability, genetic information, veteran status, gender identity, or expression, sexual orientation, or any other characteristic protected by applicable federal, state or local law.

Average salary estimate

$152500 / YEARLY (est.)
min
max
$130000K
$175000K

If an employer mentions a salary or salary range on their job, we display it as an "Employer Estimate". If a job has no salary data, Rise displays an estimate if available.

What You Should Know About MLOps Professional Services Engineer (Cloud & AI Infra), Lavendo

Join an innovative team as an MLOps Professional Services Engineer with our client, a leading company revolutionizing the AI landscape! This exciting opportunity, based in San Francisco but open for remote work across the US, allows you to design, implement, and maintain large-scale machine learning (ML) workflows that are crucial for clients’ success. You'll work alongside talented Solutions Architects and support teams, offering hands-on guidance to optimize ML pipeline performance. With your expertise in Kubernetes and containerization—specifically Docker—you'll create scalable ML training and inference workflows that make a real impact. What’s more, you'll get to integrate with widely-used ML frameworks like TensorFlow and PyTorch, ensuring that high-volume workloads run smoothly. The company fosters a vibrant culture of creativity and collaboration, enabling you to thrive in a fast-paced environment. If you're passionate about AI and eager to tackle new challenges, this role is perfect for you. Plus, enjoy competitive compensation, comprehensive medical benefits, and a flexible work environment. Get ready to contribute to groundbreaking AI technologies and advance your career with a company that prioritizes both innovation and employee well-being!

Frequently Asked Questions (FAQs) for MLOps Professional Services Engineer (Cloud & AI Infra) Role at Lavendo
What skills are required for the MLOps Professional Services Engineer role at this AI-focused company?

To excel as an MLOps Professional Services Engineer at this innovative AI company, you should have strong experience with Kubernetes and containerization technologies like Docker. Additionally, familiarity with cloud providers such as AWS, GCP, or Azure is essential. A solid understanding of ML frameworks including TensorFlow, PyTorch, or MXNet will be vital to ensure integration and deployment of ML models. Your technical proficiency should also extend to automation tools like Ansible or Terraform, as well as monitoring solutions like Prometheus or Grafana.

Join Rise to see the full answer
What does the daily workflow look like for an MLOps Professional Services Engineer?

As an MLOps Professional Services Engineer at this cutting-edge AI company, your daily tasks will revolve around designing and optimizing ML training and inference workflows. You’ll collaborate closely with data scientists and engineers, ensuring the seamless operation of distributed ML training workloads. Expect to develop automation scripts, monitor ML performance, and leverage Kubernetes and Slurm to streamline processes. Engaging with clients and guiding them towards optimizing their ML pipelines will also be a critical component of your role.

Join Rise to see the full answer
What technologies will the MLOps Professional Services Engineer work with?

In this role, the MLOps Professional Services Engineer will work with a range of advanced technologies, including Kubernetes for container orchestration, and Slurm for distributed computing. Familiarity with ML frameworks such as TensorFlow, PyTorch, or MXNet will be crucial for model deployment. The position also involves using automation tools like Ansible and Terraform, as well as monitoring and logging software like Prometheus and Grafana, to ensure efficient and effective ML operations.

Join Rise to see the full answer
What are the benefits of working as an MLOps Professional Services Engineer in this company?

Working as an MLOps Professional Services Engineer in this AI-centric firm comes with numerous benefits, including competitive compensation ranging between $130,000-$175,000, plus full medical benefits for you and your family. There’s a robust 401(k) match program, flexible remote work options, and reimbursement for mobile and internet expenses. You'll also have the chance to engage with state-of-the-art technology, such as the latest NVIDIA GPUs, fostering both professional growth and personal satisfaction.

Join Rise to see the full answer
How does this company support career development for MLOps Professional Services Engineers?

This innovative company places a strong emphasis on career development for MLOps Professional Services Engineers. You’ll have opportunities to participate in industry conferences and meetups, which will keep you up-to-date with the latest ML and MLOps advancements. The hands-on role facilitates active engagement in cutting-edge projects, enabling you to refine your skills while contributing to impactful initiatives. Overall, the company’s culture fosters continuous learning and knowledge sharing within a collaborative environment.

Join Rise to see the full answer
What is the work culture like for MLOps Professional Services Engineers at this AI company?

The work culture for MLOps Professional Services Engineers at this AI-focused company is vibrant and dynamic, valuing creativity and collaboration. Employees are encouraged to embrace challenges and cultivate innovative solutions to complex problems. The organization is predominantly remote-friendly, allowing for a balanced work environment where adaptation and teamwork are at the forefront. This inclusive atmosphere supports diverse perspectives and ideas, making it an exciting place to work for technology enthusiasts.

Join Rise to see the full answer
What is the interviewing process for the MLOps Professional Services Engineer position?

The interviewing process for the MLOps Professional Services Engineer role involves three stages. Initially, a virtual interview with the Talent Acquisition Lead assesses general fit and answers any questions you might have. Next, you’ll have a technical assessment interview with the Hiring Manager to gauge your skills and capabilities. Finally, a conversation with a C-level executive constitutes the last stage, followed by reference and background checks. This thorough process ensures candidates are a good fit for both technical skills and cultural alignment.

Join Rise to see the full answer
Common Interview Questions for MLOps Professional Services Engineer (Cloud & AI Infra)
Can you explain your experience with Kubernetes and how you have used it in past MLOps projects?

In answering this question, focus on specific projects where you utilized Kubernetes, detailing how you set up clusters and managed applications. Mention any challenges faced and how Kubernetes helped in scaling your ML models. Highlight your familiarity with orchestration and any containerization using Docker, and share any results that demonstrate improved efficiency or performance.

Join Rise to see the full answer
How do you optimize the performance of ML training and inference pipelines?

When discussing optimization, emphasize your approach to identifying bottlenecks in ML pipelines and how you address them. You might mention specific tools or techniques, such as using distributed computing frameworks like Slurm, and integrating with ML frameworks like TensorFlow or PyTorch to improve efficiency. Sharing metrics, results, or case studies enhances the credibility of your answer.

Join Rise to see the full answer
Which automation tools have you used, and how have they improved your MLOps workflows?

Share your experience with automation tools such as Ansible or Terraform, explaining how they were implemented in MLOps processes. Discuss the impact of these tools on workflow efficiency, error reduction, or deployment speed. Providing specific examples of scripts you developed can illustrate your hands-on expertise while reinforcing your problem-solving capabilities.

Join Rise to see the full answer
What is your experience with CI/CD pipelines and how do you implement them in ML projects?

For this question, explain your approach to implementing Continuous Integration/Continuous Deployment (CI/CD) in machine learning projects. Discuss the importance of version control, testing, and deployment automation in ensuring high-quality ML models. Mention any tools you’ve used, like Jenkins or GitLab CI/CD, and give a brief overview of a successful pipeline you’ve established.

Join Rise to see the full answer
Can you describe your familiarity with ML frameworks and model deployment?

In your response, highlight your hands-on experience with popular ML frameworks like TensorFlow, PyTorch, or MXNet. Describe how you have deployed ML models into production, optimizing for performance and scalability. This is a great opportunity to mention any challenges faced and strategies used to overcome them, showcasing your technical competence and strategic approach.

Join Rise to see the full answer
How do you approach monitoring and logging for distributed ML systems?

Outline the monitoring and logging tools you've used, such as Prometheus or Grafana. Discuss how you set them up to track performance metrics and detect anomalies in distributed ML systems. It’s vital to convey the importance of real-time logging and how you've utilized these systems to troubleshoot issues effectively and ensure high uptime.

Join Rise to see the full answer
What challenges do you anticipate in the role of MLOps Professional Services Engineer?

Here, it's valuable to show both your problem-solving mindset and your knowledge of industry challenges. For example, you may discuss the complexities of managing distributed systems, ensuring seamless integrations between ML frameworks and infrastructure, or helping diverse clients with varying needs. Explain how your past experiences have prepared you to tackle these challenges head-on.

Join Rise to see the full answer
How would you explain complex technical concepts to non-technical team members?

Emphasize your communication skills and ability to break down complex topics into more digestible parts. Highlight any experiences where you successfully presented technical information to non-technical audiences, emphasizing patience and clarity. Providing an example of a subject you taught or simplified showcases your ability to foster understanding across teams.

Join Rise to see the full answer
Can you give an example of a successful ML project you worked on? What role did you play?

While telling about a successful ML project, explain your role and contributions in depth. Discuss the project’s goals, methodologies used, and outcomes achieved. Quantifying results like performance improvements or efficiency gains adds weight to your response and showcases your impact on team success.

Join Rise to see the full answer
What motivates you to work in MLOps and how do you stay current with industry trends?

In your response, convey your passion for AI and transformative technology. Mention specific resources like blogs, online courses, or industry events that you follow to stay informed. Elaborating on your participation in relevant communities or conferences showcases ongoing commitment to professional development in the MLOps field.

Join Rise to see the full answer
MATCH
Calculating your matching score...
FUNDING
SENIORITY LEVEL REQUIREMENT
TEAM SIZE
No info
LOCATION
No info
EMPLOYMENT TYPE
Full-time, remote
DATE POSTED
November 24, 2024

Subscribe to Rise newsletter

Risa star 🔮 Hi, I'm Risa! Your AI
Career Copilot
Want to see a list of jobs tailored to
you, just ask me below!