Let’s get started
By clicking ‘Next’, I agree to the Terms of Service
and Privacy Policy
Jobs / Job page
Platform engineer, MLOps  image - Rise Careers
Job details

Platform engineer, MLOps

✍🏽 About Writer

Writer is the full-stack generative AI platform delivering transformative ROI for the world’s leading enterprises. Named one of the top 50 companies in AI by Forbes and one of the best places to work by Inc. Magazine, Writer empowers hundreds of customers like Accenture, Intuit, L’Oreal, Mars, Salesforce, and Vanguard to transform the way they work. 

Writer’s fully integrated solution makes it easy to deploy secure and reliable AI applications and agents that solve mission-critical business challenges.  Our suite of development tools is powered by Palmyra – Writer’s state-of-the-art family of LLMs — alongside our industry-leading graph-based RAG and customizable AI guardrails. 

Founded in 2020 with office hubs in San Francisco, New York City, Austin, Chicago, and London, our team of over 250 employees thinks big and moves fast, and we’re looking for smart, hardworking builders and scalers to join us on our journey to create a better future of work. 

📐 About this role 

As a Platform engineer, MLOps, you will be critical to  deploying and managing cutting-edge infrastructure crucial for AI/ML operations, and you will collaborate with AI/ML engineers and researchers to develop a robust CI/CD pipeline that supports safe and reproducible experiments. Your expertise will also extend to setting up and maintaining monitoring, logging, and alerting systems to oversee extensive training runs and client-facing APIs. You will ensure that training environments are optimally available and efficiently managed across multiple clusters, enhancing our containerization and orchestration systems with advanced tools like Docker and Kubernetes. 

This role demands a proactive approach to maintaining large Kubernetes clusters, optimizing system performance, and providing operational support for our suite of software solutions. If you are driven by challenges and motivated by the continuous pursuit of innovation, this role offers the opportunity to make a significant impact in a dynamic, fast-paced environment.

🦸🏻‍♀️ Your responsibilities:

  • Work closely with AI/ML engineers and researchers to design and deploy a CI/CD pipeline that ensures safe and reproducible experiments.

  • Set up and manage monitoring, logging, and alerting systems for extensive training runs and client-facing APIs.

  • Ensure training environments are consistently available and prepared across multiple clusters.

  • Develop and manage containerization and orchestration systems utilizing tools such as Docker and Kubernetes.

  • Operate and oversee large Kubernetes clusters with GPU workloads.

  • Improve reliability, quality, and time-to-market of our suite of software solutions

  • Measure and optimize system performance, with an eye toward pushing our capabilities forward, getting ahead of customer needs, and innovating for continual improvement

  • Provide primary operational support and engineering for multiple large-scale distributed software applications

⭐️ Is this you? 

  • You have professional experience with: 

    • Model training

    • Huggingface Transformers

    • Pytorch

    • vLLM

    • TensorRT

    • Infrastructure as code tools like Terraform

    • Scripting languages such as Python or Bash

    • Cloud platforms such as Google Cloud, AWS or Azure

    • Git and GitHub workflows

    • Tracing and Monitoring

  • Familiar with high-performance, large-scale ML systems

  • You have a knack for troubleshooting complex systems and enjoy solving challenging problems

  • Proactive in identifying problems, performance bottlenecks, and areas for improvement

  • Take pride in building and operating scalable, reliable, secure systems

  • Familiar with monitoring tools such as Prometheus, Grafana, or similar

  • Are comfortable with ambiguity and rapid change

Preferred skills and experience:

  • Familiar with monitoring tools such as Prometheus, Grafana, or similar

  • 5+ years building core infrastructure 

  • Experience running inference clusters at scale

  • Experience operating orchestration systems such as Kubernetes at scale

Curious to learn more about who we are and how we operate? Visit us here

🍩 Benefits & perks

  • Generous PTO, plus company holidays

  • Medical, dental, and vision coverage for you and your family

  • Paid parental leave for all parents (12 weeks)

  • Fertility and family planning support

  • Early-detection cancer testing through Galleri

  • Flexible spending account and dependent FSA options

  • Health savings account for eligible plans with company contribution

  • Annual work-life stipends for:

    • Home office setup, cell phone, internet

    • Wellness stipend for gym, massage/chiropractor, personal training, etc.

    • Learning and development stipend

  • Company-wide off-sites and team off-sites

  • Competitive compensation, company stock options and 401k

Writer is an equal-opportunity employer and is committed to diversity. We don't make hiring or employment decisions based on race, color, religion, creed, gender, national origin, age, disability, veteran status, marital status, pregnancy, sex, gender expression or identity, sexual orientation, citizenship, or any other basis protected by applicable local, state or federal law. Under the San Francisco Fair Chance Ordinance, we will consider for employment qualified applicants with arrest and conviction records.

By submitting your application on the application page, you acknowledge and agree to Writer's Global Candidate Privacy Notice.

Writer Glassdoor Company Review
4.8 Glassdoor star iconGlassdoor star iconGlassdoor star iconGlassdoor star icon Glassdoor star icon
Writer DE&I Review
5.0 Glassdoor star iconGlassdoor star iconGlassdoor star iconGlassdoor star iconGlassdoor star icon
CEO of Writer
Writer CEO photo
Unknown name
Approve of CEO

Average salary estimate

$135000 / YEARLY (est.)
min
max
$120000K
$150000K

If an employer mentions a salary or salary range on their job, we display it as an "Employer Estimate". If a job has no salary data, Rise displays an estimate if available.

What You Should Know About Platform engineer, MLOps , Writer

If you're passionate about harnessing the power of AI and machine learning, Writer is excited to offer a fantastic opportunity for a Platform Engineer, MLOps in New York City! At Writer, we’re not just about building cutting-edge generative AI platforms; we’re about transforming the way enterprises operate. As a Platform Engineer focusing on MLOps, you’ll play a pivotal role in shaping the infrastructure that supports our AI and ML initiatives. You'll collaborate closely with our talented AI/ML engineers and researchers to design a robust CI/CD pipeline, ensuring our experiments are both safe and reproducible. Your expertise will be essential for setting up and maintaining monitoring and logging systems for our complex training runs and client-facing APIs. With a focus on optimizing system performance and availability across clusters, you'll get to dive deep into advanced tools like Docker and Kubernetes. This position is perfect for those who thrive on challenges and embody a spirit of innovation. If you're eager to make an impact in a dynamic environment, providing operational support and enhancing our infrastructure, we want to hear from you! Join Writer and contribute to changing how businesses leverage AI!

Frequently Asked Questions (FAQs) for Platform engineer, MLOps Role at Writer
What are the primary responsibilities of a Platform Engineer, MLOps at Writer?

As a Platform Engineer, MLOps at Writer, your primary responsibilities include designing and deploying a CI/CD pipeline in collaboration with AI/ML engineers, managing extensive training runs through robust monitoring and logging systems, and ensuring consistent availability of training environments across clusters. You’ll also enhance our containerization and orchestration systems using technologies like Docker and Kubernetes.

Join Rise to see the full answer
What qualifications are needed for the Platform Engineer, MLOps position at Writer?

To qualify for the Platform Engineer, MLOps position at Writer, candidates should possess professional experience with model training, Huggingface Transformers, PyTorch, and infrastructure as code tools like Terraform. Proficiency in scripting languages such as Python or Bash, expertise in cloud platforms like AWS or Azure, and a solid understanding of monitoring tools such as Prometheus and Grafana are also essential.

Join Rise to see the full answer
What tools will a Platform Engineer, MLOps at Writer be expected to use?

In this role, you will be using an array of tools including Docker and Kubernetes for containerization and orchestration, along with monitoring tools like Prometheus and Grafana. Familiarity with frameworks such as PyTorch and Huggingface Transformers, as well as infrastructure as code tools like Terraform, will also be important for success in the role.

Join Rise to see the full answer
How does the Platform Engineer, MLOps role contribute to AI initiatives at Writer?

The Platform Engineer, MLOps at Writer is integral to our AI initiatives as this role ensures that the infrastructure is not only reliable but also highly efficient in managing AI models. By deploying CI/CD pipelines and overseeing large Kubernetes clusters for GPU workloads, you will enable our AI/ML teams to conduct safe, reproducible experiments, significantly impacting our innovative projects.

Join Rise to see the full answer
What is the work culture like for a Platform Engineer, MLOps at Writer?

At Writer, the work culture for a Platform Engineer, MLOps is dynamic and collaborative. You will be part of a fast-paced team that values innovation and creativity, working alongside smart, hardworking colleagues. The company prioritizes diversity and inclusion, providing an environment where your ideas can thrive and where continuous improvement is embedded in the company's ethos.

Join Rise to see the full answer
Common Interview Questions for Platform engineer, MLOps
Can you explain the role of a CI/CD pipeline in MLOps?

A CI/CD pipeline in MLOps is crucial for ensuring that the machine learning models are delivered reliably and efficiently. It automates the process of integrating code changes, testing them, and deploying them to production. A strong answer may include discussing how it improves collaboration, reduces errors, and enables rapid iterations in model development.

Join Rise to see the full answer
What experience do you have with containerization technologies like Docker?

When asked about your experience with Docker, focus on specific projects where you've utilized Docker to streamline development and deployment. Discuss how containerization aids in application scalability and reliability, and mention any particular challenges you’ve overcome using Docker to manage complex applications.

Join Rise to see the full answer
How would you troubleshoot a performance issue in a Kubernetes cluster?

For troubleshooting a performance issue in a Kubernetes cluster, describe a systematic approach: start by monitoring resource usage, checking logs, and analyzing metrics. Could you elaborate on tools like Prometheus and Grafana that you would use for monitoring? This will highlight your technical skills and problem-solving ability.

Join Rise to see the full answer
What are the best practices for optimizing Kubernetes workloads for AI/ML tasks?

Best practices for optimizing Kubernetes workloads for AI/ML tasks include leveraging GPU resources effectively, using node affinities for deployment, and properly configuring resource limits and requests. Additionally, mention the importance of horizontal pod autoscaling and efficient data management.

Join Rise to see the full answer
How do you ensure security and reliability in your MLOps processes?

To ensure security and reliability in MLOps, you should implement data privacy measures, use role-based access control, and set up comprehensive monitoring and alerting systems. Explain your understanding of secure coding practices and regular vulnerability assessments, emphasizing your commitment to maintaining compliance.

Join Rise to see the full answer
What cloud platforms have you worked with, and how did you utilize them in your previous roles?

Discuss specific projects you've undertaken on cloud platforms such as AWS, Azure, or Google Cloud. Explain how you leveraged cloud services for ML model deployment, scaling applications, or reducing costs, providing clear examples that demonstrate your expertise.

Join Rise to see the full answer
Can you describe your experience with Huggingface Transformers?

Provide a concise overview of how you've utilized Huggingface Transformers in building and deploying language models or NLP applications. Mention any specific challenges you faced and how you overcame them, showcasing your expertise in working with modern machine learning frameworks.

Join Rise to see the full answer
How do you stay updated with the latest trends in MLOps and AI technologies?

When asked about staying updated with trends, detail the resources you use, such as academic papers, industry blogs, workshops, and conferences. Emphasize your commitment to continuous learning and your participation in relevant communities or forums.

Join Rise to see the full answer
Describe your experience with Infrastructure as Code (IaC). Why is it important?

Explain your experience with Infrastructure as Code, highlighting tools like Terraform. Discuss how IaC allows for automated environment provisioning, consistency in setups, and documentation of infrastructure as code, ensuring environments are reproducible and auditable.

Join Rise to see the full answer
What role do monitoring and logging play in MLOps?

Monitoring and logging are critical in MLOps for real-time insights into deployed models, identifying performance issues quickly, and ensuring system reliability. Discuss your experience setting up monitoring systems using tools like Prometheus and Grafana and sharing how these tools help in proactive problem-solving.

Join Rise to see the full answer
Similar Jobs
Photo of the Rise User
Dare to be Different
Diversity of Opinions
Inclusive & Diverse
Collaboration over Competition
Fast-Paced
Growth & Learning
Photo of the Rise User
Posted 10 days ago
Dare to be Different
Diversity of Opinions
Inclusive & Diverse
Collaboration over Competition
Fast-Paced
Growth & Learning
Photo of the Rise User
Neuralink Hybrid Austin, Texas, United States
Posted 3 days ago
Photo of the Rise User
AECOM Remote NEOM, TABUK, Saudi Arabia
Posted 8 days ago
Photo of the Rise User
Posted 11 days ago
Mission Driven
Inclusive & Diverse
Take Risks
Collaboration over Competition
Growth & Learning
Photo of the Rise User
CIMA+ Remote 500 Airport Rd, Mississauga, ON L4W 1S9, Canada
Posted 3 days ago
Photo of the Rise User
Posted 3 days ago
Photo of the Rise User
AECOM Remote Centurion, South Africa
Posted 13 days ago
Photo of the Rise User
Anduril Industries Hybrid McHenry, Mississippi, United States
Posted 2 days ago

Writer is the full-stack generative AI platform for enterprises. We empower your entire organization — support, operations, product, sales, HR, marketing, and more.

161 jobs
MATCH
Calculating your matching score...
BADGES
Badge ChangemakerBadge Future MakerBadge InnovatorBadge Rapid Growth
CULTURE VALUES
Dare to be Different
Diversity of Opinions
Inclusive & Diverse
Collaboration over Competition
Fast-Paced
Growth & Learning
FUNDING
DEPARTMENTS
SENIORITY LEVEL REQUIREMENT
TEAM SIZE
EMPLOYMENT TYPE
Full-time, on-site
DATE POSTED
December 24, 2024

Subscribe to Rise newsletter

Risa star 🔮 Hi, I'm Risa! Your AI
Career Copilot
Want to see a list of jobs tailored to
you, just ask me below!