Let’s get started
By clicking ‘Next’, I agree to the Terms of Service
and Privacy Policy
Jobs / Job page
Software Engineer - ML Platform image - Rise Careers
Job details

Software Engineer - ML Platform


At Replicate, we’re on a mission to redefine AI infrastructure. We’re not just another AI company; we’re a team of developers, engineers, and innovators from organizations like Docker, Spotify, Dropbox, GitHub, Heroku, NVIDIA, and more. We’ve built foundational technologies like Docker Compose and OpenAPI, and now, we’re applying that expertise to make AI deployment as intuitive and reliable as web deployment.

Our goal is straightforward: build the best platform for creating, deploying, and running machine learning models. As an Infrastructure Engineer on the Platform team, you’ll play a key role in making generative AI available to everyone.

The Platform team at Replicate oversees the entire lifecycle of models, from packaging and deployment to serving, scaling, and monitoring. You’ll be developing the infrastructure that supports thousands of models and powers millions of predictions daily. This is a chance to build something truly innovative, where each decision you make has a tangible impact and allows your creativity to shine.

What you’ll be doing:

  • Designing and building our deployment and model-serving platform.

  • Building technology to operate the latest advancements in the ML and AI space.

  • Designing systems to maximize the utilization and reliability of our Kubernetes clusters and GPUs, including multi-regional traffic shifting and failover capabilities.

  • Owning and optimizing fair and reliable task allocation and queuing across a diverse set of customers with heterogeneous workloads.

  • Working with our Models team to speed up model inference through techniques like caching, weights management, machine configurations, and runtime optimizations in Python and PyTorch.

  • Working with technologies such as

    • Python, Go, and Node.js

    • Kubernetes and Terraform

    • Redis, Google BigQuery, and PostgreSQL

We're looking for the right person, not just someone who checks boxes, but it’s likely you have…

  • Experience building platforms at scale.

  • Worked in complex systems with many moving parts; you have opinions on monoliths vs. services.

  • Designed and implemented developer-friendly APIs to enable scalable and reliable integration.

  • Hands-on experience setting up and operating Kubernetes.

  • A passion for building tools that empower developers.

  • Strong communication and collaboration skills, with the ability to understand customer needs and distill complex topics into clear, actionable insights. We believe that most of programming isn’t just about writing code; building a platform requires a collaborative approach.

  • At least 3 years of full time software engineering experience.

These aren’t hard requirements, but we definitely want to talk with you if…

  • You have worked on machine learning platform teams in the past.

  • You have experience working with or on teams that have put ML/AI into production, even though this role does not entail building ML models directly.

  • You have some exposure to serving Generative AI features where GPUs are costly commodities and workloads can take significant time to finish.

This role can be remote (anywhere in the United States) or in-person. We have a strong preference for people in PST. If possible, we like people to come into our San Francisco office at least 3 days a week.

Average salary estimate

$130000 / YEARLY (est.)
min
max
$100000K
$160000K

If an employer mentions a salary or salary range on their job, we display it as an "Employer Estimate". If a job has no salary data, Rise displays an estimate if available.

What You Should Know About Software Engineer - ML Platform, Replicate

Join Replicate as a Software Engineer - ML Platform and help us revolutionize AI infrastructure! We've gathered an incredible team of developers and innovators hailing from renowned organizations like Docker, Spotify, and NVIDIA, and we’re working together to make AI deployment as seamless as traditional web deployment. At Replicate, we're dedicated to building a top-notch platform for creating, deploying, and running machine learning models, and you will play a crucial role on our Platform team. Here, you will oversee the entire lifecycle of machine learning models, diving into packaging, deployment, serving, scaling, and monitoring. Imagine developing the infrastructure that not only supports thousands of models but also powers millions of daily predictions. Your responsibilities will include designing our model-serving platform, maximizing Kubernetes cluster efficiency, and collaborating with our talented Models team to enhance model inference through various optimizations in Python and PyTorch. With your experience in building scalable platforms and knowledge of technologies like Go, Node.js, Redis, and Google BigQuery, you’ll significantly impact our mission. More than just technical skills, we value your ability to communicate effectively and work collaboratively as we create tools that empower developers to thrive. This role can be remote or based in our San Francisco office, depending on your preference. If you're passionate about innovating within the generative AI space, we want to hear from you!

Frequently Asked Questions (FAQs) for Software Engineer - ML Platform Role at Replicate
What are the main responsibilities of a Software Engineer - ML Platform at Replicate?

As a Software Engineer - ML Platform at Replicate, your primary responsibilities include designing and building our deployment and model-serving platform, enhancing the utilization of Kubernetes clusters and GPUs, and owning task allocation and queuing across diverse customer workloads. Additionally, you'll collaborate with the Models team to optimize model inference and work with technologies such as Python, Go, and Redis to provide an innovative ML infrastructure.

Join Rise to see the full answer
What qualifications are ideal for a Software Engineer - ML Platform at Replicate?

Ideal candidates for the Software Engineer - ML Platform position at Replicate typically have at least three years of software engineering experience, with a strong background in building scalable platforms. Experience with Kubernetes, designing developer-friendly APIs, and familiarity with machine learning platforms is a plus. Strong collaboration and communication skills are also important for understanding customer needs and sharing insights among team members.

Join Rise to see the full answer
What technologies should a Software Engineer - ML Platform be familiar with at Replicate?

A Software Engineer - ML Platform at Replicate should be familiar with a range of technologies including Python, Node.js, Go, and tools for container orchestration such as Kubernetes and Terraform. Additionally, proficiency in databases like PostgreSQL, Redis, and Google BigQuery will enhance your contributions to our infrastructure and model-serving capabilities.

Join Rise to see the full answer
Can a Software Engineer - ML Platform work remotely at Replicate?

Yes, the Software Engineer - ML Platform position at Replicate offers flexibility for remote work across the United States. While there is a preference for candidates in the Pacific Time Zone who can come into our San Francisco office at least three days a week, remote options are available for those who coordinate effectively with the team.

Join Rise to see the full answer
How does team collaboration enhance the role of a Software Engineer - ML Platform at Replicate?

Collaboration is key for a Software Engineer - ML Platform at Replicate. By working closely with the Models team and developers from diverse backgrounds, you'll engage in decision-making that influences the development of tools that empower your teammates. This collaborative approach aims to break down complex problems into manageable insights, making it essential for the role.

Join Rise to see the full answer
Common Interview Questions for Software Engineer - ML Platform
Can you describe a previous project where you built a scalable platform?

When answering this question, emphasize your role in the project's architecture and the technologies you employed. Discuss how you approached scalability challenges and enhanced the performance of the platform, including specific outcomes like increased user satisfaction or reduced costs.

Join Rise to see the full answer
How do you prioritize tasks in a complex system with multiple moving parts?

Discuss your methodology for task prioritization, using examples from past experiences in managing project timelines and team collaboration. Highlight tools or strategies you utilize to ensure efficient workflow, such as kanban boards or agile methodologies, to showcase your structured approach.

Join Rise to see the full answer
What is your experience with Kubernetes and how have you utilized it in your past roles?

Explain your hands-on experience with Kubernetes, focusing on specific projects where you set up and managed clusters. Mention any issues you resolved related to scaling or load balancing, showing your deep understanding of Kubernetes' capabilities and advantages.

Join Rise to see the full answer
What methods do you employ for optimizing machine learning models?

Share techniques you've used for model optimization, such as caching strategies, resource management, or utilizing enhanced configurations in ML frameworks like PyTorch. Provide insights on performance enhancements you've achieved and the impact on delivery timelines.

Join Rise to see the full answer
How do you ensure effective communication within a team environment?

Outline your strategies for fostering communication, such as regular check-ins, using collaboration tools, and promoting an open-door policy. Use examples to illustrate how these practices have led to successful project milestones and team cohesion.

Join Rise to see the full answer
Explain your approach to designing developer-friendly APIs.

Highlight your philosophy on API design, discussing principles you swear by like clarity, simplicity, and robustness. Share specific examples where you've created or contributed to APIs that significantly improved developer experience with enhanced documentation or intuitive endpoints.

Join Rise to see the full answer
Describe how you handle code reviews and provide feedback.

Discuss your philosophy on code reviews, emphasizing the importance of constructive feedback and knowledge sharing. Provide examples of how you have given or received feedback, and the positive outcomes that arose from those exchanges.

Join Rise to see the full answer
Can you discuss a time you had to manage conflicting priorities?

Use this question to illustrate your problem-solving and decision-making abilities. Provide a specific example, detailing how you assessed the situation, communicated with stakeholders, and implemented a solution that aligned with project goals.

Join Rise to see the full answer
What role do you think collaboration plays in building effective ML platforms?

Discuss the importance of collaboration in understanding diverse viewpoints and requirements across teams. Provide examples of projects that benefited from collaborative input, showing how it led to deeper insights and improved product offerings.

Join Rise to see the full answer
How do you keep up with the latest trends in AI and machine learning?

Share your strategies for staying informed about advancements in AI, such as reading research papers, attending conferences, or participating in online forums. Highlight how these efforts have influenced your work and helped you introduce innovative solutions.

Join Rise to see the full answer
Similar Jobs
Photo of the Rise User
Replicate Hybrid San Francisco
Posted 14 days ago
Photo of the Rise User
Posted 7 days ago
Photo of the Rise User
Posted 7 days ago
Photo of the Rise User
InPost Remote Pana Tadeusza, Kraków, Poland
Posted 2 days ago
Dental Insurance
Performance Bonus
Photo of the Rise User
Numeral Remote No location specified
Posted 8 days ago

Machine learning can now do some extraordinary things, but its still hard to use. You spend all day battling with messy Python scripts, broken Colab notebooks, perplexing CUDA errors, misshapen tensors. Its a mess. The reason machine learning is s...

9 jobs
MATCH
Calculating your matching score...
FUNDING
SENIORITY LEVEL REQUIREMENT
TEAM SIZE
EMPLOYMENT TYPE
Full-time, hybrid
DATE POSTED
December 5, 2024

Subscribe to Rise newsletter

Risa star 🔮 Hi, I'm Risa! Your AI
Career Copilot
Want to see a list of jobs tailored to
you, just ask me below!