Let’s get started
By clicking ‘Next’, I agree to the Terms of Service
and Privacy Policy
Jobs / Job page
Staff Software Engineer, ML Ops and Infrastructure image - Rise Careers
Job details

Staff Software Engineer, ML Ops and Infrastructure

Who are we?

Our mission is to scale intelligence to serve humanity. We’re training and deploying frontier models for developers and enterprises who are building AI systems to power magical experiences like content generation, semantic search, RAG, and agents. We believe that our work is instrumental to the widespread adoption of AI.

We obsess over what we build. Each one of us is responsible for contributing to increasing the capabilities of our models and the value they drive for our customers. We like to work hard and move fast to do what’s best for our customers.

Cohere is a team of researchers, engineers, designers, and more, who are passionate about their craft. Each person is one of the best in the world at what they do. We believe that a diverse range of perspectives is a requirement for building great products.

Join us on our mission and shape the future!

Why this team?

This team is responsible for building world-class infrastructure that is critical to all of Cohere’s success. Focus on stability, scalability, and observability are all paramount as this work acts as the foundation for all members of technical staff.

Our team optimizes for a wide range of technical skillsets (some of which are outlined below). Being self-directed and adaptable, identifying and solving key problems are essential.

Please Note: All of our infrastructure roles require participating in a 24x7 on-call rotation, where you are compensated for your on-call schedule. 

For this role, we are targeting candidates who live in EMEA.

In order to be successful in the role, you have:

  • 5+ years of engineering experience running production infrastructure at a large scale 

  • Experience designing large, highly available distributed systems with Kubernetes, and GPU workloads on those clusters

  • Experience working with GCP, Azure, AWS and/or OCI 

  • Experience in designing, deploying, supporting, and troubleshooting in complex Linux-based computing environments

  • Excellent collaboration and troubleshooting skills to build mission-critical systems, and ensure smooth operations and efficient teamwork

  • The grit and adaptability to solve complex technical challenges that evolve day to day

Bonus qualifications:

  • You worked with or supported MLEs or data scientists

  • Familiarity troubleshooting RDMA networking

—---------

As a Senior Site Reliability Engineer you will:

  • Build self-service systems that automate managing, deploying and operating services.

  • This includes our custom Kubernetes operators that support language model deployments.

  • Automate environment observability and resilience. Enable all developers to troubleshoot and resolve problems.

  • Take steps required to ensure we hit defined SLOs, including participation in an on-call rotation.

  • Build strong relationships with internal developers and influence the Infrastructure team’s roadmap based on their feedback.

  • Develop our team through knowledge sharing and an active review process.

You may be a good fit if:

  • You have proven production experience with Kubernetes.

  • You have hands-on coding experience developing services and automated tests (we use Go).

  • You prefer contributing to Open Source solutions rather than building solutions from the ground up.

  • You have experience scaling and debugging cloud-based infrastructure (we use Oracle, GCP, and Coreweave).

  • You draw motivation from building systems that help others be more productive.

  • You see mentorship, knowledge transfer, and review as essential prerequisites for a healthy team.

If some of the above doesn’t line up perfectly with your experience, we still encourage you to apply! If you consider yourself a thoughtful worker, a lifelong learner, and a kind and playful team member, Cohere is the place for you.

We value and celebrate diversity and strive to create an inclusive work environment for all. We welcome applicants from all backgrounds and are committed to providing equal opportunities. Should you require any accommodations during the recruitment process, please submit an Accommodations Request Form, and we will work together to meet your needs.

Full-Time Employees at Cohere enjoy these Perks:

🤝 An open and inclusive culture and work environment 

🧑‍💻 Work closely with a team on the cutting edge of AI research 

🍽 Weekly lunch stipend, in-office lunches & snacks

🦷 Full health and dental benefits, including a separate budget to take care of your mental health 

🐣 100% Parental Leave top-up for 6 months for employees based in Canada, the US, and the UK

🎨 Personal enrichment benefits towards arts and culture, fitness and well-being, quality time, and workspace improvement

🏙 Remote-flexible, offices in Toronto, New York, San Francisco and London and co-working stipend

✈️ 6 weeks of vacation

Note: This post is co-authored by both Cohere humans and Cohere technology.

Cohere Glassdoor Company Review
3.8 Glassdoor star iconGlassdoor star iconGlassdoor star icon Glassdoor star icon Glassdoor star icon
Cohere DE&I Review
No rating Glassdoor star iconGlassdoor star iconGlassdoor star iconGlassdoor star iconGlassdoor star icon
CEO of Cohere
Cohere CEO photo
Unknown name
Approve of CEO

Average salary estimate

$140000 / YEARLY (est.)
min
max
$120000K
$160000K

If an employer mentions a salary or salary range on their job, we display it as an "Employer Estimate". If a job has no salary data, Rise displays an estimate if available.

What You Should Know About Staff Software Engineer, ML Ops and Infrastructure, Cohere

Join Cohere as a Staff Software Engineer specializing in ML Ops and Infrastructure, where you'll become part of a passionate team dedicated to scaling intelligence for humanity. At Cohere, we focus on building innovative AI systems that support cutting-edge solutions like content generation and semantic search. In this pivotal role, you'll be responsible for creating robust infrastructure essential to the success of our organization, ensuring that our systems are stable, scalable, and observable. You will apply your extensive engineering experience, particularly with Kubernetes and cloud services like GCP, Azure, and AWS, to design and manage large-scale distributed systems that operate seamlessly. Your collaboration skills will help build mission-critical systems and foster relationships with internal developers, allowing you to positively influence future infrastructure roadmaps. If you're a self-direction expert thrilled by solving complex issues, and can thrive in a dynamic environment, this is the perfect opportunity for you. We’re on the lookout for passionate individuals, who, like us, value diversity and are eager to contribute to a culture of inclusion while working on the frontier of AI. If you meet our requirements, we encourage you to apply and help shape the future with us at Cohere!

Frequently Asked Questions (FAQs) for Staff Software Engineer, ML Ops and Infrastructure Role at Cohere
What are the responsibilities of a Staff Software Engineer, ML Ops and Infrastructure at Cohere?

As a Staff Software Engineer focusing on ML Ops and Infrastructure at Cohere, your key responsibilities will include building self-service systems to automate the management and deployment of services, particularly focusing on our custom Kubernetes operators. You'll also ensure that our environment is observable and resilient, enabling developers to troubleshoot issues effectively. Participation in an on-call rotation is vital to meet defined Service Level Objectives (SLOs), alongside nurturing relationships with internal developers and contributing to the team through knowledge-sharing.

Join Rise to see the full answer
What qualifications are required for the Staff Software Engineer position at Cohere?

To be successful in the Staff Software Engineer role at Cohere, you should possess over five years of engineering experience working with large-scale production infrastructure. Familiarity with designing highly available distributed systems using Kubernetes and cloud services such as GCP, Azure, or AWS is necessary. In addition, having excellent collaboration and troubleshooting skills to build mission-critical systems is essential, along with a tenacious mindset towards solving complex technical challenges.

Join Rise to see the full answer
What is the team culture like at Cohere for the Staff Software Engineer, ML Ops and Infrastructure?

Cohere prides itself on fostering an open and inclusive culture where every team member contributes uniquely. As a Staff Software Engineer in ML Ops and Infrastructure, you’ll collaborate with a diverse group of experts who are passionate about their crafts. Teamwork is highly valued, and knowledge sharing occurs actively to enhance personal development and collective success. We're committed to creating a workplace that embraces diversity and promotes a sense of belonging.

Join Rise to see the full answer
Is remote work an option for the Staff Software Engineer, ML Ops and Infrastructure position at Cohere?

Yes, Cohere supports a remote-flexible work environment, allowing you to thrive whether you are based in Toronto, New York, San Francisco, London, or anywhere else. We also provide a co-working stipend, making it easier to stay productive while enjoying the benefits of a hybrid workspace arrangement.

Join Rise to see the full answer
What benefits can Staff Software Engineers at Cohere expect?

Staff Software Engineers at Cohere enjoy a comprehensive benefits package, including full health and dental coverage, a weekly lunch stipend, and a budget for mental health resources. We provide generous parental leave and personal enrichment benefits, as well as a healthy work-life balance with six weeks of vacation. Our culture promotes inclusiveness and support for personal growth.

Join Rise to see the full answer
Common Interview Questions for Staff Software Engineer, ML Ops and Infrastructure
Can you describe your experience with Kubernetes and how it applies to building infrastructure?

In preparing for this question, think about specific projects you've worked on involving Kubernetes. Discuss how you utilized its capabilities to automate deployments, manage container orchestration, and optimize resource allocation, ensuring high availability and scalability. Highlight any unique challenges you faced and how you overcame them using Kubernetes tools and strategies.

Join Rise to see the full answer
What strategies do you employ for troubleshooting complex Linux-based systems?

When addressing this question, consider mentioning different methods and tools you rely on for diagnostics, like logs, monitoring tools, and system performance metrics. Provide real examples of challenges you've tackled, detailing how you identified the root causes of issues and subsequently deployed solutions to enhance system reliability.

Join Rise to see the full answer
How do you ensure effective collaboration within an engineering team?

Discuss your approach to fostering collaboration, such as utilizing Agile methodologies, regular stand-ups, and code reviews. Emphasize the importance of open communication, leveraging collaborative tools, and how you foster a supportive environment where team members feel valued and encouraged to share ideas.

Join Rise to see the full answer
Explain your hands-on coding experience and how it benefits your engineering role?

Describe your programming background, including languages and frameworks you are proficient in, particularly focusing on practical applications in your projects. Explain how your coding experience enhances your understanding of system architecture, allows you to create efficient solutions, and aids in mentoring other team members.

Join Rise to see the full answer
What is your experience with cloud service providers like AWS, GCP, or Azure?

Be prepared to discuss your direct experience with these platforms, specifically detailing projects where you deployed or scaled applications. Talk about the services you've used (e.g., storage, database, serverless functions) and how you optimized costs while ensuring high performance and security.

Join Rise to see the full answer
How do you approach maintaining operational excellence, including SLOs?

For this question, explain methodologies you employ to track and meet SLOs, using concrete metrics and performance indicators. Discuss the importance of proactive monitoring, scheduled maintenance, and incident response strategies to maintain operational integrity in production infrastructure.

Join Rise to see the full answer
Can you provide an example of a significant technical challenge you've tackled?

Prepare a compelling story about a specific technical obstacle, explaining the context, your thought process, the actions you took, and the outcome. Focus on demonstrating your problem-solving skills, resilience, and how you ensured that systems continued to operate smoothly for end users.

Join Rise to see the full answer
How do you keep up with the latest developments in ML Ops and infrastructure engineering?

Discuss the resources you use to stay current with industry trends, such as conferences, webinars, online courses, or tech blogs. Mention how you actively apply what you learn to your work and share insights with your team to drive innovation.

Join Rise to see the full answer
What does diversity and inclusion mean to you in a tech environment?

Reflect on the significance of diverse perspectives in technology. Share your thoughts on how a diverse and inclusive culture enhances problem-solving and innovation. Discuss personal experiences where inclusivity improved team dynamics or project outcomes.

Join Rise to see the full answer
Why do you want to work at Cohere, and what do you hope to contribute?

For this question, emphasize what excites you about working at Cohere, such as the mission, values, or the opportunity to be at the forefront of AI technology. Share specific skills or experiences you can bring to the team and how you envision making a positive impact on Cohere’s projects and culture.

Join Rise to see the full answer
Similar Jobs
Photo of the Rise User
Startup Mindset
Collaboration over Competition
Growth & Learning
Inclusive & Diverse
Photo of the Rise User
Startup Mindset
Collaboration over Competition
Growth & Learning
Inclusive & Diverse
Photo of the Rise User
Posted 13 days ago
Photo of the Rise User
Posted 6 days ago
Photo of the Rise User
Posted 3 days ago
Photo of the Rise User
Inclusive & Diverse
Rise from Within
Mission Driven
Diversity of Opinions
Work/Life Harmony
Take Risks
Startup Mindset
Collaboration over Competition
Fast-Paced
Growth & Learning
Dental Insurance
Vision Insurance
Disability Insurance
Flexible Spending Account (FSA)
Health Savings Account (HSA)
Performance Bonus
Family Medical Leave
Paid Holidays
Photo of the Rise User
Posted 9 days ago
Photo of the Rise User
Posted 12 days ago

Cohere, founded by AI pioneers, offers a leading enterprise AI platform that combines ease-of-use, data privacy, and unparalleled flexibility with its cloud-agnostic and API-accessible services,

101 jobs
MATCH
Calculating your matching score...
BADGES
Badge ChangemakerBadge Future MakerBadge Innovator
CULTURE VALUES
Startup Mindset
Collaboration over Competition
Growth & Learning
Inclusive & Diverse
FUNDING
SENIORITY LEVEL REQUIREMENT
INDUSTRY
TEAM SIZE
EMPLOYMENT TYPE
Full-time, remote
DATE POSTED
December 18, 2024

Subscribe to Rise newsletter

Risa star 🔮 Hi, I'm Risa! Your AI
Career Copilot
Want to see a list of jobs tailored to
you, just ask me below!