Job details

MLOps Engineer - Supercomputing

Who are we?

Coastal Carbon is a seed-funded startup on a mission to create positive impact through earth observation and AI. Founded at the University of Waterloo by a team of PhDs and engineers, we’re backed by some of the best AI and climate tech investors like HF0, Inovia Capital and Propeller Ventures, angels like James Tamplin (cofounder Firebase) and Sid Gorham (cofounder OpenTable, Granular), and partners like Amazon AWS and the United Nations.

What do we do?

We’re building multimodal foundation models for the natural world. We believe there’s more to the world than the internet + more to intelligence than memorizing the internet. Our models are trained on satellite remote sensing and real world ground truth data, and are used by our customers in nature conservation, carbon dioxide removal, and government to protect and positively impact our increasingly changing world. Our ultimate goal is to build AGI of the natural world.

About the role

We are seeking an MLOps Engineer to join our team and help run large-scale experiments, managing the infrastructure for foundation models and large machine learning models efficiently on GPUs.

The role will involve:

GPU Programming:
- Implement scalable pipelines, optimize models for performance and accuracy, and ensure they are production-ready.
- Write low-level code to maximize the capacity of high-end GPUs.
- Enhance models and pipelines for efficient inference, integrating low-level efficient code within a high-level MLOps framework.
Cloud Supercomputing:
- Collaborate with AI scientists and engineers to build and maintain scalable, efficient, and secure infrastructure.
- Demonstrate strong expertise in the distributed computation infrastructure of current-generation GPU clusters.
- Maintain and update training environments on clusters.
Host Management:
- Operate large GPU supercomputing clusters both on-premises and in the cloud for training and serving production models.
- Implement Infrastructure as Code (IaC) best practices, enhance deployment pipelines, and ensure robust, secure service delivery across production environments.

Requirements

Bachelor’s degree in computer science, engineering, a related field, or equivalent experience.
3+ years of relevant work experience.
Experience with scalable training-inference pipelines, with a strong preference for experience on AWS. Familiarity with AWS, GCP, Azure, etc.
Location wise, strong preference for in-person in San Francisco however remote work is possible for exceptional candidates.

Nice to have

Proficiency in scripting languages such as Python, Bash, or PowerShell.
Demonstrated experience with deep learning and transformer models.
Understanding of generative AI, with knowledge or interest in fine-tuning and using foundation models for applications.
Proficiency in frameworks like PyTorch or TensorFlow.
Experience with containerization and orchestration technologies such as Docker and Kubernetes.
Team player, willing to undertake various tasks to support the team.