Let’s get started
By clicking ‘Next’, I agree to the Terms of Service
and Privacy Policy
Jobs / Job page
Member of Technical Staff: ML Infrastructure, Platform Engineer image - Rise Careers
Job details

Member of Technical Staff: ML Infrastructure, Platform Engineer

About Us

Essential AI’s mission is to deepen the partnership between humans and computers, unlocking collaborative capabilities that far exceed what could be achieved today. We believe that building delightful end-user experiences requires innovating across the stack - from the UX all the way down to models that achieve the best user value per FLOP.

We believe that a small, focused team of motivated individuals can create outsized breakthroughs. We are building a world-class multi-disciplinary team who are excited to solve hard real-world AI problems. We are well-capitalized and supported by March Capital and Thrive Capital, with participation from AMD, Franklin Venture Partners, Google, KB Investment, NVIDIA.

The Role

The ML Infra Platform Engineer will be responsible for architecting and building the compute infra that powers the training and serving of our models. This requires a full understanding of the complete backend stack → from frameworks to compilers to runtimes to kernels.

Running and training models at scale often requires solving novel system problems. As an Infra Systems Engineer, you'll be responsible for identifying these problems and then developing systems that optimize the throughput and robustness of distributed systems. With proven experience building large-scale platforms, you will be responsible for building and advancing our systems that allow research and engineering organizations to iteratively develop, test, and deploy new features reliably, with high velocity, and with a frictionless-fast development cycle.

What you’ll be working on

  • Design, build, and maintain scalable machine learning infrastructure to support our model training, inference and applications

  • Design and implement scalable machine learning and distributed systems that enable training and scaling of LLMs. Work on parallelism methods improve training of in a fast and reliable way

  • You will help oversee and drive the vision of how we should build, test, and deploy models, while taking ownership and transform state-of-the-art development experience for research

  • Develop tools and frameworks to automate and streamline ML experimentation and management

  • Collaborate with other researchers and product engineers to bring magical product experiences through large language models

  • Working on lower levels of the stack to build high-performing and optimal training and serving infrastructure including researching new techniques and writing custom kernels as needed to achieve improvements

  • Be willing to optimize performance and efficiency across different accelerators

What we are looking for

  • A strong understanding of architectures of new AI accelerators like TPU, IPU, HPU etc and their tradeoffs.

  • Knowledge of parallel computing concepts and distributed systems.

  • Prior experience in performance tuning of training and/or inference LLM workloads. Experience with MLPerf or internal production workloads will be valued.

  • 6+ years of relevant industry experience in leading the design of large-scale & production ML infra systems.

  • Experience with training and building large language models using frameworks such as Megatron, DeepSpeed, etc and deployment frameworks like vLLM, TGI, TensorRT-LLM etc

  • Comfortable with working under-the-hood with kernel languages like OAI Triton, Pallas and compilers like XLA

  • Experience with INT8/FP8 training and inference, quantization and/or distillation

  • Knowledge of container technologies like Docker and Kubernetes and cloud platforms like AWS, GCP, etc.

  • Intermediate fluency with network fundamentals like VPC, Subnets, Routing Tables, Firewalls etc

We encourage you to apply for this position even if you don’t check all of the above requirements but want to spend time pushing on these techniques.

We are based in-person in SF and work fully onsite 5 days a week. We offer relocation assistance to new employees.

The base pay range target for the role seniority described in this job description is up to $225,000 in San Francisco, CA. Final offer amounts depend on various job-related factors, including where you place on our internal performance ladders, which is based on factors including past work experience, relevant education, and performance on our interviews and our benchmarks against market compensation data. In addition to cash pay, full-time regular positions are eligible for equity, 401(k), health benefits, and other benefits like daily onsite lunches and snacks; some of these benefits may be available for part-time or temporary positions.

Essential AI commits to providing a work environment free of discrimination and harassment, as well as equal employment opportunity regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity or veteran status. We also consider qualified applicants regardless of criminal histories, consistent with legal requirements. You may view all of Essential AI’s recruiting notices here, including our EEO policy, recruitment scam notice, and recruitment agency policy.

Essential AI Glassdoor Company Review
No rating Glassdoor star iconGlassdoor star iconGlassdoor star iconGlassdoor star iconGlassdoor star icon
Essential AI DE&I Review
No rating Glassdoor star iconGlassdoor star iconGlassdoor star iconGlassdoor star iconGlassdoor star icon
CEO of Essential AI
Essential AI CEO photo
Unknown name
Approve of CEO
What You Should Know About Member of Technical Staff: ML Infrastructure, Platform Engineer, Essential AI

At Essential AI, we're on a mission to deepen the partnership between humans and computers, and we're looking for a talented Member of Technical Staff: ML Infrastructure, Platform Engineer to join our dynamic team in San Francisco. In this role, you’ll be responsible for architecting and building the compute infrastructure that powers the training and serving of our models. Your expertise will help us identify and solve unique system challenges as we optimize the throughput of distributed systems. You won't just be immersed in the code; you’ll design, build, and maintain scalable machine learning infrastructure that supports our cutting-edge applications. You'll collaborate closely with researchers and product engineers to drive innovations that result in magical experiences with large language models. With 6+ years of relevant industry experience, you'll leverage your understanding of AI accelerator architectures and parallel computing concepts to propel us forward. If you’re passionate about developing tools and systems that enhance the machine learning workflow, and eager to push the limits of what’s possible in AI, then Essential AI is the perfect place for you! Let's inspire the world together through revolutionary technology, where your contributions can lead to real-world breakthroughs.

Frequently Asked Questions (FAQs) for Member of Technical Staff: ML Infrastructure, Platform Engineer Role at Essential AI
What are the responsibilities of a Member of Technical Staff: ML Infrastructure, Platform Engineer at Essential AI?

As a Member of Technical Staff: ML Infrastructure, Platform Engineer at Essential AI, you will be tasked with designing, building, and maintaining scalable machine learning infrastructure to support our model training and serving. You will also oversee the vision for model development and collaborate with teams to ensure seamless integration of new features in our products.

Join Rise to see the full answer
What qualifications are required for the Member of Technical Staff: ML Infrastructure position at Essential AI?

To qualify for the Member of Technical Staff: ML Infrastructure at Essential AI, candidates should possess a strong understanding of architectures for new AI accelerators such as TPU or IPU, along with experience in parallel computing and distributed systems. A minimum of 6 years in designing large-scale ML infrastructure systems is preferred, as is familiarity with training large language models.

Join Rise to see the full answer
What technologies should a candidate be familiar with for the Member of Technical Staff: ML Infrastructure role?

Candidates applying for the Member of Technical Staff: ML Infrastructure role at Essential AI should have knowledge of container technologies like Docker and Kubernetes, cloud platforms like AWS or GCP, and be comfortable working with various kernel languages and compilers. Experience with performance tuning of ML workloads is also valuable.

Join Rise to see the full answer
What is the work environment like for the Member of Technical Staff: ML Infrastructure at Essential AI?

The work environment at Essential AI is collaborative, supportive, and driven by innovation. The role is fully onsite in San Francisco, providing opportunities to engage directly with your team members and contribute to exciting projects in real-time.

Join Rise to see the full answer
Does Essential AI offer relocation assistance for the Member of Technical Staff: ML Infrastructure role?

Yes, Essential AI offers relocation assistance to new employees joining us for the Member of Technical Staff: ML Infrastructure position, ensuring you have the support needed to transition smoothly to your new role.

Join Rise to see the full answer
Common Interview Questions for Member of Technical Staff: ML Infrastructure, Platform Engineer
Can you describe your experience with scalable machine learning infrastructure?

In your answer, highlight specific projects where you were responsible for designing or optimizing machine learning infrastructure. Discuss the challenges faced, technologies used, and the impact your work had in terms of performance and efficiency.

Join Rise to see the full answer
What parallel computing concepts are you familiar with, and how have you applied them?

Share your understanding of fundamental parallel computing concepts, and give examples of how you’ve implemented these concepts in real-world projects, particularly in relation to distributed systems or large-scale ML training.

Join Rise to see the full answer
How do you handle system performance tuning for machine learning workloads?

Discuss your approach to system performance tuning, including any specific techniques or tools you’ve used to measure and improve performance. Make sure to share examples of previous success in optimizing training or inference workloads.

Join Rise to see the full answer
What is your experience with container technologies and cloud platforms?

Provide examples of your hands-on experience with Docker and Kubernetes, as well as any cloud platforms like AWS or GCP. Discuss how you’ve utilized these technologies in building scalable ML systems.

Join Rise to see the full answer
How familiar are you with AI accelerator architectures?

Talk about your understanding of AI accelerators like TPU, HPU, or IPU, emphasizing any direct experience you have with leveraging these technologies for machine learning applications.

Join Rise to see the full answer
Can you explain a complex problem you solved in ML infrastructure?

Choose a specific instance where you identified a challenging issue within ML infrastructure. Explain the problem, the steps you took to solve it, and the eventual outcome or improvements resulting from your solution.

Join Rise to see the full answer
What strategies do you use for collaborating with product engineers and researchers?

Discuss your communication style and any collaboration tools or methodologies you prefer. Provide examples of past collaborative projects and how you contributed to successful outcomes.

Join Rise to see the full answer
How do you approach the testing and deployment of ML models?

Outline your process for testing and deploying ML models, including any specific frameworks or practices you follow to ensure reliability and efficiency in the deployment cycle.

Join Rise to see the full answer
What experience do you have with quantization and distillation in machine learning?

Describe any projects where you utilized quantization or distillation techniques. Discuss the goals, results, and how you ensured performance did not degrade while achieving these objectives.

Join Rise to see the full answer
Why do you want to work at Essential AI as a Member of Technical Staff: ML Infrastructure?

Share your motivation for wanting to join Essential AI. Talk about aspects of the company's mission, culture, or projects that resonate with you and how your skills align with the role’s responsibilities.

Join Rise to see the full answer
Similar Jobs
Photo of the Rise User
Posted 11 days ago
Photo of the Rise User
Redwood Materials Hybrid Carson City, Nevada, United States
Posted 12 days ago
Photo of the Rise User
Posted yesterday
Photo of the Rise User
Posted yesterday
Photo of the Rise User
NextDecade Hybrid No location specified
Posted 4 days ago
Photo of the Rise User
Posted 6 days ago
Rapid Growth
Customer-Centric
Reward & Recognition
Fully Distributed
Flex-Friendly
Learning & Development
MATCH
Calculating your matching score...
FUNDING
DEPARTMENTS
SENIORITY LEVEL REQUIREMENT
TEAM SIZE
EMPLOYMENT TYPE
Full-time, on-site
DATE POSTED
December 24, 2024

Subscribe to Rise newsletter

Risa star 🔮 Hi, I'm Risa! Your AI
Career Copilot
Want to see a list of jobs tailored to
you, just ask me below!