Let’s get started
By clicking ‘Next’, I agree to the Terms of Service
and Privacy Policy
Jobs / Job page
ML Engineer  Large-scale AI Infrastructure image - Rise Careers
Job details

ML Engineer Large-scale AI Infrastructure

Headquartered in Silicon Valley, we are a newly established start-up, where a collective of visionary scientists, engineers, and entrepreneurs are dedicated to transforming the landscape of biology and medicine through the power of Generative AI. Our team comprises leading minds and innovators in AI and Biological Science, pushing the boundaries of what is possible. We are dreamers who reimagine a new paradigm for biology and medicine.


We are committed to decoding biology holistically and enabling the next generation of life-transforming solutions. As the first mover in pan-modal Large Biological Models (LBM), we are pioneering a new era of biomedicine, with our LBM training leading to ground-breaking advancements and a transformative approach to healthcare. Our exceptionally strong R&D team and leadership in LLM and generative AI position us at the forefront of this revolutionary field. With headquarters in Silicon Valley, California, and a branch office in Paris, we are poised to make a global impact. Join us as we embark on this journey to redefine the future of biology and medicine through the transformative power of Generative AI.


Job Description
  • GPU Cluster Management: Design, deploy, and maintain high-performance GPU clusters, ensuring their stability, reliability, and scalability. Monitor and manage cluster resources to maximize utilization and efficiency.
  • Distributed/Parallel Training: Implement distributed computing techniques to enable parallel training of large deep learning models across multiple GPUs and nodes. Optimize data distribution and synchronization to achieve faster convergence and reduced training times.
  • Performance Optimization: Fine-tune GPU clusters and deep learning frameworks to achieve optimal performance for specific workloads. Identify and resolve performance bottlenecks through profiling and system analysis.
  • Deep Learning Framework Integration: Collaborate with data scientists and machine learning engineers to integrate distributed training capabilities into GenBio AI’s model development and deployment frameworks. 
  • Scalability and Resource Management: Ensure that the GPU clusters can scale effectively to handle increasing computational demands. Develop resource management strategies to prioritize and allocate computing resources based on project requirements. 
  • Troubleshooting and Support: Troubleshoot and resolve issues related to GPU clusters, distributed training, and performance anomalies. Provide technical support to users and resolve technical challenges efficiently.
  • Documentation: Create and maintain documentation related to GPU cluster configuration, distributed training workflows, and best practices to ensure knowledge sharing and seamless onboarding of new team members.


Job Requirements:
  • Master’s or Ph.D. degree in computer science, or a related field with a focus on High-Performance Computing, Distributed Systems, or Deep Learning.
  • 2+ years proven experience in managing GPU clusters, including installation, configuration, and optimization.
  • Strong expertise in distributed deep learning and parallel training techniques.
  • Proficiency in popular deep learning frameworks like PyTorch, Megatron-LM, DeepSpeed, etc.
  • Programming skills in Python and experience with GPU-accelerated libraries (e.g., CUDA, cuDNN).
  • Knowledge of performance profiling and optimization tools for HPC and deep learning.
  • Familiarity with resource management and scheduling systems (e.g., SLURM, Kubernetes)
  • Strong background in distributed systems, cloud computing (AWS, GCP), and containerization (Docker, Kubernetes)


Join us as we embark on this journey to redefine the future of biology and medicine.

We are an equal opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all employees.

Average salary estimate

$125000 / YEARLY (est.)
min
max
$100000K
$150000K

If an employer mentions a salary or salary range on their job, we display it as an "Employer Estimate". If a job has no salary data, Rise displays an estimate if available.

What You Should Know About ML Engineer Large-scale AI Infrastructure, GenBio AI

At GenBio AI, we're leading the charge in revolutionizing the fields of biology and medicine with the power of Generative AI. Located in the heart of Silicon Valley, our newly established start-up is home to a diverse and talented group of scientists, engineers, and entrepreneurs dedicated to breaking the traditional paradigms in biomedicine. As a Machine Learning Engineer specializing in large-scale AI Infrastructure, you will play a crucial role in helping us design, deploy, and maintain high-performance GPU clusters. Your expertise in distributed computing and deep learning frameworks will enable parallel training across multiple GPUs, paving the way for efficient data processing and faster results. You'll collaborate deeply with our dynamic R&D team to optimize performance, troubleshoot technical issues, and develop resource management strategies to accommodate our growing computational needs. Our culture is one of innovation and inclusivity, celebrating diverse ideas and backgrounds. We are incredibly excited about the future and are looking for passionate individuals who are ready to make a substantial impact in a revolutionary way. If you’re eager to take on challenges in GPU management and deep learning frameworks while working in a supportive and engaging environment, we invite you to join our journey and contribute to life-transforming solutions.

Frequently Asked Questions (FAQs) for ML Engineer Large-scale AI Infrastructure Role at GenBio AI
What are the main responsibilities of a Machine Learning Engineer at GenBio AI?

As a Machine Learning Engineer at GenBio AI, your primary responsibilities will include designing, deploying, and maintaining high-performance GPU clusters, implementing distributed computing techniques for deep learning, and collaborating with our data scientists to integrate these capabilities into our models. You'll also be focused on performance optimization, resource management, and providing support when issues arise, ensuring everything runs smoothly in this fast-paced environment.

Join Rise to see the full answer
What qualifications are required for the Machine Learning Engineer position at GenBio AI?

The position of Machine Learning Engineer at GenBio AI requires a Master’s or Ph.D. in computer science or a related field with a concentration in High-Performance Computing, Distributed Systems, or Deep Learning. You should have at least 2 years of practical experience managing GPU clusters and a strong expertise in distributed deep learning techniques, along with proficiency in deep learning frameworks like PyTorch and programming in Python.

Join Rise to see the full answer
What kind of experience is beneficial for a Machine Learning Engineer role at GenBio AI?

For the Machine Learning Engineer role at GenBio AI, relevant experience includes managing GPU clusters, optimization of deep learning models, and proficiency with tools for performance profiling. Familiarity with cloud computing platforms like AWS or GCP, as well as resource management systems like Kubernetes, will give you a strong advantage in our rapidly evolving tech landscape.

Join Rise to see the full answer
How does GenBio AI support the professional development of its Machine Learning Engineers?

At GenBio AI, we believe in fostering the continuous growth of our team members. As a Machine Learning Engineer, you will have access to resources for professional development, including seminars, courses, and conferences related to AI and machine learning. We encourage collaborations across departments that stimulate innovation and creativity in tackling challenges within the field.

Join Rise to see the full answer
What is the work environment like for a Machine Learning Engineer at GenBio AI?

GenBio AI offers an engaging and inclusive work environment where creativity thrives. Our team comprises cutting-edge professionals who are passionate about pushing the frontiers of technology in biology and medicine. You'll find opportunities for collaboration, knowledge sharing, and engagement with diverse perspectives, all contributing to meaningful advancements in our transformative mission.

Join Rise to see the full answer
Common Interview Questions for ML Engineer Large-scale AI Infrastructure
Can you describe your experience with GPU cluster management as an ML Engineer?

In answering this question, highlight specific projects where you've designed, deployed, or optimized GPU clusters. Discuss challenges you faced and how you overcame them, as well as any tools or frameworks you utilized. Mention how you ensured reliability and scalability in demanding environments.

Join Rise to see the full answer
How do you approach performance optimization for deep learning models?

Begin by discussing methodologies like profiling to identify bottlenecks. Share experiences where you've optimized computational resources or algorithm paths for efficiency, and describe specific frameworks or libraries you’ve used in your efforts to enhance performance.

Join Rise to see the full answer
What distributed computing techniques have you implemented in your previous roles?

Provide a clear example of distributed computing techniques you've applied. Discuss how you achieved parallel processing and synchronization among multiple nodes, including the tools or protocols you used, such as SLURM or custom scripts.

Join Rise to see the full answer
Can you explain a challenging technical problem you solved?

Use the STAR technique (Situation, Task, Action, Result) to describe a specific challenge in your work with GPU clusters or deep learning. Focus on the actions you took and the impact of your solutions, emphasizing problem-solving skills and technical expertise.

Join Rise to see the full answer
How do you stay current with advancements in Machine Learning and AI?

Discuss the resources you use to stay updated, such as journals, online courses, conferences, or communities. Mention how you have applied new knowledge or trends to enhance your work, demonstrating a proactive approach to professional development.

Join Rise to see the full answer
What role do deep learning frameworks like PyTorch or TensorFlow play in your work?

Share how you have integrated these frameworks into your projects, mentioning specific features or functions that you've used. Discuss examples of projects where these tools enabled you to achieve significant results or improved efficiencies.

Join Rise to see the full answer
How do you ensure effective collaboration with data scientists and other engineers?

Highlight collaboration strategies you've employed, such as regular meetings, shared documentation, or project management tools. Discuss how communication and transparency improve project outcomes and team synergy.

Join Rise to see the full answer
What strategies do you use for troubleshooting GPU cluster issues?

Discuss systematic approaches to troubleshooting, including checking logs, monitoring performance metrics, or utilizing specific troubleshooting tools. Provide examples of past experiences where you resolved issues effectively under time constraints.

Join Rise to see the full answer
Can you talk about your experience with cloud computing and containerization?

Illustrate your familiarity with platforms like AWS or GCP, emphasizing how you've deployed ML workflows in the cloud. Discuss the significance of containerization with Docker or Kubernetes in your work, detailing how it has aided in scalability and resource management.

Join Rise to see the full answer
What motivates you to work in a transformative field like generative AI?

Share your passion for AI's potential impact on humanity and innovation in healthcare or biology. Discuss personal experiences or inspirations that led you to this path and how you wish to contribute to advancements within the field.

Join Rise to see the full answer
Similar Jobs
GenBio AI Hybrid Palo Alto, Paris, Abu Dhabi
Posted 5 hours ago
Photo of the Rise User
UpKeep Remote Los Angeles / Remote
Posted 3 days ago
Photo of the Rise User
FreedomPay Hybrid Philadelphia, Pennsylvania
Posted 9 days ago
Photo of the Rise User
Posted 10 days ago
Photo of the Rise User
Posted yesterday
Photo of the Rise User
Mission Driven
Social Impact Driven
Passion for Exploration
Reward & Recognition
Photo of the Rise User
Posted 13 days ago
MATCH
Calculating your matching score...
FUNDING
DEPARTMENTS
SENIORITY LEVEL REQUIREMENT
TEAM SIZE
No info
LOCATION
No info
EMPLOYMENT TYPE
Full-time, on-site
DATE POSTED
December 12, 2024

Subscribe to Rise newsletter

Risa star 🔮 Hi, I'm Risa! Your AI
Career Copilot
Want to see a list of jobs tailored to
you, just ask me below!