Job details

ML Engineer Large-scale AI Infrastructure

Headquartered in Silicon Valley, we are a newly established start-up, where a collective of visionary scientists, engineers, and entrepreneurs are dedicated to transforming the landscape of biology and medicine through the power of Generative AI. Our team comprises leading minds and innovators in AI and Biological Science, pushing the boundaries of what is possible. We are dreamers who reimagine a new paradigm for biology and medicine.

We are committed to decoding biology holistically and enabling the next generation of life-transforming solutions. As the first mover in pan-modal Large Biological Models (LBM), we are pioneering a new era of biomedicine, with our LBM training leading to ground-breaking advancements and a transformative approach to healthcare. Our exceptionally strong R&D team and leadership in LLM and generative AI position us at the forefront of this revolutionary field. With headquarters in Silicon Valley, California, and a branch office in Paris, we are poised to make a global impact. Join us as we embark on this journey to redefine the future of biology and medicine through the transformative power of Generative AI.

Job Description

GPU Cluster Management: Design, deploy, and maintain high-performance GPU clusters, ensuring their stability, reliability, and scalability. Monitor and manage cluster resources to maximize utilization and efficiency.
Distributed/Parallel Training: Implement distributed computing techniques to enable parallel training of large deep learning models across multiple GPUs and nodes. Optimize data distribution and synchronization to achieve faster convergence and reduced training times.
Performance Optimization: Fine-tune GPU clusters and deep learning frameworks to achieve optimal performance for specific workloads. Identify and resolve performance bottlenecks through profiling and system analysis.
Deep Learning Framework Integration: Collaborate with data scientists and machine learning engineers to integrate distributed training capabilities into GenBio AI’s model development and deployment frameworks.
Scalability and Resource Management: Ensure that the GPU clusters can scale effectively to handle increasing computational demands. Develop resource management strategies to prioritize and allocate computing resources based on project requirements.
Troubleshooting and Support: Troubleshoot and resolve issues related to GPU clusters, distributed training, and performance anomalies. Provide technical support to users and resolve technical challenges efficiently.
Documentation: Create and maintain documentation related to GPU cluster configuration, distributed training workflows, and best practices to ensure knowledge sharing and seamless onboarding of new team members.

Job Requirements:

Master’s or Ph.D. degree in computer science, or a related field with a focus on High-Performance Computing, Distributed Systems, or Deep Learning.
2+ years proven experience in managing GPU clusters, including installation, configuration, and optimization.
Strong expertise in distributed deep learning and parallel training techniques.
Proficiency in popular deep learning frameworks like PyTorch, Megatron-LM, DeepSpeed, etc.
Programming skills in Python and experience with GPU-accelerated libraries (e.g., CUDA, cuDNN).
Knowledge of performance profiling and optimization tools for HPC and deep learning.
Familiarity with resource management and scheduling systems (e.g., SLURM, Kubernetes)
Strong background in distributed systems, cloud computing (AWS, GCP), and containerization (Docker, Kubernetes)

Join us as we embark on this journey to redefine the future of biology and medicine.

We are an equal opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all employees.

Average salary estimate

$125000 / YEARLY (est.)

min

max

$100000K

$150000K

If an employer mentions a salary or salary range on their job, we display it as an "Employer Estimate". If a job has no salary data, Rise displays an estimate if available.

What You Should Know About ML Engineer Large-scale AI Infrastructure, GenBio AI

At GenBio AI, we're leading the charge in revolutionizing the fields of biology and medicine with the power of Generative AI. Located in the heart of Silicon Valley, our newly established start-up is home to a diverse and talented group of scientists, engineers, and entrepreneurs dedicated to breaking the traditional paradigms in biomedicine. As a Machine Learning Engineer specializing in large-scale AI Infrastructure, you will play a crucial role in helping us design, deploy, and maintain high-performance GPU clusters. Your expertise in distributed computing and deep learning frameworks will enable parallel training across multiple GPUs, paving the way for efficient data processing and faster results. You'll collaborate deeply with our dynamic R&D team to optimize performance, troubleshoot technical issues, and develop resource management strategies to accommodate our growing computational needs. Our culture is one of innovation and inclusivity, celebrating diverse ideas and backgrounds. We are incredibly excited about the future and are looking for passionate individuals who are ready to make a substantial impact in a revolutionary way. If you’re eager to take on challenges in GPU management and deep learning frameworks while working in a supportive and engaging environment, we invite you to join our journey and contribute to life-transforming solutions.

Frequently Asked Questions (FAQs) for ML Engineer Large-scale AI Infrastructure Role at GenBio AI

What are the main responsibilities of a Machine Learning Engineer at GenBio AI?

As a Machine Learning Engineer at GenBio AI, your primary responsibilities will include designing, deploying, and maintaining high-performance GPU clusters, implementing distributed computing techniques for deep learning, and collaborating with our data scientists to integrate these capabilities into our models. You'll also be focused on performance optimization, resource management, and providing support when issues arise, ensuring everything runs smoothly in this fast-paced environment.