Let’s get started
By clicking ‘Next’, I agree to the Terms of Service
and Privacy Policy
Jobs / Job page
Software Engineer - LLM Training image - Rise Careers
Job details

Software Engineer - LLM Training

About Us

We believe AI will fundamentally transform how people live and work. CentML's mission is to massively reduce the cost of developing and deploying ML models so we can enable anyone to harness the power of AI and everyone to benefit from its potential.


Our founding team is made up of experts in AI, compilers, and ML hardware and has led efforts at companies like Amazon, Google, Microsoft Research, Nvidia, Intel, Qualcomm, and IBM. Our co-founder and CEO, Gennady Pekhimenko, is a world-renowned expert in ML systems who holds multiple academic and industry research awards from Google, Amazon, Facebook, and VMware.


About the Position

We are seeking highly crafted and motivated software engineers to join our team to empower AI practitioners to develop AI models on CentML Platform, productively and affordably. If you have launched multi-node distributed training jobs before and experienced firsthand how painful and cumbersome to get it functional, let alone high-performing, and you wanna be part of the team that derives solutions to address this challenge so that other AI practitioners wouldn’t feel the same pain that you had, please come and join us!



What you’ll do
  • Design and implement highly efficient distributed training systems for large-scale deep learning models.
  • Optimize parallelism strategies to improve performance and scalability across hundreds or thousands of GPUs.
  • Develop low-level systems components and algorithms to maximize throughput and minimize memory and compute bottlenecks.
  • Productionize the training systems onto CentML Platform.
  • Collaborate with researchers and engineers to productionize cutting-edge model architectures and training techniques.
  • Contribute to the design of APIs, abstractions and UX that make it easier to scale models while maintaining usability and flexibility.
  • Profile, debug, and tune performance at the system, model, and hardware levels.
  • Participate in design discussions, code reviews, and technical planning to ensure the product aligns with business goals.
  • Stay up to date with the latest advancements in large-scale model training and help translate research into practical, robust systems.


What you’ll need to be successful
  • Bachelor’s, Master’s, or PhD’s degree in Computer Science/Engineering, Software Engineering, related field or equivalent working experience.
  • 3+ years of experience in software development, preferably with Python and C++.
  • Deep understanding of machine learning pipelines and workflows, distributed systems, parallel computing, and high-performance computing principles.
  • Hands-on experience with large-scale training of deep learning models using frameworks like PyTorch, Megatron Core, DeepSpeed.
  • Experience optimizing compute, memory, and communication performance in large model training workflows.
  • Familiarity with GPU programming, CUDA, NCCL, and performance profiling tools.
  • Solid grasp of deep learning fundamentals, especially as they relate to transformer-based architectures and training dynamics.
  • Experience working with cloud platforms (AWS, GCP, or Azure) and containerization tools (Docker, Kubernetes).
  • Ability to work closely with both research and engineering teams, translating evolving needs into robust infrastructure.
  • Excellent problem-solving skills, with the ability to debug complex systems.
  • A passion for building high-impact tools that push the boundaries of what’s possible with large-scale AI.


Bonus points if you have
  • Experience building tools or platforms for ML model training or fine-tuning.
  • Experience building backends (e.g., using FastAPI) and frontend (e.g., using React).
  • Experience building and optimizing LLM inference engines (e.g., vLLM, SGLang).
  • Exposure to DevOps practices, CI/CD pipelines, and infrastructure as code.
  • Familiarity with MLOps concepts, including model versioning and serving.


Benefits & Perks

- An open and inclusive work environment

- Employee stock options

- Best-in-class medical and dental benefits

- Parental Leave top-up

- Professional development budget

- Flexible vacation time to promote a healthy work-life blend


We are an equal opportunity employer and value diversity at our company. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, disability, and any other protected ground of discrimination under applicable human rights legislation. 


CentML strives to respect the dignity and ‎‎independence of people with disabilities and is committed to giving them the same ‎‎opportunity to succeed as all other employees. 


Inclusiveness is core to our culture at CentML, and we strive to ensure you get the most from your interview experience. CentML makes reasonable accommodations for applicants with disabilities. If a reasonable accommodation is needed to participate in the job application or interview process, please reach out to the Talent team.

CentML Glassdoor Company Review
No rating Glassdoor star iconGlassdoor star iconGlassdoor star iconGlassdoor star iconGlassdoor star icon
CentML DE&I Review
No rating Glassdoor star iconGlassdoor star iconGlassdoor star iconGlassdoor star iconGlassdoor star icon
CEO of CentML
CentML CEO photo
Unknown name
Approve of CEO

Average salary estimate

$110000 / YEARLY (est.)
min
max
$90000K
$130000K

If an employer mentions a salary or salary range on their job, we display it as an "Employer Estimate". If a job has no salary data, Rise displays an estimate if available.

What You Should Know About Software Engineer - LLM Training, CentML

At CentML, we’re on a mission to transform the way AI is developed and deployed by significantly lowering costs and making it accessible to everyone. As a Software Engineer focusing on LLM Training, you’ll be part of a passionate team of experts known for their work at major companies like Amazon, Google, and Nvidia. Your role will be pivotal in enhancing the CentML Platform, where you’ll design and implement distributed training systems for large-scale deep learning models. If you’ve ever battled the struggles of launching multi-node training jobs and want to help others avoid the same pitfalls, this is the perfect opportunity for you. You’ll dive into optimizing parallelism across numerous GPUs, develop low-level components to reduce bottlenecks, and collaborate closely with researchers to bring innovative model architectures to life. Your coding expertise will shine as you contribute to APIs that simplify scaling, profile and debug performance, and ensure our cutting-edge systems stay aligned with business objectives. This role perfectly blends technical challenges with the chance to keep up with the latest advancements in model training. If you're ready to create significant impact across the AI landscape, we can’t wait for you to join us at CentML!

Frequently Asked Questions (FAQs) for Software Engineer - LLM Training Role at CentML
What are the responsibilities of a Software Engineer - LLM Training at CentML?

As a Software Engineer - LLM Training at CentML, you will design and implement distributed training systems for large-scale deep learning models while optimizing performance across thousands of GPUs. You’ll also collaborate with researchers to productionize advanced architectures, conduct performance profiling, and participate in technical discussions to align product development with business goals.

Join Rise to see the full answer
What qualifications do I need to apply for the Software Engineer - LLM Training position at CentML?

To apply for the Software Engineer - LLM Training position at CentML, you should have a Bachelor’s, Master’s, or PhD in Computer Science, Software Engineering, or a related field, along with over three years of software development experience, particularly with Python and C++. A solid understanding of machine learning, distributed systems, and performance optimization techniques is also essential.

Join Rise to see the full answer
What programming languages and tools should I be familiar with for the Software Engineer - LLM Training role at CentML?

Candidates for the Software Engineer - LLM Training role at CentML should be proficient in Python and C++. Familiarity with machine learning frameworks such as PyTorch, DeepSpeed, and Megatron Core, as well as GPU programming, CUDA, and Docker or Kubernetes for containerization, are critical to success in this position.

Join Rise to see the full answer
How does CentML address inclusivity for its Software Engineer - LLM Training team?

CentML prioritizes inclusivity and provides equal opportunities for all employees, including those with disabilities. The company actively ensures that all applicants have the same chance to succeed, making reasonable accommodations during the application and interview process to support candidates needing assistance.

Join Rise to see the full answer
What kind of company culture can I expect at CentML as a Software Engineer - LLM Training?

At CentML, you'll experience an open and inclusive work environment that values diversity and collaboration. The culture is built on innovation and a shared passion for advancing AI. With benefits such as flexible vacation time and professional development budgets, the company fosters a healthy work-life balance and encourages continuous learning.

Join Rise to see the full answer
Common Interview Questions for Software Engineer - LLM Training
Can you describe your experience with large-scale training of deep learning models?

When answering this question, highlight specific projects where you trained large models, including the frameworks used, like PyTorch or TensorFlow. Discuss challenges faced and how you optimized training processes, focusing on performance metrics that showcase your impact.

Join Rise to see the full answer
What strategies do you employ to optimize distributed training systems?

Share your knowledge of parallel computing, data partitioning, and optimization techniques. Discuss how you’ve implemented strategies to minimize communication overhead and improve throughput in previous roles.

Join Rise to see the full answer
How do you stay up to date with advancements in machine learning and AI?

Demonstrate your interest in continuous learning by mentioning resources like academic journals, machine learning conferences, and influential online courses. Highlight any personal projects or contributions to open-source that keep you engaged in the community.

Join Rise to see the full answer
What is your familiarity with GPU programming, and why is it crucial in deep learning?

Explain your experience with GPU programming using CUDA, detailing projects where you utilized GPUs to accelerate training. Emphasize the importance of GPUs in handling computations required for deep learning due to their parallel processing capabilities.

Join Rise to see the full answer
Have you worked with cloud platforms in deploying models, and which ones do you prefer?

Discuss the cloud platforms you’ve used, such as AWS, GCP, or Azure, and the services relevant to deploying machine learning models. Mention specific features that enhance scaling and efficiency in your deployments.

Join Rise to see the full answer
Can you explain the importance of performance profiling in model training?

Discuss how performance profiling helps identify bottlenecks in training and how you’ve used specific profiling tools to analyze memory and compute efficiency. Provide examples of changes made based on profiling results to improve performance.

Join Rise to see the full answer
What techniques do you use to debug complex systems?

Outline your structured approach to debugging, including systematic logging, performance monitoring, and utilizing tools for tracking down issues in distributed systems that may arise during training.

Join Rise to see the full answer
Describe a time when you took a project from concept to production.

Share a specific example of a project where you were involved in all phases of development. Highlight your role, decisions made, and how you overcame obstacles to successfully launch the project.

Join Rise to see the full answer
How do you prioritize tasks in a fast-paced development environment?

Provide insight into your organizational skills and prioritization strategy. Discuss tools you use for project management and how you balance urgency and importance in your workload.

Join Rise to see the full answer
What excites you most about working with large language models?

Express your passion for large language models and how they are reshaping AI's capabilities. Discuss specific applications or innovations you find particularly exciting, and how you envision contributing to this evolving field.

Join Rise to see the full answer
Similar Jobs
Photo of the Rise User
Posted 13 days ago

Optimum is on the lookout for a Software Development Engineer I who is eager to innovate and improve mobile applications.

Photo of the Rise User

Join NXP as a Junior Software Engineer and help shape the future of automotive safety through innovative radar technologies.

Join Abnormal Security as a Senior Full Stack Engineer, where you'll lead the development of innovative email security products.

Photo of the Rise User

Join Rolls-Royce as a Software Engineer and contribute to innovative submarine software solutions in a collaborative hybrid environment.

Posted 9 days ago

Join a dynamic team as a Principal Frontend Engineer to shape the future of web applications with cutting-edge technologies.

Posted 13 days ago

We are seeking a highly skilled Senior Full Stack Developer specializing in .NET to lead our development team and deliver complex projects.

Photo of the Rise User
Galileo Hybrid United States
Posted 4 days ago

Join Galileo as a Staff iOS Engineer to develop and enhance their mobile app, supporting high-quality, affordable healthcare.

Photo of the Rise User
Posted 4 days ago
Inclusive & Diverse
Mission Driven
Social Impact Driven
Passion for Exploration
Dare to be Different
Diversity of Opinions
Reward & Recognition
Empathetic
Feedback Forward
Work/Life Harmony
Collaboration over Competition
Growth & Learning
Transparent & Candid
Customer-Centric
Rise from Within
Friends Outside of Work
Medical Insurance
Dental Insurance
Vision Insurance
Mental Health Resources
Life insurance
Disability Insurance
Health Savings Account (HSA)
Flexible Spending Account (FSA)
Learning & Development
Work Visa Sponsorship
Employee Resource Groups
401K Matching
Paid Time-Off
Maternity Leave
Social Gatherings
Company Retreats

We are seeking a Software Engineer II to innovate and enhance Microsoft's OneDrive and SharePoint infrastructure in a collaborative team environment.

MATCH
Calculating your matching score...
FUNDING
SENIORITY LEVEL REQUIREMENT
TEAM SIZE
EMPLOYMENT TYPE
Full-time, remote
DATE POSTED
April 18, 2025

Subscribe to Rise newsletter

Risa star 🔮 Hi, I'm Risa! Your AI
Career Copilot
Want to see a list of jobs tailored to
you, just ask me below!
LATEST ACTIVITY
Photo of the Rise User
Someone from OH, East Liverpool just viewed Full Stack Developer at BlackStone eIT
C
8 people applied to iOS Developer at Clipt
Photo of the Rise User
Someone from OH, Pickerington just viewed Salesforce Lead at Bounteous
Photo of the Rise User
Someone from OH, Pickerington just viewed Industry Lead - High Tech (Salesforce) at Thunder
D
Someone from OH, Akron just viewed Junior Motion Designer at DEPT®
R
Someone from OH, Akron just viewed 2D Graphic and Motion Designer at Ruby Labs
Photo of the Rise User
22 people applied to Junior Unity Developer at Gameloft
Photo of the Rise User
Someone from OH, Columbus just viewed Customer Success Manager, US SLED at Dataminr
Photo of the Rise User
Someone from OH, Greenville just viewed Systems Engineer (Linux & Shell or Python scripting) at Visa
Photo of the Rise User
Someone from OH, Greenville just viewed Help Desk Technician - Youngstown at R.I.T.A.
Photo of the Rise User
Someone from OH, Mount Orab just viewed Backend Developer at G2i Inc.
Photo of the Rise User
Someone from OH, Cincinnati just viewed Product Marketing Manager at Cast & Crew
Photo of the Rise User
Someone from OH, Cincinnati just viewed Marketing Manager at Cast & Crew
o
Someone from OH, Cincinnati just viewed Administrative Assistant at osu
A
Someone from OH, Cincinnati just viewed Data Entry Clerk at Alphabe Insight Inc
Photo of the Rise User
Someone from OH, Cincinnati just viewed Machine Learning Engineer at Allstate
Photo of the Rise User
Someone from OH, Twinsburg just viewed Data Analyst/Power BI Developer at Datadog
Photo of the Rise User
Someone from OH, Cuyahoga Falls just viewed Small Fleet Underwriter at HDVI