Job details

Staff Software Engineer / Tech Lead (Model Training Infrastructure)

Get a free resume review

About Anyscale:

At Anyscale, we're on a mission to democratize distributed computing and make it accessible to software developers of all skill levels. We’re commercializing Ray, a popular open-source project that's creating an ecosystem of libraries for scalable machine learning. Companies like OpenAI, Uber, Spotify, Instacart, Cruise, and many more, have Ray in their tech stacks to accelerate the progress of AI applications out into the real world.

With Anyscale, we’re building the best place to run Ray, so that any developer or data scientist can scale an ML application from their laptop to the cluster without needing to be a distributed systems expert.

Proud to be backed by Andreessen Horowitz, NEA, and Addition with $250+ million raised to date.

About The Role:

Anyscale is looking for a staff software engineer to lead the Model Training Infrastructure team.

The Model Training Infrastructure team leads the development and optimization of Ray’s distributed training libraries, focusing on enabling large-scale ML workloads. The team owns and maintains widely adopted open source libraries like Ray Train for distributed model training and Ray Tune for distributed hyperparameter tuning.

As the technical leader for this team, you will be responsible for:

Thinking deeply about delightful, programmatic interfaces for machine learning engineers to scale model training
Build and rethink distributed training architectures to scale seamlessly from laptop to the cloud
Implement and innovate on distributed training algorithms like elastic training to improve model training performance
Working with and leading a robust open source community around the Ray project
Engage directly with ML infrastructure teams around the world to iterate and build the best training infrastructure.
Advocate and share your work broadly with the ML community through talks, tutorials, and blog posts

On the day-to-day basis, you will drive the technical direction of the team, mentor engineers, and deliver high-impact projects. You’ll shape the vision for what training infrastructure looks like for enterprises around the world and remain hands-on with the code and product development.

We’d love to hear from you if you have:

Multiple years of experience building, scaling, and maintaining complex software systems in production
Proven experience leading or mentoring engineering teams in a technical capacity
Expertise in machine learning frameworks (e.g., PyTorch, TensorFlow, XGBoost)
Hands-on experience with distributed systems and designing fault-tolerant infrastructure
Excellent communication and collaboration skills

Bonus points if you have:

Experience with Ray
Experience with cloud technologies (e.g., AWS, GCP, Kubernetes)
Experience building and operating ML training platforms in production
Contributions to or maintenance of open-source libraries
Experience leading open-source or cross-functional teams

Compensation:

At Anyscale, we take a market-based approach to compensation. We are data-driven, transparent, and consistent. The target salary for this role is $237,000 ~ $284,614. As the market data changes over time, the target salary for this role may be adjusted.
This role is also eligible to participate in Anyscale's Equity and Benefits offerings, including the following:
Stock Options
Healthcare plans, with premiums covered by Anyscale at 99%
401k Retirement Plan
Education & Wellbeing Stipend
Paid Parental Leave
Fertility Benefits
Flexible Time Off
Commute reimbursement
100% of in office meals covered

Anyscale Inc. is an Equal Opportunity Employer. Candidates are evaluated without regard to age, race, color, religion, sex, disability, national origin, sexual orientation, veteran status, or any other characteristic protected by federal or state law.

Anyscale Inc. is an E-Verify company and you may review the Notice of E-Verify Participation and the Right to Work posters in English and Spanish

Software Engineering Machine Learning Distributed Systems Leadership Model Training

Average salary estimate

$260807 / YEARLY (est.)

min

max

$237000K

$284614K

If an employer mentions a salary or salary range on their job, we display it as an "Employer Estimate". If a job has no salary data, Rise displays an estimate if available.

Similar Jobs

Experienced Platform Engineer - Azure

Mindera Hybrid No location specified

VIEW

Posted 4 days ago

Experienced Platform Engineer role at Mindera to innovate and maintain cloud-native platforms with Azure Kubernetes Service at its core.

Tech Lead (Go) - Technology

Truelogic Hybrid No location specified

VIEW

Posted 10 days ago

Drive innovation and lead a software engineering team as Tech Lead specializing in Go development for a global cloud communications client with Truelogic’s remote team.

Sr. SW Engineer

Visa Hybrid Atlanta, GA, USA

VIEW

Posted 10 days ago

Experienced Sr. Software Engineer needed at Visa to develop secure, scalable digital payment solutions in a hybrid work environment.

Integration Engineer II

Memorial Sloan Kettering Cancer Center Hybrid NY-New York

VIEW

Posted 5 days ago

Memorial Sloan Kettering Cancer Center seeks an Integration Engineer II to architect and implement advanced integration solutions supporting cancer treatment innovations.

GPU Software Development Engineer

Intel Hybrid US, California, Folsom

VIEW

Posted 9 days ago

Inclusive & Diverse

Rise from Within

Mission Driven

Diversity of Opinions

Work/Life Harmony

Growth & Learning

Transparent & Candid

Customer-Centric

Snacks

Onsite Gym

Family Coverage (Insurance)

Medical Insurance

Dental Insurance

Vision Insurance

Mental Health Resources

Life insurance

Disability Insurance

Health Savings Account (HSA)

Flexible Spending Account (FSA)

Learning & Development

Paid Time-Off

401K Matching

Maternity Leave

Paternity Leave

Intel is looking for a skilled GPU Software Development Engineer with AI/ML expertise to enhance and optimize Intel GPU software solutions.

Lead Software Engineer, CoCounsel AI Assistant

Thomson Reuters Hybrid USA-MSP-2900 Ames Crossing Road

VIEW

Posted 11 days ago

Lead the development of cutting-edge AI-powered legal applications as a Full Stack Software Engineer at Thomson Reuters' CoCounsel AI Assistant team.

Micro Services Developer

General Motors (GM) Hybrid Roswell, Georgia, United States of America

VIEW

Posted 14 days ago

Contribute to cutting-edge automotive technology as a Micro Services Developer at General Motors, driving microservices and API integration in a hybrid work environment.

Senior Software Engineer, Distributed Systems

Magical Hybrid San Francisco

VIEW

Posted 14 days ago

Contribute as a Senior Software Engineer at Magical, building scalable distributed systems and leading the creation of innovative AI-powered infrastructure in a fast-growth environment.

Software Engineer II, Frontend (Growth)

AllTrails Hybrid Remote

VIEW

Posted 6 days ago

Inclusive & Diverse

Rise from Within

Mission Driven

Diversity of Opinions

Work/Life Harmony

Take Risks

Casual Dress Code

Collaboration over Competition

Fast-Paced

Rapid Growth

Social Impact Driven

Passion for Exploration

Medical Insurance

Paid Time-Off

Maternity Leave

Mental Health Resources

Equity

Learning & Development

Life insurance

Dental Insurance

Vision Insurance

Disability Insurance

Health Savings Account (HSA)

Flexible Spending Account (FSA)

Contribute to AllTrails’ mission by developing impactful frontend features in a fast-paced, remote Growth team focused on enhancing user journeys and engagement.