Let’s get started
By clicking ‘Next’, I agree to the Terms of Service
and Privacy Policy, and consent to receive emails from Rise
Jobs / Job page
Software Engineer, SystemML - AI Networking image - Rise Careers
Job details

Software Engineer, SystemML - AI Networking

In this role, you will be a member of the AI Networking Software team and part of the bigger DC networking organization. The team develops and owns the software stack around NCCL (NVIDIA Collective Communications Library), which enables multi-GPU and multi-node data communication through HPC-style collectives. NCCL has been integrated into PyTorch and is on the critical path of multi-GPU distributed training. In other words, nearly every distributed GPU-based ML workload in Meta Production goes through the SW stack the team owns. At the high level, the team aims to enable Meta-wide ML products and innovations to leverage our large-scale GPU training and inference fleet through an observable, reliable and high-performance distributed AI/GPU communication stack. Currently, one of the team’s focus is on building customized features, SW benchmarks, performance tuners and SW stacks around NCCL and PyTorch to improve the full-stack distributed ML reliability and performance (e.g. Large-Scale GenAI/LLM training) from the trainer down to the inter-GPU and network communication layer. And we are seeking for engineers to work on the space of GenAI/LLM scaling reliability and performance.

Responsibilities

Tech-leading the collective communication library development on Meta's large-scale GPU training infra with a focus on GenAI/LLM scaling

Qualifications

Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience Proven C/C++ and Python programming skills Proven track record of leading successful projects Effective leadership and communication skills Specialized experience in one or more of the following machine learning/deep learning domains: Distributed ML Training, GPU architecture, ML systems, AI infrastructure, high performance computing, performance optimizations, or Machine Learning frameworks (e.g. PyTorch). PhD in Computer Science, Computer Engineering, or relevant technical field Experience with NCCL and distributed GPU performance analysis on RoCE/Infiniband Experience working with DL frameworks like PyTorch, Caffe2 or TensorFlow Experience with both data parallel and model parallel training, such as Distributed Data Parallel, Fully Sharded Data Parallel (FSDP), Tensor Parallel, and Pipeline Parallel Experience in AI framework and trainer development on accelerating large-scale distributed deep learning models Experience in HPC and parallel computing Knowledge of GPU architectures and CUDA programming Knowledge of ML, deep learning and LLM

Average salary estimate

$135000 / YEARLY (est.)
min
max
$110000K
$160000K

If an employer mentions a salary or salary range on their job, we display it as an "Employer Estimate". If a job has no salary data, Rise displays an estimate if available.

What You Should Know About Software Engineer, SystemML - AI Networking, Meta

Are you ready to step into the forefront of AI technology? As a Software Engineer for SystemML at Meta in beautiful Menlo Park, California, you’ll join a dynamic team dedicated to enhancing AI networking capabilities. You'll work closely with the AI Networking Software team within the larger Data Center networking organization. Your primary focus will be on developing and maintaining the software stack surrounding NCCL (NVIDIA Collective Communications Library), a vital component that allows for high-performance multi-GPU and multi-node data communication. Imagine contributing to something that powers nearly every distributed GPU-based ML workload in Meta Production! Your role will be instrumental in driving Meta-wide machine learning products, ensuring seamless performance and reliability across our expansive GPU training and inference fleet. You'll dive deep into performance tuning, crafting benchmarks, and creating custom features tailored to bolster the efficiency of NCCL and PyTorch. If you're passionate about scaling reliability and performance in the GenAI/LLM realm, we would love to meet you. This position not only demands strong C/C++ and Python skills but also a solid background in machine learning/deep learning domains. With your expertise, you'll help make our ambitious AI innovations a reality. At Meta, we believe in empowering our engineers to lead projects and work collaboratively to push the boundaries of technology. Join us to create impactful solutions and help shape the future of AI at Meta.

Frequently Asked Questions (FAQs) for Software Engineer, SystemML - AI Networking Role at Meta
What are the primary responsibilities of a Software Engineer for SystemML at Meta?

As a Software Engineer for SystemML at Meta, you'll primarily be responsible for tech-leading the development of the collective communication library tailored to our large-scale GPU training infrastructure. This includes driving improvements in GenAI/LLM scaling, enhancing our software stack's reliability and performance, and facilitating seamless communication on multi-GPU setups. Your role will also involve building customized features, conducting software benchmarks, and tuning performance for a range of distributed machine learning applications.

Join Rise to see the full answer
What qualifications are required for the Software Engineer position at Meta?

To qualify for the Software Engineer position at Meta, you should have a Bachelor's degree in Computer Science, Computer Engineering, or a similar technical field, along with solid experience in C/C++ and Python programming. A proven track record of leading successful projects is important, along with specialized knowledge in areas like distributed ML training, GPU architecture, high-performance computing, and machine learning frameworks such as PyTorch. Advanced qualifications such as a PhD in a relevant field would be a plus.

Join Rise to see the full answer
What programming skills are essential for the Software Engineer, SystemML role at Meta?

Essential programming skills for the Software Engineer, SystemML role at Meta primarily include proficiency in C/C++ and Python. These languages are critical for performance tuning and communication library development. Experience with machine learning frameworks such as PyTorch and knowledge of GPU architectures and CUDA programming will also greatly enhance your contributions to the team and our projects.

Join Rise to see the full answer
What experience is beneficial for a Software Engineer, SystemML position at Meta?

Beneficial experience for a Software Engineer, SystemML position at Meta includes hands-on work with distributed GPU performance analysis, familiarity with the NCCL library, and practice with deep learning frameworks like PyTorch, Caffe2, or TensorFlow. Additionally, understanding data and model parallel training approaches, as well as HPC and parallel computing, will give you a competitive edge in this role.

Join Rise to see the full answer
How does the Software Engineer, SystemML contribute to Meta's AI initiatives?

As a Software Engineer, SystemML, your contributions will be vital to optimizing Meta's AI initiatives by enhancing the performance and reliability of the underlying software stack. By focusing on areas such as GenAI/LLM scaling, you will directly influence the development of efficient machine learning products that leverage our robust GPU infrastructure, thus shaping the future of AI-driven innovations at Meta.

Join Rise to see the full answer
Common Interview Questions for Software Engineer, SystemML - AI Networking
Can you explain your experience with machine learning and deep learning frameworks?

When answering this question, highlight your practical experience with frameworks like PyTorch or TensorFlow. Discuss specific projects where you applied these frameworks, emphasizing your role in developing machine learning models or optimizing existing systems. Be sure to mention any performance tuning you accomplished and the impact it had on the final outcomes.

Join Rise to see the full answer
What is your understanding of NCCL and its role in distributed GPU training?

For this question, describe NCCL's function in facilitating multi-GPU and multi-node communication during distributed training. Explain how it supports efficient data transfer and synchronization across GPUs, thereby enhancing performance in large-scale machine learning tasks. If possible, share any personal experience or insights you've gained working with NCCL.

Join Rise to see the full answer
How have you approached performance optimizations in machine learning systems?

Detail specific strategies you've employed for performance optimization in machine learning systems, such as profiling tools, analyzing bottlenecks, or implementing techniques like data parallelism. Use examples from previous projects to illustrate your analytical capabilities and how your interventions improved overall system efficiency.

Join Rise to see the full answer
Describe any leadership experiences you’ve had in software projects.

Here, outline your experience leading software projects, focusing on your ability to guide a team through the development cycle. Discuss how you motivated team members, delegated tasks, managed timelines, and ensured that project goals were met or exceeded. Emphasize your effective communication skills and the impact of your leadership on the project's success.

Join Rise to see the full answer
What strategies do you use for debugging and resolving complex software issues?

In your answer, describe a systematic approach to debugging that might include understanding the problem domain, reproducing bugs consistently, and using debugging tools effectively. Share personal experiences where your methods led to resolving significant software issues, demonstrating your problem-solving skills under pressure.

Join Rise to see the full answer
How familiar are you with GPU architectures and their implications for software development?

Discuss your understanding of different GPU architectures (like NVIDIA CUDA) and their impact on software development for machine learning. Mention how you've optimized algorithms or implemented strategies that leverage GPU capabilities, including parallel processing and memory management, to achieve better performance in ML systems.

Join Rise to see the full answer
Can you explain the differences between data parallelism and model parallelism?

This is an opportunity to demonstrate your understanding of distributed training methodologies. Explain that data parallelism involves splitting data across multiple GPUs while each GPU runs the same model, while model parallelism involves distributing distinct parts of the model across multiple GPUs. Use examples to illustrate situations where each would be beneficial.

Join Rise to see the full answer
What experience do you have with high-performance computing (HPC)?

When answering this, share your experiences working with HPC environments. Discuss any specific projects where you utilized HPC environments for training large models. Mention the tools or libraries you used, such as MPI or NCCL, to ensure your understanding of HPC's role in fast-tracking complex computations.

Join Rise to see the full answer
How do you prioritize tasks in a fast-paced development environment?

Explain your method for task prioritization, possibly citing specific frameworks like Agile or Kanban. Emphasize your ability to adapt to changing circumstances while ensuring that critical milestones are met. Detail experiences where you successfully juggled multiple priorities without sacrificing quality.

Join Rise to see the full answer
Why are you interested in the Software Engineer position for SystemML at Meta?

In your response, express genuine enthusiasm for Meta's focus on innovative AI solutions. Discuss how your skills and experiences align with the team's objectives, particularly your passion for enhancing distributed machine learning performance. Highlight your desire to be part of a company known for driving technological advancements in AI.

Join Rise to see the full answer
Similar Jobs
Photo of the Rise User
Meta Hybrid Menlo Park, California, United States
Posted 6 days ago
Inclusive & Diverse
Rise from Within
Mission Driven
Diversity of Opinions
Work/Life Harmony
Take Risks
Collaboration over Competition
Fast-Paced
Growth & Learning
Transparent & Candid
Feedback Forward
Dare to be Different
Medical Insurance
Paid Time-Off
Maternity Leave
Mental Health Resources
Equity
Paternity Leave
Flex-Friendly
Snacks
Social Gatherings
Company Retreats
Fitness Stipend
Paid Holidays
Summer Fridays
Work Visa Sponsorship
Bias Training
Flexible Spending Account (FSA)
Health Savings Account (HSA)
Vision Insurance
Dental Insurance
Life insurance

Join Meta as a QA Engineering Lead to enhance quality assurance on GenAI products that serve billions worldwide.

Photo of the Rise User
Meta Hybrid Los Angeles, California, United States
Posted 5 days ago
Inclusive & Diverse
Rise from Within
Mission Driven
Diversity of Opinions
Work/Life Harmony
Take Risks
Collaboration over Competition
Fast-Paced
Growth & Learning
Transparent & Candid
Feedback Forward
Dare to be Different
Medical Insurance
Paid Time-Off
Maternity Leave
Mental Health Resources
Equity
Paternity Leave
Flex-Friendly
Snacks
Social Gatherings
Company Retreats
Fitness Stipend
Paid Holidays
Summer Fridays
Work Visa Sponsorship
Bias Training
Flexible Spending Account (FSA)
Health Savings Account (HSA)
Vision Insurance
Dental Insurance
Life insurance

Join Meta as a Lead Counsel for Labor & Employment and drive legal strategies in one of the world's most innovative tech companies.

LVIS Hybrid No location specified
Posted 11 days ago

Become a crucial part of LVIS Corporation as a Frontend Software Engineer, helping to innovate AI-driven solutions for neurological care.

Photo of the Rise User

Join Bedrock as a Full Stack Cloud Software Engineer to build impactful tools that connect autonomous underwater vehicles with users and clients.

Photo of the Rise User

Join CyberArk as a Staff Software Developer to develop high-impact tools that elevate the quality and speed of testing for cloud-native services.

Photo of the Rise User
Posted 7 days ago

Gameloft is looking for an Intermediate C++ Game Developer to collaborate in crafting engaging gaming experiences.

Photo of the Rise User

Woven by Toyota is looking for a Senior Software Engineer to innovate and develop sensor fusion algorithms critical for autonomous driving systems.

Photo of the Rise User

As a Senior Back-End Engineer at Foodsmart, you'll play a critical role in enhancing our data infrastructures and backend systems to better support personalized nutrition guidance.

Posted 7 days ago

Join Polycam as a Senior Android Developer and help shape the future of mobile 3D capture technologies.

Photo of the Rise User

Join a forward-thinking hardware company as a Web Front End Engineer, focusing on developing innovative applications for robotic logistics operations.

Photo of the Rise User
UPS Remote United States
Posted 8 months ago
Photo of the Rise User
Posted 10 months ago
Photo of the Rise User
Inclusive & Diverse
Mission Driven
Social Impact Driven
Passion for Exploration
Dare to be Different
Diversity of Opinions
Reward & Recognition
Empathetic
Feedback Forward
Work/Life Harmony
Collaboration over Competition
Growth & Learning
Transparent & Candid
Customer-Centric
Rise from Within
Friends Outside of Work
Medical Insurance
Dental Insurance
Vision Insurance
Mental Health Resources
Life insurance
Disability Insurance
Health Savings Account (HSA)
Flexible Spending Account (FSA)
Learning & Development
Work Visa Sponsorship
Employee Resource Groups
401K Matching
Paid Time-Off
Maternity Leave
Social Gatherings
Company Retreats

Meta's mission is to build the future of human connection and the technology that makes it possible.

292 jobs
MATCH
Calculating your matching score...
CULTURE VALUES
Inclusive & Diverse
Rise from Within
Mission Driven
Diversity of Opinions
Work/Life Harmony
Take Risks
Collaboration over Competition
Fast-Paced
Growth & Learning
Transparent & Candid
Feedback Forward
Dare to be Different
BENEFITS & PERKS
Medical Insurance
Paid Time-Off
Maternity Leave
Mental Health Resources
Equity
Paternity Leave
Flex-Friendly
Snacks
Social Gatherings
Company Retreats
Fitness Stipend
Paid Holidays
Summer Fridays
Work Visa Sponsorship
Bias Training
Flexible Spending Account (FSA)
Health Savings Account (HSA)
Vision Insurance
Dental Insurance
Life insurance
FUNDING
SENIORITY LEVEL REQUIREMENT
INDUSTRY
TEAM SIZE
EMPLOYMENT TYPE
Full-time, on-site
DATE POSTED
April 22, 2025

Subscribe to Rise newsletter

Risa star 🔮 Hi, I'm Risa! Your AI
Career Copilot
Want to see a list of jobs tailored to
you, just ask me below!