Let’s get started
By clicking ‘Next’, I agree to the Terms of Service
and Privacy Policy, and consent to receive emails from Rise
Jobs / Job page
Principal Network Development Engineer, ML Networking image - Rise Careers
Job details

Principal Network Development Engineer, ML Networking

DESCRIPTION

The Performance Assured Networking organization (PAN) owns delivering high performance networks for running ML workloads with specialized network products and a custom control plane solution to meet the scale, performance and availability needs of such workloads. The organization owns five inter-related product portfolios. First is the ML network and the network connectivity service it provides to ML servers. AWS Intent Driven Networking (AIDN) is our control plane in which network routing and forwarding behaviors, called Intents, can be programmed across an entire network using highly available APIs. AIDN uses closed-loop actors to program network devices and ensure that the network is in sync with the specified Intent. Third, SIDR (Scalable Intent Driven Routing), our only AIDN actor in production, is a fabric routing protocol and a network controller system, that leverages the prescriptive nature of our networks allowing topology, prefixes, and policy to be controlled using Intents. SIDR harnesses a multi-phase commit mechanism (MPC) with built-in rollback to distribute and atomically enable administrative changes across a single fabric. It also provides rapid responses to network events within the fabric, minimizing customer impact. Fourth is a set of safety systems that assures that changes being rolled out to the fabric will not cause customer impact. Fifth is AWACS, a set of off-the-box services that enables WCMP-based traffic engineering in existing DC fabrics to increase effective capacity of the CLOS network and provide capacity safety for shared failure domains. All of the products and services described above are operational. Each are in different stages of expansion and new capabilities.

Key job responsibilities
This Principal Engineer will take ownership of ML network performance dependent on the EC2 interface, a critical capability that directly impacts our customers' ability to train and deploy ML models efficiently. In the immediate term, they'll tackle one of our most pressing challenges: building a comprehensive understanding of network performance for ML workloads in production. This means designing and implementing systems that can intelligently measure and baseline performance without direct visibility into customer applications.
Over the next 12-18 months, they'll need to transform how we approach ML networking. This starts with developing new ways to identify and classify network traffic patterns from ML training, building systems that can automatically tune network configurations based on observed workload characteristics. They'll architect flexible abstractions that allow us to quickly adapt to new ML training patterns while maintaining peak performance for existing workloads.
The role requires someone who can move from theoretical understanding to practical implementation. They'll need to deliver a production-grade telemetry system that provides actionable insights about network performance, develop new approaches to baseline measurements, and demonstrate concrete performance improvements for key ML workloads. Success in this role means not just solving today's performance challenges, but building systems flexible enough to handle tomorrow's ML innovations.
This PE will be the technical authority for ML networking performance at AWS, working across teams to drive adoption of their approaches and establishing best practices that will shape how we build and operate our ML infrastructure for years to come.

BASIC QUALIFICATIONS

  • A Masters Degree in Computer Science or Engineering, or equivalent experience is mandatory.
  • Excellent IP networking fundamentals and extensive experience in the application of IP protocols.
  • Expertise with major internet routing protocols; specifically, BGP, OSPF, MPLS, RSVP and ISIS
  • Expertise with major router platforms; specifically, a deep technical understanding of all internal hardware components and experience with router system design.
  • Expert level network analysis fundamentals and robust troubleshooting skills; specifically, network performance analysis.
  • Ability to lead teams of engineers to deliver large scale solutions.
  • Excellent written and verbal communication skills and an ability to interact efficiently with peers and customers is required.

PREFERRED QUALIFICATIONS

  • Deep expertise in RDMA technologies (RoCEv2, EFA, InfiniBand)
  • Strong understanding of ML training patterns and NCCL internals
  • Experience with large-scale performance measurement systems
  • Knowledge of ML frameworks and their distributed training implementations
  • Expertise in network protocol design and optimization

Amazon is an equal opportunities employer. We believe passionately that employing a diverse workforce is central to our success. We make recruiting decisions based on your experience and skills. We value your passion to discover, invent, simplify and build. Protecting your privacy and the security of your data is a longstanding top priority for Amazon. Please consult our Privacy Notice (https://www.amazon.jobs/en/privacy_page) to know more about how we collect, use and transfer the personal data of our candidates.

Amazon is an equal opportunity employer and does not discriminate on the basis of protected veteran status, disability, or other legally protected status.

Our inclusive culture empowers Amazonians to deliver the best results for our customers. If you have a disability and need a workplace accommodation or adjustment during the application and hiring process, including support for the interview or onboarding process, please visit https://amazon.jobs/content/en/how-we-hire/accommodations for more information. If the country/region you’re applying in isn’t listed, please contact your Recruiting Partner.

Average salary estimate

$175000 / YEARLY (est.)
min
max
$150000K
$200000K

If an employer mentions a salary or salary range on their job, we display it as an "Employer Estimate". If a job has no salary data, Rise displays an estimate if available.

Similar Jobs
Photo of the Rise User
Comulate Hybrid San Francisco
Posted 11 days ago

Innovate with Comulate as an AI Engineer applying advanced LLM techniques to reshape insurance workflows and drive automation at scale.

Photo of the Rise User
CS GROUP Hybrid 222 Pitkin St suite 114, East Hartford, CT 06108, USA
Posted 10 days ago

Experienced FPGA Developer and Tester needed for embedded safety-critical software projects in aerospace and defense with flexible remote and onsite work options.

Image Associates Inc. Hybrid Jackson, Tennessee, United States
Posted 6 days ago

Experienced Senior Project Engineer needed to lead capital projects and process improvements in a state-of-the-art heavy manufacturing steel facility in Jackson, Tennessee.

Photo of the Rise User

Lead complex electrical commissioning and testing efforts for Commonwealth Fusion Systems’ cutting-edge SPARC fusion project at the Devens, MA facility.

Photo of the Rise User
Posted 5 days ago
Mission Driven
Social Impact Driven
Passion for Exploration
Reward & Recognition

Design and maintain advanced fluid systems critical to Starship launches as a Launch Pad Engineer at SpaceX in Cape Canaveral.

Photo of the Rise User
Zscaler Hybrid San Jose, California, United States
Posted 5 days ago

Experienced Staff Site Reliability Engineer wanted to join Zscaler's Government Cloud team in a hybrid role focused on secure cloud operations and innovation.

Photo of the Rise User

Collaborate as a Site Reliability Engineer to enhance infrastructure development workflows on MongoDB's DevInfra team, supporting multi-cloud provisioning and developer efficiency.

Posted 6 days ago

A leading materials engineering firm is hiring a Process Integration Engineer to optimize advanced photonics process flows and enhance manufacturing efficiency.

Photo of the Rise User
SPT Labtech Hybrid No location specified
Posted 4 days ago

A Field Service Engineer role at SPT Labtech offering technical support and customer service for innovative lab instruments in the San Francisco area.

Photo of the Rise User
Posted 3 days ago

Contribute to cutting-edge robotics and embedded firmware development at Carbon Robotics to help build the future of sustainable farming technology.

Photo of the Rise User
Mission Driven
Collaboration over Competition
Inclusive & Diverse
Growth & Learning
Maternity Leave
Paternity Leave
Medical Insurance
Dental Insurance
Vision Insurance
Mental Health Resources
Life insurance
Disability Insurance
Health Savings Account (HSA)
Flexible Spending Account (FSA)
401K Matching
Paid Time-Off

Lead the development of AI-powered solutions at Airbnb to revolutionize global customer support as a Senior Staff Machine Learning Engineer.

Photo of the Rise User

Technical Application Engineer sought by Skeleton Technologies to support and innovate energy storage integrations for U.S. data center customers.

Cretex Medical is looking for a Senior Process Development Engineer specializing in stamping to lead engineering solutions and product development within their New Product Development team.

MATCH
Calculating your matching score...
FUNDING
DEPARTMENTS
SENIORITY LEVEL REQUIREMENT
TEAM SIZE
No info
HQ LOCATION
No info
EMPLOYMENT TYPE
Full-time, onsite
DATE POSTED
May 26, 2025

Subscribe to Rise newsletter

Risa star 🔮 Hi, I'm Risa! Your AI
Career Copilot
Want to see a list of jobs tailored to
you, just ask me below!
LATEST ACTIVITY