Let’s get started
By clicking ‘Next’, I agree to the Terms of Service
and Privacy Policy
Jobs / Job page
Pod Software Engineer image - Rise Careers
Job details

Pod Software Engineer

Job Summary:

We are seeking highly motivated and skilled Pod Software Engineers to join our System Software team. This team plays a critical role in developing, qualifying, and optimizing high-performance networking solutions for large-scale inference workloads. As a Pod Software Engineer, you will focus on developing and qualifying software that drives communication amongst Sohu inference nodes in multi-rack inference clusters. You will collaborate closely with kernel, platform, and telemetry teams to push the boundaries of peer-to-peer RDMA efficiency.

Key Responsibilities:

  • High Performance Peer to Peer Networking: Design, develop, and implement RDMA based networking peering, supporting high bandwidth, low latency communication across PCIe nodes within and across racks. Includes work across Operating System, kernel drivers, embedded software and system software.

  • Test Development: Develop tests that qualify host processors (x86),. NICs, TORs and device network interfaces for high performance.

  • Burn-in integration: Furnish burn-in teams with tests that represent both real-world use cases and workloads for device to device networking, and extreme-load stress testing. 

  • Performance/Health Telemetry Design: Define the key metrics that system software must collect to maintain high availability and performance under extreme communications workloads.

Representative Projects:

  • Analyze performance deviations, optimize network stack configurations, and propose kernel tuning parameters for low-latency, high-bandwidth inference workloads.

  • Design and execute automated qualification tests for RDMA NICs and interconnects across various server configurations.

  • Identify and root-cause firmware, driver, and hardware issues that impact RDMA performance and reliability.

  • Collaborate with ODMs and silicon vendors to validate new RDMA features and enhancements.

  • Implement and validate peer RDMA support for GPU-to-GPU and accelerator-to-accelerator communication.

  • Modify kernel drivers and user-space libraries to optimize direct memory access between inference pods.

  • Profile and benchmark inter-node RDMA latency and bandwidth to improve inference job scaling.

  • Optimize NIC and switch configurations to balance throughput, congestion control, and reliability.

Must-Have Skills and Experience:

  • Proficiency in C/C++

  • Proficiency in at least one scripting language (e.g., Python, Bash, Go).

  • Strong experience with device-to-device networking technologies (RDMA, GPUDirect, etc.), including RoCE.

  • Experience with zero-copy networking, RDMA verbs and memory registration.

  • Familiarity with queue pairs, completions queues, and transport types.

  • Strong understanding of operating systems (Linux preferred) and server hardware architectures.

  • Ability to analyze complex technical problems and provide effective solutions.

  • Excellent communication and collaboration skills.   

  • Ability to work independently and as part of a team.

  • Experience with version control systems (e.g., Git).   

  • Experience with reading and interpreting hardware logs.

Nice-to-Have Skills and Experience:

  • Experience with networking technologies like NVLink, Infiniband, ML Pod interconnects.

  • Experience with widely deployed Top of Rack Switches (Cisco, Juniper, Arista, etc.)

  • Knowledge of server virtualization.

  • Experience with tracing tools like perf, eBPF, ftrace, etc.

  • Experience with performance testing and benchmarking tools (gProf, vTune, Wireshark, etc.).

  • Familiarity with hardware diagnostic tools and techniques 

  • Experience with containerization technologies (e.g., Docker, Kubernetes).

  • Experience with CI/CD pipelines.

  • Experience with Rust.

Ideal Background:

  • Candidates who have worked on GPU or TPU pods, specifically in the networking domain.

  • Candidates who understand up-time challenges of very big ML deployments.

  • Candidates who have actively debugged complex network topologies, specifically dealing with cases of node dropouts/failures, route-arounds, and pod resiliency at large.

  • Candidates must understand performance implications of Pod Networking SW.

    Benefits

    • Full medical, dental, and vision packages, with 100% of premium covered

    • Housing subsidy of $2,000/month for those living within walking distance of the office

    • Daily lunch and dinner in our office

    • Relocation support for those moving to West San Jose

    How we’re different

    Etched believes in the Bitter Lesson. We think most of the progress in the AI field has come from using more FLOPs to train and run models, and the best way to get more FLOPs is to build model-specific hardware. Larger and larger training runs encourage companies to consolidate around fewer model architectures, which creates a market for single-model ASICs.

    We are a fully in-person team in West San Jose, and greatly value engineering skills. We do not have boundaries between engineering and research, and we expect all of our technical staff to contribute to both as needed.

Average salary estimate

$140000 / YEARLY (est.)
min
max
$120000K
$160000K

If an employer mentions a salary or salary range on their job, we display it as an "Employer Estimate". If a job has no salary data, Rise displays an estimate if available.

What You Should Know About Pod Software Engineer, Etched

At Sohu, we're on the lookout for enthusiastic and talented Pod Software Engineers to join our System Software team in Cupertino! If you have a passion for developing cutting-edge networking solutions for large-scale inference workloads, then this is the opportunity for you. As a Pod Software Engineer, your primary focus will be to create and qualify software that ensures seamless communication among our inference nodes across multi-rack clusters. You'll work closely with various teams including kernel, platform, and telemetry to innovate in peer-to-peer RDMA efficiency. Imagine designing high-performance networking systems that involve low-latency communication while pushing the envelope on what's possible. You'll dive into the intricacies of RDMA based networking, develop crucial tests for high-performance host processors, and design telemetry to maintain optimal performance under extreme loads. This role allows you to analyze performance deviations and collaborate with ODMs and silicon vendors to validate the deployment of new RDMA features. If you're excited about optimizing network interactions between GPUs and accelerators and enjoy solving complex technical puzzles, then the Pod Software Engineer position at Sohu could be your next big adventure. We value creative thinkers who can communicate effectively and collaborate within teams, all while also having the ability to tackle individual challenges. We believe in research and engineering intersection - here's your chance to be a part of something significant!

Frequently Asked Questions (FAQs) for Pod Software Engineer Role at Etched
What are the key responsibilities of a Pod Software Engineer at Sohu?

As a Pod Software Engineer at Sohu, you will be responsible for designing and implementing RDMA based networking peering that enhances high bandwidth, low latency communications across PCIe nodes. Your role will involve developing tests to qualify various network interfaces and integrating real-world workloads to ensure optimal device performance. You will also focus on telemetry design to maintain system health under extreme workloads, while collaborating with teams to enhance RDMA performance and identify software issues.

Join Rise to see the full answer
What qualifications do I need to become a Pod Software Engineer at Sohu?

To qualify for the Pod Software Engineer position at Sohu, you'll need proficiency in C/C++ and a scripting language like Python or Bash. A strong understanding of device-to-device networking technologies like RDMA and experience with zero-copy networking are essential. Familiarity with operating systems, particularly Linux, and the ability to analyze complex technical issues will also be crucial. Additionally, excellent communication skills and a collaborative mindset are key to thriving in this role.

Join Rise to see the full answer
What kind of projects will I work on as a Pod Software Engineer at Sohu?

At Sohu, Pod Software Engineers work on exciting projects such as analyzing network performance, optimizing configurations for low-latency workloads, and designing automated qualification tests for RDMA NICs. You'll also get to modify kernel drivers to maximize direct memory access, and benchmark inter-node RDMA latency to enhance inference job scaling. Collaborating with vendors to validate new networking features will also be part of your rewarding work environment.

Join Rise to see the full answer
What benefits can I expect as a Pod Software Engineer at Sohu?

Joining Sohu as a Pod Software Engineer comes with a competitive benefits package, including full coverage of medical, dental, and vision insurance premiums. There’s also a generous housing subsidy for those living nearby, free meals at the office, and support for relocation if you're moving to West San Jose. Additionally, you'll be part of a collaborative environment that encourages personal and professional growth.

Join Rise to see the full answer
How does Sohu support employees in their career development as Pod Software Engineers?

At Sohu, we encourage continuous learning and professional development for our Pod Software Engineers. With a focus on the integration of engineering and research, our team is empowered to explore innovative ideas and stay ahead in the AI field. As an employee, you’ll have access to projects that push your boundaries and enhance your skills while collaborating with industry experts, allowing for robust career growth.

Join Rise to see the full answer
Common Interview Questions for Pod Software Engineer
Can you describe your experience with RDMA and how you've used it in past projects?

When answering this question, highlight your practical experience with RDMA, detailing specific projects where you implemented peer-to-peer networking or optimized performance using RDMA technologies. Discuss the challenges you faced, how you overcame them, and the measurable outcomes of your contributions.

Join Rise to see the full answer
What strategies do you employ to analyze and debug networking issues?

For this question, demonstrate your analytical approach by discussing tools and techniques you use for diagnosing network problems. Include examples from past experiences where you successfully identified the root cause of an issue and implemented a solution.

Join Rise to see the full answer
How do you ensure the performance of device-to-device networking solutions?

Explain your approach to maintaining and enhancing the performance of networking solutions. Share how you utilize benchmarking tools and performance metrics to evaluate efficiency and any methodologies you apply to optimize system performance under different workloads.

Join Rise to see the full answer
What is your experience with Linux-based operating systems in the context of networking?

Share your familiarity with Linux operating systems, highlighting any specific tools, commands, or systems you have administered. Discuss how your knowledge of Linux has aided your networking projects, and provide examples of kernel modifications or network configurations you've accomplished.

Join Rise to see the full answer
Can you provide an example of a challenging technical problem you've resolved?

Detail a specific instance where you encountered a significant challenge in your work. Explain the problem, the steps you took to address it, and the tools you used. Lay out the lessons learned and how this experience has prepared you for the Pod Software Engineer role at Sohu.

Join Rise to see the full answer
How do you approach performance testing for networking applications?

Discuss the frameworks, tools, and methodologies you utilize in performance testing. Describe your process for designing tests, analyzing results, and making necessary adjustments to ensure optimal system performance under peak loads.

Join Rise to see the full answer
What role do metrics play in the success of networking projects?

Explain how you establish key performance indicators (KPIs) for networking projects. Discuss specific metrics you've found to be valuable and how they guided your decision-making processes throughout past projects.

Join Rise to see the full answer
How do you handle collaboration across different teams during a project?

Emphasize your communication and teamwork skills. Share experiences where collaboration with kernel, platform, or telemetry teams was essential for project success, illustrating how you fostered a cooperative environment and aligned diverse viewpoints toward a common goal.

Join Rise to see the full answer
What scripting languages are you proficient in, and how have you applied them to your work?

List the scripting languages you're familiar with, such as Python or Bash, and provide examples of how you've used them for automating tasks, generating tests, or processing data within your previous roles.

Join Rise to see the full answer
What is your experience with version control systems, particularly Git?

Discuss your experience using Git for version control, including how you've utilized branching, merging, and managing repositories. Share examples of projects where Git played a vital role in collaboration and code management.

Join Rise to see the full answer
Similar Jobs
Photo of the Rise User
Posted 12 days ago
Photo of the Rise User
Posted 2 days ago
EoT Labs GmbH Remote No location specified
Posted 13 days ago
Photo of the Rise User
Posted 11 days ago
Photo of the Rise User
ServiceNow Hybrid Building A,B,C 2225 Lawson Lane, Santa Clara, California, United States
Posted 8 days ago
Inclusive & Diverse
Mission Driven
Rise from Within
Diversity of Opinions
Work/Life Harmony
Empathetic
Feedback Forward
Take Risks
Collaboration over Competition
Medical Insurance
Dental Insurance
Vision Insurance
Mental Health Resources
Life insurance
Disability Insurance
Health Savings Account (HSA)
Flexible Spending Account (FSA)
Conferences Stipend
Paid Time-Off
Maternity Leave
Equity
Photo of the Rise User
Posted 2 days ago

by burning the transformer architecture into our chips, we’re creating the world’s most powerful servers for transformer inference.

20 jobs
MATCH
VIEW MATCH
FUNDING
SENIORITY LEVEL REQUIREMENT
TEAM SIZE
EMPLOYMENT TYPE
Full-time, on-site
DATE POSTED
March 21, 2025

Subscribe to Rise newsletter

Risa star 🔮 Hi, I'm Risa! Your AI
Career Copilot
Want to see a list of jobs tailored to
you, just ask me below!
LATEST ACTIVITY
B
Someone from OH, Toledo just viewed Data Entry Clerk-Remote at Bond Clinic P.A
Photo of the Rise User
Someone from OH, Columbus just viewed Health & Wellness Account Coordinator at PNOE
Photo of the Rise User
Someone from OH, Columbus just viewed Warehouse Associate - Third Shift at Babylist
B
Someone from OH, Athens just viewed Associate Production Designer at Brooks Running
Photo of the Rise User
120 people applied to Scrum Master-Remote at DICE
Photo of the Rise User
Someone from OH, Cleveland just viewed Graphic Designer for UX/UI Portfolio Mockups at Upwork
Photo of the Rise User
Someone from OH, Dublin just viewed Product Designer (Ambient AI) at Commure + Athelas
V
Someone from OH, Cleveland just viewed Product Designer (UX/UI) at VML Enterprise Solutions
Photo of the Rise User
Someone from OH, Cleveland just viewed Need an expert UI/UX designer ( for long term) at Upwork
Photo of the Rise User
Someone from OH, Cleveland just viewed US Product Designer at Praxent
Photo of the Rise User
Someone from OH, Cleveland just viewed UX / UI Designer at DocPlanner
Photo of the Rise User
Someone from OH, Columbus just viewed Cyber Analyst, Digital Forensics Incident Response at At-Bay
Photo of the Rise User
20 people applied to Software Engineer Intern at Hudl
P
Someone from OH, Marion just viewed Customer Experience Agent at ProjectGrowth
Photo of the Rise User
Someone from OH, Wilmington just viewed Accounts Receivable Specialist at Flock Safety
Photo of the Rise User
23 people applied to Senior PLSQL Developer at ProArch
Photo of the Rise User
Someone from OH, Milford just viewed Visual Designer (Contract to Hire) at Abridge
Photo of the Rise User
Someone from OH, Dublin just viewed User Researcher III at Fearless
Photo of the Rise User
Someone from OH, Dublin just viewed Senior UX Designer at Nox Health
Photo of the Rise User
Someone from OH, Solon just viewed QA Analyst at Two Circles