Let’s get started
By clicking ‘Next’, I agree to the Terms of Service
and Privacy Policy
Jobs / Job page
ML Framework Engineer image - Rise Careers
Job details

ML Framework Engineer

At Tensorwave, we’re leading the charge in AI compute, building a versatile cloud platform that’s driving the next generation of AI innovation. We’re focused on creating a foundation that empowers cutting-edge advancements in intelligent computing, pushing the boundaries of what’s possible in the AI landscape.


Job Description:

TensorWave is seeking an ML Framework Engineer to lead the integration, optimization, and maintenance of PyTorch (and select AI libraries) on AMD ROCm GPUs. This role is critical in ensuring our AI cloud platform remains at the cutting edge of performance, stability, and compatibility by tracking upstream framework changes, debugging compatibility issues, and automating builds, testing, and benchmarking. You will be responsible for maintaining a registry of validated AI libraries, debugging low-level performance issues, and working with external maintainers to upstream fixes. You will collaborate with DevOps, MLOps, and AI researchers to ensure a seamless deployment and development experience across TensorWave’s infrastructure. This role is ideal for an engineer with deep PyTorch internals knowledge, strong GPU debugging experience, and a passion for optimizing AI workloads at the framework level.


Responsibilities
  • Framework Compatibility & Versioning: Track PyTorch and other AI framework updates, maintain a versioned registry of validated builds, and proactively handle breaking changes.
  • Kernel Debugging & Profiling: Triage and debug ROCm-related issues affecting AI workloads, handling small fixes directly and escalating complex issues to MLOps and third-party maintainers.
  • Build & CI/CD Automation: Develop and maintain automated build pipelines for AI frameworks, integrating regression testing and benchmarking, while working with DevOps for large-scale automation.
  • Performance Optimization: Profile and analyze AI workload performance on AMD GPUs, identifying bottlenecks in memory access, kernel execution, and framework overhead.
  • Third-Party Collaboration: Work with PyTorch maintainers, ROCm engineers, and external AI library contributors to improve framework compatibility and push upstream fixes when needed.
  • Container & Environment Management: Maintain and update prebuilt AI container environments, ensuring seamless integration with TensorWave’s inference and training infrastructure.
  • Documentation & Knowledge Sharing: Serve as the SME (Subject Matter Expert) for library compatibility, maintaining internal documentation on framework versions, known issues, and best practices.


Essential Skills & Qualifications
  • 3+ years of experience in ML framework development, optimization, or GPU debugging.
  • Strong expertise in PyTorch internals, model execution, and AI framework architecture.
  • Experience with ROCm or CUDA development, including kernel debugging and profiling.
  • Proficiency in Python and C++, with experience in optimizing AI workloads at the framework level.
  • Familiarity with low-level GPU performance profiling tools (rocprof, Nsight, perf, VTune, etc.).
  • Hands-on experience with CI/CD for AI frameworks, including automated testing and benchmarking.
  • Strong understanding of containerization (Docker, Kubernetes) and dependency management (pip, Conda, Bazel, CMake, etc.).
  • Excellent documentation skills, with a focus on library versioning, compatibility tracking, and regression analysis.


Preferred Qualifications
  • Experience contributing to PyTorch or other open-source ML frameworks.
  • Prior experience maintaining a private pip or Conda package registry for AI software.
  • Familiarity with distributed training, model parallelism, and mixed precision training.
  • Knowledge of LLM-specific optimizations, such as quantization and tensor parallel execution.
  • Exposure to high-performance computing (HPC) environments for AI workloads.


We’re looking for resilient, adaptable people to join our team—folks who enjoy collaborating and tackling tough challenges. We’re all about offering real opportunities for growth, letting you dive into complex problems and make a meaningful impact through creative solutions. If you're a driven contributor, we encourage you to explore opportunities to make an impact at Tensorwave. Join us as we redefine the possibilities of intelligent computing.


What We Bring:

In addition to a competitive salary, we offer a variety of benefits to support your needs, including:

Stock Options

100% paid Medical, Dental, and Vision insurance 

Life and Voluntary Supplemental Insurance

Short Term Disability Insurance

Flexible Spending Account

401(k)

Flexible PTO

Paid Holidays

Parental Leave

Mental Health Benefits through Spring Health

Average salary estimate

$110000 / YEARLY (est.)
min
max
$90000K
$130000K

If an employer mentions a salary or salary range on their job, we display it as an "Employer Estimate". If a job has no salary data, Rise displays an estimate if available.

What You Should Know About ML Framework Engineer, TensorWave

At Tensorwave in vibrant Las Vegas, NV, we are taking the world of AI computing by storm, and we want you to be part of that journey. We’re looking for a talented ML Framework Engineer to join our innovative team. If you are passionate about optimizing AI workloads and have a solid grasp of PyTorch internals, this is the opportunity for you! In this role, you will spearhead the integration and maintenance of PyTorch on AMD ROCm GPUs, which is crucial for keeping our AI cloud platform at the forefront of performance and stability. Your day-to-day will involve debugging compatibility issues, managing a registry of validated AI libraries, and automating builds and tests to ensure our infrastructure runs smoothly. You’ll collaborate closely with DevOps, MLOps, and AI researchers, creating seamless development and deployment experiences. We believe in pushing boundaries, so you’ll also be triaging ROCm-related issues and optimizing performance by analyzing AI workloads. If you love working with third-party contributors to enhance framework compatibility and you're excited to dive into complex challenges, we want to hear from you. At Tensorwave, we’re committed to your growth and offer a supportive environment where you can truly make an impact in the world of intelligent computing.

Frequently Asked Questions (FAQs) for ML Framework Engineer Role at TensorWave
What are the main responsibilities of an ML Framework Engineer at Tensorwave?

As an ML Framework Engineer at Tensorwave, your main responsibilities will include tracking updates for PyTorch and other AI frameworks, maintaining a versioned registry of validated builds, debugging ROCm-related issues affecting AI workloads, and developing automated build pipelines. You will also analyze the performance of AI workloads, work closely with third-party contributors, and serve as a subject matter expert in library compatibility.

Join Rise to see the full answer
What qualifications are essential for an ML Framework Engineer position at Tensorwave?

To qualify for the ML Framework Engineer position at Tensorwave, candidates should have 3+ years of experience in ML framework development or GPU debugging, strong knowledge of PyTorch internals, and familiarity with ROCm or CUDA development. Proficiency in Python and C++, as well as experience with automated testing and CI/CD for AI frameworks, is also essential.

Join Rise to see the full answer
What skills will help me thrive as an ML Framework Engineer at Tensorwave?

Key skills that will help you thrive as an ML Framework Engineer at Tensorwave include a strong understanding of AI framework architecture, proficiency in low-level GPU performance profiling tools, hands-on experience with containerization technologies like Docker and Kubernetes, and excellent documentation practices. Additionally, collaboration skills are vital for working with external maintainers and AI library contributors.

Join Rise to see the full answer
How does Tensorwave support the growth and development of its ML Framework Engineers?

Tensorwave is committed to fostering growth and development among its ML Framework Engineers. We offer competitive salaries, stock options, comprehensive health benefits, and a flexible PTO policy, along with a collaborative work culture that encourages tackling challenging projects and advancing your skills in the rapidly evolving field of AI.

Join Rise to see the full answer
What is the work environment like for an ML Framework Engineer at Tensorwave?

The work environment for an ML Framework Engineer at Tensorwave is dynamic and collaborative. Located in Las Vegas, NV, we prioritize teamwork and innovative problem-solving. Our culture emphasizes resilience and adaptability, allowing you to dive into complex challenges and contribute meaningfully to AI advancements in a supportive atmosphere.

Join Rise to see the full answer
Common Interview Questions for ML Framework Engineer
Can you explain your experience with PyTorch internals?

In your response, highlight specific projects where you've manipulated PyTorch internals, discussing how you've implemented custom layers or modified existing functionalities. Mention any performance improvements achieved as a result of your work.

Join Rise to see the full answer
How do you approach debugging compatibility issues in AI frameworks?

Describe your systematic approach to debugging, focusing on how you identify and isolate issues. Share any tools and techniques you use to troubleshoot compatibility issues, particularly in relation to ROCm or CUDA.

Join Rise to see the full answer
What strategies do you use for performance optimization on AMD GPUs?

Discuss your experience profiling AI workloads on AMD GPUs, mentioning specific profiling tools you've employed. Explain how you've identified bottlenecks and optimized memory access or kernel execution to enhance overall performance.

Join Rise to see the full answer
Can you describe your experience with CI/CD processes for AI frameworks?

Elaborate on your previous work with CI/CD pipelines, emphasizing how you developed automated testing and benchmarking procedures for ML frameworks. Include any challenges faced and how you overcame them.

Join Rise to see the full answer
What is your experience with collaborating with third-party maintainers?

Provide examples of your collaboration with external contributors or maintainers, detailing how you've approached upstream fixes and contributed to open-source projects. Highlight the importance of communication and teamwork in these scenarios.

Join Rise to see the full answer
How do you stay updated with the latest trends in AI framework development?

Share your approach to professional development, including resources such as conferences, online courses, or communities. Mention any particular areas of innovation in AI framework development that excite you.

Join Rise to see the full answer
What tools do you use for GPU performance profiling?

Be specific about the tools you've used, such as rocprof, Nsight, or VTune. Discuss how you've leveraged these tools to gain insights into GPU performance and optimize AI workloads effectively.

Join Rise to see the full answer
How do you document your work on AI frameworks?

Explain your documentation techniques, focusing on how you maintain records of library compatibility, versioning, and known issues. Stress the importance of clear documentation for team collaboration and future reference.

Join Rise to see the full answer
What role does containerization play in your development process?

Talk about your experience with containerization, specifically Docker and Kubernetes. Explain how they facilitate consistent development environments and streamline deployment processes for AI frameworks.

Join Rise to see the full answer
How do you handle challenging performance issues in ML workloads?

Outline a specific instance where you encountered a significant performance issue, detailing your investigation process. Discuss the steps you took to address it, including collaboration and testing methodologies.

Join Rise to see the full answer
Similar Jobs
Photo of the Rise User
TensorWave Hybrid Silver Springs, MD
Posted 9 days ago
Photo of the Rise User
Auria Hybrid No location specified
Posted 7 days ago
Photo of the Rise User
NBCUniversal Hybrid 454 N.Columbus Dr., Chicago, ILLINOIS
Posted 12 days ago
Dexterra Hybrid 844 N Rush St, Chicago, IL 60611, USA
Posted 6 days ago
Photo of the Rise User
Silfab Solar Hybrid 7149 Logistics Ln, Fort Mill, SC 29715, USA
Posted 9 days ago
Photo of the Rise User
Posted 4 days ago
Photo of the Rise User
Posted 4 days ago
Photo of the Rise User
Sun Tribe Remote Charlottesville, VA
Posted 3 days ago

Supercharge your large-scale PyTorch LLM workloads with our cloud powered by AMD MI300X

3 jobs
MATCH
Calculating your matching score...
FUNDING
DEPARTMENTS
SENIORITY LEVEL REQUIREMENT
TEAM SIZE
EMPLOYMENT TYPE
Full-time, on-site
DATE POSTED
March 22, 2025

Subscribe to Rise newsletter

Risa star 🔮 Hi, I'm Risa! Your AI
Career Copilot
Want to see a list of jobs tailored to
you, just ask me below!