Let’s get started
By clicking ‘Next’, I agree to the Terms of Service
and Privacy Policy
Jobs / Job page
SRE - Performance Engineering image - Rise Careers
Job details

SRE - Performance Engineering

Job Title: Site Reliability Engineering - Performance Engineer

Location:  Bay Area preferred/Hybrid

Department: DevOps

At WitnessAI, we're at the intersection of innovation and security in AI.  We are seeking a Site Reliability Engineer - This role emphasizes deep systems-level performance analysis, tuning, and optimization to ensure the reliability and efficiency of our cloud-based infrastructure. You will drive performance across a tech stack that includes Cloud Infrastructure, Linux, Kubernetes, databases, message queuing systems, AI workloads, and GPUs. The ideal candidate brings a passion for data-driven methodologies, flame graph analysis, and advanced performance debugging to solve complex system challenges.

Key Responsibilities

  • Conduct root cause analysis (RCA) for performance bottlenecks using data-driven approaches like flame graphs, heatmaps, and latency histograms.

  • Perform detailed kernel and application tracing using tools based on technologies like eBPF, perf, and ftrace to gain insights into system behavior.

  • Design and implement performance dashboards to visualize key performance metrics in real-time.

  • Recommend Linux and Cloud Server tuning improvements to increase throughput and latency 

  • Tune Linux systems for workload-specific demands, including scheduler, I/O subsystem, and memory management optimizations.

  • Analyze and optimize cloud instance types, EBS volumes, and network configurations for high performance and low latency.

  • Improve throughput and latency for message queues (e.g., ActiveMQ, Kafka, SQS, etc) by profiling producer/consumer behavior and tuning configurations.

  • Apply profiling tools to analyze GPU utilization and kernel execution times and implement techniques to boost GPU efficiency.

  • Optimize distributed training pipelines using industry-standard frameworks.

  • Evaluate and reduce training times through mixed precision training, model quantization, and resource-aware scheduling in Kubernetes.

  • Work with AI teams to identify scaling challenges and optimize GPU workloads for inference and training.

  • Design observability systems for granular monitoring of end-to-end latency, throughput, and resource utilization.

  • Implement and leverage modern observability stacks to capture critical insights into application and infrastructure behavior.

  • Work with developers to refactor applications for performance and scalability, using profiling tools

  • Mentor teams on performance best practices, debugging workflows, and methodologies inspired by leading performance engineers.

Qualifications Required:

  • Deep expertise in Linux systems internals (kernel, I/O, networking, memory management) and performance tuning.

  • Strong experience with AWS cloud services and their performance optimization techniques.

  • Proficiency in performance analysis and load testing  tools and other system tracing frameworks.

  • Hands-on experience with database tuning, query analysis, and indexing strategies.

  • Expertise in GPU workload optimization, and cloud-based GPU instances

  • Familiarity with message queuing systems including performance tuning.

  • Programming experience with a focus on profiling and tuning

  • Strong scripting skills (e.g., Python, Bash) to automate performance measurement and tuning workflows.

Preferred:

  • Knowledge of distributed AI/ML training frameworks

  • Experience designing and scaling GPU workloads on Kubernetes using GPU-aware scheduling and resource isolation.

  • Expertise in optimizing AI inference pipelines.

  • Familiarity with Brendan Gregg’s methodologies for systems analysis, such as USE (Utilization, Saturation, Errors) and Workload Characterization Frameworks.

Benefits:

  • Hybrid work environment

  • Competitive salary

  • Health, dental, and vision insurance

  • 401(k) plan

  • Opportunities for professional development and growth

  • Generous vacation policy

Average salary estimate

$115000 / YEARLY (est.)
min
max
$100000K
$130000K

If an employer mentions a salary or salary range on their job, we display it as an "Employer Estimate". If a job has no salary data, Rise displays an estimate if available.

What You Should Know About SRE - Performance Engineering, Witness AI

Join WitnessAI as a Site Reliability Engineer - Performance Engineering and immerse yourself in a collaborative, innovative environment where AI meets security in the Bay Area! As an SRE focused on performance, you'll dive deep into the intricacies of cloud-based infrastructure, turning data-driven insights into reliable and efficient systems. Your mission will involve tackling performance bottlenecks through advanced methodologies like flame graph analysis and kernel tracing. If you're passionate about optimizing complex tech stacks, including Kubernetes, databases, and GPUs, this role is tailor-made for you! You'll be responsible for designing performance dashboards, recommending improvements, and mentoring teams on best practices. With a focus on enhancing throughput and latency across various systems and workloads, your contributions will directly influence the robustness of our AI technologies. Bring your expertise in performance analysis, cloud services tuning, and scripting to witness how your work at WitnessAI can enhance the efficiency of our applications. Let’s work together to shape the future of AI reliability and performance, while enjoying a competitive salary, hybrid work flexibility, health benefits, and ample professional growth opportunities. Get on board to transform the AI landscape and make a real impact!

Frequently Asked Questions (FAQs) for SRE - Performance Engineering Role at Witness AI
What does a Site Reliability Engineer - Performance Engineering do at WitnessAI?

At WitnessAI, a Site Reliability Engineer - Performance Engineering focuses on analyzing and optimizing the performance of our cloud-based infrastructure. This includes conducting deep systems-level performance analysis, tuning Linux systems, and implementing real-time performance dashboards to ensure reliability and efficiency.

Join Rise to see the full answer
What skills are required for the Site Reliability Engineer - Performance Engineering position at WitnessAI?

Candidates for the Site Reliability Engineer - Performance Engineering role at WitnessAI should have deep expertise in Linux systems internals, strong experience with AWS cloud services, proficiency in performance analysis tools, and scripting skills. Familiarity with GPU workload optimization and message queuing systems would also be beneficial.

Join Rise to see the full answer
What technologies will I work with as a Site Reliability Engineer - Performance Engineering at WitnessAI?

As a Site Reliability Engineer - Performance Engineering at WitnessAI, you will engage with a wide range of technologies, including Linux, Kubernetes, OCI frameworks, performance analysis tools, databases, message queuing systems, and GPUs in a cloud environment, specifically AWS.

Join Rise to see the full answer
What are the key responsibilities of a Site Reliability Engineer - Performance Engineering at WitnessAI?

Key responsibilities include conducting root cause analysis for performance bottlenecks, tuning Linux systems, analyzing GPU usage, optimizing distributed training pipelines, and designing observability systems. You will also mentor teams on performance best practices, integrating your insights into applications.

Join Rise to see the full answer
Is the Site Reliability Engineer - Performance Engineering role at WitnessAI hybrid or remote?

The Site Reliability Engineer - Performance Engineering position at WitnessAI offers a hybrid work environment, allowing candidates to enjoy flexibility while engaging with our innovative teams and projects in the Bay Area.

Join Rise to see the full answer
What are the career growth opportunities for a Site Reliability Engineer - Performance Engineering at WitnessAI?

Working as a Site Reliability Engineer - Performance Engineering at WitnessAI allows for ample opportunities for professional development, including mentorship, training, and the chance to engage in cutting-edge projects that enhance your skills and career trajectory.

Join Rise to see the full answer
What benefits does WitnessAI offer for Site Reliability Engineers - Performance Engineering?

WitnessAI offers competitive salaries, health, dental, and vision insurance, a 401(k) plan, generous vacation policies, and opportunities for professional growth, making it an attractive choice for Site Reliability Engineers - Performance Engineering.

Join Rise to see the full answer
Common Interview Questions for SRE - Performance Engineering
How do you approach performance tuning in Linux systems?

When discussing performance tuning in Linux systems, it's important to highlight your understanding of kernel and I/O optimization. Mention specific tools like eBPF or perf, and give examples of how you have measured and improved performance in previous roles.

Join Rise to see the full answer
Can you explain your experience with AWS cloud services?

When asked about AWS cloud services, detail your practical experience optimizing cloud instances, EBS volumes, and network configurations. Provide specific examples of how these optimizations led to improved performance and cost efficiency.

Join Rise to see the full answer
What methods do you utilize for performance analysis?

Discuss your familiarity with performance analysis methods such as flame graphs, heatmaps, and latency histograms. Give examples of how you've successfully applied these methods to identify and resolve performance bottlenecks in past projects.

Join Rise to see the full answer
How do you optimize GPU workloads for AI applications?

For questions about GPU workload optimization, describe your experience with GPU profiling tools and techniques. Mention specific successes where you improved efficiency and reduced training times in AI applications.

Join Rise to see the full answer
What is your process for conducting root cause analysis for performance issues?

Explain your systematic approach to root cause analysis, discussing the tools you use and the steps you follow. Highlight your analytical skills and use of data-driven methodologies to quickly identify and resolve system issues.

Join Rise to see the full answer
How do you ensure a project adheres to performance best practices?

Mention your knowledge of established best practices in performance engineering, such as monitoring and alerting frameworks. Emphasize how you communicate these practices to teams and mentor them in their implementation to ensure project success.

Join Rise to see the full answer
What experience do you have with database tuning and query optimization?

Share any relevant experiences with database performance tuning, including query analysis and indexing strategies. Be sure to illustrate your knowledge with examples of how your optimizations led to tangible performance improvements.

Join Rise to see the full answer
Can you discuss your experience with message queuing systems?

Detail your understanding of message queuing systems like Kafka or ActiveMQ. Provide examples of how you've optimized message flows, tuned configurations, and improved overall throughput and latency in past projects.

Join Rise to see the full answer
How do you prioritize tasks while working on performance optimization?

Talk about your prioritization process when addressing performance issues, such as focusing on the impact of each task and collaborating with stakeholders to determine critical areas that need immediate attention.

Join Rise to see the full answer
How do you keep up with the latest trends in performance engineering?

Describe your methods for staying updated in the field, such as following industry blogs, participating in webinars, and being active in online communities. Highlight any recent trends you've integrated into your work as a Site Reliability Engineer.

Join Rise to see the full answer
Similar Jobs
Witness AI Remote No location specified
Posted 7 days ago
Posted 5 days ago
Photo of the Rise User
Copy.ai Remote Remote (US/Canada)
Posted 9 days ago
Photo of the Rise User
Posted 9 days ago
Photo of the Rise User
Posted 5 days ago
Photo of the Rise User
Posted 6 days ago
Photo of the Rise User
Posted 10 days ago
Posted 12 days ago
MATCH
Calculating your matching score...
FUNDING
DEPARTMENTS
SENIORITY LEVEL REQUIREMENT
TEAM SIZE
No info
LOCATION
No info
EMPLOYMENT TYPE
Full-time, hybrid
DATE POSTED
November 24, 2024

Subscribe to Rise newsletter

Risa star 🔮 Hi, I'm Risa! Your AI
Career Copilot
Want to see a list of jobs tailored to
you, just ask me below!