Job Title: Site Reliability Engineering - Performance Engineer
Location: Bay Area preferred/Hybrid
Department: DevOps
At WitnessAI, we're at the intersection of innovation and security in AI. We are seeking a Site Reliability Engineer - This role emphasizes deep systems-level performance analysis, tuning, and optimization to ensure the reliability and efficiency of our cloud-based infrastructure. You will drive performance across a tech stack that includes Cloud Infrastructure, Linux, Kubernetes, databases, message queuing systems, AI workloads, and GPUs. The ideal candidate brings a passion for data-driven methodologies, flame graph analysis, and advanced performance debugging to solve complex system challenges.
Key Responsibilities
Conduct root cause analysis (RCA) for performance bottlenecks using data-driven approaches like flame graphs, heatmaps, and latency histograms.
Perform detailed kernel and application tracing using tools based on technologies like eBPF, perf, and ftrace to gain insights into system behavior.
Design and implement performance dashboards to visualize key performance metrics in real-time.
Recommend Linux and Cloud Server tuning improvements to increase throughput and latency
Tune Linux systems for workload-specific demands, including scheduler, I/O subsystem, and memory management optimizations.
Analyze and optimize cloud instance types, EBS volumes, and network configurations for high performance and low latency.
Improve throughput and latency for message queues (e.g., ActiveMQ, Kafka, SQS, etc) by profiling producer/consumer behavior and tuning configurations.
Apply profiling tools to analyze GPU utilization and kernel execution times and implement techniques to boost GPU efficiency.
Optimize distributed training pipelines using industry-standard frameworks.
Evaluate and reduce training times through mixed precision training, model quantization, and resource-aware scheduling in Kubernetes.
Work with AI teams to identify scaling challenges and optimize GPU workloads for inference and training.
Design observability systems for granular monitoring of end-to-end latency, throughput, and resource utilization.
Implement and leverage modern observability stacks to capture critical insights into application and infrastructure behavior.
Work with developers to refactor applications for performance and scalability, using profiling tools
Mentor teams on performance best practices, debugging workflows, and methodologies inspired by leading performance engineers.
Qualifications Required:
Deep expertise in Linux systems internals (kernel, I/O, networking, memory management) and performance tuning.
Strong experience with AWS cloud services and their performance optimization techniques.
Proficiency in performance analysis and load testing tools and other system tracing frameworks.
Hands-on experience with database tuning, query analysis, and indexing strategies.
Expertise in GPU workload optimization, and cloud-based GPU instances
Familiarity with message queuing systems including performance tuning.
Programming experience with a focus on profiling and tuning
Strong scripting skills (e.g., Python, Bash) to automate performance measurement and tuning workflows.
Preferred:
Knowledge of distributed AI/ML training frameworks
Experience designing and scaling GPU workloads on Kubernetes using GPU-aware scheduling and resource isolation.
Expertise in optimizing AI inference pipelines.
Familiarity with Brendan Gregg’s methodologies for systems analysis, such as USE (Utilization, Saturation, Errors) and Workload Characterization Frameworks.
Hybrid work environment
Competitive salary
Health, dental, and vision insurance
401(k) plan
Opportunities for professional development and growth
Generous vacation policy
If an employer mentions a salary or salary range on their job, we display it as an "Employer Estimate". If a job has no salary data, Rise displays an estimate if available.
Join WitnessAI as a Site Reliability Engineer - Performance Engineering and immerse yourself in a collaborative, innovative environment where AI meets security in the Bay Area! As an SRE focused on performance, you'll dive deep into the intricacies of cloud-based infrastructure, turning data-driven insights into reliable and efficient systems. Your mission will involve tackling performance bottlenecks through advanced methodologies like flame graph analysis and kernel tracing. If you're passionate about optimizing complex tech stacks, including Kubernetes, databases, and GPUs, this role is tailor-made for you! You'll be responsible for designing performance dashboards, recommending improvements, and mentoring teams on best practices. With a focus on enhancing throughput and latency across various systems and workloads, your contributions will directly influence the robustness of our AI technologies. Bring your expertise in performance analysis, cloud services tuning, and scripting to witness how your work at WitnessAI can enhance the efficiency of our applications. Let’s work together to shape the future of AI reliability and performance, while enjoying a competitive salary, hybrid work flexibility, health benefits, and ample professional growth opportunities. Get on board to transform the AI landscape and make a real impact!
Subscribe to Rise newsletter