Job details

Lead Systems Engineer, High-Performance Computing - job 10 of 20

Get a free resume review

IaaS Systems and Storage & Engineering (ISSE) team is part of the Operations & Infrastructure technology organization. Distributed Compute engineering (DCE) is part of ISSE and High-performance compute platform engineering is part of DCE. Our vision, mission and purpose are summarized as following:

Vision: To become a leading technical engineering professional, pioneering in the design and automation of server infrastructure. We envision creating highly secure and efficient operations environments that drive business success and technological advancement.

Mission: Our mission is to deliver high-quality server infrastructure design and automated implementation. We are committed to operating in complex, highly secure, and highly available environments, while maintaining rigorous operations, security, and procedural models.

Purpose: The purpose of this role is to utilize strong hands-on technical engineering skills to design and automate the implementation of server infrastructure based on business requirements. This role will interact with technology domain experts to maintain high security and availability in complex operational environments, thereby driving business efficiency and security.

Essential Functions:

GPU as a Service and High-Performance Compute Platform Support: Expertise in deploying, managing, and optimizing GPU as a Service (GaaS) and high-performance compute platforms to support advanced computational workloads.
Extensive Datacenter Experience: Proficient in managing complex, geographically distributed IT infrastructures to ensure high availability and performance.
Advanced Technical Knowledge: Profound understanding of high-performance, highly available, and secure computing systems utilizing x86 technologies and protocols (NVME, GPU, PCI-E).
Enterprise Server and Component Expertise: In-depth knowledge of server components (storage/network controllers, HBA, SSDs) and their functionalities, essential for maintaining high-performance compute environments.
Processor and GPU Systems Proficiency: Strong grasp of Intel/AMD architectures, GPU systems, memory hierarchy, and hardware-level security to enhance system performance and reliability.
Out-of-Band, UEFI, and BIOS Expertise: Comprehensive understanding of out-of-band management, UEFI, BIOS settings, and their impact on system performance and security in high-performance computing environments.
Hardware Lifecycle Management: Experienced in hardware lifecycle management, including firmware and OS driver certifications, to ensure the longevity and reliability of compute resources.
Infrastructure Management and Automation: Proficient in installing, configuring, supporting, and maintaining compute infrastructure management tools, with skills in Ansible for automation to streamline deployment and operational tasks.
Performance Benchmarking and Tech Evaluation: Capable of running performance benchmarks and evaluating new technologies for various platforms (Linux, Windows, containerized, and virtualized) to ensure optimal performance.
Scripting Proficiency: Advanced skills in scripting languages such as PowerShell and Python to automate and optimize infrastructure tasks.
Team and Independent Work: Highly motivated, excellent team player, capable of working independently, with strong analytical and troubleshooting abilities to resolve complex issues and mentor junior staff.

This is a hybrid position. Hybrid employees can alternate time between both remote and office. Employees in hybrid roles are expected to work from the office 2-3 set days a week (determined by leadership/site), with a general guidepost of being in the office 50% or more of the time based on business needs.

Average salary estimate

$135000 / YEARLY (est.)

min

max

$120000K

$150000K

If an employer mentions a salary or salary range on their job, we display it as an "Employer Estimate". If a job has no salary data, Rise displays an estimate if available.

What You Should Know About Lead Systems Engineer, High-Performance Computing, Visa

As a Lead Systems Engineer for High-Performance Computing at our Ashburn office, you'll play a pivotal role in our IaaS Systems and Storage & Engineering (ISSE) team, part of the Operations & Infrastructure technology organization. Here, you'll be at the forefront of Distributed Compute engineering, driving innovation in our high-performance computing platform. Your mission? To design and automate server infrastructure that not only meets complex business requirements but enhances security and availability in a fast-paced operational environment. You'll leverage your extensive experience with GPU as a Service, particularly focused on deploying, managing, and optimizing our computing platforms to support advanced computational workloads. We value a strategic mindset paired with strong technical expertise in datacenter management. Your familiarity with x86 technologies, NVME, GPU, and PCI-E protocols will be vital in ensuring our systems operate efficiently. Additionally, your skills in infrastructure management and automation—especially with tools like Ansible—will streamline our deployment processes. If you're excited about performance benchmarking across multiple platforms and enjoy collaborating with cross-functional teams while also thriving in independent tasks, this hybrid position could be a perfect fit for you. Join us in our mission to become industry leaders in secure and efficient operations environments!

Frequently Asked Questions (FAQs) for Lead Systems Engineer, High-Performance Computing Role at Visa

What are the main responsibilities of a Lead Systems Engineer at High-Performance Computing?

The Lead Systems Engineer at High-Performance Computing is primarily responsible for the design and automation of server infrastructure. This includes deploying and managing GPU as a Service, ensuring high availability in complex IT infrastructures, and performing advanced technical evaluations to optimize performance across multiple platforms. Their role requires collaboration with domain experts to maintain stringent security protocols while enhancing operational efficiencies.