Job details

Linux Administrator

Seeking a Sr. Linux Operations Support Engineer to provide system monitoring, hardware troubleshooting, hardware repair, and collaboration with internal Engineering and Lab Management teams to ensure that system issues are resolved as quickly and efficiently as possible.Primarily NVIDIA GPU administrationInterested in having a great critical thinker who can provide thought leadership to continually improve the service to their customers.Responsibilities for this role will include:- Monitoring health of Linux-based GPU clusters and nodes- Interpreting system event logs and monitoring portal to determine cause of errors- Coordinating with various teams world-wide to enact repairs quickly and efficiently- Filing support tickets with other teams and following them through to completion- Administering cluster node status to bring nodes online/offline for repair- Troubleshooting system error issues to address hardware and software problems- Communicating with end users and platform engineers in a timely and professional manner- Be responsible for uptime metrics during Redmond business hoursThe ideal candidate for this role will have:- Proven background in Linux Systems administration and troubleshooting. 5+ years of hands-on Linux/Ubuntu system administration preferred.- Strong technical support operations background- Hands on experience troubleshooting and repairing multi-GPU hardware systems- Experience parsing event log data to identify root cause of system issues- Strong communication, customer service and cross-group collaboration skills- Ability to own resolution of issues from identification through resolution- Passion for technology and continuous learning- Knowledge/experience with GPU based High Performance Computing preferred.