Let’s get started
By clicking ‘Next’, I agree to the Terms of Service
and Privacy Policy
Jobs / Job page
Staff Site Reliability Engineer - Cloud Engineering image - Rise Careers
Job details

Staff Site Reliability Engineer - Cloud Engineering - job 19 of 20

Visa’s Technology Organization is a community of problem solvers and innovators reshaping the future of commerce. We operate the world’s most sophisticated processing networks capable of handling more than 65k secure transactions a second across 80M merchants, 15k Financial Institutions, and billions of everyday people. While working with us you’ll get to work on complex distributed systems and solve massive scale problems centered on new payment flows, business and data solutions, cyber security, and B2C platforms.

 

The Opportunity:

As a Staff Site Reliability Engineer in Product Reliability Engineering, you will be part of a team that maintains and supports Visa's Data Platform and provides support for key cloud based Big data and Kafka Platforms. You will be responsible for driving innovation for our partners and clients, within Visa and globally. You will work on open-source Big Data and Kafka clusters focusing on Cloud, ensuring their availability, performance, reliability, and improving operational efficiency.

 

The Work itself:

Essential Functions:

· Design, build and manage Big Data and Kafka infrastructure on AWS, GCP and Azure.

· Manage and optimize Apache Big Data and Kafka clusters for high performance, reliability, and scalability.

· Develop tools and processes to monitor and analyze system performance and to identify potential issues.

· Collaborate with other teams to design and implement Solutions to improve reliability and efficiency of the Big data cloud platforms.

· Ensure security and compliance of the platforms within organizational guidelines.

· Other responsibilities include effective root cause analysis of major production incidents and the development of learning documentation. The person will identify and implement high-availability solutions for services with a single point of failure.

· The role involves planning and performing capacity expansions and upgrades in a timely manner to avoid any scaling issues and bugs. This includes automating repetitive tasks to reduce manual effort and prevent human errors.

· The successful candidate will tune alerting and set up observability to proactively identify issues and performance problems. They will also work closely with Level 3 teams in reviewing new use cases and cluster hardening techniques to build robust and reliable platforms.

· The role involves creating standard operating procedure documents and guidelines on effectively managing and utilizing the platforms. The person will leverage DevOps tools, disciplines (Incident, problem, and change management), and standards in day-to-day operations.

· The individual will ensure that the platforms can effectively meet performance and service level agreement requirements. They will also perform security remediation, automation, and self-healing as per the requirement.

· The individual will concentrate on developing automations and reports to minimize manual effort. This can be achieved through various automation tools such as Shell scripting, Ansible, or Python scripting, or by using any other programming language.

 

The Skills You Bring:

· Energy and Experience: A growth mindset that is curious and passionate about technologies and enjoys challenging projects on a global scale.

·  Challenge the Status Quo: Comfort in pushing the boundaries, “hacking” beyond traditional solutions.

·  Language Expertise: Expertise in one or more general development languages (e.g., Java, python)

· Builder: Experience building and deploying distributed systems.

·  Learner: Constant drive to learn new technologies such as cloud technologies, Kubernetes, MLOPS.

· Partnership: Experience collaborating with Engineering, Application and Other functional teams.

 

**We do not expect that any single candidate would fulfill all these characteristics. For instance, we have awesome team members who are really focused on building scalable systems but didn’t work with payments technology or web applications before joining Visa.

This is a hybrid position. Hybrid employees can alternate time between both remote and office. Employees in hybrid roles are expected to work from the office 2-3 set days a week (determined by leadership/site), with a general guidepost of being in the office 50% or more of the time based on business needs.

Average salary estimate

$140000 / YEARLY (est.)
min
max
$120000K
$160000K

If an employer mentions a salary or salary range on their job, we display it as an "Employer Estimate". If a job has no salary data, Rise displays an estimate if available.

What You Should Know About Staff Site Reliability Engineer - Cloud Engineering, Visa

Are you ready to take your career to the next level as a Staff Site Reliability Engineer in Cloud Engineering with Visa? In this exciting position in Austin, you'll become part of Visa's Technology Organization, a dynamic community that's all about innovation and problem-solving. Here, you'll dive into complex distributed systems and tackle immense challenges around payment flows and data solutions that impact billions around the world. Your primary role will involve maintaining and supporting Visa’s cutting-edge Data Platform while optimizing cloud-based Big Data and Kafka infrastructures. Your design and management of these systems on AWS, GCP, and Azure will ensure high performance and reliability. Embracing a growth mindset, you'll collaborate with teams to enhance the efficiency of our platforms while focusing on security compliance. You'll get hands-on with root cause analysis, capacity planning, and automating tasks to improve operational excellence. This is a unique chance to drive innovation not only within Visa but also for our partners and clients globally. Whether you are tuning alerts or setting observability, your work will be crucial in creating seamless, reliable services. If you're excited about diverse technologies, from Java and Python to Kubernetes, and you're eager to push boundaries while collaborating with passionate peers, this role is your chance to shine and make a real difference in the world of commerce!

Frequently Asked Questions (FAQs) for Staff Site Reliability Engineer - Cloud Engineering Role at Visa
What are the responsibilities of a Staff Site Reliability Engineer at Visa?

As a Staff Site Reliability Engineer within Visa, your key responsibilities include designing and managing cloud infrastructures for Big Data and Kafka on AWS, GCP, and Azure. You'll optimize these systems for high reliability and performance while collaborating with cross-functional teams. Your role also encompasses driving automation, conducting root cause analysis of production incidents, enhancing security measures, and creating documentation for standard operating procedures.

Join Rise to see the full answer
What qualifications are required for the Staff Site Reliability Engineer position at Visa in Austin?

To be successful as a Staff Site Reliability Engineer at Visa, candidates should possess expertise in development languages such as Java or Python, and have experience with distributed systems. A passion for learning new technologies like cloud services and Kubernetes is highly valued. Additionally, familiarity with DevOps practices, incident management, and collaboration with engineering teams is essential.

Join Rise to see the full answer
How does Visa support career growth for the Staff Site Reliability Engineer role?

Visa fosters an environment of continuous learning, providing Staff Site Reliability Engineers with opportunities to explore innovative technologies and take on challenging projects. Employees are encouraged to share knowledge and collaborate across teams, paving the way for professional growth and advancement within the company.

Join Rise to see the full answer
What tools and technologies do Staff Site Reliability Engineers work with at Visa?

Staff Site Reliability Engineers at Visa work with various tools and technologies, including cloud platforms such as AWS, GCP, and Azure, along with Apache Big Data and Kafka clusters. Familiarity with automation tools like Ansible and Shell scripting, along with programming skills in languages such as Java and Python, is essential to succeed in this role.

Join Rise to see the full answer
Is the Staff Site Reliability Engineer position at Visa remote or hybrid?

The Staff Site Reliability Engineer position at Visa is hybrid, allowing employees to alternate between remote work and the office in Austin. Employees are expected to be in the office for about 2-3 set days per week, depending on leadership and business needs.

Join Rise to see the full answer
Common Interview Questions for Staff Site Reliability Engineer - Cloud Engineering
Can you explain your experience with cloud technologies as a Staff Site Reliability Engineer?

In your answer, highlight specific cloud platforms you've worked with, such as AWS or GCP. Discuss projects where you designed or managed Big Data systems, including details on performance optimizations and automation efforts you implemented.

Join Rise to see the full answer
What strategies do you use to monitor system performance and identify issues?

Explain your approach to system monitoring, including tools you utilize for metrics collection, alerting protocols, and how you analyze data to predict potential issues. Mention any experience you have with tuning alerts and enhancing observability.

Join Rise to see the full answer
Describe a challenging incident you resolved in a production environment.

Provide a concise narrative of a specific incident, focusing on the root cause analysis process, the actions you took to resolve it, and the lessons learned. Emphasize your analytical skills and commitment to improving system reliability.

Join Rise to see the full answer
How do you ensure compliance with security protocols on cloud platforms?

Discuss your understanding of security best practices, including authentication protocols, encryption methods, and regular security audits. Mention any specific experiences where you successfully implemented security measures or remediated vulnerabilities.

Join Rise to see the full answer
What is your experience with automation in Site Reliability Engineering?

Highlight your familiarity with automation tools such as Ansible, Python, or Shell scripting. Share examples of processes you have automated to reduce manual work and improve operational efficiency, detailing how this impacted team performance.

Join Rise to see the full answer
How do you prioritize tasks when managing multiple projects?

Explain your approach to task management, including any methodologies you use to prioritize based on urgency or impact. Share tools or systems you’ve employed to keep your workflow organized and efficient.

Join Rise to see the full answer
What steps do you take to optimize performance and scalability in Big Data infrastructures?

Discuss specific methods you've implemented to enhance performance, such as load balancing, tuning configurations, or choosing the right architecture. Include any experiences where your optimizations led to measurable improvements.

Join Rise to see the full answer
How do you collaborate with cross-functional teams, and what’s your strategy for effective communication?

Emphasize the importance of clear communication and active collaboration with different teams. Provide examples of successful projects where cross-team collaboration led to improved implementations or innovative solutions.

Join Rise to see the full answer
Can you discuss a time you challenged the status quo in a tech environment?

Share a specific case where you introduced a new technology or methodology that improved processes. Highlight your thought process and any resistance you faced, and how you overcame it to achieve buy-in from stakeholders.

Join Rise to see the full answer
What do you foresee as the biggest trends in Site Reliability Engineering moving forward?

Discuss potential trends such as increased automation, the adoption of AI in monitoring, or the growth of Kubernetes. Share your insights on how these trends could impact the role of Site Reliability Engineers in the technology landscape.

Join Rise to see the full answer
Similar Jobs
Photo of the Rise User
Posted 9 days ago
Photo of the Rise User
Posted 5 days ago

Join Hexagon US Federal as a Draftsperson to create detailed engineering models and support product development initiatives.

Photo of the Rise User
Posted 7 days ago
Photo of the Rise User
JLL Hybrid WINDSOR, CT
Posted yesterday

Join JLL as a Maintenance Technician II to help maintain and repair automated equipment in a growing team setting.

Photo of the Rise User

Join Columbia Road as a Senior Ecommerce Solutions Architect to lead impactful ecommerce solutions while enjoying professional growth and a community-driven work culture.

Join GDMS as an electrical engineering intern to gain hands-on experience in hardware design and testing.

Posted 19 minutes ago

Join CD PROJEKT RED as a Lead PCG/Tools Engineer and help shape the future of procedural tools for groundbreaking RPGs.

Visa Inc. operates as a payments technology company worldwide. The company facilitates commerce through the transfer of value and information among consumers, merchants, financial institutions, businesses, strategic partners, and government entiti...

8905 jobs
MATCH
Calculating your matching score...
FUNDING
DEPARTMENTS
SENIORITY LEVEL REQUIREMENT
TEAM SIZE
EMPLOYMENT TYPE
Full-time, hybrid
DATE POSTED
April 2, 2025

Subscribe to Rise newsletter

Risa star 🔮 Hi, I'm Risa! Your AI
Career Copilot
Want to see a list of jobs tailored to
you, just ask me below!