Let’s get started
By clicking ‘Next’, I agree to the Terms of Service
and Privacy Policy
Jobs / Job page
Site Reliability Engineer image - Rise Careers
Job details

Site Reliability Engineer

About AION

AION is building the next generation of AI cloud platform by transforming the future of high-performance computing (HPC) through its decentralized AI cloud. Purpose-built for bare-metal performance, AION democratizes access to compute power for AI training, fine-tuning, inference, data labeling, and beyond.

By leveraging underutilized resources such as idle GPUs and data centers, AION provides a scalable, cost-effective, and sustainable solution tailored for developers, researchers, and enterprises. The platform's innovative Proof of Compute Contribution (PoCC) protocol rewards contributors based on performance, creating a transparent and efficient ecosystem.

Integrated with Tether (USD₮ & USD₮0) for stability and regulatory clarity, AION eliminates volatility, ensuring predictable costs and seamless transactions. With cutting-edge partnerships and a USD-backed economy, AION is pioneering the commoditization of high-performance compute, empowering global innovation and bridging the AI wealth gap.

Led by high-pedigree founders with previous exits, AION is well-funded by major VCs with strategic global partnerships. Headquartered in the US with global presence, the company is building its initial core team in India.

Who you are

You are a reliability-focused engineer with deep expertise in cloud-native systems and infrastructure automation. You thrive on building robust monitoring solutions and creating self-healing infrastructure. You understand the challenges of maintaining high availability across distributed systems and have experience implementing SRE best practices. You're passionate about creating production-ready environments that can scale efficiently and recover automatically from failures.

Technical Skills & Experience

  • 3-8 years of experience in Site Reliability Engineering or DevOps (exceptional candidates with different experience profiles will be considered)
  • A Tier1 college education or previous work experience at FAANG/top startups is preferred but not required
  • Cloud Platforms: Deep expertise with AWS, GCP, or Azure infrastructure services
  • Kubernetes: Advanced knowledge of Kubernetes operations, cluster management, and troubleshooting
  • Infrastructure as Code: Strong experience with Terraform, Pulumi, or similar IaC tools
  • Observability: Expertise implementing comprehensive monitoring using Prometheus, Grafana, and ELK stack
  • Service Mesh: Experience with Istio, Linkerd, or similar service mesh technologies
  • Networking: Understanding of network architectures, DNS, load balancing, and security groups
  • CI/CD: Knowledge of automated deployment pipelines and GitOps workflows
  • Scripting: Proficiency in Bash, Python, or Go for automation scripts
  • Container Technologies: Deep understanding of Docker, containerd, and OCI specifications
  • Security: Knowledge of infrastructure security best practices and compliance requirements
  • Incident Management: Experience with incident response, post-mortems, and developing SOP documentation

Key Responsibilities

  1. Responsible for designing and implementing comprehensive monitoring and alerting systems across all AION platforms.
  2. Develop automation for infrastructure provisioning, scaling, and recovery using Terraform and Kubernetes.
  3. Create and maintain runbooks and playbooks for handling common operational scenarios and incidents.
  4. Responsible for implementing service mesh solutions for observability, traffic management, and security.
  5. Design and implement logging systems that provide visibility into complex distributed systems.
  6. Responsible for capacity planning and resource optimization across cloud environments.
  7. Implement CI/CD pipelines for reliable and consistent deployments across all environments.
  8. Design and build self-healing systems that automatically recover from common failure modes.
  9. Develop infrastructure for both the compute platform and data annotation services with consistent reliability practices.
  10. Responsible for designing and implementing disaster recovery strategies and testing procedures.
  11. Create and maintain production, staging, and development environments with appropriate isolation.
  12. Collaborate with security teams to implement infrastructure security best practices and compliance requirements.

Location

Individuals in this role are expected to relocate to Bangalore, though exceptions can be made. We offer a hybrid working setup with 3 days in-office setup. Employees would have flexibility to work from anywhere for a few months during a year.

Why Join Us

  • Be part of a mission-driven team at the intersection of web3 and AI, tackling some of the most exciting challenges in the industry.
  • Join the ground floor of an AI startup, with the opportunity to make a significant impact on the company and the industry.
  • Collaborate with top-tier talent from the tech industry.
  • Competitive salary and benefits package.
  • Flexible work environment with opportunities for professional growth and development.

If you are a skilled and motivated Site Reliability Engineer with a passion for building reliable, scalable infrastructure for cutting-edge compute systems, we would love to hear from you.

Average salary estimate

$100000 / YEARLY (est.)
min
max
$80000K
$120000K

If an employer mentions a salary or salary range on their job, we display it as an "Employer Estimate". If a job has no salary data, Rise displays an estimate if available.

What You Should Know About Site Reliability Engineer, AION

AION is on the lookout for a talented Site Reliability Engineer to help us redefine the AI cloud platform space! We’re not just transforming high-performance computing; we’re also creating a decentralized AI cloud that brings powerful computing capabilities within everyone’s reach. If you're someone who thrives on reliability and possesses expertise in cloud-native systems and infrastructure automation, you'll fit right in at AION. In this role, you'll have the chance to design and implement robust monitoring solutions, ensuring that our systems are not just efficient but capable of recovering seamlessly from failures. You’ll also get your hands dirty with Terraform and Kubernetes, orchestrating infrastructure that scales like a pro. Your understanding of observability tools like Prometheus and Grafana will help us maintain high availability across our systems. Plus, with our innovative Proof of Compute Contribution (PoCC) protocol, your contributions will be recognized, allowing for a transparent and efficient working ecosystem. At AION, you’ll be surrounded by like-minded individuals who are committed to making a difference in the fast-paced world of AI. With a hybrid working model and opportunities for professional growth, there’s never been a better time to join this dynamic team. If you’re excited to work at the forefront of technology with a mission-driven company, we’d love to meet you!

Frequently Asked Questions (FAQs) for Site Reliability Engineer Role at AION
What are the main responsibilities of a Site Reliability Engineer at AION?

As a Site Reliability Engineer at AION, you will be responsible for designing and implementing comprehensive monitoring and alerting systems across all platforms. Additionally, you will work on developing automation for infrastructure provisioning using Terraform and Kubernetes, ensuring services are reliable and can scale efficiently. You'll also create runbooks for handling operational scenarios, manage CI/CD pipelines, and optimize capacity across cloud environments while implementing disaster recovery strategies.

Join Rise to see the full answer
What qualifications do I need to become a Site Reliability Engineer at AION?

To qualify for the Site Reliability Engineer position at AION, you should have 3-8 years of experience in Site Reliability Engineering or DevOps, with a strong educational background (Tier 1 college preferred). Expertise in cloud platforms like AWS, GCP, or Azure is essential, as well as advanced knowledge of Kubernetes, Infrastructure as Code tools like Terraform, and observability practices using Prometheus and Grafana.

Join Rise to see the full answer
How does AION support the professional growth of its Site Reliability Engineers?

AION is committed to the professional development of its employees, including Site Reliability Engineers. You can expect a flexible work environment and the opportunity to collaborate with top-tier talent in the tech industry. Our mission-driven approach fosters an atmosphere where continuous learning and hands-on experience are encouraged, and as the company grows, so do the opportunities for advancement.

Join Rise to see the full answer
What tech stack will a Site Reliability Engineer use at AION?

As a Site Reliability Engineer at AION, you will work with a diverse tech stack that includes cloud platforms (AWS, GCP, or Azure), Kubernetes for container orchestration, Infrastructure as Code tools like Terraform or Pulumi, and observability tools such as Prometheus and Grafana. You will also gain experience with service mesh technologies, networking structures, and CI/CD pipelines, making it a comprehensive role that enhances your skill set.

Join Rise to see the full answer
Is relocation required for the Site Reliability Engineer position at AION?

While candidates for the Site Reliability Engineer position at AION are expected to relocate to Bangalore, exceptions can be made. We offer a hybrid working setup where employees are in the office for three days a week, while also providing flexibility to work remotely for several months throughout the year.

Join Rise to see the full answer
Common Interview Questions for Site Reliability Engineer
Can you explain your experience with Kubernetes and how it relates to your role as a Site Reliability Engineer?

When answering this question, highlight specific projects where you've managed or implemented Kubernetes clusters. Discuss your responsibilities such as cluster management, troubleshooting, and scaling applications. Providing concrete examples of how you utilized Kubernetes to improve reliability or reduce downtime will showcase your skills effectively.

Join Rise to see the full answer
What monitoring tools have you implemented in your previous roles?

Be prepared to discuss specific monitoring tools like Prometheus, Grafana, or the ELK stack that you've implemented. Describe how you used these tools to create alerting systems and improved observability in production environments. Mention any challenges you faced and how you overcame them to ensure system reliability.

Join Rise to see the full answer
How do you approach incident management and postmortems?

Discuss your systematic approach to incident management, emphasizing your experience with responding to incidents, documenting processes, and conducting postmortems. Illustrate how you've leveraged learnings from incidents to develop SOPs and improve system reliability moving forward.

Join Rise to see the full answer
What strategies do you use for capacity planning in cloud environments?

Explain how you assess usage patterns, analyze metrics, and forecast resource needs based on both historical and anticipated usage. You might reference specific scenarios where you successfully optimized capacity, ensuring applications ran smoothly while minimizing costs.

Join Rise to see the full answer
Describe your experience with Infrastructure as Code tools.

Detail your experience using Infrastructure as Code tools like Terraform or Pulumi. Provide examples of how you’ve automated resource provisioning and managed application deployments through code, highlighting effectiveness and efficiency gains from this approach.

Join Rise to see the full answer
How have you implemented disaster recovery strategies in your previous positions?

Outline specific disaster recovery strategies you’ve developed, such as backup solutions or failover processes. Discuss the importance of these strategies and any successful outcomes from scenarios where you had to invoke them, emphasizing the lessons learned.

Join Rise to see the full answer
What steps do you take to ensure security compliance in cloud infrastructures?

Talk about your understanding of security best practices and compliance requirements. Provide examples of security measures you’ve implemented to safeguard cloud infrastructures, stating any relevant regulations you are familiar with, such as GDPR or HIPAA.

Join Rise to see the full answer
How do you maintain a collaborative atmosphere while working remotely?

Emphasize your communication skills and ability to leverage collaboration tools to keep teams connected. Share examples of how you've built relationships with remote colleagues, ensuring that everyone stays aligned and engaged with ongoing projects.

Join Rise to see the full answer
Can you discuss a significant challenge you faced as a Site Reliability Engineer and how you overcame it?

Choose a specific challenge that showcases your problem-solving skills. Describe the scenario, the steps you implemented to identify and resolve the issue, and the positive outcomes. This demonstrates your ability to think critically and handle pressure effectively.

Join Rise to see the full answer
What is your experience with CI/CD pipelines?

Outline your understanding of continuous integration and continuous deployment principles. Share relevant experiences where you’ve designed or maintained CI/CD pipelines, focusing on tools used, your role in the implementation, and benefits observed in terms of deployment efficiency.

Join Rise to see the full answer
Similar Jobs
Productivity Inc Hybrid US, Hennepin County, MN; Minnesota, Plymouth, MN
Posted yesterday

Join Productivity, Inc as a Fabrication Applications Engineer and tackle complex manufacturing challenges with innovative solutions in a collaborative environment.

Photo of the Rise User

Join Bridges as a Senior Mechanical Design Engineer and play a key role in leading innovative projects in the water sector.

Posted 10 days ago

We are looking for an experienced Mechanical Engineer to provide innovative design and consulting services for sustainable energy infrastructure solutions.

Photo of the Rise User
Sunsource Hybrid Southgate, MI 48195
Posted 5 days ago

Join K+S Services as an Electronic Technician, where you will ensure the quality and efficiency of industrial repair services.

Photo of the Rise User
Inclusive & Diverse
Feedback Forward
Collaboration over Competition
Growth & Learning

Drive the execution of advanced AI infrastructure projects at OpenAI as an Infrastructure Turnover & Deployment Specialist.

Photo of the Rise User
Posted 14 days ago
Photo of the Rise User

Join Protegrity as a Principal Performance Engineer and drive the performance of cutting-edge data protection products.

Photo of the Rise User

Join a dynamic team as a Senior Kubernetes/DevOps Engineer, where you'll be instrumental in designing and implementing DevOps solutions.

MATCH
VIEW MATCH
FUNDING
DEPARTMENTS
SENIORITY LEVEL REQUIREMENT
TEAM SIZE
No info
HQ LOCATION
No info
EMPLOYMENT TYPE
Full-time, hybrid
DATE POSTED
April 2, 2025

Subscribe to Rise newsletter

Risa star 🔮 Hi, I'm Risa! Your AI
Career Copilot
Want to see a list of jobs tailored to
you, just ask me below!
LATEST ACTIVITY
Photo of the Rise User
Someone from OH, Euclid just viewed Work From Home Union Benefits Rep at Global Elite
Photo of the Rise User
Someone from OH, Cincinnati just viewed Runtime QA Tester II at Aechelon Technology
Photo of the Rise User
Someone from OH, Columbus just viewed Director, VB Learning & Development at Trustmark
Photo of the Rise User
Someone from OH, Loveland just viewed Associate Buyer - Kid's Basics, Uniforms & Dance at Target
F
Someone from OH, Loveland just viewed Senior Buyer - Lifestyle Accessories at Forseven
Photo of the Rise User
Someone from OH, Loveland just viewed Category Manager at Emma – The Sleep Company
Photo of the Rise User
Someone from OH, Cleveland just viewed Graphic Designer (Temporary) at MasterBrand Cabinets LLC
Photo of the Rise User
56 people applied to REMOTE Sr Piping Designer at Kelly
S
Someone from OH, Ottoville just viewed Full Stack Developer at Sunreef Yachts
Photo of the Rise User
Someone from OH, Ottoville just viewed Senior Developer at NRF
P
Someone from OH, Ottoville just viewed Principal Software Developer - TS/SCI at Parsons
F
Someone from OH, Ottoville just viewed Software Developer at Fil
Photo of the Rise User
Someone from OH, Ottoville just viewed Senior Backend Developer - Big Data at LiveRamp
Photo of the Rise User
Someone from OH, Ottoville just viewed Software Developer - Product Analytics (Python) at Autodesk
Photo of the Rise User
Someone from OH, Ottoville just viewed Software Application Developer at Boeing
Photo of the Rise User
Someone from OH, Ottoville just viewed Senior FullStack Developer at CVS Health
Photo of the Rise User
Someone from OH, Ottoville just viewed Senior Software Developer at Cinemark
T
Someone from OH, Ottoville just viewed Full Stack Developer at Talent Worx
B
Someone from OH, Ottoville just viewed Digital Designer/Front-End Developer at Blackbridge
M
Someone from OH, Ottoville just viewed Full Stack Developer (React/NodeJS) at MySigrid
Photo of the Rise User
9 people applied to OSP Designer at Millennium