Let’s get started
By clicking ‘Next’, I agree to the Terms of Service
and Privacy Policy, and consent to receive emails from Rise
Jobs / Job page
Sr. Director, Site Reliability Engineering image - Rise Careers
Job details

Sr. Director, Site Reliability Engineering

Company Details

 

Company URL:  https://www.berkleytechnologyservices.com

 

Berkley Technology Services (BTS) is the dynamic technology solution for W. R. Berkley Corporation, a Fortune 500 Commercial Lines Insurance Company. With key locations in Urbandale, IA and Wilmington, DE, BTS provides innovative and customer-focused IT solutions to the majority of WRBC’s 60+ operating units across the globe. BTS’s wide reach ensures that ideas and opinions are considered at every level of the organization to guarantee we find the best solutions possible.

 

Driven by a commitment to collaboration, BTS acts as consultants to our customers and Operating Units by providing comprehensive solutions that not only address the challenge at hand, but proactively plan for the “What’s Next” in our industry and beyond.

 

With a culture centered on innovation and entrepreneurial spirit, BTS stands as a community of technology leaders with eyes toward the future -- leaders who truly care about growing not only their team members, but themselves, and take pride in their employees who shine. BTS offers endless ways to get involved and have the chance to grow your career into a wide range of roles you'd never known existed. Come join us as we push forward into the future of industry leading technological solutions.

 

Berkley Technology Services: Right Team, Right Technology, Simple and Secure.

Responsibilities

The Sr Director, Site Reliability Engineering (SRE) is responsible for developing and implementing a comprehensive strategy for site reliability, encompassing scalability, performance, and reliability improvements. The role will align SRE objectives with overall business goals and technology roadmaps. It will foster the spirit of continuous improvement to the SRE and position it to benefit the organizational objectives across the Berkley Corporation.

 

The person in this role is responsible for overseeing SRE team operations, ensuring the reliability and availability of key applications and supporting infrastructure. This role will work effectively with Service Management to enforce best practices for system reliability, monitoring, capacity planning, incident response, problem management, disaster recovery, change management, and workflow automation.  They will also own and administer the tools and technologies necessary to generate a complete view of SRE metrics and improvement areas, including (but limited to) monitoring, logging, notification, dashboarding, and AIOps.

 

This role will involve overseeing multiple teams, with the possibility of additional teams being assigned as the organization grows and evolves.

 

Team Performance Management:

  • Instantiate and build a robust SRE team over time and integrate SRE into Berkley’s product development and operational process.
  • Recruit, mentor, and develop a high-performing team of SRE professionals.
  • Monitor ongoing staff performance; identify and communicate opportunities for improvement.
  • Provide leadership and support to ensure projects are staffed appropriately and timelines are met.

 

Collaboration and Relationship Building:

  • Collaborate with the BTS IT Leadership Teams and other groups across the IT organization to drive a unified approach to site reliability that reduces downtime and minimizes outage business impact.
  • Foster strong relationships with delivery organization leadership to align SRE efforts with organizational goals. Work collaboratively with other business and IT leaders to ensure cross functional problems are addressed cohesively across the organization.
  • Work cross-functionally in partnership with software development teams to guide product development in creating resilient and durable software systems.
  • Collaborate with EA to institute design patterns for resilient systems and mechanisms for scoring applications against industry-recognized configurations (including active-active, active-passive, recover-from-scratch, and data replication scenarios).

 

Execution, Project, and Work Management:

  • Define, and track reliability and observability OKRs for infrastructure and key systems.
  • Implement robust monitoring and alerting systems to proactively identify potential issues, analyze system performance, and facilitate quick response to incidents.
  • Implement AIOps functionality to enable auto-response, self-healing, and anomaly trend analysis.
  • Drive the development and implementation of automation solutions to remove “toil”, streamline processes, reduce manual interventions, and enhance the overall efficiency of the product engineering and SRE teams.
  • Work closely with product, development, infrastructure, and architecture teams to conduct capacity planning, ensuring that systems can handle current and future demand. Anticipate growth and scalability requirements.
  • Establish and oversee effective high-severity incident response processes, ensure timely incident resolution, and conduct post-mortems to identify root causes and implement preventive measures.
  • Improve reliability by identifying and addressing gaps in our architecture, services, and tooling.
  • Oversee disaster recovery program for both on premise and Cloud-based Berkley solutions.
  • Performs other duties assigned.

Qualifications

  • A passion for technology and innovation in the end user computing space.
  • 8+ years of experience in building/leading strong and flexible teams, managing large scale systems consumed by tens/hundreds of thousands of users.
  • 8+ years of experience of Site Reliability Engineering and DevOps.
  • 4+ years of experience in Disaster Recovery and/or Business Continuity.
  • Strong understanding of Cloud computing platforms (Azure preferred) including life-and-shift environments (VMs, etc.) and cloud-native setups (AKS, serverless, etc.).
  • Strong understanding and experience in automation tools and programming/scripting languages to develop and implement automated system reliability and performance solutions including infrastructure automation and configurations management tools (Ansible, Chef, Puppet).
  • Strong understanding of observability, monitoring, alerting, and logging tools and ability to design and implement effective monitoring and logging strategies.
  • Experience in designing and implementing on-premise, cloud, and hybrid resiliency solutions, disaster recovery, and business continuity planning.
  • Ability to drive critical issues and system design discussions and moderate between multiple technology teams.
  • Solid understanding of security best practices in on-premise, cloud, and hybrid environments along with Network technologies.
  • Working knowledge of CI/CD - preferably GitHub workflows and Actions.
  • Working knowledge of IaC automation tools (Terraform, Ansible, etc.)
  • Experience with Kubernetes and other auto-scaling tools and technologies.
  • Skilled at assessing and developing IT talent across multiple time zones and multiple business domains.
  • Exceptional written and verbal communication skills.
  • Ability to work independently in a fast-paced environment.
  • Travel Requirement: Up to 25%

Average salary estimate

$175000 / YEARLY (est.)
min
max
$150000K
$200000K

If an employer mentions a salary or salary range on their job, we display it as an "Employer Estimate". If a job has no salary data, Rise displays an estimate if available.

Similar Jobs
Posted 12 days ago

Join Berkley Entertainment as a Business Analyst, bridging the gap between IT and business teams while enhancing entertainment insurance solutions.

Photo of the Rise User
Posted 7 days ago

Join Arbor's team as an Instrumentation Engineer to innovatively design and maintain critical systems in the pursuit of environmental sustainability.

Photo of the Rise User

Join Rula as an Engineering Manager to lead initiatives that enhance mental healthcare access and outcomes.

Posted 3 days ago

Join Fluidstack as a Principal Networking Engineer and help optimize networks for cutting-edge AI deployments.

Photo of the Rise User
JLL Hybrid Tallahassee, FL
Posted 10 days ago

Become an essential part of JLL as an Automation Engineer, where you'll drive automation projects and mentor talent in a supportive environment.

Photo of the Rise User
Apple Hybrid Sunnyvale, California, United States
Posted 3 days ago
Inclusive & Diverse
Diversity of Opinions
Work/Life Harmony
Dare to be Different
Reward & Recognition
Empathetic
Take Risks
Growth & Learning
Transparent & Candid
Mission Driven
Passion for Exploration
Feedback Forward
Medical Insurance
Dental Insurance
Vision Insurance
Mental Health Resources
Life insurance
Disability Insurance
Health Savings Account (HSA)
Flexible Spending Account (FSA)
Learning & Development
Paid Time-Off
Maternity Leave
Social Gatherings

Join Apple as an RFIC Layout Automation Engineer to optimize layout processes and integrate AI-driven techniques for cutting-edge wireless technology.

Photo of the Rise User
Electric Hydrogen Remote Houston, Texas, United States
Posted 9 days ago

Join Electric Hydrogen as a Staff Pipe Stress Engineer to help design the next generation of hydrogen plants for a sustainable future.

Photo of the Rise User

Join Kimley-Horn as a Civil Engineering Analyst and pave the way for innovative land development solutions in Orlando!

Photo of the Rise User

Join Assystem as an Operational Development Engineer and contribute to innovative nuclear projects that shape the future of energy.

Photo of the Rise User
AECOM Hybrid Murray, Utah, United States
Posted 4 days ago

Join AECOM as a Mid-Level Transportation Engineer and play a key role in designing sustainable infrastructure solutions.

Panoptyc Remote No location specified
Posted 9 days ago

Join Panoptyc as a DevOps Engineer to spearhead their migration from Heroku to AWS and enhance IT security practices.

Photo of the Rise User

Join Toyota's team as a Principal Engineer to lead the integration of Siemens Manufacturing Operations Management solutions in logistics.

Photo of the Rise User

Skyworks is looking for a skilled Sr. Equipment Engineer to contribute to innovative semiconductor solutions in a collaborative and fast-paced environment.

Photo of the Rise User
Posted 10 days ago

Lead and innovate as the CTO Engineering Manager at Konecranes, driving the design of cutting-edge material handling solutions.

MATCH
Calculating your matching score...
FUNDING
DEPARTMENTS
SENIORITY LEVEL REQUIREMENT
TEAM SIZE
No info
HQ LOCATION
No info
EMPLOYMENT TYPE
Full-time, on-site
DATE POSTED
April 16, 2025

Subscribe to Rise newsletter

Risa star 🔮 Hi, I'm Risa! Your AI
Career Copilot
Want to see a list of jobs tailored to
you, just ask me below!