Job details

Sr. Director, Site Reliability Engineering

Get a free resume review

Company Details

Company URL: https://www.berkleytechnologyservices.com

Berkley Technology Services (BTS) is the dynamic technology solution for W. R. Berkley Corporation, a Fortune 500 Commercial Lines Insurance Company. With key locations in Urbandale, IA and Wilmington, DE, BTS provides innovative and customer-focused IT solutions to the majority of WRBC’s 60+ operating units across the globe. BTS’s wide reach ensures that ideas and opinions are considered at every level of the organization to guarantee we find the best solutions possible.

Driven by a commitment to collaboration, BTS acts as consultants to our customers and Operating Units by providing comprehensive solutions that not only address the challenge at hand, but proactively plan for the “What’s Next” in our industry and beyond.

With a culture centered on innovation and entrepreneurial spirit, BTS stands as a community of technology leaders with eyes toward the future -- leaders who truly care about growing not only their team members, but themselves, and take pride in their employees who shine. BTS offers endless ways to get involved and have the chance to grow your career into a wide range of roles you'd never known existed. Come join us as we push forward into the future of industry leading technological solutions.

Berkley Technology Services: Right Team, Right Technology, Simple and Secure.

Responsibilities

The Sr Director, Site Reliability Engineering (SRE) is responsible for developing and implementing a comprehensive strategy for site reliability, encompassing scalability, performance, and reliability improvements. The role will align SRE objectives with overall business goals and technology roadmaps. It will foster the spirit of continuous improvement to the SRE and position it to benefit the organizational objectives across the Berkley Corporation.

The person in this role is responsible for overseeing SRE team operations, ensuring the reliability and availability of key applications and supporting infrastructure. This role will work effectively with Service Management to enforce best practices for system reliability, monitoring, capacity planning, incident response, problem management, disaster recovery, change management, and workflow automation. They will also own and administer the tools and technologies necessary to generate a complete view of SRE metrics and improvement areas, including (but limited to) monitoring, logging, notification, dashboarding, and AIOps.

This role will involve overseeing multiple teams, with the possibility of additional teams being assigned as the organization grows and evolves.

Team Performance Management:

Instantiate and build a robust SRE team over time and integrate SRE into Berkley’s product development and operational process.
Recruit, mentor, and develop a high-performing team of SRE professionals.
Monitor ongoing staff performance; identify and communicate opportunities for improvement.
Provide leadership and support to ensure projects are staffed appropriately and timelines are met.

Collaboration and Relationship Building:

Collaborate with the BTS IT Leadership Teams and other groups across the IT organization to drive a unified approach to site reliability that reduces downtime and minimizes outage business impact.
Foster strong relationships with delivery organization leadership to align SRE efforts with organizational goals. Work collaboratively with other business and IT leaders to ensure cross functional problems are addressed cohesively across the organization.
Work cross-functionally in partnership with software development teams to guide product development in creating resilient and durable software systems.
Collaborate with EA to institute design patterns for resilient systems and mechanisms for scoring applications against industry-recognized configurations (including active-active, active-passive, recover-from-scratch, and data replication scenarios).

Execution, Project, and Work Management:

Define, and track reliability and observability OKRs for infrastructure and key systems.
Implement robust monitoring and alerting systems to proactively identify potential issues, analyze system performance, and facilitate quick response to incidents.
Implement AIOps functionality to enable auto-response, self-healing, and anomaly trend analysis.
Drive the development and implementation of automation solutions to remove “toil”, streamline processes, reduce manual interventions, and enhance the overall efficiency of the product engineering and SRE teams.
Work closely with product, development, infrastructure, and architecture teams to conduct capacity planning, ensuring that systems can handle current and future demand. Anticipate growth and scalability requirements.
Establish and oversee effective high-severity incident response processes, ensure timely incident resolution, and conduct post-mortems to identify root causes and implement preventive measures.

Improve reliability by identifying and addressing gaps in our architecture, services, and tooling.

Oversee disaster recovery program for both on premise and Cloud-based Berkley solutions.
Performs other duties assigned.

Qualifications

A passion for technology and innovation in the end user computing space.
8+ years of experience in building/leading strong and flexible teams, managing large scale systems consumed by tens/hundreds of thousands of users.
8+ years of experience of Site Reliability Engineering and DevOps.
4+ years of experience in Disaster Recovery and/or Business Continuity.
Strong understanding of Cloud computing platforms (Azure preferred) including life-and-shift environments (VMs, etc.) and cloud-native setups (AKS, serverless, etc.).
Strong understanding and experience in automation tools and programming/scripting languages to develop and implement automated system reliability and performance solutions including infrastructure automation and configurations management tools (Ansible, Chef, Puppet).
Strong understanding of observability, monitoring, alerting, and logging tools and ability to design and implement effective monitoring and logging strategies.
Experience in designing and implementing on-premise, cloud, and hybrid resiliency solutions, disaster recovery, and business continuity planning.
Ability to drive critical issues and system design discussions and moderate between multiple technology teams.
Solid understanding of security best practices in on-premise, cloud, and hybrid environments along with Network technologies.
Working knowledge of CI/CD - preferably GitHub workflows and Actions.
Working knowledge of IaC automation tools (Terraform, Ansible, etc.)
Experience with Kubernetes and other auto-scaling tools and technologies.
Skilled at assessing and developing IT talent across multiple time zones and multiple business domains.
Exceptional written and verbal communication skills.
Ability to work independently in a fast-paced environment.
Travel Requirement: Up to 25%

Average salary estimate

$175000 / YEARLY (est.)

min

max

$150000K

$200000K

If an employer mentions a salary or salary range on their job, we display it as an "Employer Estimate". If a job has no salary data, Rise displays an estimate if available.

Similar Jobs

Business Analyst

Berkley Hybrid Irving, TX

VIEW

Posted 12 days ago

Join Berkley Entertainment as a Business Analyst, bridging the gap between IT and business teams while enhancing entertainment insurance solutions.

Instrumentation Engineer

Arbor Hybrid El Segundo, CA

VIEW

Posted 7 days ago

Join Arbor's team as an Instrumentation Engineer to innovatively design and maintain critical systems in the pursuit of environmental sustainability.

Engineering Manager - Patient Outcomes (Remote)

Rula Remote Los Angeles

VIEW

Posted 9 days ago

Join Rula as an Engineering Manager to lead initiatives that enhance mental healthcare access and outcomes.

Principal Networking Engineer

FluidStack Remote No location specified

VIEW

Posted 3 days ago

Join Fluidstack as a Principal Networking Engineer and help optimize networks for cutting-edge AI deployments.

Automation Engineer

JLL Hybrid Tallahassee, FL

VIEW

Posted 10 days ago

Become an essential part of JLL as an Automation Engineer, where you'll drive automation projects and mentor talent in a supportive environment.

RFIC Layout Automation Engineer

Apple Hybrid Sunnyvale, California, United States

VIEW

Posted 3 days ago

Inclusive & Diverse

Diversity of Opinions

Work/Life Harmony

Dare to be Different

Reward & Recognition

Empathetic

Take Risks

Growth & Learning

Transparent & Candid

Mission Driven

Passion for Exploration

Feedback Forward

Medical Insurance

Dental Insurance

Vision Insurance

Mental Health Resources

Life insurance

Disability Insurance

Health Savings Account (HSA)

Flexible Spending Account (FSA)

Learning & Development

Paid Time-Off

Maternity Leave

Social Gatherings

Join Apple as an RFIC Layout Automation Engineer to optimize layout processes and integrate AI-driven techniques for cutting-edge wireless technology.

Staff Pipe Stress Engineer

Electric Hydrogen Remote Houston, Texas, United States

VIEW

Posted 9 days ago

Join Electric Hydrogen as a Staff Pipe Stress Engineer to help design the next generation of hydrogen plants for a sustainable future.

Civil Engineering Analyst - Land Development

Kimley-Horn Hybrid Orlando

VIEW

Posted 5 days ago

Join Kimley-Horn as a Civil Engineering Analyst and pave the way for innovative land development solutions in Orlando!

Operational Development Engineer - Nuclear

ASSYSTEM Remote Bristol, UK

VIEW

Posted 11 days ago

Join Assystem as an Operational Development Engineer and contribute to innovative nuclear projects that shape the future of energy.

Transportation Design Engineer - CE III

AECOM Hybrid Murray, Utah, United States

VIEW

Posted 4 days ago

Join AECOM as a Mid-Level Transportation Engineer and play a key role in designing sustainable infrastructure solutions.

Cloud Systems Engineer

Panoptyc Remote No location specified

VIEW

Posted 9 days ago

Join Panoptyc as a DevOps Engineer to spearhead their migration from Heroku to AWS and enhance IT security practices.

Principal Engineer - Siemens Manufacturing Operations Management OpCenter EXDS Logistics

Toyota Motor Corporation Hybrid Liberty, NC

VIEW

Posted 14 days ago

Join Toyota's team as a Principal Engineer to lead the integration of Siemens Manufacturing Operations Management solutions in logistics.

Sr. Equipment Engineer 2

Skyworks Solutions Hybrid Osaka, VA

VIEW

Posted 13 days ago

Skyworks is looking for a skilled Sr. Equipment Engineer to contribute to innovative semiconductor solutions in a collaborative and fast-paced environment.