Site Reliability Engineering Manager

NY, United States

The platform reliability engineering team at Lifion is central to the reputation and trustworthiness of our product. We are looking for a Site Reliability Engineering (SRE) Manager to work with teams across the organization to help build and maintain our distributed platform.

RESPONSIBILITIES
  • Drive quarterly SRE objectives to completion
  • Design and implement tools to improve the reliability and efficiency of Lifion services and data stores
  • Automate infrastructure and configuration management
  • Assist with all aspects of operational security and compliance
  • Draft design documents and present solutions to stakeholders
  • Meet reliability and capacity requirements while managing costs
  • Plan disaster recovery drills
  • Participate in rotating on-call duties
  • Conduct timely post-mortems of production incidents
QUALIFICATIONS
  • Experience developing and monitoring distributed systems
  • Know when and how to apply SRE and DevOps principles
  • Understand the operational complexity of a microservice architecture
  • Experience with continuous integration and continuous deployment tools
  • Fluency in one or more languages, such as Python, Go, Perl, Ruby, Bash, and Java
  • Working knowledge of Ansible, Terraform, or other configuration management tools
  • Systematic approach to problem solving
  • Strong communication skills
  • Strong sense of ownership, and an ability to drive tasks to completion
  • Experience using Docker and container orchestration technologies, such as Docker Swarm, Kubernetes, or Mesos.
COMPETENCIES

Execution: Ability to identify critical paths and drive the plans to support one or more complex product and platform deliverables with a commitment to quality and time

Management: A true "roll up the sleeves and get it done" working approach; demonstrated success as a problem solver, operating as a result-oriented, self-starter

Communication: Superior communication skills with the ability to present technical topics to non-technical audiences and build partnerships with business area leaders and external solution providers

Leadership: Ability to maintain high morale, both within SRE group, and externally as well by inspiring trust and sense of achievement.

Judgement: Be able to manage risk identification and risk mitigation strategies associated with the architecture and make sound decisions with limited and incomplete data

Mission

We’re passionate about connecting highly skilled women with leading companies commited to diversity and inclusion

Are you looking for your dream job? In Office. Flexible. Remote.

Join our Movement

Are you hiring? Join our platform for diversifying your team

Post a job