Site Reliability Engineer

Charlottesville, VA, United States
Main Location
United States
Open jobs
PowerToFly approved because…

We met with women at S&P Global to hear about the teams they're leading, the products they're building and how they integrate work with life.

Hear directly from Irina, Megan, Sameena and Meredith.

Site Reliability Engineering (SRE) is an engineering discipline that draws from software and systems engineering to define, measure and achieve reliability objectives. SRE embraces DevOps philosophies and leverages custom code, automation, tooling, support processes and service management frameworks to achieve reliability objectives. The SRE mindset considers reliability a first-class feature of any service and prioritizes engineering and automation over manual intervention. 

S&P Global's Site Reliability Engineering teams are responsible for keeping our products and services available to customers and employees located around the world. We achieve this through software, system and process engineering to maintain service level objectives, limit human intervention and minimize the level of effort associated with support (a.k.a. "toil"). SRE teams at S&P Global are generally responsible for availability, latency, performance, efficiency, change management, monitoring, emergency response and capacity planning of our products and services.

Our SRE teams value:
End-user focus: As engineers and consumers of services, we deeply value the quality of our users' experience. We recognize that a solution is only as good as the quality of service it provides. 
Passion for coding and automation: We leverage technology to improve reliability and make our lives easier. We are experienced problem-solvers and are proficient in scripting and programming languages. We look for people who enjoy problem-solving, writing code and exploring automation.
Curiosity: We compulsively search for the underlying cause of issues and ways to improve reliability.
Honesty: We value honesty and transparency over placing blame. We promote a blameless culture throughout the organization.

About You
You have 5+ years of experience in software or systems engineering.
You have experience monitoring, supporting and tuning a production application stack.
You value your time and have experience with scripting and automation frameworks.
You want to support full-stack solutions, including applications, servers, networks, data pipelines and data platforms.
You have excellent troubleshooting skills.
You demonstrate an objective, data-driven approach to problem-solving.
You demonstrate excellent collaboration and communication skills.
You take a practical and iterative approach to improvement, making small changes and testing for effect.
You have experience working across silos in change-controlled environment.
You have experience working with a globally-distributed workforce.
You have experience with cloud hosting technologies (E.g., AWS, Azure, Google).
You may have some experience with containerization platforms (Docker, Kubernetes)

Develop, maintain and report on Service Level Objectives (SLOs).
Develop and support monitoring and automation to defend SLOs.
Resolve Incidents (outages and service disruptions), including participation in on-call rotations.
Perform root cause analysis and formal postmortem write-ups for service disruptions.
Perform capacity planning to assure future reliability and efficiency as utilization grows.
Develop and test disaster recovery plans.
Implement changes and support releases in a controlled environment.
Develop and maintain runbooks, share knowledge and cross-train members of SRE and Development teams.
Consult with Development teams during service design and in advance of releases.
Conduct production readiness reviews to ensure services meet SRE onboarding requirements.

Bachelor's degree or higher in computer science, math, engineering or related disciplines.
AWS technical certifications helpful.

Technologies Leveraged
AWS, VMWare, f5 Big-IP, HAProxy, Windows Server, Linux, IIS, Apache HTTP Server, SQL Server, Oracle, MySQL, Apache NiFi, .NET, Javascript, Python, Powershell, Perl, redis, Memcached, Kafka, Active Directory, Elasticsearch, Logstash, Kibana, Google Analytics, AppDynamics, Solarwinds, DataDog, Prometheus, Graphana, Azure DevOps, Visual Studio, ServiceNow, Kubernetes, Docker, git, Selenium, Jenkins, Ansible

Help us maintain the quality of jobs posted on PowerToFly. Let us know if this job is closed.
You Might Also Like
Business Analyst III Charlottesville, VA, United States
Lead Software Engineer Charlottesville, VA, United States
Head of Cloud Center of Excellence Charlottesville, VA, United States
Business Analyst II Charlottesville, VA, United States

All Information Technology Jobs in Charlottesville, VA, United States

We're a community of women leveraging our connections into top companies to help underrepresented women get the roles they've always deserved. Simultaneously, we work to build truly inclusive hiring processes and environments where women can thrive and not just survive.
Are you hiring? Join our platform for diversifiying your team