Job Details

There is a place for you at T. Rowe Price to grow, contribute, learn, and make a difference.  We are a premier asset manager focused on delivering global investment management excellence and retirement services that investors can rely on today and in the future. The work we do matters. We invite you to explore the opportunity to join us and grow your career with us.

Overview

In this role as Principal Site Reliability Engineer, Infrastructure Observability you will help formulate, develop, and implement a team of Site Reliability Engineers (SREs) focused on the observability, sustainability, scalability, measurability and recoverability of T. Rowe Price’s innovative cloud & on-prem solutions by leveraging automation and best-of-breed tools. The successful candidate will have a strong operations & engineering background, is hands-on when needed, and has expertise in the cloud environments (public, private), infrastructure operations, DevOps practices, CI/CD toolchain and systems, code build and deployment, incident response, and 24x7 monitoring and support.

The candidate will also have extensive experience operating within a SRE function within a complex, distributed environment. They will have a demonstrated ability to work horizontally and vertically within an organization with diverse partners and sponsor groups.

Role summary and job responsibilities

Possesses extensive knowledge in own area of expertise and extensive in-depth knowledge of the broader portfolio for comprehensive understanding of up/downstream impacts across technology infrastructure
Responsibility for the design of technology solutions to prevent or minimize service disruptions
Prevents technology service disruptions through technology solution recommendations and automations
Fosters a culture of deep learning through blameless post-mortems to improve the shared goal of reliability across services
Transform operations teams by facilitating internal change to adopt SRE standard methodologies across the organization and driving strategic growth in this area within Global Technology
Analyzes incidents impacting technology availability for high-level trends across the broad portfolio
Drive initiatives to reduce or prevent technology failures in a complex, distributed technology environment
Pulls together information from disconnected systems into cohesive views of the technology portfolio for identifying trends, redundancies, and risk
Demonstrates outstanding awareness of the complexities of the tech and asset management industries
May lead initiatives of varying degrees of complexity that span multi-functional areas and of varying degrees of complexity
Contributes to definition of target state architecture and design of the technology environment

Requirements

10+ years of relevant technology experience
5+ years building and supporting solutions in Amazon AWS
5+ years of experience building and running a DevOps and/or SRE function
Experience with implementation and operation of the chaos model at scale
Strategic and program-level implementation experience
Demonstrable experience implementing new technology, tools, and platforms
System administration and scripting experience
Demonstrable experience leveraging automation to proactively prevent or quickly remediate incidents
Fluent in multiple programming languages (e.g., Python, Java, GO, Node.js, .Net Core, etc.).
Proficiency with database development (SQL Server, PostgreSQL, MySQL, etc.)
Proficiency with defining, right-sizing, tracking, and reporting on Service Level Objectives (SLOs), Service Level Indicators (SLIs), system availability, and the progress and outcomes related to reliability
Experience with implementing and managing Error Budgets
Proficiency with understanding and explaining incident situations and their recovery plans to prevent recurrence
Knowledge/experience driving dashboard standardization across the ecosystem for observability, APM and infrastructure monitoring, and application-specific logging
Knowledge/experience with observability tools such as New Relic, Elastic Stack, Prometheus, Grafana, Splunk, and cloud native tools is desirable
Knowledge/experience with cloud management tools such as Ansible, Terraform, Vault, and Vagrant.
Works independently, with guidance in only the most complex situations
Makes sound decisions with limited facts or resources.
Balances strategic and pragmatic concerns when solving problems
Adjusts communication style and materials to suit a given audience
Able to clearly articulate operational principles, practices, and policies
Stays abreast of industry trends and technologies
Accountable for work of self and others; sets standards around which others will operate
Maintains a broad internal professional network and knows when to engage/activate it
Develops or mentor’s diverse talent on the team
Ability to be on-call and/or work during off-hours

Commitment to Diversity, Equity, and Inclusion:

We strive for equity, equality, and opportunity for all associates. When we embrace the power of diversity and create an environment where people can bring their authentic and best selves to work, our firm is stronger, and we create greater value for our clients. Our commitment and inclusive programming aim to lift the experience for each associate and builds allies for our global associate community. We know that a sense of belonging is key not only to your success at the firm, but also to your ability to bring your best each day.

Benefits: We invest in our people through a wide range of programs and benefits, including:

Competitive pay and bonuses as well as a generous retirement plan and employee stock purchase plan with matching contributions
Flexible and remote work opportunities
Health care benefits (medical, dental, vision)
Tuition assistance
Wellness programs (fitness reimbursement, Employee Assistance Program)

Our policies may change as our working lives evolve. Yet, our commitment to supporting our associates’ well-being and addressing the needs of our clients, business, and communities is unwavering.

Learn more about T. Rowe Price

We're connecting diverse talent to big career moves. Meeting people who boost your career is hard - yet networking is key to growth and economic empowerment. We’re here to support you - within your current workplace or somewhere new. Upskill, join daily virtual events, apply to roles (it’s free!).

Are you hiring? Join our platform for diversifiying your team

Post a job

Principal Site Reliability Engineer, Infrastructure Observability (Remote Flexibility)

Principal Site Reliability Engineer, Infrastructure Observability (Remote Flexibility)

Principal Site Reliability Engineer, Infrastructure Observability (Remote Flexibility)

Principal Site Reliability Engineer, Infrastructure Observability (Remote Flexibility)

Job Details

Overview

Role summary and job responsibilities

Requirements

Commitment to Diversity, Equity, and Inclusion:

You Might Also Like