Posted 5 days ago

We are a Site Reliability Engineering team supporting a variety of mission critical Microsoft Services . We are part of a multidisciplinary engineering organization tasked with driving reliability across a suite of services that enhance the experience of the international customers.


We have adopted the Site Reliability Engineering (SRE) approach,providing dedicated teams of Site Reliability Engineers working on key areas related to service reliability, in tight cooperation with the product development teams. We combine our knowledge and skills across all our engagements to drive reliability and efficiencies and scale out our impact by working closely with development collaborating on the coding of the reliability features for our services.


Our team is part of a product’s DRI rotation that handles Sev2/1/0 incidents and outages. This allows us to learn from real life production and help the product team do it as well, with the specific goal to improve the availability, reliability, observability, and operability of our systems.


We strive to improve reliability and efficiency fundamentals via metrics and monitoring, issue mitigation and analysis, and software engineering, preferring long-lasting platform improvements delivered as engineering projects over repetitive manual operations. We contribute to the product fundamentals and architecture, share knowledge, and code, and prefer reuse over re-invention, always looking for ways to make what we build useful to multiple teams and products.


We know that the SRE discipline is evolving; we learn from our peers at Microsoft and elsewhere in the industry and aim to contribute to this evolution by innovating on SRE within our group and sharing those innovations

Our teams have a wide variety of professional experiences, and we are interested to meet both candidates with traditional engineering backgrounds as well as those with focus on working in a coding centric devops/livesite . We strive to continue building our team with diversity and inclusiveness.  We strongly believe that diversity and an environment where everyone can feel safe to contribute their own insights is the key to making the best workplace possible. We know that the best workplace makes the best products and services: not only is it the smart thing to do, but it is also the right thing.


We are not looking for people who know it all, we are looking for people who want to learn it all.

We value the input of people who aren’t afraid to learn all the time and embrace mistakes as they continuously improve both our services and themselves.  If you are excited by this type of challenge and you love to work in groups of people who are similarly excited: come join us! 


Billions of users across the world rely on our products, and to meet this demand we design and implement world-class distributed systems. 


As a Site Reliability Engineer in one of our SRE teams, you will be responsible for improving the reliability of key Microsoft Services.


Our key focus areas are:

  • Defining our systems’ reliability goals via Service Level Objectives (SLOs)
  • Improving our systems’ production posture via targeted observability and operability enhancements (telemetry, alerting, incident management, change management, safe production changes).
  • Building reusable automation to empower multiple teams to achieve their reliability & efficiency goals.
  • Influencing the product architecture and roadmap to make sure the customer-experienced reliability is always a key consideration when evolving the product.
  • Resolving Live Site and Customer incidents in our systems, as per defined & agreed Service Level Agreements (SLAs)

We are looking for engineers passionate about the above areas who are also interested in:

  • Providing technical leadership for engineers across multiple teams.
  • Mentoring engineers on SRE principles, practices, and tools.
QualificationsRequired Qualifications
  • 3+ years of software development experience in online services such as Azure, AWS, etc.
  • 3+ years of experience using programming languages such as C#, C++,etc.
  • Experience working with large-scale distributed systems (e.g., cloud computing providers, SaaS services, etc., ideally with complex environments. 
Preferred Qualifications
  • Experience working on large and unfamiliar codebases.
  • Experience as a technical lead.
  • Experience using scripting languages such as PowerShell.
  • Awareness of, and ability to reason about, modern distributed software design patterns and cloud systems architecture, including microservices, containers, load-balancing, queuing, caching.
  • B.Sc., M.Sc. or Ph.D. in Computer Engineering, Computer Science, or related fields.


Microsoft is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to age, ancestry, color, family or medical care leave, gender identity or expression, genetic information, marital status, medical condition, national origin, physical or mental disability, political affiliation, protected veteran status, race, religion, sex (including pregnancy), sexual orientation, or any other characteristic protected by applicable laws, regulations and ordinances. If you need assistance and/or a reasonable accommodation due to a disability during the application or the recruiting process, please send a request via the Accommodation request form.


Benefits/perks listed below may vary depending on the nature of your employment with Microsoft and the country where you work.

We're connecting diverse talent to big career moves. Meeting people who boost your career is hard - yet networking is key to growth and economic empowerment. We’re here to support you - within your current workplace or somewhere new. Upskill, join daily virtual events, apply to roles (it’s free!).
Are you hiring? Join our platform for diversifiying your team
Site Reliability Engineer