Site Reliability Engineer

Main Location
Redmond, WA, United States
Jobs

Microsoft 365 (M365) Intelligent Conversation and Communications Cloud (IC3) 

Intelligent Conversation and Communication Cloud (IC3) powers billions of real-time customer conversations across Microsoft’s first party (Teams, Skype), and second party (Dynamics) solutions. IC3 enables reliable and high-quality audio/video calling, meeting, and messaging services that work every time from anywhere seamlessly across all customer touchpoints. IC3 makes conversations on our platforms more intelligent in real-time empowering best-in-class productivity tools for the modern workplace where every call, meeting, or chat will make the next one better. 

 

About the Team 

We are the team which powers all the messaging scenarios across IC3. We develop one of the largest scale, business critical services in Microsoft. Our services run in every region and we process hundreds of millions of active users and billions of messages a day. The Micro Service we build must be highly scalable, highly available and extremely performant in geo-redundant multi tenant systems and honor obligations for data sovereignty, privacy, security and compliance.

Responsibilities

Site Reliability Engineer – IC3 Messaging Services 

We are looking for a passionate Site Reliability Engineers to join our team that manages the planet scale infrastructure for Messaging Services.   In this role, you 'll be responsible for building tools to automate and streamline our processes, managing and improving deployment and live site infrastructure across a portfolio of messaging services. You will have an opportunity to work with a highly collaborative and fun team in a fast learning environment.

 

Key responsibilities:

  • Design, write and deliver software to optimize all aspects of deployments (Resources/Applications) ‘infrastructure-as-code’.
  • Optimize service release by improving Azure DevOps release pipelines. 
  • Drive services towards reliable/predictable deployments achieving better ‘time-to-deploy’ metrics for Services across Microsoft Teams.
  • Develop safe rollout plans for a portfolio of services to prevent outages. 
  • Build, run, and improve critical service environments in large scale data centers. 
  • Learn and enhance existing tools, developing new tools to meet new scale and features aimed at reducing manual intervention, enhancing prevention, detection, and mitigation of service impacts. 
  • Manage world-wide capacity for a portfolio of services to meet the usage growth and efficiency requirements. 
  • Coordinate planning and execution with internal engineering teams, business partners and technical leaders across the division. 
  • Influence and Collaborate across orgs to bring best practices, architectures, standards, and methods for large-scale distributed systems. 
  • Analyze data and providing operational insights into service reliability, customer experience to Design and Product teams. 
Qualifications

Essential qualifications:

  • BS/BSE in computer science, Management Information Systems or technical disciplines or equivalent education
  • 2+ years as Site Reliability Engineer/Developer working on large scale/distributed systems. 
  • 2+ years implementing/automating using CICD tools. 
  • Good knowledge of basic networking fundamentals & troubleshooting tools.  
  • Proven experience creating distributed systems tools of moderate to high complexity. 
  • Ability to manage and deliver multiple project phases at the same time. 
  • Strong analytical and problem solving and organizational skills. 
  • Excellent written and oral communication skills. 
  • Ability to deal with the ambiguity associated with working in a fast-paced and changing environment. 
  • Strong Windows OS / Linux troubleshooting experience. 

 Preferred qualifications:

  • 3+ years of Azure development experience (ARM templates, Azure Monitor, PowerShell, Kubernetes, Docker etc.) 
  • 2+ years automating builds/releases using YAML. 

 

#M365Core

#IC3

 

Microsoft is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to age, ancestry, color, family or medical care leave, gender identity or expression, genetic information, marital status, medical condition, national origin, physical or mental disability, political affiliation, protected veteran status, race, religion, sex (including pregnancy), sexual orientation, or any other characteristic protected by applicable laws, regulations and ordinances. If you need assistance and/or a reasonable accommodation due to a disability during the application or the recruiting process, please send a request via the Accommodation request form.

 

Benefits/perks listed below may vary depending on the nature of your employment with Microsoft and the country where you work.

Mission
We're a community of women leveraging our connections into top companies to help underrepresented women get the roles they've always deserved. Simultaneously, we work to build truly inclusive hiring processes and environments where women can thrive and not just survive.
Are you hiring? Join our platform for diversifiying your team
Site Reliability Engineer
Microsoft Corporation