Site Reliability Engineering

Posted 12 days ago
Main Location
Redmond, WA, United States

Microsoft 365 (M365) Real time communication SRE Team
Real time communication SRE team is cartelized team for Intelligent Conversation and Communication Cloud (IC3) powers billions of real-time customer conversations across Microsoft’s first party (Teams, Skype), and second party (Dynamics) solutions. IC3 enables reliable and high-quality audio/video calling, meeting, messaging services that work every time from anywhere seamlessly across all customer touchpoints. IC3 makes conversations on our platforms more intelligent in real-time empowering best-in-class productivity tools for the modern workplace where every call, meeting or chat will make the next one better. 


Site Reliability Engineer – RTE SRE
We are searching for a Software/Site Reliability Engineer to join our Skype and Teams Site Reliability Engineering Team. Skype and Teams are both part of Microsoft 365 Suite and power enterprise communications such as conference meetings and telco services. It is used as a mission critical application by some of the most successful companies around the world. M365 is at the center of Microsoft’s cloud first, devices first strategy as it brings together cloud versions of our most trusted communication and collaboration products like Exchange, SharePoint, and Skype with our cross-platform desktop suites and mobile apps.
We are looking for an engineer who, on top of programming skills and knowledge of algorithms and data structures, is also skilled in other areas including but not limited to cloud computing, scaling, containers, virtualization, DB administration, security, and infrastructure. This candidate will also be able to troubleshoot and has good analytical skills. We are looking for a reliable candidate who can multitask and is efficient across various complex and urgent tasks.

Key responsibilities 
• Design, write and deliver software to optimize all aspects of deployments (Resources/Applications) ‘infrastructure-as-code’.
• Optimize service releases by improving Azure DevOps release pipelines.
• Drive services towards reliable/predictable deployments achieving better ‘time-to-deploy’ metrics for Services across Microsoft Teams.
• Analyze incidents to determine root cause and mitigation plans. Drive automation into service management tasks and processes.
• Develop safe rollout plans for a portfolio of services to prevent outages.
• Learn and enhance existing tools, developing new tools to meet new scale and features aimed at reducing manual intervention, enhancing prevention, detection, and mitigation of service impacts.
• Manage world-wide capacity for a portfolio of services to meet the usage growth and efficiency requirements.
• Be available for on-call rotations with the ability to troubleshoot and communicate outside of normal business hours.
• Coordinate planning and execution with internal engineering teams, business partners and technical leaders across the division.
• Influence and collaborate across orgs to bring best practices, architectures, standards, and methods for large-scale distributed systems.
• Analyze data and provide operational insights into service reliability, customer experience to Design and Product teams.


• Strong experience as Site Reliability Engineer/Developer working on large scale/distributed systems. Or implementing/automating using CICD tools.
• Cloud application / cloud services understanding
• Maintaining large scale infrastructure and tooling

• Software development experience using PowerShell, C#, Java, C++, C or other programming languages.


Preferred qualifications:
• Good knowledge of basic networking fundamentals & troubleshooting tools.
• Proven experience creating distributed systems tools of moderate to high complexity.
• Ability to manage and deliver multiple project phases at the same time.
• Strong Windows OS / Linux troubleshooting experience.
• Solid debugging, testing, and problem-solving skills
• Ability to automate routine tasks.
• Azure development experience (ARM templates, Azure Monitor, PowerShell, Kubernetes, Docker etc.)
• Experience in a cloud stack and leveraging cloud architecture, applying site reliability principles and/or demonstrating sensitivity to operational concerns.



Microsoft is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to age, ancestry, color, family or medical care leave, gender identity or expression, genetic information, marital status, medical condition, national origin, physical or mental disability, political affiliation, protected veteran status, race, religion, sex (including pregnancy), sexual orientation, or any other characteristic protected by applicable laws, regulations and ordinances. If you need assistance and/or a reasonable accommodation due to a disability during the application or the recruiting process, please send a request via the Accommodation request form.


Benefits/perks listed below may vary depending on the nature of your employment with Microsoft and the country where you work.

We're a community of women leveraging our connections into top companies to help underrepresented women get the roles they've always deserved. Simultaneously, we work to build truly inclusive hiring processes and environments where women can thrive and not just survive.
Are you hiring? Join our platform for diversifiying your team
Site Reliability Engineering
Microsoft Corporation