Back to jobsJob overview

About the role

Site Reliability Engineer II at Microsoft

Required Skills

pythonpowershelldistributed systemsautomationmonitoringai/mlcloud servicesdebugging

About the Role

Site Reliability Engineer II role focused on ensuring high availability and reliability of Microsoft 365 Exchange Online services. Responsibilities include implementing proactive engineering solutions, monitoring systems, automating incident response, and integrating AI/ML for predictive analytics. The position requires strong debugging skills, experience with distributed systems, and collaboration with product engineering teams.

Key Responsibilities

  • Implement proactive engineering solutions to identify and resolve incidents with limited disruptions
  • Develop automation code and scripts for monitoring, alerting, and deployment processes at scale
  • Analyze telemetry data and develop predictive models to improve product reliability and performance
  • Respond to incidents during on-call rotations by troubleshooting complex issues and deploying fixes
  • Mentor and coach less experienced engineers and collaborate with product engineering teams

Required Skills & Qualifications

Must Have:

  • Bachelor's or Master's degree in Computer Science, Data Science, AI, or related field
  • Mid-level years of software development experience with focus on automation
  • Understanding of modern software architectures including distributed systems, microservices, and failure modes
  • Strong troubleshooting skills and ability to debug complex systems and applications

Nice to Have:

  • Experience with scripting languages like bash, python, or PowerShell
  • Experience with compiled languages like C or C#
  • Practical experience running large scale online systems

Benefits & Perks

  • Industry leading healthcare