Back to jobsJob overview

About the role

Site Reliability Engineer II at Microsoft

Required Skills

cloud servicesazure cosmos dbobservabilityautomationtelemetry analysisdistributed systemsservice reliabilityincident management

About the Role

Site Reliability Engineer II role at Microsoft's Azure Cosmos DB team, focusing on building automated systems for root cause analysis and mitigation to maintain stringent Service Level Objectives. Responsibilities include collaborating with engineering teams, enhancing tooling, and analyzing telemetry data to improve service reliability and customer experience.

Key Responsibilities

  • Collaborating closely with engineering teams on building and enhancing tooling and automation solutions for faster resolution of issues impacting SLO's
  • Collaborating with customers to understand pain points around Supportability and SLO attainment and formulate strategies for addressing recurring issues
  • Communicating on a technical level and being the single point of contact for interfacing with enterprise customers for handling service escalations
  • Designing and implementing changes to service telemetry for automation consumption if not already available
  • Enhancing customer facing experience by proactive alerting based on utilization, trends, resource health

Required Skills & Qualifications

Must Have:

  • 4+ years technical experience in software engineering, network engineering, or systems administration OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 1+ year(s) technical experience
  • 3+ years of experience running large scale cloud services
  • Ability to meet Microsoft, customer and/or government security screening requirements including Microsoft Cloud Background Check

Nice to Have:

  • 2+ years of operational experience in improving Service Reliability, Availability and Performance
  • Understanding of Observability and MELT implementation patterns for large-scale services
  • Experience in Logic Apps and authoring Jupyter Notebooks
  • Experience in analyzing, troubleshooting, and automating root cause analysis and mitigation of incidents impacting large-scale distributed systems
  • Systematic problem-solving approach, coupled with communication skills and a sense of curiosity
  • Ability to deal with the ambiguity associated with working in a fast-paced environment
  • Influencing the product architecture and roadmap to make sure customer-experienced supportability is always a key consideration

Benefits & Perks

  • Industry leading healthcare