Back to jobsJob overview

About the role

Site Reliability Engineer - CTJ - Poly at Microsoft

Required Skills

distributed systemsautomationmonitoringcloud technologiesincident responsecapacity planningscriptingtelemetry analysis

About the Role

This Site Reliability Engineer role focuses on improving reliability, performance, and scalability of large-scale distributed systems. Responsibilities include automating operations, analyzing telemetry, and participating in on-call incident response. The position requires a U.S. government Top Secret clearance with SCI and polygraph.

Key Responsibilities

  • Independently creates, tests, and deploys changes through safe deployment processes to enhance code quality and system observability
  • Writes code or scripts to automate scalable operations processes like monitoring, alerting, and deployments
  • Develops alerts and instrumentation to monitor product capacity, security risks, and resource demands
  • Engages with product engineering teams through code reviews, meetings, on-call rotations, and incident response
  • Uses tools and models to troubleshoot availability, security, reliability, and performance issues

Required Skills & Qualifications

Must Have:

  • Master's Degree in Computer Science, IT, or related field AND 1+ years technical experience OR Bachelor's Degree AND 2+ years experience OR equivalent experience
  • Active U.S. Government Top Secret Clearance with access to Sensitive Compartmented Information (SCI) based on Single Scope Background Investigation (SSBI) with Polygraph
  • Verification of U.S. citizenship
  • Ability to pass Microsoft Cloud background check upon hire and every two years

Nice to Have:

  • Experience working on large-scale distributed services with on-call responsibilities
  • Ability to build and influence broadly towards common goals and priorities
  • Experience with distributed database systems such as SQL and PostgreSQL

Benefits & Perks

  • Industry leading healthcare