Back to jobsJob overview

About the role

Senior Site Reliability Engineer at Microsoft

Required Skills

azurekubernetesdockergpuinfinibandlinuxscriptingai infrastructurecloud platforms

About the Role

Senior Site Reliability Engineer role focusing on AI infrastructure reliability, scalability, and performance for Azure's specialized AI systems. Responsibilities include incident management, performance optimization, and infrastructure automation for high-performance AI workloads. Requires extensive experience with cloud platforms, distributed systems, and AI infrastructure technologies.

Key Responsibilities

  • Ensure reliability, scalability, and security of AI infrastructure supporting HPC & AI workloads
  • Lead incident response, root cause analysis, and continuous improvement to minimize downtime
  • Identify and resolve bottlenecks in compute, storage, networking, and specialized hardware
  • Develop and maintain automation tools for deployment, monitoring, and management of AI infrastructure
  • Provide technical guidance in cloud and AI infrastructure technologies and collaborate with cross-functional teams

Required Skills & Qualifications

Must Have:

  • 6+ years technical experience in software engineering, network engineering, or systems administration (or Bachelor's + 3 years, Master's + 2 years)
  • 5+ years hands-on experience developing and supporting infrastructure services for AI or cloud platforms
  • 1+ years experience with incident management and reliability engineering in cloud or AI environments
  • Proven ability to modify componentized, well-architected infrastructure software and collaborate across teams

Nice to Have:

  • 7+ years technical experience in software engineering, network engineering, or systems administration (or Bachelor's + 4 years, Master's + 3 years)
  • Experience in distributed systems and/or cloud platforms (Azure, Kubernetes, Docker, containers ecosystem)
  • Experience with GPUs, InfiniBand, or similar high-performance technologies
  • Proficiency in RDMA, MPI, and high-performance computing architecture
  • Proficient in scripting (PowerShell, Shell script) and deep expertise in Linux

Benefits & Perks

  • Industry leading healthcare