Back to jobsJob overview

About the role

Site Reliability Engineer II at Microsoft

Required Skills

azurekubernetesdockerlinuxscriptinggpuinfinibandai infrastructurecloud platforms

About the Role

Senior Site Reliability Engineer role focusing on AI infrastructure reliability, scalability, and performance for generative AI workloads. Responsibilities include incident management, performance optimization, and automation of large-scale distributed systems. The role requires collaboration across teams and customer advocacy in a cloud environment.

Key Responsibilities

  • Ensure reliability, scalability, and security of AI infrastructure
  • Lead incident response and root cause analysis
  • Identify and resolve performance bottlenecks in compute, storage, and networking
  • Develop automation tools for deployment and monitoring
  • Provide technical guidance and collaborate with cross-functional teams

Required Skills & Qualifications

Must Have:

  • 4+ years technical experience in software engineering, network engineering, or systems administration OR Bachelor's Degree with 1+ year experience
  • Proven ability to modify infrastructure software and collaborate across teams
  • Proficient technical design, analytical, and debugging abilities
  • 1+ years experience with incident management and reliability engineering in cloud or AI environments

Nice to Have:

  • 5+ years technical experience OR Bachelor's Degree with 2+ years experience OR Master's Degree with 1+ year experience
  • Experience in distributed systems and/or cloud platforms (Azure, Kubernetes, Docker)
  • Experience with GPUs, InfiniBand, or similar high-performance technologies
  • Proficiency in RDMA, MPI, and high-performance computing architecture
  • Proficient in scripting (PowerShell, Shell script) and deep expertise in Linux

Benefits & Perks

  • Industry leading healthcare