Name: AI Career Space
Availability: InStock
Rating: 4.8 (1250 reviews)

About the Role

Senior Site Reliability Engineer role focusing on AI infrastructure reliability, scalability, and performance for generative AI workloads. Responsibilities include incident management, performance optimization, and automation of large-scale distributed systems. The role requires collaboration across teams and customer advocacy in a cloud environment.

Key Responsibilities

Ensure reliability, scalability, and security of AI infrastructure
Lead incident response and root cause analysis
Identify and resolve performance bottlenecks in compute, storage, and networking
Develop automation tools for deployment and monitoring
Provide technical guidance and collaborate with cross-functional teams

Required Skills & Qualifications

Must Have:

4+ years technical experience in software engineering, network engineering, or systems administration OR Bachelor's Degree with 1+ year experience
Proven ability to modify infrastructure software and collaborate across teams
Proficient technical design, analytical, and debugging abilities
1+ years experience with incident management and reliability engineering in cloud or AI environments

Nice to Have:

5+ years technical experience OR Bachelor's Degree with 2+ years experience OR Master's Degree with 1+ year experience
Experience in distributed systems and/or cloud platforms (Azure, Kubernetes, Docker)
Experience with GPUs, InfiniBand, or similar high-performance technologies
Proficiency in RDMA, MPI, and high-performance computing architecture
Proficient in scripting (PowerShell, Shell script) and deep expertise in Linux

Benefits & Perks

Industry leading healthcare

Site Reliability Engineer II at Microsoft