Name: AI Career Space
Availability: InStock
Rating: 4.8 (1250 reviews)

About the Role

Principal Supercomputing Software Engineer role at Microsoft Azure AI/HPC team, focusing on building tools for reliable and performant cloud-native supercomputers. Responsibilities include system management, debugging, and driving architectural changes to support AI and HPC workloads.

Key Responsibilities

Be part of a comprehensive systems management team focused on operational excellence and customer success
Analyze key system metrics and telemetry to proactively identify and debug HPC system issues, build appropriate tooling, help develop processes and ensure that solutions are responsive to emerging user needs
Partner with customers, vendors, and other teams within Azure to drive comprehensive solutions for operating world class Supercomputers in the public cloud environment
Ensure that the Azure platform is performant, scalable and resilient
Foster test-driven engineering culture to reduce regressions and bugs in production and will set a higher bar for infrastructure quality

Required Skills & Qualifications

Must Have:

Bachelor's Degree in Computer Science or related technical or scientific field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
5+ years of experience in operating AI/HPC systems, developing and running AI/HPC applications on clusters, or operating Cloud Infrastructure
3+ years of specialized experience with one of AI/HPC system management OR High-Speed Networks OR HPC Storage OR managing Cloud Infrastructure
Ability to meet Microsoft, customer and/or government security screening requirements, including Microsoft Cloud Background Check

Nice to Have:

Masters' Degree or PhD in Computer Science or related technical or scientific field
Operational experience running large scale HPC systems or infrastructure situated in Cloud environments
Previous experience with running and troubleshooting machine learning workloads on GPU-based HPC systems
Expertise in Cloud Computing, Virtualization and Container Technologies
Familiarity with the HPC software stack

Benefits & Perks

Industry leading healthcare

Principal Supercomputing Software Engineer at Microsoft