Back to jobsJob overview
About the role
Principal Supercomputing Software Engineer at Microsoft
Required Skills
pythonc++ai/hpccloud infrastructurehigh-speed networkshpc storagecontainer technologiessystem management
About the Role
Principal Supercomputing Software Engineer role at Microsoft Azure AI/HPC team, focusing on building tools for reliable and performant cloud-native supercomputers. Responsibilities include system management, debugging, and driving architectural changes to support AI and HPC workloads.Key Responsibilities
- Be part of a comprehensive systems management team focused on operational excellence and customer success
- Analyze key system metrics and telemetry to proactively identify and debug HPC system issues, build appropriate tooling, help develop processes and ensure that solutions are responsive to emerging user needs
- Partner with customers, vendors, and other teams within Azure to drive comprehensive solutions for operating world class Supercomputers in the public cloud environment
- Ensure that the Azure platform is performant, scalable and resilient
- Foster test-driven engineering culture to reduce regressions and bugs in production and will set a higher bar for infrastructure quality
Required Skills & Qualifications
Must Have:
- Bachelor's Degree in Computer Science or related technical or scientific field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
- 5+ years of experience in operating AI/HPC systems, developing and running AI/HPC applications on clusters, or operating Cloud Infrastructure
- 3+ years of specialized experience with one of AI/HPC system management OR High-Speed Networks OR HPC Storage OR managing Cloud Infrastructure
- Ability to meet Microsoft, customer and/or government security screening requirements, including Microsoft Cloud Background Check
Nice to Have:
- Masters' Degree or PhD in Computer Science or related technical or scientific field
- Operational experience running large scale HPC systems or infrastructure situated in Cloud environments
- Previous experience with running and troubleshooting machine learning workloads on GPU-based HPC systems
- Expertise in Cloud Computing, Virtualization and Container Technologies
- Familiarity with the HPC software stack
Benefits & Perks
- Industry leading healthcare