Back to jobsJob overview

About the role

Senior Software Engineer at Microsoft

Required Skills

pythonc++ai/hpccloud infrastructuregpu systemshigh-speed networkingcontainer technologiessystem troubleshootingsupercomputing

About the Role

Senior Supercomputing Software & Systems Engineer responsible for diagnosing and troubleshooting large-scale supercomputing systems across the infrastructure stack. Develops advanced tools and implements features to ensure system reliability and performance for AI/HPC workloads on Microsoft Azure.

Key Responsibilities

  • Diagnose and troubleshoot large-scale supercomputing systems across GPU hardware, networking, datacenter and core software
  • Develop and apply advanced tools to manage cloud-native supercomputers
  • Drive identification of dependencies and development of design documents
  • Create, implement, optimize, debug, refactor and reuse code to improve performance
  • Act as Designated Responsible Individual (DRI) to monitor systems and guide other engineers

Required Skills & Qualifications

Must Have:

  • Bachelor's Degree in Computer Science or related technical field AND 4+ years technical engineering experience with coding in C, C++, C#, Java, JavaScript, or Python OR equivalent experience
  • 3+ years experience operating AI/HPC systems, developing/running AI/HPC applications on clusters, or operating Cloud Infrastructure
  • 2+ years specialized experience with AI/HPC system management OR High-Speed Networks OR HPC Storage OR managing Cloud Infrastructure
  • Ability to pass Microsoft Cloud Background Check upon hire and every two years thereafter

Nice to Have:

  • Bachelor's Degree in Computer Science AND 8+ years technical engineering experience OR Master's Degree AND 6+ years experience
  • 1+ year experience running and troubleshooting machine learning workloads on GPU-based HPC systems
  • 1+ year experience with cloud computing, virtualization, and container technologies

Benefits & Perks

  • Industry leading healthcare