Back to jobsJob overview
About the role
Software Engineer- AI/ML, AWS Neuron Distributed Training at Annapurna Labs (U.S.) Inc.
Required Skills
pythonpytorchtensorflowdistributed trainingmachine learningawssoftware developmentlarge language models
About the Role
This role is for a senior software engineer in the Machine Learning Applications team for AWS Neuron, focusing on distributed training of large-scale ML models like GPT and stable diffusion. The engineer will build and tune distributed training solutions using PyTorch and TensorFlow on AWS Trainium and Inferentia silicon. Strong software development and ML expertise are essential.Key Responsibilities
- Build, deliver, and maintain complex products for AWS Neuron distributed training
- Design fault-tolerant systems that run at massive scale in the AWS Cloud
- Develop, enable, and performance tune a wide variety of ML model families, including large language models
- Lead efforts building distributed training support into PyTorch and TensorFlow using XLA and Neuron stacks
- Tune models to ensure highest performance on AWS Trainium and Inferentia silicon
Required Skills & Qualifications
Must Have:
- 5+ years of non-internship professional software development experience
- 5+ years of programming with at least one software programming language
- 5+ years of leading design or architecture of new and existing systems
- 5+ years of full software development life cycle experience
Nice to Have:
- Bachelor's degree in computer science or equivalent
Benefits & Perks
- Inclusive team culture with employee-led affinity groups
- Work-life balance with flexible working hours
- Mentorship and career growth opportunities
- Comprehensive compensation package including medical, financial, and other benefits