Back to jobsJob overview
About the role
Member of Technical Staff - Reinforcement Learning (Infrastructure), AGI Autonomy at Amazon.com Services LLC
Required Skills
pythonjavac++reinforcement learningllmsdistributed systemsmlsysgpus
About the Role
This role involves designing, building, and maintaining infrastructure for training and evaluating state-of-the-art AI agent models, focusing on large-scale reinforcement learning for LLMs. You will work closely with research teams to ensure efficient and robust ML systems, troubleshooting performance bottlenecks and conducting MLSys research.Key Responsibilities
- Develop training infrastructure for efficient large-scale reinforcement learning on LLMs
- Work across the entire technology stack including low-level ML systems, job orchestration, and data management
- Analyze, troubleshoot, and profile complex ML systems to identify and address performance bottlenecks
- Work closely with researchers to conduct MLSys research and create new techniques and tooling
Required Skills & Qualifications
Must Have:
- PhD or Master's degree and 3+ years of applied research experience
- Experience with programming languages such as Python, Java, C++
- Experience with neural deep learning methods and machine learning
- Experience with training and deploying ML systems for large-scale optimizations or troubleshooting technical systems
Nice to Have:
- PhD or Master's degree with experience in various ML techniques and performance parameters
- Experience with large-scale ML systems, profiling, debugging, and understanding system performance and scalability
- Experience with distributed systems, Megatron, vLLM, Ray, and working with GPUs
- Experience with patents or publications at top-tier peer-reviewed conferences or journals
Benefits & Perks
- Base pay ranging from $255,000 to $345,000/year depending on location
- Equity, sign-on payments, and other forms of compensation may be provided
- Full range of medical, financial, and/or other benefits