Name: AI Career Space
Availability: InStock
Rating: 4.8 (1250 reviews)

About the Role

This role involves designing, building, and maintaining infrastructure for training and evaluating state-of-the-art AI agent models, focusing on large-scale reinforcement learning for LLMs. You will work closely with research teams to ensure efficient and robust ML systems, troubleshooting performance bottlenecks and conducting MLSys research.

Key Responsibilities

Develop training infrastructure for efficient large-scale reinforcement learning on LLMs
Work across the entire technology stack including low-level ML systems, job orchestration, and data management
Analyze, troubleshoot, and profile complex ML systems to identify and address performance bottlenecks
Work closely with researchers to conduct MLSys research and create new techniques and tooling

Required Skills & Qualifications

Must Have:

PhD or Master's degree and 3+ years of applied research experience
Experience with programming languages such as Python, Java, C++
Experience with neural deep learning methods and machine learning
Experience with training and deploying ML systems for large-scale optimizations or troubleshooting technical systems

Nice to Have:

PhD or Master's degree with experience in various ML techniques and performance parameters
Experience with large-scale ML systems, profiling, debugging, and understanding system performance and scalability
Experience with distributed systems, Megatron, vLLM, Ray, and working with GPUs
Experience with patents or publications at top-tier peer-reviewed conferences or journals

Benefits & Perks

Base pay ranging from $255,000 to $345,000/year depending on location
Equity, sign-on payments, and other forms of compensation may be provided
Full range of medical, financial, and/or other benefits

Member of Technical Staff - Reinforcement Learning (Infrastructure), AGI Autonomy at Amazon.com Services LLC