Name: AI Career Space
Availability: InStock
Rating: 4.8 (1250 reviews)

About the Role

This role focuses on enabling reliable and highly scalable distributed ML training on Meta's large-scale GPU infrastructure, with a particular emphasis on GenAI/LLM scaling. The engineer will work on the software stack around NCCL and PyTorch to improve distributed ML reliability and performance from the trainer down to inter-GPU and network communication layers.

Key Responsibilities

Enabling reliable and highly scalable distributed ML training on Meta's large-scale GPU training infra with a focus on GenAI/LLM scaling

Required Skills & Qualifications

Must Have:

Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
Specialized experience in one or more of the following machine learning/deep learning domains: Distributed ML Training, GPU architecture, ML systems, AI infrastructure, high performance computing, performance optimizations, or Machine Learning frameworks (e.g. PyTorch)

Nice to Have:

Knowledge of GPU architectures and CUDA programming
Experience working with DL frameworks like PyTorch, Caffe2 or TensorFlow
Experience in AI framework and trainer development on accelerating large-scale distributed deep learning models
PhD in Computer Science, Computer Engineering, or relevant technical field
Experience with both data parallel and model parallel training, such as Distributed Data Parallel, Fully Sharded Data Parallel (FSDP), Tensor Parallel, and Pipeline Parallel
Experience in HPC and parallel computing
Knowledge of ML, deep learning and LLM
Experience with NCCL and distributed GPU reliability/performance improvement on RoCE/Infiniband

Benefits & Perks

bonus
equity
benefits

Software Engineer, SystemML - Scaling / Performance at Meta