Back to jobsJob overview
About the role
Software Engineer, SystemML - Scaling / Performance at Meta
Required Skills
pythonpytorchcudanvidia nccdistributed ml traininggpu architecturehigh performance computingllmai infrastructure
About the Role
This role focuses on enabling reliable and highly scalable distributed ML training on Meta's large-scale GPU infrastructure, with a particular emphasis on GenAI/LLM scaling. The engineer will work on the software stack around NCCL and PyTorch to improve distributed ML reliability and performance from the trainer down to inter-GPU and network communication layers.Key Responsibilities
- Enabling reliable and highly scalable distributed ML training on Meta's large-scale GPU training infra with a focus on GenAI/LLM scaling
Required Skills & Qualifications
Must Have:
- Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
- Specialized experience in one or more of the following machine learning/deep learning domains: Distributed ML Training, GPU architecture, ML systems, AI infrastructure, high performance computing, performance optimizations, or Machine Learning frameworks (e.g. PyTorch)
Nice to Have:
- Knowledge of GPU architectures and CUDA programming
- Experience working with DL frameworks like PyTorch, Caffe2 or TensorFlow
- Experience in AI framework and trainer development on accelerating large-scale distributed deep learning models
- PhD in Computer Science, Computer Engineering, or relevant technical field
- Experience with both data parallel and model parallel training, such as Distributed Data Parallel, Fully Sharded Data Parallel (FSDP), Tensor Parallel, and Pipeline Parallel
- Experience in HPC and parallel computing
- Knowledge of ML, deep learning and LLM
- Experience with NCCL and distributed GPU reliability/performance improvement on RoCE/Infiniband
Benefits & Perks
- bonus
- equity
- benefits