Interested in this MLOps Engineer role at Forecareer?
Apply Now →Skills & Technologies
About This Role
Overview
A well-funded AI infrastructure company is building next-generation multimodal foundation models and a highly optimized training and serving platform. The team is looking for a GPU Kernel Engineer to push the limits of performance on modern accelerators and help power large-scale AI systems.
This role sits at the intersection of GPU programming, systems engineering, and cutting-edge AI workloads. You’ll work across the hardware–software stack, from low-level kernel development to integrating optimized operations into production ML frameworks used for training and inference at scale.
What You’ll Do
- Custom Kernel Development: Design, implement, and optimize high-performance GPU kernels using C++, CUDA, ROCm, PTX, Triton, and/or JAX Pallas.
- Performance Optimization: Profile and optimize end-to-end ML workloads, with a focus on large-scale LLM training and inference.
- Framework Integration: Integrate low-level GPU kernels into frameworks such as PyTorch, JAX, and custom runtime systems.
- Bottleneck Analysis: Build performance models, identify compute and memory bottlenecks, and deliver kernel-level improvements that meaningfully accelerate AI workloads.
- Cross-Functional Collaboration: Work closely with ML researchers, distributed systems engineers, and model-serving teams to optimize performance across the stack.
- Hardware-Aware Engineering: Collaborate with hardware vendors and stay current with evolving GPU architectures, compilers, and toolchains.
- Tooling & Reliability: Contribute to benchmarking, testing, documentation, and tooling to ensure correctness, reproducibility, and sustained performance gains.
Ideal Candidate Profile
- 5+ years of experience in GPU kernel development, high-performance computing, or systems programming
- Bachelor’s, Master’s, or PhD in Computer Science, Computer Engineering, Electrical Engineering, Applied Mathematics, or a related field
- Strong programming skills in C++ and Python
- Deep expertise in CUDA and/or ROCm, GPU memory models, and performance optimization
- Hands-on experience with Triton and/or JAX Pallas for custom kernel development
- Strong understanding of PTX, GPU assembly, and low-level execution models
- Proven experience integrating custom kernels into PyTorch, JAX, or similar ML frameworks
- Experience working with large-scale LLM workloads (training or inference)
Nice to Have
- Experience optimizing for AMD GPUs and ROCm
- Familiarity with JAX FFI and custom ML operator development
- Experience with efficient inference or serving frameworks (e.g., vLLM, TensorRT)
- Exposure to TPUs, XLA, or other accelerator programming environments
- Contributions to open-source ML systems, compilers, or GPU kernel libraries
Benefits
- Medical, dental, and vision insurance
- 401(k) plan
- Daily meals and snacks
- Flexible time off
- Competitive compensation and meaningful equity
Job Type: Full-time
Pay: $190,000.00 - $250,000.00 per year
Benefits:
- 401(k)
- Dental insurance
- Health insurance
- Paid time off
- Relocation assistance
- Stock options
- Vision insurance
Work Location: In person
Salary Context
This $190K-$250K range is above the median for MLOps Engineer roles in our dataset (median: $201K across 79 roles with salary data).
View full MLOps Engineer salary data →Role Details
Get Weekly AI Career Intelligence
Salary data, skills demand, and market signals from 16,000+ AI job postings. Every Monday.