Interested in this MLOps Engineer role at Forecareer?

Apply Now →

Skills & Technologies

JAXMultimodalPyTorchPython

About This Role

Overview

A well-funded AI infrastructure company is building next-generation multimodal foundation models and a highly optimized training and serving platform. The team is looking for a GPU Kernel Engineer to push the limits of performance on modern accelerators and help power large-scale AI systems.

This role sits at the intersection of GPU programming, systems engineering, and cutting-edge AI workloads. You’ll work across the hardware–software stack, from low-level kernel development to integrating optimized operations into production ML frameworks used for training and inference at scale.

What You’ll Do

Custom Kernel Development: Design, implement, and optimize high-performance GPU kernels using C++, CUDA, ROCm, PTX, Triton, and/or JAX Pallas.

Performance Optimization: Profile and optimize end-to-end ML workloads, with a focus on large-scale LLM training and inference.

Framework Integration: Integrate low-level GPU kernels into frameworks such as PyTorch, JAX, and custom runtime systems.

Bottleneck Analysis: Build performance models, identify compute and memory bottlenecks, and deliver kernel-level improvements that meaningfully accelerate AI workloads.

Cross-Functional Collaboration: Work closely with ML researchers, distributed systems engineers, and model-serving teams to optimize performance across the stack.

Hardware-Aware Engineering: Collaborate with hardware vendors and stay current with evolving GPU architectures, compilers, and toolchains.

Tooling & Reliability: Contribute to benchmarking, testing, documentation, and tooling to ensure correctness, reproducibility, and sustained performance gains.

Ideal Candidate Profile

5+ years of experience in GPU kernel development, high-performance computing, or systems programming

Bachelor’s, Master’s, or PhD in Computer Science, Computer Engineering, Electrical Engineering, Applied Mathematics, or a related field

Strong programming skills in C++ and Python

Deep expertise in CUDA and/or ROCm, GPU memory models, and performance optimization

Hands-on experience with Triton and/or JAX Pallas for custom kernel development

Strong understanding of PTX, GPU assembly, and low-level execution models

Proven experience integrating custom kernels into PyTorch, JAX, or similar ML frameworks

Experience working with large-scale LLM workloads (training or inference)

Nice to Have

Experience optimizing for AMD GPUs and ROCm

Familiarity with JAX FFI and custom ML operator development

Experience with efficient inference or serving frameworks (e.g., vLLM, TensorRT)

Exposure to TPUs, XLA, or other accelerator programming environments

Contributions to open-source ML systems, compilers, or GPU kernel libraries

Benefits

Medical, dental, and vision insurance

401(k) plan

Daily meals and snacks

Flexible time off

Competitive compensation and meaningful equity

Job Type: Full-time

Pay: $190,000.00 - $250,000.00 per year

Benefits:

401(k)
Dental insurance
Health insurance
Paid time off
Relocation assistance
Stock options
Vision insurance

Work Location: In person

Salary Context

This $190K-$250K range is above the median for MLOps Engineer roles in our dataset (median: $201K across 79 roles with salary data).

View full MLOps Engineer salary data →

Role Details

Company Forecareer

Title GPU Kernel Engineer (AI Infrastructure)

Location San Francisco, CA, US

Category MLOps Engineer

Experience Mid Level

Salary $190K - $250K

Remote No

Get Weekly AI Career Intelligence

Salary data, skills demand, and market signals from 16,000+ AI job postings. Every Monday.