Interested in this MLOps Engineer role at Forecareer?
Apply Now →Skills & Technologies
About This Role
Overview
A well-funded AI infrastructure company is building a next-generation model serving platform to power real-time, multimodal foundation models at scale. The team is looking for a Lead Software Engineer to help architect, build, and lead the core systems that enable low-latency, high-throughput AI inference in production.
This is a senior, hands-on leadership role at the center of the AI stack, spanning GPU execution, distributed inference, scheduling, and developer-facing APIs. You’ll both build critical components yourself and guide other engineers, shaping technical direction, standards, and execution quality.
What You’ll Do
- Technical Leadership: Own the architecture and technical direction of the model serving platform, guiding design decisions and execution across the team.
- Core Serving Systems: Build high-performance inference systems including execution runtimes, batching, scheduling, and distributed serving.
- Performance-Critical Engineering: Develop optimized components in C++ and CUDA/HIP, including memory-efficient runtimes and custom GPU kernels where needed.
- Research-to-Production: Partner closely with ML researchers to productionize new multimodal models with strict latency, reliability, and scalability requirements.
- APIs & Services: Build Python APIs and backend services that expose model capabilities to downstream products and applications.
- Mentorship & Quality: Mentor engineers through code reviews, design discussions, and hands-on technical guidance while driving engineering best practices.
- Observability & Reliability: Lead profiling, benchmarking, monitoring, and troubleshooting across GPU, runtime, and service layers.
Ideal Candidate Profile
- Bachelor’s degree in Computer Science, Computer Engineering, Electrical Engineering, or equivalent practical experience
- 5+ years of experience building scalable backend systems or distributed infrastructure
- Strong understanding of LLM inference mechanics (prefill vs. decode, batching strategies, KV cache)
- Experience with Kubernetes, Ray, and containerized systems
- Strong proficiency in C++ and Python
- Deep debugging, profiling, and performance optimization skills at the systems level
- Ability to translate research or runtime requirements into production-grade systems
- Strong communication skills and experience leading technical discussions and mentoring engineers
- Comfortable working on-site in a fast-moving, high-ownership environment
Nice to Have
- Experience with ML systems engineering or distributed GPU scheduling
- Familiarity with open-source inference engines (e.g., vLLM, SGLang, TRT-LLM)
- Experience building large-scale ML or MLOps infrastructure
- Proficiency with CUDA or ROCm and GPU profiling tools
- Background at an AI/ML startup, research lab, or large-scale infrastructure team
- Familiarity with multimodal models or efficient inference techniques
- Contributions to open-source ML, systems, or HPC infrastructure
Job Type: Full-time
Pay: $230,000.00 - $300,000.00 per year
Benefits:
- 401(k)
- Dental insurance
- Flexible schedule
- Health insurance
- Paid time off
- Relocation assistance
- Vision insurance
Work Location: In person
Salary Context
This $230K-$300K range is above the 75th percentile for MLOps Engineer roles in our dataset (median: $201K across 79 roles with salary data).
View full MLOps Engineer salary data →Role Details
Get Weekly AI Career Intelligence
Salary data, skills demand, and market signals from 16,000+ AI job postings. Every Monday.