Senior Machine Learning Engineer

$188K - $282K Palo Alto, CA, US Senior AI/ML Engineer

Interested in this AI/ML Engineer role at Rubrik?

Apply Now →

Skills & Technologies

AzureInstantlyPythonPytorchRlhfVertex Ai

About This Role

AI job market dashboard showing open roles by category

About the Team \& Role:

===========================

We're building SAGE, Rubrik's Semantic AI Governance Engine, which is the first system designed to monitor, govern, and remediate autonomous AI agents in real time. SAGE powers Rubrik Agent Cloud: enterprises define governance policies in natural language, and SAGE's custom small language models act as judges on every agent action. These models are fast enough to sit in the live request path and accurate enough that customers trust them with allow/block decisions on production traffic.

At its core, SAGE is "LLM\-as\-judge" applied to AI governance, utilizing the same technique most teams use for offline evaluation but productionized for real\-time enforcement at enterprise scale. Our first\-generation SLM Policy Guard already outperforms the larger frontier models we've benchmarked against on accuracy while running approximately 5x faster on the same workload. We're hiring to push that lead even further.

As an Applied ML Engineer on the SAGE team, you'll work end\-to\-end across the model lifecycle: curating data, training small models, serving them at production latency, and closing the feedback loop with real customer signals. The models you build don't just enforce policies in the live request path; they will also drive Agent Rewind, Rubrik's capability to instantly and precisely undo destructive autonomous\-agent actions and restore the affected data to a trusted state.

We're a collaborative, applied team that ships models to enterprise customers within weeks, and we're passionate about proving that small, specialized models can outperform frontier LLMs at the problems that matter most for AI safety and governance.

Nature of the Specialized Duties

------------------------------------

### Training, Fine\-Tuning, and Distilling Production Small Language Models and Classifiers (25% of time)

  • Owning the full training lifecycle for the SLMs and classifiers in SAGE's real\-time enforcement path, including base\-model selection, supervised fine\-tuning, preference optimization (DPO/RLAIF), and distillation from frontier teacher models.
  • Training anomaly and action\-severity models that catch novel agent\-side attack patterns at real\-time decision latency, such as supply\-chain compromises or emergent destructive behaviors not covered by any explicit policy. Severity scores route the highest\-impact events to Agent Rewind for precise remediation.
  • Designing adversarial training pipelines like purpose\-built adversarial agents and automated red\-teams whose outputs feed directly into the next training run, turning every discovered weakness into a permanent model improvement.
  • Pushing the pareto frontier of accuracy, latency, and cost for governance\-specific tasks through deliberate post\-training choices (LoRA, quantization\-aware training, distillation recipes, GRPO, etc.) and validating the wins on production traffic patterns.

### Engineering High\-Performance Model Serving and Inference Infrastructure (25% of time)

  • Designing multi\-stage inference pipelines that handle both real\-time enforcement (inline prompt, response, and tool\-call blocking) and high\-throughput batch workloads (offline scoring, back\-testing, corpus mining) while processing billions of tokens daily across Global 2000 customer agent fleets.
  • Optimizing live deployments through shared GPU pools, KV\-cache\-aware routing, continuous batching, FP8/INT8 quantization, and speculative decoding to minimize inference cost while holding sub\-second P99 SLOs.
  • Building serving\-layer infrastructure that lets SAGE block agent prompts, responses, and tool calls in real time without becoming a latency bottleneck. This includes model gateway design, request routing, and graceful degradation.
  • Owning canary, shadow, and A/B traffic patterns so new model variants are validated against live customer traffic before they take enforcement decisions.

### Building Synthetic Data Pipelines and Online \+ Offline Evaluation Frameworks (20% of time)

  • Designing automated data curation pipelines that mine live customer environments (with privacy and tenancy guarantees) for high\-value per tenant training examples, such as long\-tail violations, near\-miss policy edges, or novel agent behaviors, and routing them back into the training loop for each customer.
  • Building automated policy back\-testing by replaying historical agent traffic against new model and policy versions to catch regressions and recommend policy improvements before customer\-visible deployment.
  • Building online evaluation systems for live model decisions, including shadow scoring, drift detection, calibration monitoring, and policy\-coverage gap analysis, ensuring quality regressions surface in minutes rather than weeks.
  • Generating synthetic data using frontier teachers (adversarial prompts, policy\-edge cases, multi\-turn interactions) with evaluation that confirms synthetic data improves downstream quality, not just dataset size.

### Insights Mining, Failure Diagnosis, and Adaptive Model Improvement (15% of time)

  • Building memory and context harnesses that fuse data sensitivity, identity, and historical agent behavior into real\-time enforcement decisions to ensure SAGE reasons from each customer's specific context.
  • Mining agent insights across millions of sessions to surface security gaps, which are then turned into new policy proposals, refinements to existing policies, and signals about upstream issues across the agent ecosystem (Google ADK, Azure AI Foundry, Vertex AI, and others).
  • Building feedback loops that turn production decisions, customer\-flagged false positives, and missed violations into one\-click natural\-language policy refinements to drive false\-positive rates down without sacrificing recall.
  • Diagnosing model failures end\-to\-end and distinguishing data, training\-recipe, architecture, and serving\-layer root causes so fixes land in the right layer the first time.

### Cross\-Functional Collaboration and Translating Customer Reality into Modeling Problems (15% of time)

  • Providing technical leadership on a pillar of the SAGE model stack (training infrastructure, eval methodology, serving architecture, or insights pipeline), mentoring engineers ramping into ML, and shaping the team's technical roadmap.
  • Partnering with Product Management, customer\-facing teams, and security analysts to translate customer agent\-governance requirements into well\-scoped modeling problems, and pushing back when ML is the wrong tool.
  • Communicating model behavior, tradeoffs, and limitations clearly to non\-ML stakeholders, such as product managers and enterprise security leaders, so model decisions are made with full context.
  • Collaborating with Agent Cloud platform, security engineering, and AI research teams to integrate new SLMs into the real\-time enforcement path with the right latency, observability, rollback, and tenancy guarantees.

Minimum Requirements for the Position

-----------------------------------------

Education: A Bachelor's degree (or higher) in Computer Science, Machine Learning, Computer Engineering, Statistics, or a closely related technical field is required. Designing production SLM training and serving systems requires a deep theoretical understanding of modern deep learning, optimization, and systems performance.

### Specialized Technical Knowledge:

  • 2\+ years of professional ML experience with demonstrable end\-to\-end production ownership; you have taken models from training to serving real customer traffic and stayed accountable for them through post\-launch iteration.
  • Proficiency in Python and PyTorch (or equivalent) for production\-grade training and evaluation.
  • Hands\-on experience training, fine\-tuning, or distilling language models or classifiers in a production setting, including SFT and at least one preference\-optimization technique (DPO, RLAIF, or RLHF).
  • Production experience with serving frameworks (vLLM, SGLang, TensorRT\-LLM, or equivalent), including optimization involving continuous batching, KV\-cache strategy, and inference\-time quantization.
  • Experience designing closed\-loop ML systems, including the eval, telemetry, data\-curation, and synthetic\-data infrastructure that turns production signals back into training data and the next model release. You have built (not just used) at least one such loop.
  • Comfort operating at production scale, including debugging models that handle high QPS in safety\-critical request paths where errors have customer\-visible consequences.

### Preferred Qualifications:

  • Deep background in AI safety and red\-teaming, including hands\-on experience with adversarial ML, prompt injection defense strategies, and automated evaluation suites for enterprise\-grade LLM safety.
  • Expertise in model evaluation methodology, specifically building "LLM\-as\-judge" pipelines, calibration monitoring, and adversarial benchmarks that surface the subtle failure modes static metrics often overlook.
  • Experience with context\-fusion and retrieval systems that synthesize disparate signals \- such as data sensitivity, user identity, and behavioral history \- into high\-fidelity model decisions.
  • Production experience with low\-latency inference for streaming or safety\-critical request paths where model throughput and P99 SLOs are paramount.
  • Mastery of label\-efficient training and data mining, utilizing weak supervision, active learning, and embedding\-based retrieval to surface the production examples that drive the most significant quality improvements.
  • Hands\-on knowledge distillation experience, successfully transferring capabilities from frontier teacher models to specialized, small\-scale student models for production serving.
  • Familiarity with the agentic ecosystem, including tool\-use frameworks, model gateway architectures (MCP, LiteLLM, or equivalent), and autonomous agent patterns.
  • Active open\-source contributions to mainstream ML training, serving, or evaluation libraries.

The minimum and maximum base salaries for this role are posted below; additionally, the role is eligible for bonus potential, equity and benefits. The range displayed reflects the minimum and maximum target for new hire salaries for the role based on U.S. location. Within the range, the salary offered will be determined by work location and additional factors, including job\-related skills, experience, and relevant education or training.

US Pay Range

$188,500—$282,700 USD

Join Us in Securing and Accelerating the World's AI Transformation

----------------------------------------------------------------------

Rubrik (RBRK), the Security and AI Operations Company, leads at the intersection of data protection, cyber resilience, and enterprise AI acceleration. Rubrik Security Cloud delivers complete cyber resilience by securing, monitoring, and recovering data, identities, and workloads across clouds. Rubrik Agent Cloud accelerates trusted AI agent deployments at scale by monitoring and auditing agentic actions, enforcing real\-time guardrails, fine\-tuning for accuracy and undoing agentic mistakes.

Inclusion @ Rubrik

----------------------

At Rubrik, we are dedicated to fostering a culture where people from all backgrounds are valued, feel they belong, and believe they can succeed. Our commitment to inclusion is at the heart of our mission to secure the world’s data.

Our goal is to hire and promote the best talent, regardless of background. We continually review our hiring practices to ensure fairness and strive to create an environment where every employee has equal access to opportunities for growth and excellence. We believe in empowering everyone to bring their authentic selves to work and achieve their fullest potential.

### Our inclusion strategy focuses on three core areas of our business and culture:

  • Our Company: We are committed to building a merit\-based organization that offers equal access to growth and success for all employees globally. Your potential is limitless here.
  • Our Culture: We strive to create an inclusive atmosphere where individuals from all backgrounds feel a strong sense of belonging, can thrive, and do their best work. Your contributions help us innovate and break boundaries.
  • Our Communities: We are dedicated to expanding our engagement with the communities we operate in, creating opportunities for underrepresented talent and driving greater innovation for our clients. Your impact extends beyond Rubrik, contributing to safer and stronger communities.

Equal Opportunity Employer/Veterans/Disabled

------------------------------------------------

Rubrik is an Equal Opportunity Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, or protected veteran status and will not be discriminated against on the basis of disability.

Rubrik provides equal employment opportunities (EEO) to all employees and applicants for employment without regard to race, color, religion, sex, national origin, age, disability or genetics. In addition to federal law requirements, Rubrik complies with applicable state and local laws governing nondiscrimination in employment in every location in which the company has facilities. This policy applies to all terms and conditions of employment, including recruiting, hiring, placement, promotion, termination, layoff, recall, transfer, leaves of absence, compensation and training.

Federal law requires employers to provide reasonable accommodation to qualified individuals with disabilities. Please contact us at [email protected] if you require a reasonable accommodation to apply for a job or to perform your job. Examples of reasonable accommodation include making a change to the application process or work procedures, providing documents in an alternate format, using a sign language interpreter, or using specialized equipment.

Salary Context

This $188K-$282K range is above the 75th percentile for AI/ML Engineer roles in our dataset (median: $180K across 2130 roles with salary data).

View full AI/ML Engineer salary data →

Role Details

Company Rubrik
Title Senior Machine Learning Engineer
Location Palo Alto, CA, US
Category AI/ML Engineer
Experience Senior
Salary $188K - $282K
Remote No

About This Role

AI/ML Engineers build and deploy machine learning models in production. They work across the full ML lifecycle: data pipelines, model training, evaluation, and serving infrastructure. The role has evolved significantly over the past two years. Where ML Engineers once spent most of their time on model architecture, the job now tilts heavily toward inference optimization, cost management, and integrating LLM capabilities into existing systems. Companies want engineers who can ship production systems, and the experimenter-only role is fading fast.

Day-to-day, you're writing training pipelines, debugging data quality issues, setting up evaluation frameworks, and figuring out why your model performs differently in staging than it did on your dev set. The best ML engineers are obsessive about reproducibility and measurement. They instrument everything. They know that a model is only as good as the data feeding it and the infrastructure serving it.

Across the 4,133 AI roles we're tracking, AI/ML Engineer positions make up 69% of the market. At Rubrik, this role fits into their broader AI and engineering organization.

Demand for AI/ML Engineers has been strong and consistent. Unlike some AI roles that spike with hype cycles, ML engineering is a foundational need. Every company deploying AI models needs people who can keep them running, and the gap between research prototypes and production systems keeps growing.

What the Work Looks Like

A typical week might include: debugging a data pipeline that's silently dropping 3% of training examples, running A/B tests on a new model version, writing documentation for a feature flag system that lets you roll back model deployments, and reviewing a junior engineer's PR for a new evaluation metric. Meetings tend to be cross-functional since ML touches product, engineering, and data teams.

Demand for AI/ML Engineers has been strong and consistent. Unlike some AI roles that spike with hype cycles, ML engineering is a foundational need. Every company deploying AI models needs people who can keep them running, and the gap between research prototypes and production systems keeps growing.

Skills Required

Azure (24% of roles) Instantly Python (51% of roles) Pytorch (16% of roles) Rlhf (1% of roles) Vertex Ai (4% of roles)

Python and PyTorch dominate the requirements. Most roles expect experience with cloud platforms (AWS, GCP, or Azure) and familiarity with ML frameworks like TensorFlow or JAX. RAG (Retrieval-Augmented Generation) has become a top-3 skill requirement as companies integrate LLMs into their products. Docker and Kubernetes show up in about a third of postings, reflecting the production focus of the role.

Beyond the core stack, employers increasingly want experience with experiment tracking tools (MLflow, Weights & Biases), feature stores, and vector databases. Fine-tuning experience is valuable but less common than you'd think from reading Twitter. Most production LLM work is RAG and prompt engineering, not fine-tuning. If you have both, you're in a strong position.

Companies that are serious about AI/ML hiring tend to post specific infrastructure details in the job description: the frameworks they use, their model serving stack, their data pipeline tools. Vague postings that just say 'ML experience required' without specifics are often companies that haven't figured out what they need yet.

Compensation Benchmarks

AI/ML Engineer roles pay a median of $185,000 based on 13,200 positions with disclosed compensation. Senior-level AI roles across all categories have a median of $227,400. This role's midpoint ($235K) sits 27% above the category median. Disclosed range: $188K to $282K.

Across all AI roles, the market median is $200,700. Top-quartile compensation starts at $254,000. The 90th percentile reaches $307,500. For comparison, the highest-paying categories include AI Safety ($274,200) and AI Engineering Manager ($268,700). By seniority level: Entry: $97,760; Mid: $165,778; Senior: $227,400; Director: $250,000; VP: $250,000.

Rubrik AI Hiring

Rubrik has 1 open AI role right now. They're hiring across AI/ML Engineer. Based in Palo Alto, CA, US. Compensation range: $282K - $282K.

Location Context

Across all AI roles, 14% (583 positions) offer remote work, while 3,532 require on-site attendance. Top AI hiring metros: New York (2,760 roles, $211,000 median); San Francisco (2,258 roles, $253,000 median); Los Angeles (1,841 roles, $195,000 median).

Career Path

Common paths into AI/ML Engineer roles include Data Scientist, Software Engineer, Research Engineer.

From here, career progression typically leads toward ML Architect, AI Engineering Manager, Principal ML Engineer.

The fastest path into ML engineering is through software engineering with a self-directed ML education. A CS degree helps, but production engineering skills matter more than academic credentials. Build something that works, deploy it, and measure it. That portfolio project is worth more than a Coursera certificate. For career growth, the fork comes around the senior level: go deep on technical complexity (staff/principal track) or move into managing ML teams.

What to Expect in Interviews

Expect system design questions around ML pipelines: how you'd build a training pipeline for a specific use case, handle data drift, or design A/B testing infrastructure for model deployments. Coding rounds typically involve Python, with emphasis on data manipulation (pandas, numpy) and algorithm implementation. Take-home assignments often ask you to build an end-to-end ML pipeline from raw data to deployed model.

When evaluating opportunities: Companies that are serious about AI/ML hiring tend to post specific infrastructure details in the job description: the frameworks they use, their model serving stack, their data pipeline tools. Vague postings that just say 'ML experience required' without specifics are often companies that haven't figured out what they need yet.

AI Hiring Overview

The AI job market has 4,133 open positions tracked in our dataset. By seniority: 106 entry-level, 1,901 mid-level, 1,663 senior, and 463 leadership roles (Director, VP, C-Level). Remote roles make up 14% of the market (583 positions). The remaining 3,532 roles require on-site or hybrid attendance.

The market median for AI roles is $200,700. Top-quartile compensation starts at $254,000. The 90th percentile reaches $307,500. Highest-paying categories: AI Safety ($274,200 median, 57 roles); AI Engineering Manager ($268,700 median, 42 roles); Research Engineer ($260,000 median, 442 roles).

Demand for AI/ML Engineers has been strong and consistent. Unlike some AI roles that spike with hype cycles, ML engineering is a foundational need. Every company deploying AI models needs people who can keep them running, and the gap between research prototypes and production systems keeps growing.

The AI Job Market Today

The AI job market spans 4,133 open positions across 15 role categories. The largest categories by volume: AI/ML Engineer (2,865), Data Scientist (339), AI Software Engineer (313). These three account for the majority of open positions, though smaller categories often have higher per-role compensation because of specialized skill requirements.

The seniority mix tells a story about where AI teams are in their maturity. Entry-level roles (106) are outnumbered by mid-level (1,901) and senior (1,663) positions, reflecting that most companies are past the 'build a team from scratch' phase and need experienced engineers who can ship production systems. Leadership roles (Director, VP, C-Level) total 463 positions, representing the bottleneck between technical execution and organizational strategy.

Remote work availability sits at 14% of all AI roles (583 positions), with 3,532 requiring on-site or hybrid attendance. The remote share has stabilized after the post-pandemic correction. Senior and specialized roles (Research Scientist, ML Architect) are more likely to be remote-eligible than entry-level positions, partly because experienced hires have more negotiating power and partly because these roles require less hands-on mentorship.

AI compensation is structured in clear tiers. The market median sits at $200,700. Top-quartile roles start at $254,000, and the 90th percentile reaches $307,500. These figures include base salary with disclosed compensation. Total compensation (including equity, bonuses, and sign-on) runs 20-40% higher at companies that offer those components.

Category matters for compensation. AI Safety roles lead at $274,200 median, while Prompt Engineer roles sit at $140,000. The spread between highest and lowest-paying categories reflects the premium on specialized technical skills versus broader analytical roles.

The most in-demand skills across all AI postings: Python (2,128 postings), Aws (1,324 postings), Azure (1,003 postings), Rag (916 postings), Gcp (817 postings), Pytorch (655 postings), Prompt Engineering (639 postings), Claude (571 postings). Python dominates, appearing in the vast majority of role descriptions regardless of category. Cloud platform experience (AWS, GCP, Azure) is the second most common requirement. The newer entrants to the top skills list (RAG, vector databases, LLM APIs) reflect the shift from traditional ML toward generative AI applications.

Frequently Asked Questions

Based on 13,200 roles with disclosed compensation, the median salary for AI/ML Engineer positions is $185,000. Actual compensation varies by seniority, location, and company stage.
Python and PyTorch dominate the requirements. Most roles expect experience with cloud platforms (AWS, GCP, or Azure) and familiarity with ML frameworks like TensorFlow or JAX. RAG (Retrieval-Augmented Generation) has become a top-3 skill requirement as companies integrate LLMs into their products. Docker and Kubernetes show up in about a third of postings, reflecting the production focus of the role.
About 14% of the 4,133 AI roles we track offer remote work. Remote availability varies by company and seniority level, with senior and leadership roles more likely to offer location flexibility.
Rubrik is among the companies actively hiring for AI and ML talent. Check our company profiles for detailed breakdowns of open roles, salary ranges, and hiring trends.
Common next steps from AI/ML Engineer positions include ML Architect, AI Engineering Manager, Principal ML Engineer. Progression depends on whether you lean toward technical depth, people management, or product strategy.

Get Weekly AI Career Intelligence

Salary data, skills demand, and market signals from 16,000+ AI job postings. Every Monday.