AI-Powered Test Automation: 30% Faster Fixes at Salesforce

AI-powered Test Failure Triage Agents

Aug 25, 2025

∙ Paid

TL;DR

Salesforce’s Platform Quality Engineering team built an AI-powered Test Failure (TF) Triage Agent to handle 150K+ monthly test failures across 6M daily tests. By pairing FAISS-based semantic search with LLM reasoning and layering AI insights with historical fix data, they reduced failure resolution time by 30%, scaling from a 20-person pilot to 500+ engineers.

150K Failures and Developer Burnout

Imagine shipping code in an environment with 6M tests running daily across 78B test combinations.

Before this project, engineers burned hours sifting through logs, changelogs, and failure patterns to answer basic questions:

Is this a flaky test?
Did another team’s change break it?
Should I wait, retry, or fix?

With 30K engineers pushing code, failures piled up faster than teams could triage them. Average resolution time? Seven days. Developer trust? Low. Burnout? Rising.

AI With Context, Not Guesswork

The team didn’t just drop in a generic LLM and hope for the best.

They built asynchronous pipelines to process test failure data in real time without slowing CI/CD. At the core:

FAISS-based semantic search over historical test failures for sub-30s lookups
Contextual embeddings of stack traces, code snippets, and changelists
LLM reasoning layered on top to narrow fixes with precision

Instead of asking, “Why did this fail?” the system provided the exact file, feature, and recent changes. Developers saw context-driven suggestions tied directly to past fixes avoiding the usual AI “hallucination” problem.

Building Trust: Incremental Rollout & Human-in-the-Loop

AI tools live or die by developer trust. Too many false positives, and people revert to manual debugging.

Salesforce started small:

20-person pilot → measured accuracy & adoption
Focused scrum teams → validated improvements
500+ engineers → full AI Application Development Cloud rollout

Developers reported the highest confidence in features that surfaced the most likely changelist causing the break, letting them skip log-hunting entirely.

The Results: 30% Faster Resolution Times

Keep reading with a 7-day free trial

Subscribe to Byte-Sized Design to keep reading this post and get 7 days of free access to the full post archives.