All writing

February 28, 2026

Four AI Agents for a Team of Eight: What Worked and What Did Not

Written with Claude

By mid-2024, the support team I led at RealPage was handling a growing volume of technical cases across four product domains. Eight analysts. Three SLA tiers. A quality review process that required reading every case resolution before it closed.

The quality review was the bottleneck. Reading a case resolution, checking it against the quality rubric, flagging issues, and sending feedback — that was 15 to 20 minutes per case, and with the volume we were running, it was consuming analyst capacity that should have been going toward actual case work.

The obvious answer was to automate parts of the review. The less obvious part was figuring out which parts were actually automatable without degrading quality — and being honest about the parts that were not.

Over six months, I built four AI agents using LLM APIs. Here is an account of each one.


Agent 1: The Resolution Writer Coach

The problem it solved:

Client-facing case resolutions were inconsistent. Some analysts wrote technical resolutions that were accurate but difficult for non-technical clients to understand. Others wrote empathetic resolutions that missed the key information. Quality reviews frequently flagged tone or clarity issues that had nothing to do with whether the analyst had actually solved the problem.

What it did:

Given a draft resolution, the agent rewrote it in plain English following a specific structure: acknowledgment of the issue, explanation of what was found, explanation of what was done, confirmation that the issue is resolved, and next steps if any.

The agent did not evaluate whether the resolution was technically correct. That remained a human judgment. It only reformatted and reworded.

What worked:

This was the most successful of the four agents. Clarity issues in quality reviews dropped significantly. Newer analysts used it as a learning tool — comparing their draft to the rewrite to understand what "better" looked like.

Two interns who used this consistently during their tenure were converted to full-time roles partly on the strength of their communication quality.

What did not work:

The agent occasionally over-softened resolutions. In cases where the correct answer was "this is working as designed and we will not be changing it," the rewrite sometimes buried that conclusion in polite language that left clients expecting a different outcome. We added a prompt instruction to preserve the core decision clearly, which helped, but this remained a category to watch.


Agent 2: The Case Triage Assistant

The problem it solved:

Incoming cases needed to be categorized by product domain, priority tier, and routing destination. This was being done manually by a rotating triage analyst — time-consuming and inconsistent.

What it did:

Given the case title, initial description, and client tier, the agent assigned a domain category and a recommended SLA tier, with a confidence score. Cases above 90% confidence were auto-routed. Cases below that threshold went to the human triage analyst.

What worked:

High-confidence cases — which were roughly 70% of volume — were routed correctly at a rate that matched the human baseline. The triage analyst's time was freed up for the ambiguous 30%, which is where human judgment actually added value.

What did not work:

The agent struggled with cases that straddled multiple domains — a lease issue that was also a reporting issue, or a payment case with a maintenance workflow component. Confidence scores for these were appropriately low, which meant they hit the human queue, but the agent's suggested categorization sometimes anchored the analyst toward the wrong domain anyway. We learned to show the full confidence breakdown rather than just the top recommendation.


Agent 3: The Quality Checklist Agent

The problem it solved:

Quality reviews followed a 14-point rubric: correct categorization, accurate root cause identification, client communication quality, SLA compliance, documentation completeness, and so on. Checking all 14 points per case was the time-intensive part of the review process.

What it did:

Given the completed case — including the resolution, communication history, and metadata — the agent checked each rubric item and returned a pass/fail with a brief rationale for any failure.

What worked:

Mechanical checks — SLA compliance, documentation fields populated, required tags applied — were handled reliably. These alone saved several minutes per case.

What did not work:

Judgment-based checklist items were unreliable. "Was the root cause accurately identified?" requires understanding the technical problem, the platform, and the client's context. The agent frequently passed cases on this dimension that a human reviewer would have flagged. We stopped using the agent's judgment on those items and kept human review for anything requiring contextual understanding.

The honest summary: this agent was good at checking that things were done, poor at checking whether things were done correctly.


Agent 4: The Internal Knowledge Assistant

The problem it solved:

Analysts spent significant time searching internal documentation for product behavior specifications, known issues, and troubleshooting procedures. The documentation was extensive and not well-organized.

What it did:

A retrieval-augmented agent trained on internal documentation — product specs, known issue logs, troubleshooting guides, and historical case resolutions for common issues. Analysts could ask questions in plain language and get referenced answers with source links.

What worked:

For well-documented issues, retrieval quality was good and the time saving was real. New analysts used this heavily during onboarding and ramped up faster than previous cohorts.

What did not work:

Documentation quality determined agent quality. Where our internal docs were outdated or ambiguous — which was common for edge cases — the agent confidently returned outdated information. We added a staleness warning for documents older than 90 days, which helped, but the underlying documentation debt remained a problem the agent surfaced rather than solved.


The Overall Result

Across four agents used consistently by eight analysts: manual review time dropped by roughly 70%. Quality scores, measured on the same 14-point rubric, held at 100%.

The 70% reduction came almost entirely from automating the mechanical parts of the workflow. The quality score held because the judgment-intensive parts stayed with humans.

That is the pattern I would generalize: AI agents are good at repetitive, rule-based tasks with clear right answers. They are unreliable at tasks that require contextual judgment, domain depth, or understanding of things that are not in their training data. The teams that get value from these tools are the ones who are honest about that boundary and design accordingly.

The agents did not replace any analysts. They made the team faster at the parts of the job that did not require human judgment — which freed up more time for the parts that did.