Reliability & Eval
16 篇文章
Action Caching & Replay Pattern
LLM-based agent execution is expensive (both in costs and latency) and non-deterministic. Running the same workflow multiple times yields different results and incurs repeated LLM costs. This creates…
emergingAdaptive Sandbox Fan-Out Controller
Parallel sandboxes are intoxicating: you can spawn 10... 100... 1000 runs. But two things break quickly: 1. **Diminishing returns:** After some N, you're mostly paying for redundant failures or near-…
emergingAnti-Reward-Hacking Grader Design
During reinforcement learning training, models actively search for ways to maximize reward. If your grader has edge cases or loopholes, the model will find and exploit them: - **Gaming the metric**: …
emergingAsynchronous Coding Agent Pipeline
Synchronous execution of coding tasks—where the agent must wait for compilation, testing, linting, or static analysis—creates **compute bubbles** and **idle resources**. When a coding agent issues a t…
proposedCriticGPT-Style Code Review
As AI-generated code becomes more sophisticated, it becomes increasingly difficult for human reviewers to catch subtle bugs, security issues, or quality problems. Traditional code review processes may…
validated-in-productionExtended Coherence Work Sessions
Early AI agents and models often suffered from a short "coherence window," meaning they could only maintain focus and context for a few minutes before their performance degraded significantly (e.g., l…
rapidly-improvingFailover-Aware Model Fallback
AI model requests fail for varied and often opaque reasons. Simple retry logic fails to distinguish between: - **Transient failures** (timeouts, rate limits) that benefit from retry with backoff - **…
validated-in-productionLethal Trifecta Threat Model
Combining three agent capabilities— 1. **Access to private data** 2. **Exposure to untrusted content** 3. **Ability to externally communicate** —creates a straightforward path for prompt-injection at…
best-practiceLLM Observability
Agents introduce **non-determinism**—the same input can produce different outputs. When agents do something sub-optimal, users flag it as a "bug" even if it's just prompt ambiguity. Debugging these is…
proposedMerged Code + Language Skill Model
Building a **unified model** that excels both at **natural language tasks** (e.g., summarization, documentation generation) and **code generation/reasoning** typically requires a massive centralized t…
emergingNo-Token-Limit Magic
Aggressive prompt compression to save tokens stifles reasoning depth and self-correction.
experimental-but-awesomeRLAIF (Reinforcement Learning from AI Feedback)
Traditional Reinforcement Learning from Human Feedback (RLHF) requires extensive human annotation for preference data, which is expensive (often $1+ per annotation), time-consuming, and difficult to s…
emergingSchema Validation Retry with Cross-Step Learning
LLMs don't always produce valid structured output matching the expected schema. Single-attempt validation leads to task failures even when retry would succeed. The issues compound in multi-step workf…
emergingStructured Output Specification
Free-form agent outputs are difficult to validate, parse, and integrate with downstream systems. When agents return unstructured text, you face: - Unpredictable output formats requiring complex parsi…
establishedVersioned Constitution Governance
When an agent rewrites its own "constitution," it may accidentally violate safety or regress on alignment objectives if changes aren't reviewed.
emergingWorkflow Evals with Mocked Tools
Unit tests, linters, and typecheckers validate individual components but don't test agent workflows end-to-end. It's easy to create prompts that don't work well despite all underlying pieces being cor…
emerging