Reliability & Eval

16 篇文章

Action Caching & Replay Pattern

LLM-based agent execution is expensive (both in costs and latency) and non-deterministic. Running the same workflow multiple times yields different results and incurs repeated LLM costs. This creates

emerging

Adaptive Sandbox Fan-Out Controller

Parallel sandboxes are intoxicating: you can spawn 10... 100... 1000 runs. But two things break quickly: 1. **Diminishing returns:** After some N, you're mostly paying for redundant failures or near-

emerging

Anti-Reward-Hacking Grader Design

During reinforcement learning training, models actively search for ways to maximize reward. If your grader has edge cases or loopholes, the model will find and exploit them: - **Gaming the metric**:

emerging

Asynchronous Coding Agent Pipeline

Synchronous execution of coding tasks—where the agent must wait for compilation, testing, linting, or static analysis—creates **compute bubbles** and **idle resources**. When a coding agent issues a t

proposed

CriticGPT-Style Code Review

As AI-generated code becomes more sophisticated, it becomes increasingly difficult for human reviewers to catch subtle bugs, security issues, or quality problems. Traditional code review processes may

validated-in-production

Extended Coherence Work Sessions

Early AI agents and models often suffered from a short "coherence window," meaning they could only maintain focus and context for a few minutes before their performance degraded significantly (e.g., l

rapidly-improving

Failover-Aware Model Fallback

AI model requests fail for varied and often opaque reasons. Simple retry logic fails to distinguish between: - **Transient failures** (timeouts, rate limits) that benefit from retry with backoff - **

validated-in-production

Lethal Trifecta Threat Model

Combining three agent capabilities— 1. **Access to private data** 2. **Exposure to untrusted content** 3. **Ability to externally communicate** —creates a straightforward path for prompt-injection at

best-practice

LLM Observability

Agents introduce **non-determinism**—the same input can produce different outputs. When agents do something sub-optimal, users flag it as a "bug" even if it's just prompt ambiguity. Debugging these is

proposed

Merged Code + Language Skill Model

Building a **unified model** that excels both at **natural language tasks** (e.g., summarization, documentation generation) and **code generation/reasoning** typically requires a massive centralized t

emerging

No-Token-Limit Magic

Aggressive prompt compression to save tokens stifles reasoning depth and self-correction.

experimental-but-awesome

RLAIF (Reinforcement Learning from AI Feedback)

Traditional Reinforcement Learning from Human Feedback (RLHF) requires extensive human annotation for preference data, which is expensive (often $1+ per annotation), time-consuming, and difficult to s

emerging

Schema Validation Retry with Cross-Step Learning

LLMs don't always produce valid structured output matching the expected schema. Single-attempt validation leads to task failures even when retry would succeed. The issues compound in multi-step workf

emerging

Structured Output Specification

Free-form agent outputs are difficult to validate, parse, and integrate with downstream systems. When agents return unstructured text, you face: - Unpredictable output formats requiring complex parsi

established

Versioned Constitution Governance

When an agent rewrites its own "constitution," it may accidentally violate safety or regress on alignment objectives if changes aren't reviewed.

emerging

Workflow Evals with Mocked Tools

Unit tests, linters, and typecheckers validate individual components but don't test agent workflows end-to-end. It's easy to create prompts that don't work well despite all underlying pieces being cor

emerging