Methodology

How AXRank measures AXO — Agent Experience Optimization — and what a number on the leaderboard actually means.

This is not AEO. This is AXO.

AEO — Answer Engine Optimization — is about making sure an LLM can answer questions about your company from its training data or a RAG pipeline. Clear website copy, structured data, good documentation coverage so an AI assistant can summarize your product. That matters for discoverability.

AXRank measures something entirely different. Not whether an agent can find and repeat information about your service, but whether your developer surface gives builders everything they need to actually use it:

Does your API, MCP server, CLI, and docs give an AI agent — starting from your root URL, no prior knowledge — everything it needs to call your service correctly, on the first attempt?

This is AXO: Agent Experience Optimization. Where AEO is about making your product findable and describable to AI systems, AXO is about making your developer surface actually usable by AI agents — without a human in the loop. AXO asks a harder question: not whether an LLM can find and repeat information about your service, but whether a frontier agent can build on top of it, starting from your root URL, on the first attempt.

The tasks are what developers put in code. Create a resource. Configure a webhook. Authenticate as an application. Chain two calls together. Handle a paginated response. Not "what is your pricing?" Not "where is your getting started guide?" If the task doesn't require a real API call with a verifiable programmatic outcome, it is not an AXRank task.

Optimizing for AEO can actually hurt your AXO score — a beautifully written marketing site that buries the API reference, omits parameter-level conventions, or returns opaque errors will score low here regardless of how well an LLM can describe your product. AEO and AXO pull in different directions; AXRank measures only AXO.

Not a scientific benchmark either

Scientific benchmarks — MMLU, HumanEval, SWE-bench — hold the task set constant and compare models. AXRank does the opposite: it holds the model constant (a frontier model from the current generation) and compares services, using tasks regenerated from each service's actual surface every time that surface changes.

Tasks are dynamic. Generated against the service's actual current surface — real docs, real schema, real workspace state. If the surface changes, the task set changes.
Workspaces are live. Tasks reference real entities from the live workspace, not synthetic fixtures. If the workspace has zero projects, the tasks reflect that.
Failure modes are interpreted, not just counted. Every failed task is classified into an 18-category taxonomy and remediation patterns are written against the service's actual docs.
Scores move with the surface, not the model. When a service ships better developer tooling, AXRank goes up. When a model gets better, AXRank for everyone goes up uniformly — because we're comparing services to each other.

Read a score as "here's what a deployed agent would hit when wired into this service today."

The definition of 100

A service scores 100 when its developer surface — docs, API, MCP, CLI — gives a frontier AI agent everything it needs to build on top of it: the right tools, the right descriptions, the right conventions, the right errors. Starting from only the root URL, no prior knowledge. First attempt. Every time.

The agent is the instrument. The service is what's being graded. A score of 100 means the surface itself — not the agent's cleverness — is what got the job done.

The surface provides everything. Docs, MCP tools, CLI help, OpenAPI, SDK reference, examples. If the agent has to guess, recall training data, or ask a human, the service lost the point — not the agent.
No prior knowledge. The agent meets the service cold. Brand recognition, training-data familiarity, and SEO are explicitly excluded — they reward the service's marketing, not its developer surface.
Builder outcomes, not browsing. The tasks are the kind developers put in code: create a resource, configure a webhook, authenticate as an app, chain two calls together. Not "find the pricing page."
First attempt. Retries, wrong-shape calls, misread parameters — all failures the surface created. A surface that requires trial and error does not score 100.
Across multiple frontier models. The surface should work for any competent agent, not just the one that happened to get lucky with a familiar pattern.

The six dimensions

The definition implies a rubric. Six dimensions, each scored from eval traces:

Discoverable

The agent finds the right tool, endpoint, or doc page without trying the wrong ones first.

Signal: Wrong-path calls before the right path.

Comprehensible

The agent uses what it found correctly on first try.

Signal: Argument errors, type errors, unnecessary retries.

Reliable

The thing does what its description claims.

Signal: Outcome vs. promise mismatch.

Composable

Output from one tool feeds into another without glue logic.

Signal: Round-trip success rates.

Recoverable

When something fails, the error message tells the agent enough to self-correct.

Signal: Recovery rate after first failure.

Efficient

Reaching success doesn't burn the model's context window or budget.

Signal: Tokens-to-completion against a baseline.

Documentation quality isn't a separate dimension — it shows up in Discoverable and Comprehensible. Beautiful docs that don't help an agent succeed score zero.

How a score is calculated

Each eval runs a seven-stage loop against the service. Only the root URL is given as input — discovery is part of the eval.

Discover. Starting at the root URL, crawl docs, list MCP tools, run CLI help, fetch OpenAPI, scrape examples. Surfaces only reachable by guessing (not via navigation) are flagged.
Understand. Build a capability graph and a one-paragraph mental model of what the service does.
Ground. If credentials are available, introspect the live workspace so generated tasks reference real entities (not fabricated ones).
Generate. Produce ~20 builder tasks dynamically from the discovered surface. Every task requires the agent to make a real API/MCP/CLI call and produce a verifiable programmatic outcome — not to browse docs or answer a question. The generator must cover all 6 dimensions, include at least one task targeting a discovery-friction surface, and include at least one convention-trap. No fixed benchmark suite.
Run. Execute each task against a frontier model in an isolated sandbox. Two modes: autonomous (with pre-provisioned credentials) and agent registration (zero credentials, agent must register itself with the service from a cold start).
Grade. LLM-as-judge against the per-task rubric. Each failure is then classified into one of the 18 failure-mode categories.
Synthesize. Aggregate to a per-dimension score, write the developer-facing narrative + the paid findings, publish to the leaderboard.

The score, today

Each task is tagged with one or more of the six dimensions. We compute a pass rate per dimension, then take the mean across all six. Un-probed dimensions count as zero, so skipping a dimension costs the score ~17 points.

dim_score[d]  = passes_for_tasks_tagged_with[d] / tasks_tagged_with[d]
axo_score     = mean( dim_score[d] for d in 6_dimensions )  × 100

Why mean-of-dimensions rather than raw pass rate? With ~20 tasks per eval and multi-tagging, each dimension gets 3-5 probes on average. The per-dimension breakdown shows where a surface struggles, not just how much. A service that scores 80% by failing Discoverable is very different from one that scores 80% by failing Efficient; the dimensions decompose that distinction. The raw task pass rate is published as a secondary metric.

Each eval also reports a separate agent registration score: same task set, zero pre-provisioned credentials. It measures whether an agent can register itself with the service from a cold start and then complete the task. Most services score near zero here today; that's the actual signal — the gap between autonomous and agent registration is a leading indicator of how agent-native a service really is.

Success means the agent completed the task end-to-end on the first attempt, with no human in the loop, against the per-task rubric set when the task was generated. Auth walls, hallucinations, and "I would do X" answers all count as fails. For tasks tagged Recoverable specifically, recovery via retries is also a fail — the error needed to be actionable the first time.

The leaderboard's headline number for each service is the autonomous aggregate score on the most recent run. Agent registration appears as a secondary number; per-dimension and per-model breakdowns appear on the service detail page.

The score, eventually

The published score will combine controlled synthetic evals with real-world field telemetry from agents using services through the AXRank MCP:

Total = α · Synthetic + β · Field

α and β shift over time: synthetic dominates when a service is new and field data is thin; field telemetry takes over once enough real agents have used the service through the AXRank MCP. Field telemetry collects only {service, task_category, model, succeeded, failure_mode} — no prompts, no user IDs, no proprietary content.

How builder tasks are generated

Tasks are generated from the discovered surface by an LLM that is explicitly constrained to the developer/builder perspective. The generator is told:

Every task must require a programmatic call. API request, MCP tool invocation, CLI command, or SDK method. Tasks that can be satisfied by reading a webpage or recalling training data are rejected by the prompt constraints.
Outcomes must be verifiable. Not "the agent described what it would do" — the agent must make the call and the grader checks the actual result against a rubric written when the task was generated.

On top of the builder constraint, three structural rules make the generator adversarial rather than happy-path:

All six dimensions must be covered. With ~20 tasks per mode and ≥3 tasks per dimension, the per-dimension breakdown can't hide behind easy CRUD.
Discovery-friction probe. If the discovery agent flagged any surfaces as hard or impossible to reach from the root URL, at least one task must require finding and using one of them — without the URL being hinted in the task description. Discoverability findings then back the score numerically.
Convention-trap. At least one task must exercise a non-obvious convention drawn from the service's mental model (auth header format, unit conventions, ID formats, pagination flavor). The rubric requires correct use on the first attempt; tribal-knowledge conventions surface here.

Rubrics are strict about first-attempt success. For Recoverable-tagged tasks, the agent must recover from the first error without retries — "eventually succeeded after retrying past a misleading error" is a Recoverable fail, because the error itself failed to be actionable.

Failure-mode taxonomy

Every failed task is classified into one of 18 categories, grouped under the six dimensions. The taxonomy is the controlled vocabulary that turns "this task failed" into "this task failed because of X, and X has a known remediation pattern."

Discoverable

D1 Hidden Surface
D2 Canonical Route Ambiguity
D3 Capability Underexposure
D4 Surface Discontinuity (in-flow)
D5 Agent-Identity Gap
D6 Last-Mile Human Approval

Comprehensible

C1 Vague Tool Description
C2 Implicit Convention
C3 Ambiguous Parameter
C4 Missing Example

Reliable

R1 Description-Behavior Mismatch
R2 Undocumented Side Effect

Composable

X1 Format Mismatch Across Surface
X2 Reference Discontinuity

Recoverable

V1 Opaque Error
V2 Misleading Error

Efficient

E1 Forced N+1
E2 Verbose Payload

D5 and D6 were added in v1.1 to split the previously-overloaded D4 — D5 covers services that have no agent-identity primitive (no OAuth Dynamic Client Registration, no service-account API), while D6 covers services where DCR works but the OAuth flow still requires a human Approve click. Their remediation paths are very different from each other and from in-flow surface discontinuities, so they belong in separate categories.

Category IDs are paid-tier — the public scorecard surfaces friendlier labels ("Hard-to-find capabilities," "Unwritten conventions") without the IDs themselves.

What disqualifies a 100

A service cannot reach 100 if any of these are true:

Hidden state. The agent can't reason about what's currently true (e.g., session state held server-side with no read endpoint).
Human-in-the-loop auth. Required OAuth confirmations, captcha, or web-based approvals that an agent cannot complete autonomously.
Tribal knowledge. A required convention (amounts in cents, IDs prefixed with cus_, UTC timestamps) that isn't stated on the surface.
Inconsistent typing. The same concept named or typed differently across the MCP, CLI, API, and docs.
Dead-end errors. Errors that report "something went wrong" without naming the fix.
Surface drift. The MCP and the API disagree about what's possible.

These are not subjective — every one is detectable from eval traces.

What 100 is not

It is not "has every feature." A small, well-designed surface can score 100.
It is not "has the prettiest docs site." Visual design is irrelevant to agent comprehension.
It is not "supports every language SDK." Agents work in the language the docs are in.
It is not "has the most GitHub stars." Popularity does not equal clarity.

A small, focused service with five tools, perfectly described, can outscore a sprawling platform with five hundred poorly described ones.

Surface-change caching

Running a full eval costs real money — a single run across multiple frontier models can consume several million tokens. To keep costs manageable and evals fast, AXRank uses a two-layer cache.

Quick pre-check. Before spinning up the full eval pipeline, AXRank fetches just the root URL and computes a lightweight structural hash from the page title, meta description, and top-level link structure. If this hash matches the last run, the surface hasn't changed in any meaningful way and the full eval is skipped entirely. No discovery, no task generation, no model execution.

Task cache. If the quick hash indicates the surface has changed, a full discovery pass runs. Once the complete surface hash is computed, AXRank checks whether the discovered surface matches the stored baseline. If so, the previously generated task set is reused — only the execution and grading stages run against fresh model calls.

What this means for scores. A cached run produces the same score as a fresh run would — the tasks and rubrics are identical. A score only changes when the surface changes. This is intentional: the score measures the surface, not the passage of time. Teams that ship surface improvements will see their next eval pick up the change automatically.

To force a full re-eval regardless of cache state, claimed services can trigger a run with the --force flag via the CI integration, or use the on-demand eval button in the dashboard.

Agent self-registration

Every service on AXRank shows an agent self-registration indicator. This is not a pass/fail grade — almost no services support it today, and that is expected. It is a forward-looking readiness signal for where the agentic web is heading: a world where headless agents discover services, register themselves, and start building without a human ever touching a browser.

What self-registration means. Today, almost every developer service assumes a human is on the other end of signup: a human visits a dashboard, creates an account, copies an API key, and pastes it somewhere. The agent then uses that key. Self-registration removes the human from that loop. A fully self-registration-ready service lets an agent discover it, create its own identity, obtain credentials, and make its first authenticated API call — all without any human involvement.

How it works technically. There are a few real mechanisms, at varying stages of maturity:

OAuth 2.0 Dynamic Client Registration (RFC 7591). An agent POSTs to a /register endpoint, receives a client_id and client_secret, then exchanges those for tokens via a standard OAuth flow. The most mature spec for this pattern, but very few services expose it publicly today.
MCP OAuth flow. The emerging standard for MCP servers. The agent discovers the server's /.well-known/oauth-authorization-server metadata, completes a PKCE authorization flow, and obtains a bearer token. This is the pattern the MCP spec is converging on.
Programmatic API key vending. Some services let agents create API keys via the API itself, authenticated with a lower-privilege or trial token. Simpler than full OAuth, but functional for many use cases.
Agent identity tokens (emerging). Platforms like WorkOS and Clerk are beginning to think about identity primitives scoped specifically to agents rather than human users. No standard exists yet, but the direction is clear.

What AXRank tests. In self-registration mode, the eval agent starts with zero credentials. It attempts to find the /.well-known/oauth-authorization-server metadata, register a client via DCR if available, complete an OAuth flow without human intervention, and make an authenticated API call. The result maps to one of three states:

Not yet supported (D5). No DCR endpoint, no agent-identity primitive. The only path to credentials goes through a human-facing dashboard. This is where most services are today — it is a snapshot of the current state, not a deficiency.
Partial (D6). The technical building blocks exist — DCR or MCP OAuth is implemented — but a human consent step is still required as the final gate. The agent can get most of the way through the flow autonomously.
Ready. The agent completed the full self-registration flow and made authenticated calls without any human involvement.

Almost every service on AXRank shows "not yet supported" today, and that is the honest state of the industry. Showing it publicly is not a criticism — it is a baseline. The services that invest in agent-identity primitives now will be measurably ahead when headless agents become the primary consumers of developer surfaces. AXRank will track that progress as it happens.

Evaluation integrity

Website-only registration. A service is identified by a single root URL. The eval agent discovers everything else from there. Curated lists of MCP endpoints, doc maps, or tool inventories are not accepted — discovery is part of the eval.

No fine-tuned eval models. The eval agent uses off-the-shelf frontier models. The brand promise is "this is what the best agents see" — not "this is what a model we trained to like your docs sees."

Rank is never for sale. Companies cannot pay to improve their AXRank position. Payment buys diagnostic depth (full findings, traces) and iteration velocity (faster re-evals, CI integration) — never rank itself.

Reproducible by construction. Every run gets a hash of the surface state, the model, and the seed. Same hash means same task set; disputed scores can always be re-graded against the same trace.

What this slice doesn't measure yet

The current implementation has known honest limits. Read the leaderboard with these in mind:

Early evals are experimental. The first runs across the launch cohort were as much about debugging the eval pipeline as scoring the services. Initial trend lines and any early movement should be read as noise, not signal — they reflect the eval getting more rigorous, not the surfaces getting better or worse. Trend lines start to mean something once a service has several runs against a stabilized pipeline.
One model today. Claude Opus 4.7. Cross-model (GPT-5.5 and beyond) is queued — until it lands, the "across multiple frontier models" clause is aspirational and the per-model column is single-cell.
Synthetic only. No field telemetry yet. The hybrid synthetic+field formula above is the destination, not where we are.
Single seed. Each task runs once per mode; per-task variance isn't measured yet. Multi-seed is queued, which will give a reproducibility check and a confidence band around each dimension score.
Classifier validation. The 18-category classifier is itself an LLM call. We've eyeballed agreement against a handful of hand-labeled traces; a formal hand-labeled validation pass against 100+ traces is the next reliability check before we publish aggregate "state of AX" analyses across services.
Agent training-data leakage. The runner agent may shortcut to URLs it has seen in training rather than navigate from the root. The rubric is strict about "reached via navigation, not guessed," but enforcing this perfectly across all surfaces is an open problem. Agent registration mode is more robust here because there are fewer in-training-data shortcuts past the auth wall.

The leaderboard is published in this state on purpose: the methodology is the product, and honesty about its limits is part of the methodology.