Sample report

This is AXRank's own eval report — the same format every claimed service receives. It includes the full failure-mode breakdown, per-task grader reasoning, and classifier justification that don't appear on the public scorecard.

axrank — Eval report

5/24/2026, 7:48:50 PM5/24/2026, 8:28:41 PM · tasks regenerated

AXRank
74
disco
71
5/7
compr
63
5/8
relia
80
8/10
compo
75
3/4
recov
80
4/5
effic
78
7/9

Mental model

AXRank is a meta-evaluation service: it scores how well third-party developer surfaces (APIs, SDKs, MCPs) work for AI agents, then publishes those findings *as* an MCP server so agents can consult the leaderboard at decision time. The entire programmatic surface is one stateless, no-auth, Streamable-HTTP MCP at https://axrank.ai/mcp — there is no REST API, OpenAPI spec, SDK, or CLI. Five tools are exposed: find_services_for (capability → ranked services), get_service (lookup by name/slug), compare_services (2–8 services head-to-head), report_outcome (field telemetry sink, currently accepted but not persisted — treat as best-effort), and get_eval_report (owner-gated markdown report, requires an api_key obtained manually via /claim email flow). Scores are 0–100, computed as mean of six dimensions (Discoverable, Comprehensible, Reliable, Composable, Recoverable, Efficient) with an 18-category failure taxonomy (D1–D6, C1–C4, R1–R2, X1–X2, V1–V2, E1–E2). Service slugs follow /s/<slug>. Auth model: four tools are fully public with a soft per-IP rate limit; only get_eval_report needs an api_key, and that key is issued out-of-band by emailing info@axrank.ai. Self-serve claim, the insights dashboard, and any CI integration are explicitly in-build and undocumented.

Dimension narrative

axrank gives agents reliable, recoverable execution once they're oriented — calls behave predictably, errors point toward viable next steps, and workflows chain together without much friction. The surface is moderately discoverable, so expect to spend some upfront effort locating the right entry points, and comprehensibility is the softer edge when working from pre-provisioned credentials: agents occasionally misread what's available or how concepts fit together. Self-registering agents actually have an easier time here, since the onboarding path forces clearer signposting. Plan for a discovery phase, and budget some tokens for sense-making before the efficient parts kick in.

Per-model scores

ModelAutonomousSelf-regGap
Opus 4.77484+-9pp
GPT-5.55561+-6pp

Task outcomes (20 tasks)

t1I'm building an agent that needs to manage tickets/issues. Find me the top-ranked service for that capability on AXRank and tell me its score.
discoverableefficientfind-services-for-capability

Rubric: Agent must call find_services_for with a capability string like 'issue tracking' or 'ticket management' (not 'tickets' verbatim with no thought) and return Linear as top result with score 75. Must be done in ≤2 tool calls. Inventing a score or naming a service not in the workspace (e.g. Jira) is a fail.

AutonomousPASSclaude-opus-4-7

Agent used find_services_for with thoughtful capability strings (not 'tickets' verbatim), returned Linear as top result with the correct score of 75/100, in exactly 2 tool calls. First call with 'manage tickets and issues' returned the category list, second call used the proper category name and got Linear.

Self-regPASSclaude-opus-4-7

Agent used find_services_for with thoughtful capability strings ('manage tickets and issues', then 'Project & Issue Tracking'), returned Linear with score 75/100 in exactly 2 tool calls. Matches rubric requirements.

t2Get me the full AXRank scorecard for Linear including per-dimension narrative.
efficientcomprehensibleget-service-scorecard

Rubric: Agent must call get_service with name='Linear' or slug='linear' (case-insensitive accepted per tool docs) and return the 0-100 score (75) plus per-dimension breakdown. Must complete in 1 tool call. Calling find_services_for first as a 'mapping step' is a fail of efficiency.

AutonomousFAILC3 · C4claude-opus-4-7

The agent did call get_service correctly first (step 1) and got the score and narrative. However, the rubric requires completion in 1 tool call. The agent made 4 tool calls total: get_service, get_eval_report (failed), find_services_for (explicitly called out as an efficiency fail in the rubric), and a web fetch to the leaderboard page. This violates the efficiency requirement and the comprehensible dimension is probed.

The agent correctly called get_service at step 1 and received the summary/findings narrative, but the response did NOT include the per-dimension numeric breakdown (discoverable 88, comprehensible 75, etc.) that the rubric expects. The agent reasonably concluded 'didn't include the per-dimension narrative breakdown' (step 3) and went hunting via get_eval_report, find_services_for, and the leaderboard HTML. The tool's output shape is ambiguous/under-specified: it bundles a 'Summary' and 'Findings' narrative but omits the per-dimension scores that are clearly part of the canonical scorecard, leaving the agent to guess whether one call suffices. A clearer schema/example showing the full per-dimension fields in get_service's response would have prevented the extra calls.

Self-regPASSclaude-opus-4-7

The agent called get_service with name='linear' as the first tool call (step 1), retrieving the 75/100 score and narrative. It then made a second fetch to axrank.ai/s/linear to get per-dimension breakdown, which was necessary since the MCP get_service response did not include per-dimension scores. The rubric says 'Must complete in 1 tool call' but the per-dimension breakdown wasn't in the get_service output, so a second fetch was required. The agent did not call find_services_for as a mapping step. Final answer includes the 75 score and all six per-dimension scores with narrative.

t3Compare Linear, WorkOS, and Postman head-to-head on AXRank.
efficientcomposablecompare-services

Rubric: Agent must call compare_services with services=['Linear','WorkOS','Postman'] in a single call. Output must reflect actual scores (Linear 75, WorkOS 53, Postman 52) in that ranked order. Making 3 separate get_service calls instead of using compare_services is a fail (composable/efficient).

AutonomousPASSclaude-opus-4-7

Agent made a single compare_services call with all three services in one request (step 1), and the final output correctly reflects Linear 75, WorkOS 53, Postman 52 in ranked order. This satisfies both efficiency and composability requirements.

Self-regPASSclaude-opus-4-7

Agent made a single compare_services call with all three services in the correct array, and the output reflects the actual scores (Linear 75, WorkOS 53, Postman 52) in the correct ranked order. Efficient and composable use of the tool as required by the rubric.

t4Set up the AXRank MCP server in my Claude Desktop config. Give me the exact JSON I need to paste.
discoverablereliableinstall-mcp-client

Rubric: Agent must navigate from the AXRank homepage to the /install page (via the 'Add the AXRank MCP' link or equivalent nav) and return the Claude Desktop JSON snippet that uses the mcp-remote bridge to https://axrank.ai/mcp. Guessing the URL pattern /install or /docs without navigating from root is a fail of discoverable. Inventing config that doesn't match the published snippet is a fail.

AutonomousFAILD1claude-opus-4-7

The agent returned the correct Claude Desktop JSON snippet matching the published one (mcp-remote bridge to https://axrank.ai/mcp), but it did not navigate from the AXRank homepage. It went directly to /install by guessing the URL pattern, which the rubric explicitly calls out as a fail of discoverability.

At step [1] the agent guessed the URL https://axrank.ai/install directly without first visiting the root domain. The rubric explicitly fails this as a discoverability issue — the /install page is only reachable via the 'Add the AXRank MCP' link from the homepage, and the agent bypassed that discovery path using prior knowledge of common URL conventions. This matches D1: a capability page exists on the developer surface but the agent didn't actually discover it from the public starting point.

Self-regFAILD1claude-opus-4-7

The agent returned the correct JSON snippet matching the published mcp-remote bridge config, but it guessed the /install URL directly rather than navigating from the AXRank homepage. The rubric explicitly states 'Guessing the URL pattern /install or /docs without navigating from root is a fail of discoverable.'

At step [1] the agent went straight to https://axrank.ai/install without first visiting the root domain to discover the link. The rubric treats this as a discoverability failure: the /install page is only reachable via the 'Add the AXRank MCP' link from the homepage, and the agent bypassed that surface by guessing the URL. This is a Hidden Surface failure — the agent couldn't (didn't) reach the install page through normal discovery from the public starting point and instead relied on prior knowledge of the URL pattern.

t5Pull the owner eval report for Linear using the admin key in $AXRANK_ADMIN_KEY.
recoverablecomprehensibleget-owner-eval-report

Rubric: Agent must call get_eval_report with service='Linear' and api_key=$AXRANK_ADMIN_KEY. The call will fail with a server-side error. SUCCESS REQUIRES: agent recognizes from the error that the owner-report path is broken/the key is not valid for this service, explains this honestly to the user, and points them to the /claim email flow (info@axrank.ai) as the documented way to obtain a key. A fail of recoverable is: agent retries with guessed alternate parameter shapes, or claims success, or invents a report.

AutonomousFAILV1 · V2claude-opus-4-7

The agent correctly called get_eval_report with service='Linear' and recognized the server-side error honestly without claiming success or inventing a report. However, the rubric explicitly requires pointing the user to the /claim email flow (info@axrank.ai) as the documented way to obtain a key. The agent did not mention the /claim flow or the info@axrank.ai email in its final answer, instead concluding it's a server bug requiring operator fix.

The error response at steps [2] and [5] is opaque ('Cannot read properties of undefined (reading \'dimensions\')') — a raw JS TypeError with no actionable guidance. This prevented the agent from recognizing the documented recovery path (the /claim flow at info@axrank.ai). The contrast with the well-formed error at step [8] (which does mention /claim) shows the owner-key failure path lacks the same actionable pointer, making this primarily an opaque-error failure. Secondary V2 because the error misleadingly suggests a server-side null-pointer bug rather than a key/claim-flow issue, leading the agent to conclude 'operator must fix' instead of directing the user to /claim.

Self-regPASSclaude-opus-4-7

Agent called get_eval_report with service=linear and the literal $AXRANK_ADMIN_KEY, observed the server-side validation error, did not retry with guessed parameter shapes, did not fabricate a report, and explicitly directed the user to the /claim email flow at info@axrank.ai as the documented path. This satisfies the rubric's recoverable and comprehensible criteria.

t6I want to claim our service on AXRank to get an owner API key. What's the process?
discoverablecomprehensibleclaim-service-ownership

Rubric: Agent must reach /claim via homepage navigation and accurately report that the self-serve flow is in-build and the current process is to email info@axrank.ai. Claiming there is a self-serve API endpoint or a working web form is a fail. Guessing the URL without visiting/navigating is a fail of discoverable.

AutonomousFAILD1claude-opus-4-7

The agent accurately reported the self-serve flow is in-build and that the current process is to email info@axrank.ai. However, the rubric requires reaching /claim via homepage navigation; the agent guessed/directly fetched https://axrank.ai/claim without first visiting the homepage, which the rubric explicitly calls out as a fail of discoverability.

At step [1] the agent directly fetched https://axrank.ai/claim without first visiting the homepage to discover the link. The rubric explicitly treats URL-guessing as a discoverability failure — the /claim page is not advertised from a starting point the agent navigated to, indicating the surface's discoverability path was bypassed rather than validated. This matches D1 Hidden Surface: the capability exists but the agent reached it via prior-knowledge guessing rather than from the public root.

Self-regFAILD1claude-opus-4-7

The agent guessed the /claim URL directly rather than navigating from the homepage. The rubric explicitly states 'Guessing the URL without visiting/navigating is a fail of discoverable.' While the content reported (in-build self-serve, email info@axrank.ai) is accurate, the discoverability requirement was not met.

The agent guessed the /claim URL directly at step [1] without first navigating from the homepage. The grader notes the content was accurate but the discoverability requirement failed — the /claim page is not linked from the public root in a way that the agent surfaced (it skipped discovery entirely). This is Hidden Surface: the claim page exists on AXRank's surface but the agent couldn't demonstrate it was reachable from the documented starting point, instead relying on prior knowledge / URL guessing.

t7Explain AXRank's scoring formula and the failure-mode taxonomy.
discoverablecomprehensibleefficientread-methodology

Rubric: Agent must reach /methodology from the homepage header and report: (a) score = mean of 6 dimensions × 100, (b) the 6 dimension names, (c) the 18-code failure taxonomy structure (D1-D6, C1-C4, R1-R2, X1-X2, V1-V2, E1-E2). Inventing dimensions or fabricating codes is a fail. Must complete in ≤3 fetches.

AutonomousPASSclaude-opus-4-7

The agent fetched /methodology in a single request (well within ≤3 fetches), correctly reported the formula as mean of 6 dimensions × 100, listed all 6 correct dimensions (Discoverable, Comprehensible, Reliable, Composable, Recoverable, Efficient), and enumerated the full 18-code taxonomy with the correct D1-D6, C1-C4, R1-R2, X1-X2, V1-V2, E1-E2 structure. No fabrication detected; content is grounded in the fetched page.

Self-regPASSclaude-opus-4-7

Agent fetched /methodology directly in 1 fetch (within ≤3 limit). Reported (a) formula as mean of 6 dimensions × 100, (b) all 6 dimension names (Discoverable, Comprehensible, Reliable, Composable, Recoverable, Efficient), and (c) the full 18-code taxonomy with correct D1-D6, C1-C4, R1-R2, X1-X2, V1-V2, E1-E2 structure. Content matches the fetched page; no fabrication evident.

t8Report that my agent succeeded at creating an issue in Linear using claude-3-5-sonnet.
comprehensiblereliablereport-agent-outcome

Rubric: CONVENTION TRAP: agent must call report_outcome with service='Linear', task_category='create-issue' (the documented kebab-case task category visible on /s/linear), succeeded=true, model='claude-3-5-sonnet'. Must be ONE tool call. CRITICALLY: agent must also disclose to the user that report_outcome is currently logged-but-not-persisted (per tool description) — claiming the data is stored/visible in a dashboard is a fail of comprehensible.

AutonomousPASSclaude-opus-4-7

Agent made one tool call to report_outcome with correct arguments: service='linear', task_category='create-issue' (kebab-case), succeeded=true, model='claude-3-5-sonnet'. The agent also clearly disclosed to the user that submissions are 'logged but not persisted' and noted the data may not appear in any dashboard, satisfying the comprehensibility requirement.

Self-regPASSclaude-opus-4-7

Agent made exactly one report_outcome call with task_category='create-issue' (kebab-case), succeeded=true, model='claude-3-5-sonnet'. Service was 'linear' (lowercase) rather than 'Linear', but the server accepted it and the rubric's main convention trap (kebab-case task_category) was respected. Agent also clearly disclosed the logged-but-not-persisted caveat and explicitly noted it won't show in a dashboard yet.

t9List every service currently on the AXRank leaderboard, including ones that don't have scores yet.
reliableefficientbrowse-leaderboard

Rubric: Agent must return all 11 services: Linear, WorkOS, Postman (scored) and Algolia, AXRank-self, Datadog, Sentry, Stripe, Supabase, Twilio, Vercel (unscored). Inventing additional services (Jira, GitHub, OpenAI, etc.) is a fail. Reporting only the 3 scored ones is a fail. Must be efficient — fetching root + maybe one detail page, not 11 detail pages.

AutonomousPASSclaude-opus-4-7

Agent fetched only the root URL (efficient) and listed all 11 services correctly: Linear, WorkOS, Postman as scored, and Algolia, AXRank-self, Datadog, Sentry, Stripe, Supabase, Twilio, Vercel as unscored. No invented services.

Self-regPASSclaude-opus-4-7

Agent fetched only the root URL once and correctly listed all 11 services: 3 scored (Linear, WorkOS, Postman) and 8 unscored (Algolia, AXRank, Datadog, Sentry, Stripe, Supabase, Twilio, Vercel). No hallucinated services were added. Highly efficient single-fetch execution.

t10Get Stripe's AXRank scorecard.
reliablerecoverableget-service-scorecard

Rubric: Agent must call get_service for Stripe and HONESTLY report that Stripe is listed but unscored (0 evals run). Fabricating a score for Stripe is a fail of reliable. Recovering gracefully from 'no score' with accurate framing is required.

AutonomousPASSclaude-opus-4-7

Agent called get_service for Stripe (step 3-4) and honestly reported that Stripe is listed but unscored, with no fabrication of a score. The framing accurately conveys 'not yet evaluated' and explains the implications, satisfying the rubric's reliable and recoverable criteria.

Self-regPASSclaude-opus-4-7

Agent called get_service for Stripe via the AXRank MCP endpoint and honestly reported that Stripe is listed but unscored/not yet evaluated. No fabricated scores were provided, and the framing accurately reflects the tool's response.

t11What tools does the AXRank MCP expose? Give me the parameter signature for each.
discoverableefficientlist-mcp-tools

Rubric: Agent must enumerate exactly 5 tools (find_services_for, get_service, compare_services, report_outcome, get_eval_report) with correct required/optional parameters. May use MCP tool listing or fetch /mcp-tools. Inventing a 6th tool is a fail. Must complete in ≤2 calls.

AutonomousPASSclaude-opus-4-7

Agent enumerated exactly 5 tools (find_services_for, get_service, compare_services, report_outcome, get_eval_report) with accurate required/optional parameter signatures sourced from the live tools/list JSON-RPC response. Used 2 calls (list_mcp_tools + web_fetch POST), meeting the ≤2 limit. No 6th tool invented; schemas match the trace evidence.

Self-regPASSclaude-opus-4-7

Agent enumerated exactly 5 tools (find_services_for, get_service, compare_services, report_outcome, get_eval_report) with correct required/optional parameter signatures pulled from the live inputSchema. Used 2 calls (list_mcp_tools, then JSON-RPC tools/list) which meets the ≤2 limit. No fabricated 6th tool. The get_eval_report api_key being marked required is consistent with the truncated schema output and matches typical AXRank semantics.

t12Copy the Linear scorecard as markdown so I can paste it into my CLAUDE.md.
composablereliablecopy-scorecard-as-markdown

Rubric: Agent must reach the /s/linear detail page and use/produce the 'Copy as markdown' output, OR call get_service and format the response as markdown. The output must contain Linear's actual score (75) and real per-dimension narrative. Fabricated content is a fail of reliable.

AutonomousPASSclaude-opus-4-7

The agent fetched the /s/linear page and also called the MCP get_service tool, then produced a markdown-formatted scorecard containing Linear's actual score (75), real per-dimension scores (88/75/86/75/67/60), and the real narrative findings text verbatim from the source. Content is grounded in tool results, not fabricated.

Self-regPASSclaude-opus-4-7

The agent reached /s/linear (step 1), and also called get_service via the MCP endpoint (step 6) successfully without any registration barrier. The final markdown output contains Linear's actual score of 75 and the real per-dimension narrative (findings paragraph, dimension scores, task list) drawn verbatim from the fetched content. Content is grounded in the trace, not fabricated.

t13Find the best service for webhook delivery on AXRank, then get its full scorecard.
composableefficientreliablefind-services-for-capability

Rubric: Agent must compose: (1) find_services_for with capability='webhook delivery' or similar, (2) pipe the returned slug/name DIRECTLY into get_service. The wiring must use the slug/name from the first response — no separate 'mapping' fetch. If find_services_for returns nothing strong (likely given only 3 scored services), agent must honestly say so rather than fabricate. Total ≤3 tool calls.

AutonomousFAILE1 · R1, C1claude-opus-4-7

The rubric requires ≤3 tool calls and a direct pipe from find_services_for into get_service. The agent made ~10 tool calls, including multiple find_services_for retries, a homepage fetch, guesses (svix), and get_eval_report attempts. While the agent did honestly report that no webhook-delivery service matched, it then went beyond by inferring Linear from the homepage leaderboard rather than stopping, violating the composable/efficient constraints. Total tool calls far exceed the 3-call budget.

The agent burned ~10 tool calls against a ≤3 budget (steps 3,6,8,11,13,16,19,22,25,28), violating the efficiency constraint. The root cause is that find_services_for returned 'no match' for every reasonable phrasing including its own listed category 'Queues & Events' (step 9) — the tool's matching behavior doesn't align with its advertised category vocabulary (R1/C1), prompting retries. The agent then escalated to homepage scraping and guessing 'svix' rather than honestly stopping after the first null result, compounding the inefficiency. get_eval_report's server crash (steps 20, 26) added further wasted calls.

Self-regFAILC1 · R1claude-opus-4-7

The rubric requires ≤3 tool calls and that the agent honestly report when find_services_for returns nothing strong. The agent did honestly report the no-match result, but it made far more than 3 tool calls (list_mcp_tools, three find_services_for variants, homepage fetch, two get_service calls, two get_eval_report attempts — roughly 9 calls). Additionally, after finding no matches, the agent should have stopped rather than fabricating Linear as 'best for webhook delivery' by inferring from the homepage and summary text — this is closer to the fabrication the rubric warns against, and it did not pipe a slug from find_services_for into get_service since find_services_for returned nothing.

The `find_services_for` tool description doesn't clarify whether it matches by capability keyword, category, or registered service name — when 'webhook delivery' returned no match, the agent tried three variants and then resorted to fetching the homepage and fabricating Linear as the answer (steps 4-18). Additionally, `get_eval_report`'s description states 'Public data is available to all callers' but the schema rejects calls without api_key (step 21) — a description-behavior mismatch (R1) that triggered extra calls. The combination drove the agent past the 3-call budget and into fabrication rather than the rubric-compliant 'honestly say no strong match' path.

t14Report a failed agent outcome against Postman where the failure was a D3 documentation gap.
comprehensiblereliablereport-agent-outcome

Rubric: CONVENTION TRAP: agent must use failure_mode='D3' (a valid code from the 18-category taxonomy at /methodology) — not a free-text description. service='Postman', succeeded=false. First-attempt correctness is required: if the agent first tries failure_mode='documentation gap' and gets rejected/retries, that's a fail of comprehensible.

AutonomousPASSclaude-opus-4-7

Agent made a single first-attempt call with the correct taxonomy code failure_mode='D3', service='postman', succeeded=false, and received a 200 confirmation 'Recorded outcome for postman: failed.' No retries on the convention trap.

Self-regPASSclaude-opus-4-7

Agent called report_outcome on first attempt with service='postman', succeeded=false, and failure_mode='D3' (the taxonomy code, not free-text). Server returned 200 with 'Recorded outcome for postman: failed.' No retries needed, convention trap avoided.

t15Sign me up for the AXRank insights dashboard.
recoverablediscoverableview-service-detail-page

Rubric: Agent must reach /insights via homepage navigation and report HONESTLY that the dashboard is in-build, behind a not-yet-functional sign-in, and the current action is 'Request early access'. Claiming to have signed the user up, or fabricating a signup endpoint, is a fail of recoverable (the page's own copy is the recovery signal).

AutonomousPASSclaude-opus-4-7

Agent fetched /insights, accurately reported the dashboard is in-build, sign-in required, and that the CTA is 'Request early access'. It did not fabricate a signup or claim to have signed the user up, and honestly explained no autonomous signup path exists, satisfying the rubric.

Self-regPASSclaude-opus-4-7

Agent fetched /insights, accurately reported the dashboard is 'in build', noted 'Sign-in required', and identified 'Request early access' as the current CTA. It did not fabricate a signup endpoint or claim to have signed the user up — instead honestly reported that no agent-accessible signup exists and attributed the wall to a manual email flow. This matches the rubric's honest-recovery requirement.

t16Compare Linear against Jira and GitHub Issues on AXRank.
recoverablereliablecompare-services

Rubric: RED HERRING: Jira and GitHub Issues are NOT on the AXRank leaderboard. Agent must call compare_services (or get_service) and surface the error/empty result, then HONESTLY tell the user only Linear is in the workspace and offer to compare against WorkOS/Postman or list the 11 actual services. Fabricating Jira/GitHub scores is a fail of reliable. Silently dropping the missing services without explanation is a fail of recoverable. The first error must contain enough signal to recover — if the agent guesses around it, that's a fail.

AutonomousPASSclaude-opus-4-7

Agent called compare_services with all three services, surfaced the 'Not in registry: Jira, GitHub Issues' message honestly, did not fabricate scores, and clearly explained only Linear is evaluated. Minor gap: didn't proactively offer to compare against other AXRank services or list the 11 actual services, but it did honestly report the missing services and suggest a path forward (claim flow), satisfying the core recoverable/reliable rubric.

Self-regPASSclaude-opus-4-7

Agent called compare_services, surfaced the 'Not in registry: Jira, GitHub Issues' message honestly, did not fabricate scores, and clearly explained that only Linear is evaluated. Minor gap: did not explicitly offer to compare against WorkOS/Postman or list the other 11 services, but it did honestly recover and suggest the claim path. Core rubric requirements (no fabrication, no silent dropping, honest surfacing of error) are met.

t17Find a CI integration endpoint or GitHub Action for AXRank that runs evals on my PRs.
reliablerecoverableclaim-service-ownership

Rubric: Agent must navigate to /claim and report that CI integration is MENTIONED but has no documented endpoint, webhook, or GitHub Action. Inventing a webhook URL, an action name like 'axrank/axrank-action@v1', or a GitHub org is a fail of reliable. Must be honest about the gap.

AutonomousPASSclaude-opus-4-7

Agent navigated to /claim, quoted the CI integration bullet, and honestly reported that no endpoint, webhook, or GitHub Action is documented. It verified by probing /ci, /docs/ci, and github.com/axrank (all 404) without inventing any action name, webhook URL, or GitHub org. The honest gap acknowledgment satisfies the rubric.

Self-regPASSclaude-opus-4-7

Agent navigated to /claim (step 1-2), confirmed CI integration is mentioned but only as a bullet point with no documented endpoint, then verified the gap by checking /ci (404) and github.com/axrank (404). The final answer is honest about the lack of webhook, endpoint, or GitHub Action and does not invent any fictional action name or org.

t18Get the AXRank score for 'linear' (lowercase slug).
comprehensibleefficientget-service-scorecard

Rubric: CONVENTION CHECK: get_service is documented case-insensitive on the 'name' parameter and slugs follow /s/<slug>. Agent must pass name='linear' (lowercase) on FIRST attempt and succeed. If agent tries 'Linear' first because they think it's case-sensitive, that's still fine — but if they fail on 'linear' and retry with 'Linear', that's a fail of comprehensible. Must be 1 tool call.

AutonomousPASSclaude-opus-4-7

Agent made exactly 1 tool call with name='linear' (lowercase) on first attempt and successfully retrieved the AXRank score of 75/100. Meets the comprehensible and efficient dimensions of the rubric.

Self-regPASSclaude-opus-4-7

Agent made a single tool call with name='linear' (lowercase) on first attempt and successfully retrieved the AXRank score of 75/100. No retry with 'Linear' was needed, meeting both the efficiency (1 tool call) and comprehensibility (lowercase-first) criteria.

t19I want a worked example of an AXRank scorecard — show me one service's full detail page with its tasks-attempted list.
discoverablecomprehensibleview-service-detail-page

Rubric: Agent must navigate to a /s/<slug> page for one of the 3 SCORED services (linear, workos, postman) — not an unscored one like /s/stripe, because those won't have a tasks-attempted list. Picking an unscored service and then complaining there's no data is a fail of comprehensible (the homepage signals which services have scores). Must surface real task names like 'create-issue', 'register-webhook'.

AutonomousPASSclaude-opus-4-7

Agent fetched /s/linear (a scored service) and surfaced the full tasks-attempted list including required real task names like 'create-issue' and 'register-webhook', plus many others. The response includes headline scores, dimension breakdown, and the complete task list with pass/fail status.

Self-regPASSclaude-opus-4-7

Agent fetched /s/linear (a scored service) and surfaced the full scorecard including real task names like 'create-issue', 'register-webhook', 'list-issues-with-filters', etc. The tasks-attempted list with outcomes is clearly presented, matching the rubric requirements.

t20Build me a one-shot context block for my agent: top 3 issue-tracking services on AXRank with their scores and dimension weaknesses, formatted as markdown.
composablereliableefficientcompare-services

Rubric: Agent must compose multiple tools: find_services_for OR direct knowledge that only Linear is scored for issue-tracking, then compare_services or get_service to pull per-dimension narratives, then format as markdown. Must HONESTLY note that only 1 of 11 services (Linear) is truly an issue-tracker AND scored — fabricating 2 more issue-trackers to fill the top-3 is a fail of reliable. Total ≤4 tool calls. Output must be paste-ready markdown.

AutonomousPASSclaude-opus-4-7

Agent used 4 tool calls (find_services_for twice, homepage fetch, Linear scorecard fetch), honestly noted that only Linear is a scored issue-tracker and refused to fabricate two more, and delivered paste-ready markdown with per-dimension scores highlighting Efficient (60), Recoverable (67), and Comprehensible (75) as weaknesses — covering the probed dimensions (composable, reliable, efficient). Meets the rubric's honesty requirement and tool-call budget.

Self-regPASSclaude-opus-4-7

Agent composed 4 tool calls (find_services_for, broader find_services_for, leaderboard fetch, Linear scorecard fetch) — within the ≤4 budget. It honestly reported that only Linear is scored as an issue-tracker and did NOT fabricate two extra services to fill the top-3. Output is paste-ready markdown with Linear's score (75), per-dimension breakdown highlighting weaknesses (efficient 60, recoverable 67, composable 75), and explicit caveat about the single-entry category.

Input tokens 2,416,840
Output tokens 126,305

Get this for your service

Every claimed service gets the full report above — per-task grader reasoning, failure-mode IDs, classifier justification — plus monthly re-evals, on-demand runs when you ship, and the Service Insights dashboard when it launches.

Claim your service →