Faceless. Independent. Nobody pays for a verdict.

Before you pay for an AI tool, find out if it passes the acid test.

I put every tool in a category through the same battery, score it out of 100, and publish a dated PASS, BORDERLINE or FAIL. The inputs are public, so you can run the test yourself and tell me where I'm wrong. A new one lands every couple of weeks.

✓ You're on the list. The next battery lands in your inbox.

A new test every couple of weeks. No spam, unsubscribe anytime. By subscribing you agree to receive the newsletter and to our Privacy Policy.

Same published inputs, dated and re-runnable. No sponsorships; the verdict isn't for sale.

PASS
protocol v1.0·running battery
The public scorecard

Every verdict stays on the record.

Each test adds a row. A re-test gets its own dated row, so nothing quietly gets overwritten. The back catalogue is the proof I'm not just chasing whatever launched this week.

RESULTS · scored out of 100 updated each battery
ToolCategoryAs ofScoreVerdictCost / resultBuy or skip

Rows marked pending are queued for the next run. Bands: PASS 75 and up, BORDERLINE 55 to 74, FAIL under 55.

How the acid test works

One protocol, run the same way every time.

Every tool in a category meets the same battery and the same scoring. The number I care about most is the cost of a result you can actually use.

01

Same battery for every tool

One task set, one set of inputs, run identically. No friendly demos and no improvising to flatter a particular product.

02

Scored across seven things

Quality, reliability, speed, setup friction, cost per result, workflow fit, and the limits nobody advertises. It all collapses into one number out of 100.

03

The cost of a usable result

Sticker price lies. I track what one output you can trust actually costs, once the retries and the re-dos are in.

04

Dated, and you can repeat it

Every verdict is pinned to a version and a date, with the inputs published so you can run it yourself and check me.

Re-run it yourself

The exact inputs — public.

First run: AI meeting notetakers, Otter vs Granola (free plans, as of 2026-07-01). This is the same meeting audio both tools received. Download it, feed it to any notetaker, and score it against the fixed set below — written before testing.

Download the audio (MP3) · three speakers, ~2:16 · voices generated with ElevenLabs

20 terms scored: Northwind, Postgres, MySQL, RDS, Kubernetes, HPA, EC2, CloudWatch, TikTok, UTM, Google Analytics, Mailchimp, Datadog, HubSpot, Salesforce, Notion, LinkedIn, SOC 2, Jira, AWS.

10 action items scored: book the Saturday maintenance window (post in Slack by Thursday); open an AWS support ticket with the CloudWatch logs; grant Carla Google Analytics edit access; send three Mailchimp subject-line options for a vote; tune the Datadog latency alert to the 95th percentile; write a dedup script and test it in the Salesforce sandbox; put the $12k budget proposal in Notion; email the SOC 2 auditor to confirm the August date; everyone update Jira tickets before standup; (GA access restated by the PM).

Result: both caught 9/10 action items; terms Otter 17/20 vs Granola 15/20; speaker labels Otter 73% vs Granola 0% (Granola doesn't separate speakers by design). Both dropped the same two — RDS and EC2 — from the quietest, most accented voice. This is a mini-run (two core tasks plus speaker labels); the full battery adds the rest.

Dip it, read the verdict

A litmus strip doesn't care about the marketing, and neither does the score. Drag it: the same paper gives every tool the same reading.

82/100 PASS
0102030405060708090100
PASS 75+ means take it. BORDERLINE 55–74, only for a specific job. FAIL under 55, skip it.
Get the verdict before you buy

See the receipts, not the marketing.

You get the scorecard, the raw inputs, and a straight buy-or-skip call, every couple of weeks.

✓ You're on the list. The next battery lands in your inbox.

No spam, unsubscribe anytime. By subscribing you agree to receive the newsletter and to our Privacy Policy.