AI Acid Test — pass/fail tests of AI tools, with the receipts

How the acid test works

One protocol, run the same way every time.

Every tool in a category meets the same battery and the same scoring. The number I care about most is the cost of a result you can actually use.

Same battery for every tool

One task set, one set of inputs, run identically. No friendly demos and no improvising to flatter a particular product.

Scored across seven things

Quality, reliability, speed, setup friction, cost per result, workflow fit, and the limits nobody advertises. It all collapses into one number out of 100.

The cost of a usable result

Sticker price lies. I track what one output you can trust actually costs, once the retries and the re-dos are in.

Dated, and you can repeat it

Every verdict is pinned to a version and a date, with the inputs published so you can run it yourself and check me.

Re-run it yourself

The exact inputs — public.

First battery: AI meeting notetakers — Otter, tl;dv, Fireflies, Granola and Fathom (free plans, tested 2026-07-14). Below is the exact meeting audio every tool received (Generation 1). Download it, feed it to any notetaker, and score it against the fixed set — written before the audio was generated.

Download Generation 1 — the scored file · three speakers, 2:16 · also the alternate take (Gen 2) · voices generated with ElevenLabs

ElevenLabs rewrites the container metadata on every download, so byte hashes of the same take won’t match between downloads — compare the decoded PCM, not the raw file.

21 terms scored: Northwind, Postgres, HubSpot, TikTok, MySQL, RDS, Slack, Kubernetes, HPA, EC2, AWS, CloudWatch, Datadog, UTM, Google Analytics, Mailchimp, Salesforce, Notion, LinkedIn, SOC 2, Jira. Each is scored twice — heard (word recognised, any spelling) and canonical (exact spelling and case) — because a lowercase brand name is a different failure from a misheard word.

10 action items scored: book the Saturday maintenance window (Slack by Thursday); open an AWS support ticket with the CloudWatch logs; tune the Datadog latency alert to the 95th percentile; grant Carla Google Analytics edit access; send three Mailchimp subject-line options for a vote; write a dedup script and test it in the Salesforce sandbox; put the $12k budget proposal in Notion; email the SOC 2 auditor about the August date; everyone update Jira tickets before standup; write a CloudWatch retention policy. Each is scored on capture, correct owner, and whether the tool invented a task that was never said.

Result (composite = terms 40% + action items 40% + speakers 20%): Otter 98 · tl;dv 90 · Fireflies 87 · Granola 59. Every tool heard essentially every term — transcription is close to solved. They split on understanding the conversation: Otter labelled all three speakers and every task owner correctly; tl;dv and Fireflies heard flawlessly but confused speakers; Granola doesn’t separate speakers at all and compressed a whole sentence into two words; Fireflies invented a task nobody said.

The mistakes followed the quietest voice. The same term was spoken by different people on purpose. Five of six errors landed on the one speaker measured 3 dB quieter than the others — the identical words came through cleanly from the louder voices. (Three tools heard “EC2” as “easy to.”) The voices are AI-generated, so read this as a reproducible probe, not proof of bias in the wild — download the file and try your own recording.

Inputs weren’t identical for every tool. Otter and Fireflies take a file directly; tl;dv, Granola and Fathom only join a live call, so they were fed the same file played into a meeting — a lossier capture that handicaps them. tl;dv still heard every term anyway. Fathom is not scored: it labels speakers by meeting participant, so on a played-in file it assigned the whole conversation to one person. That’s a limitation, not a verdict. Mini-run: two core tasks plus speaker labels; the full battery adds the rest.

Dip it, read the verdict

A litmus strip doesn't care about the marketing, and neither does the score. Drag it: the same paper gives every tool the same reading.

82/100 PASS

0102030405060708090100

PASS 75+ means take it. BORDERLINE 55–74, only for a specific job. FAIL under 55, skip it.

Before you pay for an AI tool, find out if it passes the acid test.

Every verdict stays on the record.

One protocol, run the same way every time.

Same battery for every tool

Scored across seven things

The cost of a usable result

Dated, and you can repeat it

The exact inputs — public.

Every verdict stays on the record.

One protocol, run the same way every time.

Same battery for every tool

Scored across seven things

The cost of a usable result

Dated, and you can repeat it

The exact inputs — public.

See the receipts, not the marketing.