Methodology: How We Test and Score AI Tools

Last updated: 4 July 2026

Why this page exists

Every review and buying guide on this site includes a score, a verdict, or both. This page explains exactly how those numbers are produced, what they weigh, and where the evidence behind them comes from, so you can judge how much weight to put on them yourself.

How we test tools

We test tools against the kind of tasks a real small business actually does: drafting a client email, chasing an invoice, summarising a meeting, checking a policy question. If a business with no dedicated IT department couldn't pick up a tool and use it on Monday morning, we say so plainly.

We assess usability, implementation effort, pricing, support, privacy considerations, and real-world value against what the tool actually costs a team, not just a per-seat number that doesn't mean much until you multiply it out.

The NTK Score

Reviews scored under the NTK Score system are rated across five pillars, each worth up to 20 points, for a total out of 100:

Relevance: how well the tool fits the problem a small business is actually trying to solve.
Effort: how much setup, training, and ongoing admin it takes to get value from the tool.
Adoption: how likely staff are to actually use it day to day, not just trial it once.
Commercial value: what it costs against what it saves or earns, at small-business scale.
Trust: vendor stability, data handling transparency, pricing stability, support reputation, and track record. For tools aimed at regulated industries such as healthcare, legal, or financial services, this pillar also weighs compliance-relevant criteria specific to that industry.

Each pillar carries a one-sentence rationale explaining the score, an evidence tier (a named primary source, a vendor's own published statement, or a public signal such as user reviews), and a confidence rating. A Trust score below 12 out of 20 caps the overall verdict at "Pilot first", regardless of the total score. This is deliberate: a tool can score well everywhere else and still not be one we'd recommend rolling out without a trial first if there are real concerns about the vendor or its data handling.

The four verdict tiers, based on the total score out of 100:

Recommended: 80 or above.
Recommended with caveats: 65 to 79.
Pilot first: 50 to 64, or any score where the Trust cap applies.
Not recommended: below 50.

Star ratings on newer reviews

Not every review has been through the full NTK Score process yet. New reviews are first published with an interim star rating across six dimensions (functionality, ease of use, value, trust, support, and integration), each scored out of 10, based on the same testing approach described above. This is a same-session editorial score, not yet backed by the evidence-tier research the NTK Score requires. When a product goes through the full NTK Score process, its star rating is replaced with the NTK Score on that article.

What we don't do

We don't write vendor marketing with a different logo on it. We're not paid to rank a tool higher, and if a tool isn't good enough, we say so, even when we'd rather it wasn't true. We don't dress up enterprise software as something a five-person team should buy. And we don't pad reviews with features nobody in a small business will ever touch.

Independence and corrections

We are not affiliated with any AI vendor, software company, or consultancy, and we do not accept payment in exchange for reviews or editorial coverage. Where affiliate links appear, they are disclosed clearly. See our Independence and Disclosure page for the full policy, including how to report a correction.