Methodology

Last updated: 2026-05-20

Why this page exists

The product's claim is "Verified, not promised." That claim only holds if you can see, in plain language, exactly what we check, exactly how, and what we deliberately do not check yet. This page is that record. It is kept in step with the code in src/extensions/crawler/ — when the code changes, this page changes.

The two signals we verify today

For every submission you mark as submitted, our crawler currently records two binary signals about the directory's public listing page:

Signal 1

Live

We fetch the directory's listing page (or, when not found there, a small set of secondary URLs the directory commonly uses) and scan every <a> tag. We compare each link's resolved absolute URL against your product URL using host-without-www + pathname-without-trailing-slash matching. One match anywhere on the page → status live. No match before the submission's expiry → dead. Until then it stays pending.

Signal 2

Dofollow

When the live match is found, we read the matched <a>'s rel attribute. If rel contains "nofollow", dofollow is false; otherwise true. We store this as submissions.isDofollow, which is what your dashboard and the refund logic read.

That is the entire signal surface today: two pure-HTML checks on the actual listing page, derived from the directory's own response.

The fetch stack

Getting the listing page itself is harder than it looks, because directories vary from plain static HTML to fully JavaScript-rendered single-page apps. Our crawler handles them in two layers:

Direct fetch — Node undici with a rotating User-Agent and a 15-second timeout. This covers the majority of directories that render their listings server-side.
ScrapingBee fallback — when the direct fetch returns 200 but the page has no useful HTML (because the listings are rendered by JavaScript on the client), we re-fetch through ScrapingBee with JS rendering enabled. This costs credits, so we use it as a fallback, not as the default. If SCRAPINGBEE_API_KEY is not configured for the deployment, this layer is simply skipped and the submission stays pending.

The parsed HTML is fed to a small pure function (detect.ts:findProductLink) that performs the two signal checks above. There is no AI / LLM step in the verification path — it is deterministic, repeatable, and inspectable.

The daily cadence

A Vercel cron triggers /api/cron/dispatch on a schedule, which drains a Postgres-backed work queue (crawl_jobs) of pending check jobs. A second cron at /api/cron/reconcile sweeps for orphaned jobs (created but never finished, e.g. a deployment killed mid-check) and re-enqueues them.

Current schedule (soft-launch stopgap): dispatch runs once daily at 02:00 UTC; reconcile runs once daily at 03:00 UTC. This is a Vercel Hobby plan constraint — Hobby caps cron frequency at one run per day. The product is designed for the previous 5-minute / hourly cadence that Vercel Pro unlocks, and we will restore that once paying-user volume justifies the upgrade. While on the daily schedule, the worst-case time between you marking a submission as submitted and the first re-check is about 24 hours. This is honestly disclosed and does not change the refund window: the 30-day refund clock starts at purchase, not at first check.

Once a check runs, the next check is scheduled by schedule.ts:nextCheckAt: every day for the first week after the submission's baseline, every 2 days through day 30, every 3 days through day 90, and weekly after that. This cadence is what the code says today; the same function decides when a job is re-enqueued.

How this drives the refund

The refund logic in src/core/stripe/refund-eligibility.ts looks at two counts that come from this verification process: how many submissions you have marked as submitted (markedSubmittedCount), and how many of those reached status live (liveCount). If, within 30 days of purchase and after at least 20 marked submissions, fewer than 50% went live, the Audit refund applies automatically. See the Refund Policy for the full conditions. The point worth saying here is structural: the refund threshold is computed only from signals we actually check. Anything we do not verify (see the roadmap below) cannot be a refund condition — that is a deliberate design choice, not a workaround.

Where the directory metadata comes from

The numbers shown on each /directories/[slug] page — Domain Rating, monthly visits, audit time, pricing — are not signals our crawler produces. They come from two clearly attributed sources:

Ahrefs— Domain Rating, referring domains, backlinks. Sampled manually via Ahrefs' site explorer; the values you see were last reviewed on each directory page's lastEditorialReviewdate (also shown in the sitemap as that page's lastmod).
SimilarWeb— estimated monthly visits, traffic source breakdown, top countries. We pull these via SimilarWeb's public data endpoint through a residential proxy because the consumer site blocks scraping; the values are third-party estimates, not first-party logs from the directory.

We never derive these from the directory's own self-reported marketing numbers, and we will state "—" rather than guess when a value is missing.

What we deliberately do not check (yet)

Two signals are commonly assumed to be part of a verification product but are not implemented in our crawler today. We list them here so there is no ambiguity:

Google indexing— whether the directory's listing of your product has actually been indexed by Google. Reliably checking this requires either SerpAPI / ValueSERP-class paid third-party access (Google blocks scraping of site: queries directly) or per-domain Search Console verification, which only works for sites youcontrol. We are evaluating SerpAPI for a future release; until that ships, our pages and refund logic do not reference an "indexed" signal.
Anchor-text fidelity — whether the directory used the anchor text you asked for, or rewrote it. Implementing this needs a small extension to the submission model (storing your intended anchor text) and a comparison in the crawler. Planned, not shipped.

These are honest gaps, not hidden caveats. Earlier copy on the site claimed both signals; we have corrected the copy and surfaced the gap here. When we ship either one, this page is the place it is announced first, and the refund eligibility will be updated explicitly if its semantics change.

Known limitations

HTML detection compares URLs by host and pathname. Directories that route through an intermediary redirect or wrap your URL in a tracker may not match on the first try. We retry across a small set of plausible secondary listing URLs before concluding "not live."
ScrapingBee credits are not unlimited. If our budget is exhausted we will queue and retry the next day rather than mark a submission dead.
The daily-only cron during soft-launch means a directory that goes live and then disappears within the same 24-hour window between checks would not be observed. Restoring the original 5-minute / hourly cadence closes this gap.
Anything sitting behind a hard login wall (private dashboard, paid newsletter list) cannot be verified from a public crawler. We do not attempt to log in.

Updates

When the code or schedule changes materially, this page's "Last updated" date moves and the change is summarised. For questions or specific examples of a check you would like explained, email hi@submitaitool.net or see /contact.