How AI Visibility Tools Actually Know What People Are Asking ChatGPT

How AI Visibility Tools Actually Know What People Are Asking ChatGPT

If you’ve used any AI visibility tool (Semrush’s AI Visibility Toolkit, Ahrefs Brand Radar, Otterly, Peec, SE Ranking, HubSpot AEO, Profound, Promptmonitor), you’ve probably seen claims like “13.5 million prompts tracked” or “239 million prompts in our database.” Most SEOs accept these numbers at face value without asking the obvious question.

Where do those prompts come from? How does a tool know what real people are asking ChatGPT in private sessions?

The answer involves clickstream data, third-party panels, and an infrastructure that’s been quietly powering SEO tools for over a decade. The methodology determines what the data actually means, and which tool you should trust for which question.

What Clickstream Data Is

Clickstream data is the chronological record of every action a user takes online. Pages visited. Time on page. Clicks. Search queries entered. Results clicked. The path through a session from start to exit.

The term goes back to the early web when “clicks” described most of what users did. TechTarget and Matomo both define it in roughly the same way: a record of user activity that, when aggregated, reveals behavioral patterns.

DataForSEO splits clickstream data into two forms. Aggregated data shows totals over time periods. Unaggregated data shows individual user journeys, click sequences, and visit durations. SEO tools mostly use the aggregated form, processed through algorithms that strip personally identifying information.

How Clickstream Data Is Collected

There are two collection categories, and the second one is where SEO and AI visibility tools get their data.

First-party clickstream data is collected by the site owner. You install tracking on your own site through Google Analytics, Hotjar, Amplitude, Matomo, or server log analysis, and you see what your own users do. This is the kind of data you have direct access to and complete control over.

Third-party clickstream data is collected by data providers who recruit panels of users willing to have their browsing observed. The user installs some piece of software, agrees to data collection in the terms of service (sometimes prominently, sometimes not), and their activity gets aggregated into a panel that data providers sell to third parties.

The software users install typically falls into a few categories:

  • Browser extensions, often free utilities like coupon finders, ad blockers, or tab managers
  • Free or freemium antivirus and security software
  • Free VPN services
  • Free toolbars
  • Paid research panels with explicit opt-in
  • Less commonly today, ISP-level partnerships

Victorious explicitly notes that “SEO tools like Ahrefs and Semrush typically obtain clickstream data by purchasing it from these third-party data providers.” The tools themselves don’t run the panels. They buy the data.

How SEO Tools Have Used This Data for Over a Decade

This part is worth understanding because it sets up the AI visibility section. Clickstream data isn’t a new ingredient in SEO tools. It’s been powering features SEOs interact with every day for years.

Keyword search volume estimates after Google obscured the real numbers in Keyword Planner. Competitor traffic estimates in Semrush Traffic Analytics, SimilarWeb, and similar tools. SERP click-through rate data. Keyword difficulty scoring. Audience demographics.

Semrush’s own KB article is explicit about the source: “The data in our Traffic & Market toolkit comes from our panel of over 200 million real, anonymized internet users across more than 190 countries and regions. We partner with hundreds of clickstream data providers to build this panel, which records billions of events on the internet each month.”

Shahid Shahmiri’s breakdown of Ahrefs’ data sources explains that Ahrefs runs three data pipelines in parallel: their own crawler (AhrefsBot) for link data, third-party clickstream panels for behavioral data, and Google Keyword Planner for keyword existence. The clickstream layer is what powers their traffic and volume estimates.

Every time you’ve looked at a search volume number in Ahrefs or Semrush, you’ve been looking at clickstream-derived data. You just may not have known it.

A Brief Word on Avast

In January 2020, a joint Vice and PCMag investigation revealed that Avast antivirus, with over 100 million users, was selling clickstream data through its subsidiary Jumpshot. The data was detailed enough to identify individuals despite being marketed as anonymized. Customers included major retailers, analytics firms, and SEO platforms. Avast shut Jumpshot down within weeks of the investigation publishing.

This matters because it disrupted the clickstream data market for years. The supply hasn’t disappeared, but it’s been consolidated and diversified. Tools that depend on this data now talk about partnering with “hundreds of providers” rather than a single source, which is partly a hedge against any single provider blowing up the same way Jumpshot did.

Larry Ludwig’s piece on clickstream data providers covers this history and is worth reading if you want the full context. The short version: the clickstream pipeline that powers SEO tools is real, but the sourcing is often deliberately opaque because the industry learned a lesson from Avast.

How AI Visibility Tools Use Clickstream Data: Method 1

The first approach used by AI visibility tools is capturing real prompts from clickstream panel members who use AI platforms. When a panel member opens ChatGPT, Perplexity, Gemini, or Claude and types a prompt, that prompt, the response, and any cited sources get captured by the panel software and aggregated into the tool’s database.

Semrush’s AI Visibility Toolkit KB article states the methodology directly. The exact quote: “We source billions of real prompts from AI search clickstream data and Google’s keyword dataset for AI Overviews.” The toolkit has 239 million prompts and responses across ChatGPT, Gemini, Google AI Overviews, and AI Mode.

Ahrefs Brand Radar uses the same model. Their database currently sits at 13.5 million existing prompts. From Ahrefs’ own piece: “You can track your ChatGPT visibility across 13.5 million existing prompts inside Ahrefs Brand Radar database.”

What this means in practice. When these tools show you “Topics” or “what people are asking” reports, they’re showing aggregated prompts from real users in their panels who happened to use AI platforms during the data collection window. It’s not synthetic. It’s not Google search data dressed up to look like prompt data. It’s actual prompts from a panel large enough to be statistically meaningful at scale.

How AI Visibility Tools Use Clickstream Data: Method 2

The second approach is different. Many AI visibility tools don’t have access to clickstream data at all, or they use it as a supplementary source. Instead, they rely on running prompts that users (or the tool’s AI suggestions) define, on a schedule, and capturing the responses.

Otterly describes the methodology in their own words: “An AI visibility tracker works by automatically sending queries (search prompts) to AI search engines like ChatGPT, Perplexity, Google AI Overviews, and AI Mode, and analyzing the responses for brand mentions, citations, and source links.”

Peec AI runs prompts “once every 24 hours across your selected AI models.” SE Ranking’s ChatGPT Visibility Tracker “scans ChatGPT answers for your target keywords and analyzes which of them end up with your brand being mentioned.” Promptmonitor lets users “track specific prompts or questions in AI optimization.” HubSpot AEO suggests prompts based on company data, then tracks visibility across them.

The methodology is essentially rank tracking applied to AI platforms. The user (or the tool) defines prompts. The tool runs them through APIs or by scraping the AI interface. The tool captures and analyzes the responses on a schedule.

The strength of this approach is precision. You know exactly what was prompted because the tool prompted it. You can monitor specific prompts you care about over time. You can set up brand-specific or competitor-specific tracking and watch trends.

The limitation is that you’re tracking prompts you defined, not necessarily prompts real users are entering. If you assumed the wrong prompts mattered, you’re tracking the wrong data.

Surfer SEO’s overview of AI visibility tools makes the methodology distinction explicit: “Some rely on API responses, while others track what a real user sees in the interface. On top of that, AI answers vary depending on the model, the user’s location, language settings, and even the day/time the prompt is run.”

Why Both Methodologies Have Value

Each approach answers different questions.

Real prompt data (Method 1) tells you what real people are actually asking. This is useful for content strategy: discovering prompts you didn’t know existed, understanding the actual language users employ when talking to AI, identifying topics where there’s measurable search behavior. The limitation is that you’re seeing what was asked across the panel, not necessarily by your specific audience.

Synthesized prompt data (Method 2) tells you how AI platforms answer specific prompts you care about. This is useful for monitoring: tracking whether your brand appears when potential customers ask specific questions, watching trends over time on defined prompts, comparing your visibility to competitors on the same prompts. The limitation is that you’re tracking prompts you assumed mattered, which may or may not reflect real user behavior.

Most mature AI visibility tools combine both. Semrush, Ahrefs, and HubSpot AEO all offer “real prompt” databases for discovery alongside synthesized prompt tracking for monitoring. Smaller or newer tools tend to rely entirely on Method 2 because they don’t have access to clickstream panels at the scale required for meaningful Method 1 data.

The practical implication for you. When you see numbers from these tools, ask which methodology they reflect. If a tool says “We tracked 50 prompts and you appeared in 12,” that’s Method 2. If a tool says “Across our database of 239 million prompts, you were mentioned in 1.2%,” that’s Method 1. They’re measuring different things, and conflating them leads to bad strategic decisions.

The Honest Caveats

Two worth flagging.

Privacy and sourcing transparency. SEO tools are sometimes vague about their exact data sources, and some of that vagueness is deliberate. The Avast situation taught the industry that aggressive data collection can blow up publicly. Tools that buy from “hundreds of providers” have plausible deniability about any single provider’s practices. This isn’t necessarily wrong, but it’s worth knowing that the sausage-making is more complicated than the marketing copy suggests.

Sample bias. Clickstream panels skew toward certain users. People who install free antivirus software, free VPNs, or browser extensions aren’t a representative sample of the entire internet. They tend to be more price-sensitive, more technically casual, and over-indexed in certain demographics and geographies. The data is meaningful but not perfectly representative. This applies to traditional clickstream-derived metrics (keyword volumes, traffic estimates) as much as it applies to AI prompt data. None of these numbers are precise. They’re best understood as directional intelligence, not absolute truth.

The Takeaway

The infrastructure powering AI visibility tools isn’t new. It’s the same clickstream data pipeline that’s been informing SEO traffic estimates and keyword volumes for over a decade, repurposed for a new purpose. Understanding where the data comes from helps you read it more accurately.

Tools that show you “real prompts” are showing you panel-derived data with the same strengths and limitations as Semrush’s traffic estimates or Ahrefs’ search volume numbers. Tools that show you tracked prompts are showing you a rank-tracking methodology applied to AI platforms.

Both are useful. Neither is magic. And the difference matters when you’re deciding which tool to trust for which question.

Sign up for weekly notes straight from my vault.
Subscription Form (#5)

Tools I Use:

🔎  Semrush – Competitor and Keyword Analysis

✅  Monday.com – For task management and organizing all of my client work

📄  Frase – Content optimization and article briefs

📈  Keyword.com – Easy, accurate rank tracking

🗓️  Akiflow – Manage your calendar and daily tasks

📊  Conductor Website Monitoring – Site crawler, monitoring, and audit tool

👉  SEOPress – It’s like Yoast, if Yoast wasn’t such a mess.

Sign Up So You Don't Miss the Next One:

vector representation of computers with data graphs
Subscription Form (#5)

Past tips you may have missed...