Evaluating How LLMs Handle Education and Workforce Pathway Data: An Expert-Led Evaluation of Responses from Three Models Across Six User Personas and Ten Topics

Erika Tyagi; Kristin Blagg; Emily Gutierrez

Brief

Evaluating How LLMs Handle Education and Workforce Pathway Data

Subtitle

An Expert-Led Evaluation of Responses from Three Models Across Six User Personas and Ten Topics

Erika Tyagi, Kristin Blagg,

Emily Gutierrez

May 12, 2026

Download

(409.78 KB)

Add Urban on Google

AI tools are rapidly becoming one of the primary ways people find and interpret information—including questions about education and workforce pathway data. These tools have the potential to make public data more accessible, but how well do they handle questions grounded in it? Can they find the right sources, use the data correctly, and communicate information in ways that serve their users? To find out, we conducted a structured evaluation in which subject-matter experts assessed how three leading large language models (LLMs) responded to 100 curated questions about education and workforce pathway data.

Why This Matters

Most public data systems were built to serve human users navigating websites and documentation, not to be discovered and used by AI systems. When that gap goes unaddressed, models often return answers that look authoritative but aren't—and by attributing incorrect information to specific tools and sources, they risk eroding trust in the data infrastructure that does have the right answers. Understanding where and how models fall short is a necessary first step toward making public data AI-ready, and the patterns we found are relevant to data providers well beyond education.

What We Found

Subject-matter experts scored all 300 responses, and four key patterns emerged:

Response quality varied sharply by task type. Models performed well on general questions, but quality dropped significantly when prompts required retrieving specific data and dropped further for code generation. Of all responses, only 16 percent had no accuracy issues, and 34 percent had critical errors. Accuracy problems were especially pronounced for specific retrieval tasks (58 percent critical) and code generation (79 percent critical).
Errors were common, hard to detect, and difficult to verify. Inaccurate statistics were presented with the same precision and authority as accurate ones, and more than 60 percent of responses had significant or critical citation issues—making independent verification time-consuming even for experts.
Models often answered a different question or omitted essential context. When asked about specific places or institutions, models frequently defaulted to national trends—in one case reporting declining enrollment at a community college whose enrollment was actually growing. Responses also often omitted caveats essential to interpreting data correctly, or produced answers when the right response was that the data did not exist or was not yet available.
Pointing models to Urban's data tools didn't improve results. Models recognized the Education Data Portal and the Education-to-Workforce Framework Data Tool but rarely accessed them correctly—fabricating API endpoints, citing nonexistent packages, and reporting figures that didn't come from the sources they named.

These findings are directly informing improvements to both tools. We are developing model context protocol (MCP) servers to give AI systems structured pathways to find and query our data directly, enriching metadata and documentation layers to help models interpret data correctly and surface appropriate caveats, and scaling infrastructure to handle rapidly growing AI-driven demand. Once these updates are in place, we plan to conduct a follow-up evaluation to measure what works and share what we learn with the broader community of public data providers.

How We Did It

We developed 100 prompts across ten education and workforce topics, written from the perspective of six user personas representing the range of people who ask these kinds of questions. Each prompt was sent as a standalone query to three models — GPT-5.2, Claude Sonnet 4.5, and Gemini 3 Flash—with no custom configuration, approximating the default experience most users encounter today. Subject-matter experts scored every response using a two-layer rubric capturing both overall quality and diagnostic tags identifying specific failure types.

Research and Evidence

Technology and Data

Artificial Intelligence

Expertise

AI, Work, and the Economy