Which AI Model Should Measure AI's Impact on Your Job? The Answer Depends on Which AI You Ask
Different large language models (LLMs) produce dramatically conflicting assessments of which jobs face the greatest risk from artificial intelligence, according to new research from Northwestern University. When researchers asked OpenAI's GPT-4, ChatGPT-5, Google DeepMind's Gemini 2.5, and Anthropic's Claude 4.5 to evaluate the same occupations using identical job descriptions, the models disagreed so sharply that they could shift workforce planning decisions by up to 3.6 times.
The implications are serious. Governments, consulting firms, financial institutions, and international organizations rely on AI-generated "exposure scores" to forecast labor market disruption, identify vulnerable sectors, and design job training programs. If the underlying measurements are unstable, the policies built on them may be unreliable.
Why Do AI Models Disagree So Dramatically on Job Risk?
Researchers led by labor economist Michelle Yin at Northwestern's School of Education and Social Policy tested four leading AI models on the same task: rating which occupations would be most affected by automation. They used identical rubrics, the same job descriptions from the U.S. Department of Labor, and the same data pipeline. The only variable was which AI model performed the rating.
The results were striking. Depending on which model was consulted, anywhere from 14% to 51% of an average job's tasks could be affected by AI. For high-risk occupations, the disagreement became even more severe: one model identified only 3% of occupations as seriously threatened, while another flagged 51% as at-risk.
Consider accountants. Claude 4.5 rated them as highly exposed to AI disruption, while Gemini 2.5 assigned them a much lower exposure ranking. Advertising managers and chief executives showed similar inconsistencies across the four models tested. The models also could not agree on which jobs ranked as most vulnerable; their rankings barely correlated with each other.
"The disagreement gets even more stark when you look at 'high risk' jobs: one model says only 3% of occupations are seriously threatened, while another says 51% are," explained Michelle Yin, labor economist at Northwestern University.
Michelle Yin, Labor Economist, Northwestern University School of Education and Social Policy
What's Actually Causing These Measurement Gaps?
The researchers identified a fundamental problem: the AI models are not interchangeable measurement devices. Each was trained on different data, designed with different objectives, and updated continuously. Their disagreement does not simply reflect temporary statistical noise; it may instead reveal genuine uncertainty about what "AI exposure" actually means.
The term "exposure" itself is ambiguous. Does it mean that some tasks can be partially automated? That most tasks can be replicated? That productivity will rise? That jobs will disappear entirely? These are distinct questions, yet public discussions often collapse them into a single indicator. Without clarity on what is being measured, different models naturally arrive at different conclusions.
In a companion working paper, Yin and coauthor Burhan Ogut demonstrated that the choice of AI platform behind an exposure measure can shift downstream employment estimates by 42 to 93 percent, amplifying the uncertainty even further.
How Should Researchers and Policymakers Respond?
Yin and her team recommend several practical steps to improve the reliability of AI-based labor market assessments:
- Report Multiple Models: Any study using AI-generated exposure scores should report results from multiple models rather than relying on a single tool, allowing readers to see the range of estimates.
- Define Terms Explicitly: Researchers must clarify what "exposure" means in their specific context, distinguishing between partial automation, full task replacement, productivity gains, and job elimination.
- Reconsider the Measurement Strategy: More fundamentally, researchers should question whether asking a large language model to assess its own capabilities is the right approach in the first place.
Yin told the Wall Street Journal that she would not personally rely on just one measure to make major career decisions. "I personally would not rely on just one measure to say, 'Oh, I should change my job,' or 'I should change my kid's major,'" she stated.
Yin
Why This Matters for Workers and Policymakers
Yin built her career on a core conviction: the way we measure work shapes how we value workers. When she discovered the instability in AI exposure scores, she was working with vocational rehabilitation and workforce programs in Maine and Virginia, helping agencies design programs for workers with disabilities facing a changing labor market.
The stakes are personal. Career counselors, workforce agencies, and individual workers making educational and professional decisions rely on research like this. If the numbers underlying those recommendations are not credible, the people depending on them are being failed. Yin emphasized this point directly: "I care about this because I have sat across the table from workers whose career decisions depend on what researchers like me put into the world, and if those numbers are not credible, we are failing the people we are supposed to serve".
Yin
The research was published as a working paper on the National Bureau of Economic Research website and has already generated attention from major publications including the Financial Times and VoxEU. Yin, who directs the Dual Master's Degree Program in Applied Economics and Social and Economic Policy and founded the Research and Innovation for Social and Economic Inclusion Lab, was elected to the National Academy of Social Insurance in recognition of her contributions to understanding the future of work and equity in social safety net systems.
As AI tools become increasingly central to workforce planning and policy design, this research serves as a critical reminder: the choice of which AI model measures the future of work may matter as much as the findings themselves.