What is a validity coefficient and why is .51 considered high?

A validity coefficient is the correlation between a selection procedure's score and a job-performance criterion. The theoretical maximum is 1.0 (perfect prediction); in practice, anything above .30 is meaningful and above .50 is excellent. The 1998 estimate of .51 for structured interviews — and the revised .42–.44 in Sackett (2022) — both rank structured interviews among the most predictive single procedures available.[1][2]

Did the 2022 Sackett paper invalidate Schmidt & Hunter?

It corrected, not invalidated. Sackett, Zhang, Berry, and Lievens argued that prior meta-analyses systematically over-corrected for range restriction, producing inflated validity estimates. Their revised estimates dropped most coefficients by .10 to .20 points, but the rank order is largely intact — and structured interviews emerged as the top-ranked single procedure.[2][4]

If unstructured interviews are .38, why do most companies still use them?

Three reasons: (1) interviewers feel competent at unstructured interviews and are uncomfortable with rubrics, (2) hiring is decentralized and forcing structure across many panelists is genuinely hard, and (3) the .12–.13 incremental gain only shows up at scale. The result is that the practice gap between research and reality is one of the largest in I/O psychology.

Behavioral or situational — which kind of structured interview wins?

The meta-analytic data suggests they perform similarly when both are well-designed; situational interviews edge ahead slightly for entry-level roles, behavioral edges ahead for experienced hires. The bigger lever is whether the interview is structured at all, not which flavor of structure you pick.

How does AI fit into this?

AI interviewers do one thing very well: they remove the 'I'll just deviate from the script' failure mode. Every candidate gets the same questions in the same order with the same probes, and every answer is scored against the same anchored rubric. This is the operational definition of structure — and the validity boost only materialises when structure is actually enforced.

Does adding a structured interview on top of a cognitive ability test help?

Yes — Schmidt & Hunter (1998) reported that GMA + structured interview reaches a multivariate validity of .63, compared to .51 for the structured interview alone. The combination is one of the highest-validity selection systems available; Sackett (2022) revisions still show this combination at the top of the table.[1][2]

Structured Interviews: Why Schmidt & Hunter Still Matter in 2026

The replication-crisis context, briefly

Most of psychology has spent the last decade re-running its core findings under stricter rules: pre-registered hypotheses, larger samples, more transparent analysis pipelines. Many cherished effects have shrunk or disappeared. So when an industrial-organizational psychologist points at a 1998 paper and says “structured interviews predict performance better than unstructured ones,” it is fair for an engineering leader to ask whether the result has held up.

The short answer is yes — with calibration. The Schmidt & Hunter (1998) numbers were already meta-analytic, summarising 85 years of research findings rather than relying on a single fragile study.[1][5] They have been re-examined twice in major published work since, and the rank order has held even when the absolute numbers have come down. Unlike the worst-hit areas of social psychology, hiring-validity research has weathered the audit.

Schmidt & Hunter (1998) — what they found

The original paper compiled correlations between selection procedures and supervisor performance ratings, corrected for measurement error and range restriction, across roughly 85 years of accumulated studies. The headline numbers from Table 1:[1]

Selection procedure	Validity (r)
Work sample tests	.54
General mental ability (GMA)	.51
Structured employment interview	.51
Job knowledge tests	.48
Integrity tests	.41
Unstructured employment interview	.38
Job experience (years)	.18
Years of education	.10
Graphology	.02

Two findings deserve emphasis. First, structured interviews tied with general mental ability tests at .51, dramatically outperforming unstructured interviews at .38 — a .13-point gap that translates into materially better hires across a large pipeline.[1] Second, the combination of GMA + structured interview reached a multivariate validity of .63, which is among the most predictive feasible hiring systems on record.

Sackett, Zhang, Berry & Lievens (2022) — the correction

In November 2022, the Journal of Applied Psychology published a 28-page re-examination of the 1998 estimates. The authors’ central claim: prior meta-analyses systematically over-corrected for range restriction (the statistical artefact where you only see post-selection data, so the variance is artificially compressed and correlations look bigger when you back-correct).[2][4]

They proposed a more conservative correction and reported revised validity estimates. The key results, in their Table 3:

Most procedures dropped by .10 to .20 points.
Structured interviews emerged as the top-ranked single selection procedure in the revised rankings.
GMA dropped more sharply than interviews because it had relied more heavily on the most aggressive range-restriction correction.
The validity-vs-diversity tradeoff was made explicit: structured interviews show smaller mean Black–White subgroup differences than cognitive ability tests, making them attractive on both predictive and adverse-impact grounds.[2]

The takeaway is calibration, not invalidation. If you used to quote .51 in deck slides, the defensible 2026 number is closer to .40–.44. Structured interviews still beat unstructured by a meaningful margin; the relative ordering is intact.

Why unstructured interviews still dominate practice

Despite three decades of research, most hiring still happens through unstructured panel conversations. There are three reasons, and only one of them is irrational.

Confidence asymmetry.Hiring managers feel highly competent at “reading” candidates and rate themselves high on judgment. The research literature consistently shows interviewer self-assessment is uncorrelated with their actual predictive validity. This is the irrational reason; awareness alone does not fix it.
Decentralization. Once a company has dozens of people running interviews, enforcing rubric discipline at scale is genuinely hard. People skip the script when they think they are building rapport. Structure decays unless someone owns enforcement.
Increment masking. The .13-point validity gain only shows up across a large pipeline. For any individual hire it looks like a coin flip either way, so feedback never accumulates for the panel.

A counterintuitive result from Schmidt & Zimmerman (2004): three to four independent unstructured interviews can match a single structured interview’s validity, simply because aggregation averages out interviewer bias.[3] In other words, even panel-of-five-loose-interviews can reach the structured-interview floor. But the cost is high — five interviews of an hour each instead of one of forty-five minutes — and the candidate experience is worse. Structure is the cheaper path.

Behavioral vs situational — head-to-head

Within structured interviews there are two main flavours. Behavioral interviews ask about past behaviour: “Tell me about a time you had to handle a customer escalation that required a system change.” Situational interviews pose hypotheticals: “If a customer demanded a refund that violated our policy and threatened to go to social media, what would you do?”

The meta-analytic verdict: both work, with overlapping confidence intervals. Situational interviews edge ahead very slightly for entry-level roles where the candidate has limited past behaviour to draw on; behavioral interviews edge ahead very slightly for experienced hires. The lever that matters is whether the interview is structured at all, not which structure flavour you pick. Picking either is fine; mixing the two within the same interview is also fine if every candidate sees the same mix.

What AI interviewers actually change

The research case for structure has been settled for decades. The missing piece has always been operational: how do you actually run structured interviews at scale, with consistency across hundreds of panelists, without the rubric decaying into “close enough” ratings? AI interviewers solve a specific failure mode.

When the AI is asking the questions, the script does not drift. Every candidate hears the same opener, the same probes, the same follow-up logic. When the AI is scoring, the rubric does not soften into vibes — every answer maps to anchored behavioural exemplars (Behaviorally Anchored Rating Scales / BARS) with the same definitions across candidates. Cross-candidate comparisons become meaningful because the measurement instrument has been the same instrument the whole time.

That does not by itself raise validity above what a well-run human structured interview achieves. The gain is reliability and scale: you eliminate the “I’ll deviate from the script because I have a hunch” failure that is endemic in human panels. Combined with a downstream technical screen and a final human conversation, you reach the GMA + structured-interview multivariate that the 1998 and 2022 meta-analyses both rank at the top.

Common pitfalls when switching from unstructured to structured

Writing too many questions. A 60-minute interview should have 5–8 substantive questions plus probes, not 15. More questions reduces depth per question and compresses scoring variance.
Skipping the rubric design. The questions are the visible part; the scoring rubric is the load-bearing part. Without anchored examples per level, two interviewers will score the same answer differently and the structure collapses.
Allowing “general impression” ratings. A composite “overall fit” column re-introduces the unstructured failure mode. The composite must be a transparent function of the rubric scores.
Letting interviewers pick their own questions.If the question bank is “ask 3 of these 12,” you have a pseudo-structured interview where every candidate gets a different instrument. That collapses the comparison.
Not training panel members on the rubric. Calibration sessions — three panelists rating the same recorded answer and discussing the difference — are the cheapest way to tighten inter-rater agreement.