Hiring AI shows bias, gives edge to resumes written by same model

AI also appears to show bias when evaluating resumes. [Photo: Shutterstock]

[Digital Today reporter Yoonseo Lee] AI used in hiring tends to give higher scores to resumes written by the same chatbot, a study found.

On May 11, Japanese online media outlet Gigazine reported that research teams at the University of Maryland, the National University of Singapore and Ohio State University defined the phenomenon as “AI self-preference bias” and tested whether it appears in hiring evaluations.

The study focused on a structure in which AI polishes resumes written by humans, and companies then use AI to screen applications again. It said a similar situation is also emerging on social media, where users write posts with AI and platforms classify and filter them with AI. It said that assessing fairness in hiring AI should cover not only bias tied to attributes such as gender or race, but also bias that arises when AI evaluates AI-written sentences.

The experiment used a dataset from resume-writing service LiveCareer.com. It covered 2,245 resumes written by humans before generative AI was widely used.

The research team left structured information such as education and work history unchanged. It replaced only the summary section, where writing style differences show clearly, with newly written text produced by GPT-4o, DeepSeek-V3, Qwen 2.5-72B and Llama 3.3-70B. It then presented human-written and AI-written summaries to an evaluation AI and had it choose which resume was better.

The results were clear. Many models more often chose text generated by the same model than writing done by humans. The gap in selection rates between human-written and AI-written versions was 97.6 percent for GPT-4o, 96.3 percent for Llama 3.3-70B, 95.5 percent for DeepSeek-V3 and 95.9 percent for Qwen 2.5-72B.

The research team also checked whether the AI simply picked better-written sentences. After statistically matching sentence length, vocabulary complexity, style and semantic similarity, it compared summaries of similar quality again and found self-preference bias remained. The bias rates were 81.9 percent for GPT-4o, 78.9 percent for Llama 3.3-70B, 78.0 percent for Qwen 2.5-72B and 71.6 percent for DeepSeek-V3.

The same pattern appeared when compared with human evaluators. Even when human evaluators judged the human-written summaries better based on clarity, fluency, consistency, conciseness and overall quality, GPT-4o, DeepSeek-V3 and Llama 3.3-70B sometimes still chose summaries created by their own models.

The pattern did not repeat across all model combinations. DeepSeek-V3 showed a relatively clear tendency to prefer summaries written by DeepSeek-V3 over those written by other AI models. By contrast, GPT-4o and Llama 3.3-70B responded differently depending on the comparison target, and did not show self-preference as consistently as in comparisons against human-written versions.

The research team also ran a simulation closer to an actual hiring flow. For 5 candidates, it prepared 10 resumes: 5 with human-written summaries and 5 with summaries written by the same model as the evaluation AI. It asked the system to select 4 candidates for interviews. If candidates’ substantive information is the same, each side should average 2 selections, but resumes with same-model AI summaries remained on the interview list 23 percent to 60 percent more often than those with human-written summaries.

There were also differences by job type. Business-related roles such as sales and accounting showed a larger advantage for AI-generated summaries, while auto-related and agriculture-related roles showed a relatively smaller gap. The research team pointed out that if this state repeats, a lock-in effect could occur in which the writing style of widely used AI models becomes fixed among applicants.

It also tested ways to reduce bias. One method instructed the evaluation AI through a system prompt not to consider whether a resume was written by a human or generated by AI, and to focus only on content quality. Another method did not rely on a single model, instead deciding by majority vote among multiple models, including smaller models with weaker self-preference.

Under the first method, GPT-4o’s self-preference bias fell to 61 percent from 82 percent, and Llama 3.3-70B dropped to 30 percent from 79 percent. Under the majority-vote method, GPT-4o fell to 30 percent from 82 percent, Llama 3.3-70B to 23 percent from 79 percent, and DeepSeek-V3 to 29 percent from 72 percent. The research team judged that instructing models to focus only on content quality or adopting multi-model evaluation could help mitigate bias.

The study shows that AI bias in hiring automation is not limited to judging applicant attributes. It suggests that the more widely companies use the same family of AI for both resume writing and screening, the greater the chance that evaluation standards could tilt toward the writing style of a specific model.

Yoonseo Lee yslee@d-today.co.kr

Keyword