Author disambiguation – OpenAlex

How OpenAlex identifies and links authors across millions of scholarly works.

The challenge

Scholarly works list author names in all sorts of ways. "J. Smith," "John Smith," and "John A. Smith" might all be the same person — or three different people. OpenAlex uses machine learning to figure out which works belong to the same real-world author, even when names vary.

How it works

Our disambiguation algorithm considers six factors when deciding whether two authorship records belong to the same person:

Name similarity — String matching across different name variants
Co-author patterns — Shared collaborators across papers
Institutional affiliations — Consistent workplace signals
Research topics — Whether the publication record is topically coherent
Citation patterns — Self-citation and reference overlap
ORCID — When available, this provides an authoritative identity signal

So if "J. Schmidt" and "John Jacob Jingleheimer Schmidt" both write about 19th-century ketchup production at the same university, we'll treat them as one author. But we won't lump in the J.J.J. Schmidt who writes about weasel migration, even though the names match.

Our author data comes from Crossref, PubMed, ORCID, publisher websites, and the legacy Microsoft Academic Graph.

The July 2023 upgrade

In July 2023, OpenAlex switched to a significantly improved disambiguation system. The upgrade included:

A better machine learning model for clustering
Smarter assignment strategies for newly published works
Deeper integration with ORCID data

As part of that switch, we deprecated all of the old OpenAlex Author IDs and assigned new Author IDs to all authors. You can find the old Author IDs, along with their associated works, as a data dump here. New Author IDs have a numeric component of their OpenAlex ID >5000000000. The new Author IDs have been used since late July 2023, and in the data snapshots starting in August 2023.

NULL authors (A9999999999)

You might occasionally see the special author ID A9999999999. This represents authorships that didn't go through the disambiguation process. This typically happens when:

No author name was received from the data source
The name was too short or too long to disambiguate reliably
The name matched an ignored phrase like "Unknown Author"

These records are grouped under the single NULL author rather than being assigned to real author profiles. See this article for more information.

Fixing errors

Disambiguation isn't perfect. Sometimes authors get incorrectly split into multiple profiles, or works from different people get merged into one profile. Author profile attributes like alternative names, institutions, metrics, and topics are all derived from linked publications, so they can't be edited directly.

If you notice errors in an author profile, you can submit a correction request through the OpenAlex help center or use our author curation form.

Code, data, and methods

Our methods, code, and trained models are fully open source:

openalex-name-disambiguation — Python code, methods, and training data
Live disambiguation code — Production disambiguation pipeline

For more on the Author object and available filters, see the Author API documentation.