This page contains all information related to our institution parsing. This help page will be updated with the latest information as it becomes available.
Technical documentation and overview
Institutions are universities and other organizations to which authors claim affiliations.
We work closely with ROR, so every OpenAlex institution has a corresponding entry in ROR.
Our information about institutions comes from several sources: Crossref, PubMed, ROR, MAG, and publisher websites. In order to link institutions to works, we parse every affiliation listed by every author. These affiliation strings can be quite messy, so we’ve trained an algorithm to interpret them and extract the actual institutions with reasonably high reliability.
For a simple example: we will treat both “MIT, Boston, USA” and “Massachusetts Institute of Technology” as the same institution (https://ror.org/042nb2s44).
You can find more information about OpenAlex institutions in our technical documentation.
Super systems
We mark certain institutions as "super systems". These include large university systems such as the University of California System, as well as some governments and multinational companies. These are excluded from the results when doing analyses such as Identifying collaborating institutions. You can learn more in the technical documentation here.
Institution Parsing
OpenAlex has to parse a lot of raw affiliation strings in order to affiliate authors with institutions. To do this effectively, a deep learning model was created which takes in a string and assigns one or more institutions to that string. If you would like to learn more about this model and how it was created/trained, you can go to this google doc which goes into much more detail.
Overall, institution parsing is done in 3 steps:
- String parsing using the deep learning model developed by OpenAlex
- String matching which is done once per month in order to fix common model prediction errors (adding/removing institutional affiliations based on the raw affiliation string)
- Matching process developed by ROR (see the code here)
Steps 2 and 3 were added in order to fill in the gaps that are observed in the deep learning model because it has not been updated since April 2023. This means that any institutions that are added to OpenAlex/ROR after that date will not be predicted by the model and so, additional methods are needed. The string matching code can be found in the OpenAlex databricks repo. while the ROR matcher has been integrated into our main code base.
Code/Training Data/Benchmarks
If you are interested in setting up the institution parsing model on your own, going through the code, looking at the training data, or viewing the benchmark data, The institution-parsing github repo is the best place to find more details about our parsing system. From that page, you are able to do the following:
- Set up the institution parsing model on your own computer (requires semi-advanced knowledge of python/coding)
- View the code used to develop, train, test, and deploy the model
- Get the model artifacts
- Download the training data
- View the benchmark data used to test the model
Works-Magnet Tool
In order to give users and institutions the ability to change affiliations in OpenAlex, the works-magnet tool was created by our friends at the Ministère de l'enseignement supérieur et de la recherche (MESR). Please see this article for more information.