Data: 9 May 2026

Methodology

Technical documentation of the entity resolution pipeline used to match RMSANZ members to their academic identifiers and publications.

Pipeline Overview

The RMSANZ Entity Resolution Pipeline (v8.0) identifies and links rehabilitation medicine physicians to their academic publication records across multiple open-access databases. The pipeline operates in three sequential rounds, each adding additional verification and discovery capabilities.

Input: 939 RMSANZ members with name, state, country, membership status, and fellowship status.
Sources: OpenAlex API, ORCID Public API, Crossref API.
Output: Matched academic IDs, publication lists, citation metrics, H-index, FWCI, collaboration networks, and funding records.

Round 1: Initial Entity Resolution

The initial matching pass uses an 8-signal scoring algorithm to identify the most likely OpenAlex author profile for each member.

Signals Used:

  1. Name match: Exact, fuzzy, and initial-based matching (Levenshtein distance, Jaro-Winkler)
  2. Institution affiliation: Australian/NZ institution presence in OpenAlex author record
  3. Country filter: Author must have at least one AU/NZ affiliated work
  4. Topic relevance: Rehabilitation medicine keywords in publication topics
  5. ORCID cross-reference: If member has known ORCID, verify it matches
  6. Publication count plausibility: Clinical researchers typically have 1-200 papers
  7. Career timeline: Publication years should be plausible given career stage
  8. Disambiguation signals: For common names, require multiple corroborating signals

Confidence Tiers:

  • HIGH: 3+ corroborating signals, no conflicting evidence
  • REVIEW: Partial match, some ambiguity, or common name
  • NOT_FOUND: No matching author profile identified
Round 2: Multi-Source Verification

Cross-validates Round 1 matches against ORCID and Crossref APIs. Identifies and removes incorrect matches.

Verification Steps:

  1. Query OpenAlex author record: verify name, institution, works count
  2. Query ORCID API: check if ORCID record exists and matches name/affiliation
  3. Query Crossref: search for publications by name, verify DOI overlap
  4. Geographic validation: confirm at least one AU/NZ affiliation in works
  5. Topic validation: check if publication topics align with rehabilitation medicine
  6. Flag mismatches: if record shows different country or unrelated field, mark as REMOVED
Result: 78 incorrect matches identified and removed. 30 new matches discovered for previously NOT_FOUND members.
Round 3: Works-Level Search & Profile Merging

Addresses the "author fragmentation" problem where OpenAlex creates multiple separate profiles for the same person.

The Problem:

OpenAlex often splits a single researcher into multiple author entities when they publish infrequently, change institutions, lack ORCID linkage, or have no coauthor overlap between papers.

The Solution: Works-Level Search

  1. Search OpenAlex works endpoint by raw_author_name
  2. For each work found, extract the associated author_id
  3. Apply geographic filter: only retain works with AU/NZ institutional affiliations
  4. Apply name matching: verify author name on the work matches the member name
  5. Cluster all discovered author IDs into a unified profile
  6. Merge publication lists and recompute metrics (H-index, citations, FWCI)
Result: 354 members gained additional merged author IDs. 86 previously NOT_FOUND members now have matches. Publication counts corrected across 360 members.
Round 4: Authoritative Metrics Fix

Addresses a critical bug in Round 3 where name-matched profiles were incorrectly merged without disambiguation, inflating publication counts for members with common names.

Root Cause:

Round 3's works-level search found ALL OpenAlex author profiles matching a member's name and merged their publications without verifying they were the same person. For common names (e.g., "Chris Chan" matched 64 profiles, "Michael Smith" matched 29), this produced impossibly high publication counts (5,000+).

Fix Applied:

  1. Fetched authoritative metrics directly from each member's verified primary OpenAlex ID
  2. Replaced inflated h-index, citation count, and publication count with primary-ID-only values
  3. Filtered publication lists to cap at the correct count from the primary ID
  4. Added "Suspicious Merge" flag for non-HIGH-confidence members with >5 merged IDs
  5. HIGH-confidence members (verified in R1/R2) are exempt from suspicious flags — their primary ID was already validated
Result: 94 profiles flagged as suspicious merges requiring manual review. Metrics now reflect primary OpenAlex ID only. Previously inflated H-indices (e.g., 209 → 15 for "Chris Chan") corrected.
Computed Metrics
MetricDefinitionSource
H-indexNumber of papers (h) cited at least h timesComputed from OpenAlex citation data
FWCIField-Weighted Citation Impact: actual / expected citationsComputed using OpenAlex topic classification
Rehab-relevantKeyword-based rehabilitation relevance flagKeyword classifier (indicative only)
CollaborationCo-authorship edges between RMSANZ membersOpenAlex author co-occurrence
FundingResearch grants linked to member namesNHMRC, ARC public databases
Known Limitations
  • Common names (partially fixed): Round 3 incorrectly merged multiple people with the same name. Round 4 corrects this by using primary-ID-only metrics, but 94 profiles remain flagged for manual review where the primary ID itself may be incorrect.
  • Name variants: Members publishing under significantly different name variants may not be fully captured.
  • Google Scholar: Indexes some publications not in OpenAlex (book chapters, conference papers). No public API exists.
  • PubMed: E-utilities API was inaccessible during pipeline execution. Future iterations should include PubMed.
  • Funding completeness: Only publicly available NHMRC and ARC data included. Hospital, industry, and international funding not captured.
  • Rehab-relevance: Keyword-based only. May include false positives and miss non-standard terminology.
  • Temporal lag: OpenAlex data may lag 1-4 weeks behind actual publication.