DH Blog

Deduplication of Clinical Data Without Data Loss: Critical to Meeting User Needs


Don Burt
Solutions Consultant

In today’s increasingly connected healthcare system, clinicians often receive data from multiple sources into their EHR. In a 2017 survey conducted by HL7 International, clinicians were asked, “Do you prefer to manage Transfer of Care content by receiving more information and having better presentation and incorporation capability in your EHR?” Contrary to what one might expect, 61% of those responding wanted to receive more information if they had better display and incorporation capability[i].  In other words, respondents didn’t want data to be omitted, but in fact rated,  “deduplication detection” to the most important area to address.

As these results demonstrate, deduplication is a complex challenge, confounded by factors inherent in the capture, transmission, and translation of clinical data:

  • Data captured from even a single data source often uses multiple coding standards that mean the same thing or are misused altogether, problems which are compounded in a multi-sourced clinical data ecosystem
  • Difficulty in translating coded values or maintaining complicated mapping tables
  • Reliance on exact matching of textual values and codes, on a per facility basis

Said a different way, deduplication is thwarted by clinical data concepts which may mean the same thing but are syntactically different and thus not exact duplicates.  For example, the ICD-10 and SNOMED coding systems each have different values for “heart failure.”

Effective deduplication is the final result of a three-step process:

  • Normalization and enrichment – Improving and standardizing data expected to be recorded at the source (e.g. associating disparate codes to a standard SNOMED code) as well as providing additional data helpful in analytics and data usability, but not expected to come from the source (e.g. adding a drug class, decomposing a SIG, or adding a reference range); this step requires the correction of vocabulary and syntax mistakes, translation of codes and units to a common information model, terminology management for standard national vocabularies, and structuring and codification of relevant data from clinical notes
  • Organization – Organizing the available data logically by topic, while maintaining the provenance of the source data; as an example, medications which may be found in multiple places in the clinical record (within medications, medications administered, admission medications, discharge medications, for example) should be logically grouped while leaving the source data intact
  • Deduplication – eliminating duplicate entries of the same information (e.g. not listing Atorvastatin multiple times when it appears on multiple medication lists in many encounters) using either the normalized or original form

Today, there are hundreds of EHR vendors and millions of clinicians exchanging hundreds of millions of clinical documents annually, resulting in highly variable clinical data content. Relying on manual methods to manage this process is impractical and non-scalable. More effective results have been demonstrated using technology incorporating sophisticated inference processing and clinical natural language processing (NLP). Internal parsing logic can accept expected codes and translate multiple code types to national code systems. Targeted NLP processing is able to decompose clinical text and locate national codes, while cross referencing machine readable text to ensure accuracy. An effective solution must automate the process of achieving semantic interoperability in order to apply complex deduplication logic across all clinical content areas, and then provide configurable options for out bounding the uplifted clinical document.

Finally, the ability to perform comprehensive deduplication is necessary to reconcile other properties that may differ by source. The source, onset date, and status must be factored into the preferences and requirements of disparate facilities and vendors. To meet the deliverables of the HL7 survey discussed above, outbound data should be presented to the user according to their ingestion and display capabilities, while at the same time, ensuring no loss of data provenance.

[i] HL7 International, HL7 CDA® R2 Implementation Guide: Clinical Summary Relevant and Pertinent Data, Release 1


Contact Us

® Diameter Health is a trademark registered in the US Patent and Trademark Office.