Healthy Data: improving data integrity for health information

Research student:  Obinwa Ozonze
Supervisors:Adrian Hopgood
 Philip Scott
Project started:February 2019

Project summary

This project concerns the use of artificial intelligence (AI) to correct errors in individual medical records. There has been a great deal of interest in the application of data analytics (“big data”) to such records. With millions of healthcare records potentially available, there is clear interest in spotting patterns between symptoms, diagnoses, and treatments. However, there is plenty of evidence that a large proportion of individual medical records contain errors, rendering the data analytics prone to “garbage in, garbage out”.

NHS Digital have investigated whether data quality failures could be detected in national data returns using the diagnosis of dementia as an example. The data show that of the 317,000 patients diagnosed with dementia between April 2010 and March 2015, only 51% of these had a recorded diagnosis of dementia when admitted to hospital during the year 2014/15. Clearly the dementia had not gone away, so the records were flawed. A separate study of emergency admissions data by Dr Tom Hughes of John Radcliffe Hospital found that 40% of patients have no diagnosis at all and that, of those that do, nearly half are meaningless, vague or merely a symptom.

In this context, big data approaches are important but insufficient for extracting useful information. The data need to be checked for inconsistencies and repaired. This is a challenging problem for an AI system. While some data records may be impossible, such as the ‘cured’ dementia, many others will lie on a spectrum from unusual to implausible. Despite the scale of the challenge, there is nevertheless recent work to draw upon in correcting both free-text data (primarily for spelling) and structured data records. The data records for this project will come from research-community sources such as the Mimic III database, managed by MIT.

It is proposed that a multi-stage (and possibly multi-agent) approach is adopted to repair the free text, repair the structured data, and finally cross-check between the two forms of data. Data-driven and knowledge-driven approaches will be explored. In a data-driven approach, any records that show a unique pattern among a dataset of millions would be considered suspect. In a knowledge-driven approach, fuzzy rules might propose likely combinations of symptoms and diagnoses, based on medical knowledge, and recognise any improbable combinations in the data records.

Return to Adrian Hopgood's page.

Valid HTML 4.01! Owner:
  Adrian Hopgood
  March 2019