The increasing use of technology in health has enabled collection and storage of data at an exponential rate. Data is at the heart of “Health Analytics”, where it can be used to understand the current system, project the impact of changes, or evaluate and intervention. Data in healthcare data are scattered and fragmented, available from various sources, governmental or non-governmental health organizations, commercial or non-commercial, national wide or state wide; recorded at various aggregation temporal levels, event-level, monthly or yearly, and at various aggregation spatial levels, census tract, zip code, county, state or national; and with various levels of accuracy. Some data can be called “Big Data”, which is characterized by large Volume (from Terabytes to Exabytes) and Complexity (heterogeneity, depth, dimensionality, dependencies).

In our work, we use a variety of data sources (often combined), along with robust methodological tools from the mathematical sciences to move along a continuum from Information to Data to Knowledge to Decisions. Data can be useful in understanding 'Who, Where, What, or When' in the system.

Medical Claims Data

Claims data consist of person-level data on eligibility, service utilization and payments. These data can be used to understand careflow, utilization and cost at the system, organization, provider and patient levels.  A central, large repository of claims data is the Center for Medicare and Medicaid Services (CMS), including claims for all Medicare and Medicaid-insured patients across multiple years.  These data are developed to support research and policy analysis initiatives for Medicaid and other low-income populations such as analyzing provider payments, conducting quality or access to care studies, and conducting statistical analysis for public reporting.

Electronic Health Records

EHRs are  integrated medical, clinical, administrative and patient-detailed records that could be accessed readily at a wellness visit, as well as in an ER visit, without breaching privacy and confidentiality. Currently, only a few countries adopted centralized EHR systems. In 2004 President George W. Bush established a national goal of universal adoption of electronic health records and health information exchanges by 2014 although to-date the EHR system is fragmented, and highly varying in levels of information, interoperability and accessibility from one health organization to another. One example that could be used as a benchmark for integrated EHR’s in the U.S. is the computer system connecting pharmacies with providers in the US. Nearly all pharmacies connect electronically to health plans when they enter a patient’s prescription into their computer system.

Disease Registries

Organizations like the Cystic Fibrosis Foundation have established disease registries, which contain patient-level data across many years of service at accredited Cystic Fibrosis clinics. Disease registries are useful in understanding specific diseases and patient outcomes over time.

National & State Databases

Many exist, but most commonly researched databases are:

Medical Surveys

  • The Framingham Heart Study was started in 1948 under the direction of the National Heart Institute.  Its initial purpose was to identify common factors that contributed to the onset and progression of cardiovascular disease.  Later on, the data from the study came to be used for many other studies and analyses. 
  • The Wisconsin Diabetes Registry Study was funded by the National Institute of Diabetes, Digestive, and Kidney Diseases (part of the National Institutes of Health) starting in 1987 to understand the complications and co-morbidities associated with diabetes.

Other Data Sources

  • Census Bureau data and the American Community Survey are invaluable for Health Analytics
  • National Center for Biotechnology Information is a genomics community providing many genome databases
  • Medical technologies (e.g. EEG, CT scan, MRI) are widespread sources of monitoring and diagnosis patient data
  • Patient-generated data (e.g., self-reported, tracked through mobile devices, virtual communities) are becoming more common
  • ICD-9 Codes are diagnosis codes that provide information on the primary and secondary conditions associated with a healthcare visit
  • Diagnosis Related Groups (DRGs) characterize patients by the expected utilization of resources and they are used by many payors to reimburse providers.
  • Many others, that we do not describe here
Data is the new resource, and unlike money or oil sharing doesn't deplete it.
Kairos Future