July 2, 2018

New Data Mining Method Offers Easier Access to Epic’s Massive Data Trove

Streamlining the process so it's as painless as possible


To do studies without wasting precious time and money, researchers need to efficiently explore data housed within their institution’s set of electronic health records (EHRs). However digging into patient data to extract the precise information needed often proves a major headache.


Cleveland Clinic is a non-profit academic medical center. Advertising on our site helps support our mission. We do not endorse non-Cleveland Clinic products or services. Policy

Now a team of Cleveland Clinic scientists is helping their fellow researchers by devising a better way to extract and utilize health data from the Epic EHR. They are using statistical methods that fall under the umbrella of natural language processing to create a wide array of research-ready data tables.

“Researchers want to mine the data to figure out what works in what sort of patient,” says Michael W. Kattan, PhD, MBA, coauthor along with system analyst Alex Milinovich of a recent paper on this subject. “The question becomes how to make that data mining as painless as possible because data collection takes time, and it’s time that is not reimbursed,” he says. “Everyone groans about this.”

The paper, “Extracting and utilizing electronic health data form Epic for research” appeared in the Annals of Translational Medicine.

Epic … not so research friendly

Cleveland Clinic was an early Epic adopter about 20 years ago and now possesses more than 35 billion individual data points for more than 4 million patients. In fact, Cleveland Clinic is the second largest installation of Epic; Kaiser Permanente is the largest.

Epic works well for taking care of patients, Dr. Kattan says, but it was not developed with research in mind. In fact, Epic makes life difficult for researchers. “It has a gazillion tables where data is housed,” he notes. “It stores information all over the place.” Even worse, a researcher seeking a specific sort of data in Epic will often find that the needed answers are buried in prose notes dictated by clinicians.


“At Cleveland Clinic, less than 5 percent of the EHR data are codified variables [of the sort needed for research,]” the team wrote. Ninety-five percent are identifiers, dates and free-text entries.

“I can’t work with a paragraph of text the doctor types,” Dr. Kattan says. “I am looking for a test result, for a number in that paragraph. I want to know if the patient had asthma, yes or no.”

Extracting gold from the EHR hills

To mine the raw Epic EHR and then use it to build robust datasets for statistical analysis, the team uses a number of statistical techniques to clean, parse and map the data. The cleaned, standardized data is then ready to be deposited into a registry of Cleveland Clinic clinical research data.

The statistical techniques used include calculations of similarity and relationships between terms. “For example, the term of ‘Heart Failure’ (C0018801) has relationships to various medications that may treat heart failure, finding sites of heart and myocardium as well as child diagnoses such as congestive heart failure and left-sided heart failure,” the team wrote. They stated that these relationships make querying the EHR easier. Researchers can, for example, identify top-level terms and then identify any pediatric or related terms that suit their population of interest.

“Approximately 185 tables from different data sources are condensed into 18 research-ready tables in the data repository,” the team wrote. These tables are updated automatically, on a weekly basis.


With this approach, the authors state, “Cleveland Clinic can do live population exploration as well as produce datasets for analysis faster than it takes most organizations to simply identify their base population.”

Simplifying fulfillment of the mission

Doing research fulfills one of the three pillars of the Cleveland Clinic’s mission, which includes providing “better care of the sick, investigation into their problems, and further education of those who serve.”

“If research is part of your mission, you have to do it somehow,” says Dr. Kattan, and that involves tapping the immense research resources held within the Epic EHR. “We are constantly developing processes to clean the data up, to define things, and to make rules,” he says, “Epic never sleeps.”

Related Articles

23-CCC-3771004 Quantum computer 650×450
April 19, 2023
Quantum Computing Debuts at Cleveland Clinic

Discovery Accelerator Partnership with IBM Deploys Advanced Computing Technologies to Supercharge Healthcare Research

Idyllic neighborhood street, aerial view
March 17, 2023
Digital Twin Neighborhoods: An Advanced Tool to Tackle Health Disparities

Testable models of communities can identify effective strategies to address place-based inequalities of care

22-CCC-3308323 Quantum computing 650×450
October 14, 2022
How We’re Bringing the Power of Quantum Computing to Medical Research

Cleveland Clinic is a Founding Partner in Quantum Innovation Hub

January 6, 2020
When Two Fields Collide: One Biomedical Engineer’s Passion to Innovate Surgical Care

Leveraging Microsoft HoloLens and 3D printing to improve surgical outcomes

April 19, 2019
Driving Out Treatment Resistance: the Center of Excellence in Lymphoid Malignancies Research

Lymphocenter investigates therapeutic resistance and more

December 31, 2018
Decades of Research Bring New Understanding of ILK

The pseudokinase binds to proteins to form IPP

October 31, 2018
Researchers Build a “Mini-colon” to Study IBD and Colon Cancer

NCI grant supported organotypic model