Data Lake Opportunities in Healthcare

DataLakeOpps-inHCAs the Internet of Things (IoT) continues to grab headlines across all industries, Healthcare has seen an explosion of data over the last few years as more entities become technology enabled through Electronic Medical Records (EMRs).  With the enormous amounts of healthcare data that have become available in the majority of healthcare organizations, such as providers, payers, pharmaceuticals and third-party vendors; there are unique opportunities to leverage these sources of information for analytical insight. This data can be used to gain insights for improving quality of care and preventing resource waste. The complexity of these, sometimes new and emerging, data sources present challenges in achieving strategic analytics, and within the subsequent applications in a clinical environment.

Healthcare Data Sources

In today’s healthcare environment, there are two main sources of data: claims data and clinical data.

Claims data has been the primary data source for the healthcare industry for decades. Because claims data is based on uniform and broadly available information, it is useful for finding overarching care patterns. Claims data helps answer questions about the types of people who receive care, their demographics, the care settings in which they receive the care, the categories of care they receive, and the general type of care that is being delivered.

This type of data provides great insight for population health discovery, as well as for research studies. Also, since claims data is originally meant for reimbursement, the data is an excellent source for chronicling the cost of care. Claims data helps organizations find retrospective patterns in care, and is useful for seeing the spectrum of care received by a particular patient.

While it provides care summaries and is easy to analyze, claims data also bears significant drawbacks. Foremost, the data is abstracted, then summarized from the original medical records, therefore it highlights the condition for which the provider is being paid, so not all conditions need to be listed on the claim and, subsequently, they generally aren’t. Claims often contain a very general diagnosis, and have significant lag as it’s not compiled until days after discharges and are most often made available several weeks after that.

Unlike other industries, such as banking and retail, which have been using sophisticated data solutions to drive their business needs for decades, the healthcare industry is lagging behind in technology to electronically document day-to-day operations related to patient care. It was not uncommon in the early 21st century to walk into a hospital or clinic setting and observe physicians writing notes onto thick paper charts. As a matter of fact, healthcare is the last major industry to be computerized in this country. The Budget Control Act of 2011 is the major milestone to incentivize care providers using EMRs. The effective date for Medicare reimbursement reform was on April 1, 2013 and non-compliant providers were facing a 2% deduction under Medicare programs. The changes in the regulatory arena finally pushed the healthcare industry to the modern age.

Critical information such as orders, vital signs, and medication administrations no longer only exist in paper forms. Delivery systems now have accessible electronic data that can round out the patient picture and help provider organizations improve quality of care, control costs, and increase patient satisfaction. Such clinical data is found in EMRs. EMRs make a rich store of data available for analysis and are found throughout the continuum of care, including the emergency department, hospital inpatient records, outpatient/ambulatory, physical therapy, and radiology areas. Unlike claims data, they are timely, longitudinal, and reflect how medicine is actually practiced. This is commonly known as a Patient 360 approach.

Both types of data are critical for facilitating healthcare data analytics. Combining all the benefits of EMRs with claims information will open a completely new world to gain greater insights and better understand the whole dynamic of modern patient care.

Healthcare Data Analytics

The vast majority of data used in the healthcare industry is still in some form of a relational structure, or ultimately will be pulled out into a relational form. The sweet spot between the worlds of traditional relational data warehouses (structured data) and Big Data (large scale & unstructured data) is the data lake. The nature of the data lake, with its flat architecture and capability to hold a vast amount of raw data in its native format until it is needed, provides flexibility by not defining the data until the data is queried. This late modeling/binding approach lets people pull in structured datasets, along with unstructured content such as clinical notes and pretext, etc.

On the other hand, by examining the analytic use cases and the analytic maturity of the industry right now, many requests from business/operation users are still dashboards, KPIs and regulatory reports, so making the entire process as automated as possible is very high on business users’ wish list. However, some of the leading institutes started the journey to analyze both structured and unstructured data to handle their challenges in a pivotal way. For example, Mayo Clinic’s collaboration with United Health Group led to the creation of Optum Labs which has been joined by other leading hospitals, insurance companies, and global pharmaceutical manufacturers. Mount Sinai School of Medicine started the long promised journey to use patients’ genomic profiles to guide physicians for the most optimal therapeutic approaches in real-time.

Technically, Natural Language Processing (NLP) in combination with The Systematized Nomenclature of Medicine (SNOMED) has been applied to gain insight from unstructured content such as physician/nursing notes and radiology/pathology results. Call center notes are processed, along with a patient’s history, to predict the likelihood of emergency room visits and readmissions. Other unstructured data such as socialecomonics are also being explorered to try to improve the quality of care.

The Data Lake

When pulling a data source into the late-binding data lake, the data is in a form that looks and feels much like the original source system. Each data element in the lake is assigned a unique identifier and tagged with a set of extended metadata tags. Obviously a few minor modifications to the data occur during the data modeling, such as flattening (which will de-normalize it slightly), but for the most part, that data looks like the data that was contained in the source system, which is a characteristic of a Hadoop data lake. The key benefit is to retain the binding and the fidelity of the data as it appeared in the source systems.

In contrast to that approach, many vendors in healthcare remap that data from the source systems into an enterprise data model first. But when mapping that data from the source systems into a new relational data model, it inherently makes compromises about the way the data is modeled, represented, named, and related. The consequences include losing a lot of fidelity when doing so, such as losing familiarity with the data, and it’s a time-consuming process. It’s not unusual for that early binding, monolithic data model approach to take 18 to 24 months to deploy a basic data warehouse.

When a business question arises, the data lake can be queried for relevant data, and that smaller set of data can then be analyzed to help answer the question. The data lake approach can speed up the process by deploying content quickly from multiple sources, exposing it to users for data exploration, then being ‘checked-out’ into an analytic sandbox for further modeling and analysis using the users’ familiar tools.  There are several different places where users can bind data to vocabulary or relationships as it flows from the source systems out to the analytic visualization layer.

Impact of EMRs

The Budget Control Act of 2011 dramatically changed the EMR landscape when vendors large and small started to push their products aiming to ease the transition. The fierce competition, especially in the large hospital EMR market, resulted in significant consolidations from the once-crowded field of vendors as the top 10 leaders capturing about 90% of the market share in 2013.

In the meantime, the persisting issues preventing EMR interoperability among vendors only make the data lake a more viable approach. For example EPIC Systems, the leading EMR vendor for both hospital and ambulatory settings, is creating interoperability issues even among its own customers due to the extensive customizations based on clients’ specific needs. As a result, most Health Information Exchanges (HIXs) will settle with limited data elements among their clients, or simply set up templates and ask client hospitals to transform the necessary data before sending them to HIXs. Even with recent government intervention on the interoperability matter, the dream of “interchanging healthcare information anywhere/everywhere” is still illusive.

Given the current landscape of the EMR market, it will be hard-pressed in any environment to model multiple vendors’ data into a traditional data warehouse. This is especially true for firms such as healthcare insurance companies, as they not only need claims data but also, and increasingly, clinical data from providers to compete in today’s value-based healthcare model. This makes the Hadoop-based data lake an even more viable approach.

The Healthcare Transition and the Need for More Data

With the healthcare industry transforming from a volume-based (fee-for-service or FFS) to value-based model, which drives Population Health Management (PHM) and other initiatives, it is not surprising to see the need for more analytics based on comprehensive data sources is skyrocketing from all parties.

A growing number of organizations are joining the march to value-based care to provide higher-quality care at a lower cost. For example, since the passage of the Affordable Care Act, more than 360 Medicare Accountable Care Organizations (ACOs) have been established, serving about 5.3 million Americans with Medicare in 2013. Motivated by this shared-savings model to ensure that patients receive appropriate, cost-efficient care, organizations are using PHM strategies to optimize health system performance.


Through PHM, organizations are gradually transitioning from acute, episodic care to a more coordinated, long-term approach. By investing more resources in preventive care, they’re helping patients stay healthier while controlling costs. However, this transition must be well planned. Provider groups consistently cite the same four steps of critical focus for success in PHM:

  1. Optimizing Network Management: In value-based contracts, there are networks that physicians (especially primary care physicians or PCPs) use for referring patients to specialists for coordinated care. It is in a physician’s best interest to refer to specialists who provide high-quality, cost-effective care. By analyzing their patients’ claims and clinical data, physicians can identify specialists who provide the best care at the best value.
  2. Managing Care Transitions: In the new healthcare model, providers are incentivized to ensure that patients who go home or move to a skilled nursing facility get the support they need, both inside and outside of the facility to reduce readmissions. Given that more than 90% of healthcare costs are hospitalizations, analytics can strengthen care management programs by helping predict which patients need additional support. Applied to claims, clinical, and abstracted data, this technology can spot care patterns that may be contributing to readmissions and identify clinical gaps, as well as provide outcomes data that shows the results of a transitions program.
  3. Investing In In-Home Intervention: High-acuity patients who suffer from chronic illnesses such as diabetes, Chronic Heart Failure (CHF), Chronic Obstructive Pulmonary Disease (COPD), and kidney disease are at high-risk for admissions and readmissions. It’s clear that such high-acuity patients need more proper discharge instructions than hoping for the best. In a value-based environment, they should be closely monitored post-discharge and targeted for intervention to keep them on the road to recovery. With limited resources at their disposal, finding the right patients to target with in-home interventions begins with applying analytics to data, often through risk-scoring methodology. Organizations need to find the individuals that are driving a disproportionate share of the cost of care. This often necessitates applying highly accurate predictive models to the appropriate data.
  4. Expanding Chronic Disease Management: Chronic conditions are the leading cause of death and disability, and the healthcare system spends between 80- 85% of its resources on treatment of chronic diseases such as heart disease, stroke, cancer, and diabetes. Analytics are particularly important when implementing chronic disease management programs within value-based care settings, helping providers account for and monitor their entire patient population. Running every patient through analytics allows managers to identify their highest-cost patients and processes, find unnecessary care pattern variations, and identify gaps in care.

PHM must be built on a foundation of data, and each of the four steps of PHM relies on data — timely, relevant, and comprehensive data that can help organizations make better decisions.


Healthcare is a vast land of heterogeneous data, much of it unstructured and “unclean.” Initiatives like CMS’s Virtual Research Data Center (VRDC), a federally funded Big Data project that will allow researcher’s access to the massive amounts of health data from their own computers, takes aim at making this data useful. This is where the data lakes come in. The massive environment will house data in its most raw form, giving analysts the option to format and standardize it when needed to make it machine readable and easy to use. Of course, large stores of data like these hold risks, especially in the healthcare industry. Hypothetically, they could mean access to patient-level information over a patient’s entire life. Access would have to be closely monitored to maintain any level of safety and security and in compliance with HIPPA.

Keeping data in its native format provides multiple benefits. It helps maintain data provenance and fidelity, allowing for analyses to be performed using different contexts. It also makes different data analysis projects possible — in the case of healthcare, that means functions like predicting the likelihood of readmissions, giving a facility the chance to take advanced measures to reduce that number.

We’re just scratching a very thin surface of the application of data lakes within Healthcare here, as there is more to consider with the intricacies of implementation and use including: basic Big Data architecture for scalable data lake infrastructure, the basic function of data lakes, how they solve enterprise level issues of data accessibility and integration, how data flows through a lake, and how a lake matures with increasing usage across the enterprise. However, fundamentally there is a lot to gain for the healthcare industry by utilizing the data lake architecture to help derive cost- and life-saving insights through the mass volumes of data becoming available.  The future of healthcare looks bright and data is the shining star.

Download White Paper

Never miss an update:

Subscribe to our newsletter!

Newsletter Sign Up Form

  • This field is for validation purposes and should be left unchanged.

New York, NY • Warren, NJ • Boston, MA • Toronto, Canada
©2018 Knowledgent Group Inc. All rights reserved.