Hadoop for Health Insurance: Driving Innovation in Data and Analytics


The future of the healthcare industry rests in the promise of collecting, analyzing and taking action on the output of larger amounts of information. Through the advancements in big data, machine learning, and advanced analytics, healthcare organizations can leverage and manipulate data to improve overall member health, reduce costs, improve quality, and manage clinical and financial risk. In order to improve their ability to store and analyze their data, healthcare organizations are required to implement a robust data management strategy. In today’s world, these strategies must take into consideration large volumes of data coming in at increasingly high velocities with a large variety in the type of data being stored. This is known as a big data problem, which is what all healthcare organizations are facing today.

The big data problem being faced provides a different challenge for each organization. There is no quantifiable measure of ‘what is big’ when it comes to big data. For this reason, the Hadoop Ecosystem serves as a universal framework that allows an organization to manage their big data with a company-specific data management strategy. This is evidenced by the number of components and technologies that make up the ecosystem and that can be integrated to address specific business needs.


The Hadoop ecosystem provides enterprises with an infrastructure/environment utilizing commodity hardware. The Hadoop ecosystems provides three benefits: optimal storage of large of volume datasets (HDFS), a framework for rapid and robust data processing (MapReduce), and a multitude of native and integrated technologies developed by companies such as Facebook and Yahoo. These technologies provide avenues for data ingestion, algorithms for machine learning, data querying (SQL on Hadoop), and engines for executing data flows. Along with the commodity infrastructure and tools for ingesting large volumes of disparate data quickly at low costs, the Hadoop ecosystem also presents a space for enterprises to get into the realm of advanced analytics.

Healthcare organizations are already using Hadoop infrastructures to perform advanced analytics use cases, including machine learning, natural language processing, and enterprise search. In the future, Hadoop’s next-generation data infrastructure will enable healthcare organizations to derive insights from disparate Internet of Things data sources, feed this data into more powerful natural language processing algorithms and machine learning libraries (engaging in cognitive computing and deep learning), and provide the foundation for a blockchain infrastructure that will securely distribute data and analytic outputs to all stakeholders.

Market Leaders are Performing Advanced Analytics on Hadoop

For early-adopting enterprises, the value behind Hadoop’s next- generation architecture rests in the fact that it stages and enables big data to be used for advanced analytic use cases. As Data Science matures, more and more enterprises are turning towards machine learning techniques, commonly known as predictive and prescriptive analytics, to automate and improve analytical decision making. Machine Learning analytics allow organizations to correlate large and valuable data sets with future outcomes, use current and historical data to predict which outcomes will occur next, and suggest the optimal action that should be undertaken based upon the predicted outcomes and the costs and constraints of each action.

Data Scientists glean these game-changing insights from the data by building models containing advanced machine learning algorithms on top of large-scale datasets. Healthcare organizations are able to build Data Science models within a Hadoop ecosystem on their stores of Clinical, Demographic and Financial data to perform predictive and prescriptive analytic use cases across the Care Delivery Value Chain, such as:


Knowledgent and Hortonworks have partnered on numerous programs to enable healthcare organizations to drive advanced analytic insights from their large stores of data. These projects have included the instantiation of a Hortonworks Hadoop Data Lake along with other components of the big data ecosystem. The success of these projects is largely due to the effectiveness of the Hadoop ecosystem in enabling advanced analytics. Two Data Science projects, listed below, were performed by Knowledgent at top-3 health insurance companies on a Hortonworks Hadoop cluster.

  • Increasing Risk-Adjusted Revenue: One of the nation’s largest payers sought to improve their Risk-Adjusted (RAPS) Revenue by understanding which members have an unreported disease condition that would increase their Risk Score. Using machine learning analytics, Knowledgent narrowed down the client’s substantial pool of Medicare Advantage members into the small subset with the highest ROI for RAPS chart chasing pursuits, uncovering a potential 8-figure revenue lift.
  • Member Segmentation: One of the most powerful and well-known use cases of machine learning technology is the ability to segment members into cohorts, or groups sharing similar characteristics. Knowledgent has worked with a number of clients to segment their members into cohorts, and then to identify the optimal channel and messaging for intervention for each member segment. By targeting the intervention, resources spent on outreach for marketing, customer experience, and care delivery are optimized. These programs have succeeded in increasing Member engagement rates by over 99%.

Enterprise search is another realm of advanced analytics that has been innovatively developed in recent years due to its ability to leverage Hadoop’s next-generation data architecture. Healthcare companies are now able to join internal and external structured and unstructured data sources in a Hadoop Data Lake environment, allowing for searches to span across all available enterprise data files. Text, photos, and videos are all being indexed in enterprise search solutions. Advanced analytics in enterprise search focuses on semantic and statistical analytics to ingeniously broaden text analytics, all powered through natural language processing and machine learning libraries. Additionally, enterprise search has been addressing the business need of external collaboration by empowering enterprises to leverage external datasets and merge them with internal data sources for more holistic data availability. In order to benefit from the advancements in enterprise search, a scalable and robust environment such as Hadoop is required in order to meet data storage, processing, and resiliency needs.

The current employment of Hadoop next-generation architecture has solved many of the industry’s current advanced analytics business needs, empowering enterprises to derive value from their data. In addition to addressing today’s analytics needs, the current state of the architecture also enables the future state of advanced analytics.

The Future of Healthcare IT will be built on Hadoop

The future state of Healthcare IT will involve deriving value from streaming data feeds from the Internet of Things, the expansion of machine learning to foster deep learning through neural networks and cognitive computing, and secure, transactional blockchain technologies that provide the necessary balance between security and distribution of personal health data. Each of these future state innovations will exponentially increase the volume of data being analyzed, the speed in which the data is coming in, and the processing power that is required to compute the data.


Increased consumer adoption of Internet of Things (IoT) technologies will facilitate the production of immense amounts of data with the potential to inform clinical decision making. A large benefit of IoT for healthcare companies is the collection of patient data from wearable and fitness devices. This will be beneficial for many healthcare organizations for monitoring patients inside their facilities and post-discharge. For example, within the hospital setting, wearable devices could produce critical data for patients in emergency rooms, allowing for hospitals to continuously stream and integrate patient data from various devices. Because IoT technologies are continuously collecting and analyzing large volumes of data, a great deal of storage space and processing power will be required to glean value from the data being collected in real-time. A Hadoop Data Lake environment will be necessary in this future. Moreover, just as the IoT will require an environment that resembles the storage capacity and processing power of a Data Lake, so will cognitive computing.

Cognitive computing is based on the concept of machine learning, allowing for computers to participate in what is being termed deep learning. In the future state, cognitive computing will leverage machine learning libraries and algorithms for predictive analytics and pattern recognition. But it will also go much deeper than this. The goal of cognitive computing is to re-create neural networks – in algorithms – that can simulate human thought processes, mimicking the way the human brain works. In order for an enterprise to engage in cognitive computing and create the neural pathways, massive volumes of data will have to be mined and stored in an environment that can handle its size and velocity. Additionally, in order to imitate how the brain works, powerful processing will be required for large-scale pattern recognition and natural language processing. Multiple machines will also have to be arranged as “nodes”, resembling the various lobes of the brain. These nodes are each weighted with assigned tasks, such as storing input data and distributing output data. Deep learning and the Hadoop ecosystem complement each other. Cognitive computing can leverage the multiple nodes of a Hadoop cluster as the neural nodes required for deep learning. All of these requirements are satisfied by the cluster-computing paradigm on which Hadoop resides, making the interaction between cognitive computing and Hadoop clusters symbiotic. While the IoT movement and cognitive computing will address future state business needs for achieving advanced analytics, blockchain infrastructure will provide the security necessary to distribute sensitive transactional data to all stakeholders through a secure channel.

Blockchain infrastructure is a decentralized, distributed ledger technology that focuses on storing large volumes of transactional data and information, in real time, in a shared database. The blockchain infrastructure provides an environment for data to be distributed through secure peer networks, with every new entry being validated and agreed upon by the entire network. Every transaction in the database is essentially immutable. Nevertheless, because of the high volume of transactional data coming in real time at high velocities, blockchain infrastructure requires a Hadoop environment to accommodate all storage, processing, and distribution needs. Distribution of this information can be programmatically triggered based off of certain conditions. For example, a blockchain can be triggered to distribute the medical records of a patient recently deceased to a legal executor. Sensitive clinical trials data and information can also be triggered to be distributed to all stakeholders during various phases of the trial. In the future state, blockchains will ensure better protection against cyber-attacks, unauthorized distribution of sensitive patient information, and data loss and data fraud. In addition, blockchains will also foster a greater sense of internal and external collaborative consumption, allowing for multiple approved parties to get conditional access to distributed ledgers.


Today’s applications of Hadoop’s next-generation infrastructure largely revolve around storage of and advanced analytics on large volumes of data. Data Science is providing business-critical insights into the strategy and operations of large organizations as they navigate the rapidly-changing healthcare landscape. Tomorrow’s applications of Hadoop, however, will take data size, velocity and variety to an entirely new level that few organizations are ready for today.

Streaming data measuring member health, behavior, and daily interactions. Algorithms inspired by the human brain’s neural networks that recognize patterns among millions of data points, trained by historical outcomes and learning as new outcomes occur. A decentralized, secure transactional infrastructure in which business rules execute contracts, eliminating middlemen and allowing for agreed-upon distribution of medical information.

The Internet of Things, cognitive computing, and blockchain infrastructures all have their place in the future of healthcare. These innovations will steal the headlines as they rise, reinventing business processes and business models, fueling the rapid growth of forward-thinking companies while deeply impacting the competitive positions of all players in the healthcare industry. Lost in the fury of the race to innovate, improve, and disrupt, a footnote in the future real-time, analytic decision-making, is the data infrastructure that is critical for this to occur. Today Hadoop and its complementary technologies enable Data Science and advanced analytics to be performed, but tomorrow they will be the bedrock of the future of healthcare, the data repository underlying systems that will change the industry as we know it today.

Download White Paper

Never miss an update:

Subscribe to our newsletter!

Newsletter Sign Up Form

  • This field is for validation purposes and should be left unchanged.

New York, NY • Warren, NJ • Boston, MA • Toronto, Canada
©2018 Knowledgent Group Inc. All rights reserved.