Global healthcare industry is under significant pressure to reduce costs and more efficiently manage resources while improving patient care. Rising costs, chronic illness, an aging population and a shortage of professionals are forcing massive changes in the healthcare industry. To gain insight into how they can improve service while reducing costs, healthcare payers and providers are turning to data and analytics. Today Big data in healthcare is a hot issue. Therefore, in this communication, it is attempted to briefly present definitions, sources of big data, characteristics of Big Data, the architectural framework of big data analytics required in healthcare application. Big data analytics in healthcare reported in various literature are highlighted particularly on clinical data, Pharmaceutical data, Patient behavior, sentiment data, viral and Global Infectious Disease Surveillance. Lastly, the challenges are identified followed by future directions and advantages of big data analytics in healthcare.
Now a day's Hospital captures daily large amounts of data about their customers or patients, suppliers, and operations. Health insurance organizations also have large claims data which called as big data-large pools of data that can be captured, communicated, aggregated, stored, and analyzed. We can analyze big health data to detect signals that is useful for patients and healthcare service management, although the data has quality problems. It is increasingly the case that healthcare innovation and growth could take place with predictions analyzing big data. The main objective of the chapter is to introduce to Healthcare analysts and practitioners to the advancements in the computing field to effectively handle and make inferences from voluminous and heterogeneous healthcare data.
A big-data revolution is under way in health care. Start with the vastly increased supply of information. Big Data...it's varied; it's growing; it's moving fast, and it's very much in need of smart management. Data, cloud and engagement are energizing organizations across multiple industries and present an enormous opportunity to make organizations more agile, more efficient and more competitive . To capture that opportunity, organizations require a modern Information Management architecture. Although big data has shown us many useful ways of thinking and application cases by innovative methods, there are many issues about collection and analysis for big data. It is a reality that health-care costs are rising in big data's rise, clinical trends also play a role. Physicians have traditionally used their judgment when making treatment decisions, but in the last few years there has been a move toward evidence-based medicine, which involves systematically reviewing clinical data and making treatment decisions based on the best available information. Aggregating individual data sets into big-data algorithms often provides the most robust evidence, since nuances in subpopulations may be so rare that they are not readily apparent in small samples .
Healthcare organizations are challenged by pressures to improve clinical quality of care and patient safety, lower costs, reduce medical errors, and provide more patient-centred service as well as evidence-based practice. Healthcare costs can spin out of control easily. Misallocation of resources can quickly bring down quality of care, and the evidence for this is sharply increasing. So, it is another challenge to utilize data efficiently and visualize data effectively. Fiscal concerns, perhaps more than any other factor, are driving the demand for big-data applications. After more than 20 years of steady increases, health-care expenses now represent 17.6 percent of GDP nearly $600 billion more than the expected benchmark for a nation of the United States' size and wealth .
Although the health care industry has lagged sectors like retail and banking in the use of big data partly because of concerns about patient confidentiality but it could soon catch up. First movers in the data sphere are already achieving positive results, which is prompting other stakeholders to act, lest they be left behind. These developments are encouraging, but they also raise an important question: is the health-care industry prepared to capture big data's full potential, or are there roadblocks that will hamper its use? (McKinsey director Nicolaus Henke explains how analytics is transforming the practice of medicine [3-6].
This article provides an overview of scope of big data analytics in healthcare as it is emerging as a discipline. First, we define and discuss the various definitions, sources of big data, characteristics of Big Data, and advantages of big data analytics in healthcare. Then we describe the architectural framework of big data analytics in healthcare. Third, the big data analytics application development methodology is described. Fourth, we provide examples of big data analytics in healthcare reported in the literature. Lastly, the challenges are identified followed by conclusions and future directions.
Big data holds tremendous promise for infectious diseases research, surveillance, and prevention, but that begs the question, what exactly is big data? It is explained that the "four Vs"—volume, velocity, variety, and veracity are often used to determine whether a dataset is big data or not. Volume refers to the amount of data, whether it is a terabyte, petabyte, or exabyte, while velocity reflects the speed at which live data is coming in for analysis. The two less commonly noticed attributes of big data are variety the different data formats that are not necessarily easy to combine or the different types of data coming in for simultaneous analysis and veracity, a reflection of the imperfectness, incompleteness, or unreliability of the data .
Big data can come from internal e.g., electronic health records, clinical decision support systems, CPOE, etc. and external sources e.g., government sources, laboratories, pharmacies, insurance companies & HMOs, etc., often in multiple formats like flat files, .csv, relational tables, ASCII/text, etc. and residing at multiple geographic locations as well as in different healthcare providers' sites in numerous legacy and other applications (transaction processing applications, databases, etc.) .
Sources and data types include:
a. Web and social media data: Click stream and interaction data from Facebook, Twitter, LinkedIn, blogs, and the like. It can also include health plan websites, Smartphone apps, etc.
b. Machine to machine data: readings from remote sensors, meters, and other vital sign devices.
c. Big transaction data: health care claims and other billing records increasingly available in semi-structured and unstructured formats.
d. Biometric data: finger prints, genetics, handwriting, retinal scans, x-ray and other medical images, blood pressure, pulse and pulse-oximetry readings, and other similar types of data.
e. Human-generated data: unstructured and semistructured data such as EMRs, physician's notes, email, and paper documents.
The human genome is made up of DNA which consists of four different chemical building blocks (called bases and abbreviated A, T, C, and G) .
A. It contains 3 billion pairs of bases and the particular order of As, Ts, Cs, and Gs is extremely important.
B. Size of a single human genome is about 3GB.
C. Thanks to the Human Genome Project (1990-2003).
a. The goal was to determine the complete sequence of the 3 billion DNA subunits (bases).
b. The total cost was around $3 billion.
The whole genome sequencing data is currently being annotated and not many analytics have been applied so far since the data is relatively new.
I. Several publicly available genome repositories. http:// aws.amazon.com/1000genomes/.
II. It costs around $5000 to get a complete genome. It is still in the research phase. Heavily used in the cancer biology.
III. We will focus on Genome-Wide Association Studies (GWAS).
i. It is more relevant to healthcare practice. Some clinical trials have already started using GWAS.
ii. Most of the computing literature (in terms of analytics) is available for the GWAS. It is still in rudimentary stage for whole genome sequences.
Genome-wide association studies (GWAS) are used to identify common genetic factors that influence health and disease.
a. These studies normally compare the DNA of two groups of participants: people with the disease (cases) and similar people without (controls). (One million Loci)
b. Single nucleotide polymorphisms (SNPs) are DNA sequence variations that occur when a single nucleotide (A, T,
C, or G) in the genome sequence differs between individuals.
c. SNPs occur every 100 to 300 bases along the 3-billion- base human genome.
Center for US Health System Reform which identified four main sources of big data in the healthcare industry as per Report compiled by McKinsey & Company .
i. Activity (claims) and cost data: These are the basic figures showing the amount of care which has been supplied by providers in the system, and the cost of paying for that care. Clinical data: These include patient medical records and images gathered during examinations or procedures, as well as doctors' notes.
ii. Pharmaceutical R&D data: Over the last few years many partnerships have sprung up between pharmaceutical companies-as if they have suddenly become aware of the huge benefits of pooling their knowledge.
iii. Patient behavior and sentiment data: This is data from over-the-counter drug sales combined with the latest "wearables" which monitor your activity and heart rates, patient experience and customer satisfaction surveys as well as the vast amount of unstructured information about our lifestyles broadcast every day over social media.
The data sources are also identified; the data is collected, described, and transformed in preparation for analytics. A very important step at this point is platform/tool evaluation and selection. There are several options available, as indicated previously, including AWS Hadoop, Cloudera, and IBM Big Insights. The next step is to apply the various big data analytics techniques to the data. This process differs from routine analytics only in that the techniques are scaled up to large data sets. Through a series of iterations and what-if analyses, insight is gained from the big data analytics. From the insight, informed decisions can be made.
Big data analytics is a promising right direction which is in its infancy for the healthcare domain 
a. Healthcare is a data-rich domain. As more and more data is being collected, there will be increasing demand for big data analytics.
b. Unravelling the "Big Data" related complexities can provide many insights about making the right decisions at the right time for the patients.
c. Efficiently utilizing the colossal healthcare data repositories can yield some immediate returns in terms of patient outcomes and lowering care costs.
d. Data with more complexities keep evolving in healthcare thus leading to more opportunities for big data analytics.
A typical big data architecture, often called a tech stack, comprises five components, (i) the data sources; (ii) the data system, which houses and processes the data; (iii) the consumption layer, which queries the data on request of the users; (iv) the application program interface (API), which communicates the data between the different components; and (v) the specific client applications, which create reports and visualization .
Numerous vendors including AWS, Cloudera, Hortonworks, and MapR Technologies distribute open-source Hadoop platforms. A big data analytics platform in healthcare must support the key functions necessary for processing the data. The criteria for platform evaluation may include availability, continuity, ease of use, scalability, ability to manipulate at different levels of granularity, privacy and security enablement, and quality assurance. In addition, while most platforms currently available are open source, the typical advantages and limitations of open source platforms apply. The most significant platform for big data analytics is the open-source distributed data processing platform Hadoop (Apache platform), initially developed for such routine functions as aggregating web search indexes. It belongs to the class "NoSQL" technologies-others include Couch DB and MongoDB. Hadoop has the potential to process extremely large amounts of data mainly by allocating partitioned data sets to numerous servers (nodes), each of which solves different parts of the larger problem and then integrates them for the result. Hadoop can serve the twin roles of data organizer and analytics tool. It offers a great deal of potential in enabling enterprises to harness the data that has been, until now, difficult to manage and analyze. Specifically, Hadoop makes it possible to process extremely large volumes of data with various structures or no structure at all. But Hadoop can be challenging to install, configure and administer, and individuals with Hadoop skills are not easily found [12,13]
The global healthcare industry is under significant pressure to reduce costs and more efficiently manage resources while improving patient care. In addition, rising rates of chronic disease, aging populations, changing consumer expectations about how they want to purchase and receive care, and increasing access to social media and mobile technologies are transforming the way healthcare is obtained and delivered. Payment models and healthcare systems in many countries are also evolving from a fee-for-service approach to an outcomes-based or accountable- care approach that requires access to more accurate data to document and track results. To thrive, or even survive, in this time of massive change, healthcare organizations must become data driven. They must treat data as a strategic asset and put processes and systems in place that allow them to access and analyze the right data to inform decision-making processes and drive actionable results. Historically, healthcare organizations have operated based on imprecise or incomplete cost and care measurements and did not have the comprehensive view of clinical and operational processes they needed to identify areas for improvement. The healthcare industry has recently begun to turn to data and analytics in ways that are like other industries that rely on digital information to improve service and reduce costs. At the core of a data-driven healthcare organization is the ability to analyze a wide range of big data, from within and outside its four walls, to determine what is happening right now about patient, staff and population profiles, as well as financial, clinical and operational processes. Big data comprises much larger volumes, wider varieties and greater velocities of data than most organizations have previously captured, stored and analyzed. It includes data from traditional sources such as electronic medical records (EMRs) and from non-traditional sources such as social media and public health records.
Gaining access and applying clinical and advanced analytics to this valuable new data enables organizations to improve insight into risk, outcomes, resources, referrals, performance and readmissions, and to take prescriptive action. Bringing together structured and unstructured data supports more insightful analysis that enables personalized and evidence- based medicine, more efficient processes and incentives that can improve patient behavior .
I. Premier, the largest US healthcare alliance, includes a network of more than 2,700-member hospitals and health systems, 90,000 non-acute care facilities and 400,000 physicians. The Premier alliance has compiled the largest clinical, financial, supply chain and operational comparative database, with information on one in four discharged hospital patients in the United States. The database provides members with the detailed comparative clinical outcome measures, resource utilization information and transaction- level cost data they need to make informed strategic planning decisions that improve processes and outcomes. Through one Premier performance improvement collaborative of more than 330 hospitals, more than 29,000 lives have been saved and healthcare spending has been reduced by almost USD7 billion. Premier is in the process of expanding beyond hospital data to bring together data from physician offices, nursing homes, long-term-care facilities, other non-acute care settings and home monitoring devices to enable a more complete healthcare picture of patients and populations.
II. North York General Hospital, a 450-bed community teaching hospital in Canada, is using real-time analytics to improve patient outcomes and develop a deeper understanding of the operational factors driving its business. North York has implemented a scalable, real-time analytics solution that provides a unified picture of the hospital's operations from a clinical, administrative and financial perspective. The solution processes data from more than 50 diverse collection points dispersed among a dozen internal systems to provide administrators and physicians with analytics-driven insights that improve business performance and patient outcomes. One of the largest healthcare providers in the United States is reducing costs while improving patient care by analyzing data in its EMR systems. This data includes the unstructured data in physician notes, pathology reports and other records that weren't accessible before to better understand outcomes, find the causality of care protocols and determine which care pathways lead to the best outcome. In addition, the analytics system helps enhance care decisions by enabling physicians to easily perform ad hoc queries optimized for individual patient circumstances.
I. Columbia University Medical Center is analyzing complex correlations of streaming physiological data from brain-injured patients to provide medical professionals with critical information they need to treat complications proactively rather than reactively. Advanced analytics help detect severe complications as much as 48 hours earlier than traditional methods in patients who have suffered a bleeding stroke from a ruptured brain aneurysm. These patients often experience serious complications during recovery, including delayed ischemia, which is fatal in 20 percent of individuals. The first phase of the Columbia project used advanced analytics to examine massive volumes of realtime data streams and persistent data from patients with bleeding strokes and detect hidden patterns that indicated the likelihood of complications. In the second phase of the project, researchers worked in the neurological intensive care unit to gather patient data in real time. They tested for the previously-identified early warning signs and alerted medical professionals when a patient was experiencing a life-threatening complication that required immediate, preventative treatment.
II. The Rizzoli Orthopaedic Institute in Bologna, Italy, the first Western healthcare institute specializing in orthopedic pathologies, is improving care and reducing treatment costs for hereditary bone diseases. Rizzoli is using advanced analytics to gain a more granular understanding of the dynamics of clinical variability within families where individuals show drastic differences in the severity of their symptoms. Efforts thus far have resulted in more efficient and cost-effective care, including 30 percent reductions in annual hospitalizations and over 60 percent reductions in the number of imaging tests. In the future, Rizzoli hopes to gain additional insights into the complex interplay of genetic factors and identify cures.
III. The Hospital for Sick Children (SickKids) in Toronto, the largest center dedicated to advancing children's health in Canada, is improving outcomes for infants susceptible to life-threatening nosocomial infections. SickKids applies advanced analytics to vital-sign data, captured from bedside monitoring devices up to 1,000 times per second, to detect potential signs of an infection as much as 24 hours earlier than with previous methods. Although it is still too early to report on the project's success, researchers and clinicians are hopeful that it will deliver significant future benefits for many types of medical diagnostics. Both planning and cultural and technological changes are required to create a data-driven healthcare organization. The transformation doesn’t have to happen all at once; organizations can start small and build capabilities over time .
I. Systems to provide static spatially continuous maps of infectious disease risk and continually updated reports of infectious disease occurrence exist but to-date the two have never been combined.
II. Novel online data sources, such as social media, combined with epidemiologically relevant environmental information are valuable new data sources that can assist the "real-time" updating of spatial maps.
III. Advances in machine learning and the use of crowd sourcing open the possibility of developing a continually updated atlas of infectious diseases.
IV. Freely available dynamic infectious disease risk maps would be valuable to a wide range of health professionals from policy makers prioritizing limited resources to individual clinicians.
The Big Data revolution is already underway and harnessing the useful information in these new data sources will involve collaborations with computer scientists at the forefront of machine learning and with those who have had success in engaging communities. The evidence shows that motivating people to devote some of their "cognitive surplus" to crowd sourcing is possible, so long as the products and benefits are immediately available to all for the common good. We have seen the rise of crowd sourcing influenza surveillance with participatory systems such as Flu Near You in the United States (www.flunearyou.org) and Influenza net in the EU (www.influenzanet.eu), which now boast nearly 100,000 volunteers combined. From the outset, all infectious disease data and derived maps should be made freely available to ensure engagement. This will also facilitate the uptake of new resources and their consideration by policy makers. Once the primary investment in the software platform is complete, and the community established, sustainability increases because demands for user inputs decrease as the software learns and the mapped outputs become increasingly stable. The ultimate vision is to democratize the platform by providing the code to all interested authorities [15-17].
Many inputs are needed to use geospatial techniques to generate occurrence maps, Hay explained. These inputs include
I. Occurrence points-the latitude, longitude, and time for every diagnosis of a disease;
II. Environmental covariates-the readily available data on the rainfall, temperature, vegetation, population density, and other associated demographic variables that have been collected continuously across the planet;
III. Control data-information reflecting areas in which the disease is absent; and
IV. A constraint parameter called the definitive extent-the probability that a disease will be in a location.
Using a technique for species-distribution modeling based on boosted regression trees, it is possible to take these data and make a continuous map of disease occurrence . While this approach works, many challenges make it difficult. The first challenge Hay discussed is compiling occurrence data from the literature by identifying every paper documenting the occurrence of a disease and manually mapping each occurrence. For dengue fever, which did not involve big data, this exercise took Hay and his colleagues 1 year and generated some 8,000 points from more than 100 countries .
Big data in healthcare is a hot issue as in other fields as well. With the continuous increase of digital data, which is creating in the process of medical services and health management, big data management and analysis is becoming important. Researchers in medical services and health management enclosed within the statistical methods strictly. For example, requirement of randomness and size of sample were limited in the association studies. Big data analysis may liberate researchers from these limitations and introduce new world of analysis and research. Prediction and trend awareness of disease are typical examples of many tentative big data applications in healthcare area. With the mandated adoption of electronic health records (EHRs), many healthcare professionals for the first time got centralized access to patient records. Now they're figuring out how to use all this information. Although the healthcare industry has been slow to delve into big data, that might be about to change. At stake: not only money saved from more efficient use of information, but also new research and treatments and that’s just the beginning .
It's hard to think of a more worthwhile use for big data than saving lives - and around the world the healthcare industry is finding more ways to do that every day. From predicting epidemics to curing cancer and making staying in hospital a more pleasant experience, big data is proving invaluable to improving outcomes. This is very good news indeed as the cost of caring has skyrocketed in recent years and is expected to continue to do so as the population ages to the point where we could be headed for serious trouble. Google's claims that it could detect outbreaks of flu more accurately than standard prediction methods by monitoring search activity. But these are just the tip of the iceberg in an industry which generates mountains of data across every area of its operations. IDC Health Insights found that 50% of the hospitals and healthcare insurers put increasing their analytics capabilities as their top priority for investment over the next year. The body of medical literature from which further research evolves continues to grow every day with an estimated one million records per year added to Medline, the online repository of scientific studies related to medicine. Efficiency is the great driver here with the cost of healthcare in the US currently standing at around 18% of GDP and forecast to rise, payment models are changing. While traditionally providers have been paid according to number of patients they treat, a move towards payment based on results and quality of treatment is taking place. These more complex metrics require more data and a different analytical skill set, rather than simply counting the number of patients coming through the door .
This chapter aimed to Healthcare analysts and practitioners to the advancements in the computing field to effectively handle and make inferences from voluminous and heterogeneous healthcare data. Big data analytics in healthcare is evolving into a promising field for providing insight from very large data sets and improving outcomes while reducing costs. Its potential is great however there remain challenges to overcome. Big data analytics has the potential to transform the way healthcare providers use sophisticated technologies to gain insight from their clinical and other data repositories and make informed decisions. As big data analytics becomes more mainstream, issues such as guaranteeing privacy, safeguarding security, establishing standards and governance, and continually improving the tools and technologies will garner attention. Big data analytics and applications in healthcare are at a nascent stage of development, but rapid advances in platforms and tools can accelerate their maturing process. Rising costs, chronic illness, an aging population and a shortage of professionals are forcing massive changes in the healthcare industry. To gain insight into how they can improve service while reducing costs, healthcare payers and providers are turning to data and analytics. Leading organizations are treating data as a strategic asset and putting processes and systems in place that help healthcare professionals improve decision-making and drive actionable results. In the process Data-driven healthcare organizations use big data analytics for big gains.
The potential of big data analytics in healthcare is compiled aiming to educate inquisitive minds. In this emerging discipline, there is little independent research to cite. Thanks to all those who developed the literature through scientific research.