Big Data Declares a War on Cancer
By Dmitri Adler, Co-Founder and Chief Data Scientist, Data Society
In 1970, cancer was the second-leading cause of death in the United States. President Nixon made fighting this disease a priority in his 1971 State of the Union address: “I will also ask for an appropriation of an extra $100 million to launch an intensive campaign to find a cure for cancer, and I will ask later for whatever additional funds can effectively be used. The time has come in America when the same kind of concentrated effort that split the atom and took man to the moon should be turned toward conquering this dread disease. Let us make a total national commitment to achieve this goal.”
Lots of great progress has been made over the past 45 years. Many challenges remain, but the technological capabilities have vastly improved.
In his last State of the Union address, President Obama re-iterated Vice President Joe Biden’s plea for a concerted effort to use the brightest minds in the U.S. to cure cancer, and announced the creation of a national cancer moonshot. President Obama asked Vice President Biden to be “in charge of Mission Control.” “For the loved ones we’ve all lost, for the family we can still save, let’s make America the country that cures cancer once and for all,” Obama said.
The good news is that today there is a massive amount available for cancer researchers to use in their mission. The challenge is that due to the lack of reporting standards and the disparate databases, much of the data is left unanalyzed, which can lead to lots of missed opportunities for breakthroughs.
Since President Obama’s declaration, Vice President Biden has met with leaders of the MD Anderson Cancer Center at the University of Texas which, in 2012, launched the Moon Shots Program aimed at reducing cancer mortality. There are many types of cancers. While they are all driven by gene mutations in various cells, every type of cancer requires a targeted approach. The Moon Shots Program has many mini-projects, or Moon Shots, aimed at treating specific cancers.
The program’s innovation is driven by the multitude of specialists involved in the project, from clinicians to biostatisticians and programmers. The Moon Shots include research into B-cell lymphoma, glioblastoma (brain cancer), cancers caused by the human papillomavirus (HPV), high-risk multiple myeloma, colorectal and pancreatic cancers, breast and ovarian cancers, chronic lymphocytic leukemia, lung cancer, melanoma, myelodysplastic syndrome/acute myeloid leukemia and prostate cancer. It covers an unprecedented number of diseases by one effort.
Cancer is a very complicated ailment with very complex treatments. A single tumor can have more than 100 billion cells and each cell can have different genetic mutations. The mutations are not constant over time, which requires an evolving treatment. To understand each cancer, clinicians need to understand the kinds of mutations that are driving it. There are 3 billion code letters, or amino acids, in each cell so understanding the mutations expressed in each tumor is no small task. There are as many as 300 billion opportunities for mutation in just one tumor.
With so much complexity, there are many ways to approach cancer research. For example, scientists at the NIH have used network analysis methods to map out protein interactions to discover new biomarkers and significant players in the cell’s architecture. These discoveries help guide clinical studies and other research on gene expression.
Researchers across Moon Shots programs are using machine-learning models to predict whether a patient has various types of cancer based on the expression levels of specific genes. Implementation for thyroid cancer has been especially fruitful. Thyroid cancer usually causes a lump at the base of the neck, and around 5 to 15 percent of these lumps are malignant. By measuring gene expression at the lump, the machine-learning model is able to predict with greater than 90 percent accuracy whether it is malignant or benign. The work was published in Clinical Cancer Research in 2012.
Protein data is not the only kind of information used by researchers. Scientists at Case Western University have used machine-learning techniques on Magnetic Resonance Images (MRIs) of breast cancer patients to predict if a patient is suffering from aggressive triple-negative breast cancer, slower-moving cancers or non-cancerous lesions with 95 percent accuracy. Today’s capabilities of image analytics can significantly augment the insights gleaned from lab tests. The challenge with cancer is getting a full picture.
Text stored in medical records is another powerful source of relatively untapped data. Modern natural language processing capabilities can analyze massive amounts of unstructured data and combine the results with structured research and clinical information. Combining doctors’ notes versus numerical lab tests, for example, can give context to the condition and symptoms of the patient at various stages of different cancers.
Medical records include a treasure trove of data. Factors such as family histories, clinical test results and genomic data are stored in repositories across the world. The challenge is combining all that data in one database.
“Big data is not just big. The term also implies three additional qualities: multiple varieties of data types, the velocity at which the data is generated, and the volume seen within MD Anderson,” says Keith Perry, associate vice president and deputy chief information officer.
One of the ambitious objectives of the MD Anderson Cancer Center is to collect and combine patient information including a profile of their genetic makeup, clinical histories, test results, treatment courses and treatment responses. This data will be interpreted by the massive data analytics, which provide real-time decision support to rapidly improve clinical outcomes. This is a much more challenging task than meets the eye.
When the startup Flatiron Health launched with an ambitious goal to improve cancer treatment, one of the largest obstacles they faced is the inconsistency of records from various Electronic Health Record systems (EHRs).
With over $100 million in backing from Google Ventures, Flatiron is facing this basic problem: when measuring the level of a single protein commonly tested in cancer patients, a single EMR from a single cancer clinic showed results in more than 30 different formats. There are over 100 different kinds of protein and genetic tests, biopsies, and other diagnostic methods used in cancer care. And all the various EMR systems out there report these metrics in different ways. This is an incredibly complex data integration problem. So much so that Flatiron purchased Altos Solutions, which makes an EMR service for oncology practices. This allows the company to control the data collection process.
Finding cures and treatments for various types of cancer is truly a Big Data problem. And the ability to collect, store, share and analyze the data cohesively is still in relevant infancy. This isn’t a problem you can solve with just one approach. Whether using network analysis, text mining or other machine learning techniques, the task is a true interdisciplinary challenge that requires numerous types of expertise and really Big Data.
Big Data and machine-learning don’t hold all the keys, human analysis and contextualization is key. Yet these technologies are starting to shine the light on how humanity will fight one of the most potent killers on the planet. President Nixon’s initiative gave us the Frederick Cancer Research and Development Center, an internationally recognized center for cancer research, and has achieved many breakthroughs. President Obama’s initiative has the potential to revolutionize the state of cancer treatment. We’ll make a comparison in 45 years!
To read this article on Real World Health Care, click here.