Using Scientific Text to Identify Breast Cancer Risk-Factors

Institution: University of California, Irvine
Investigator(s): Catherine Blake, M.S. -
Award Cycle: 2002 (Cycle VIII) Grant #: 8GB-0175 Award: $29,136
Award Type: Dissertation Award
Research Priorities
Etiology and Prevention>Prevention and Risk Reduction: ending the danger of breast cancer



Initial Award Abstract (2002)
Each year scientists publish thousands of articles related to breast cancer. In addition to the primary purpose of an article, authors often publish secondary information such as the number of subjects with breast cancer who smoke or consume alcohol. Secondary information can be used to identify candidate risk factors for breast cancer; for example, if people with breast cancer drink more than people in the general population, then alcohol consumption should be studied further as a possible risk factor for breast cancer. Scientists currently extract secondary information manually; however, this process is tedious, time-consuming, and costly, so risk factors that are implicit in research studies go unnoticed.

The goal of this project is to identify new risk factors associated with breast cancer. By semi-automating the extraction and meta-analysis process, scientists will be able to explore secondary information reported in the medical literature faster and more comprehensively. They will also be able to reduce publication bias by including articles that were not specifically studying the risk factor.

I will construct an interactive computer system that will extract facts from the medical literature related to breast cancer. The system will then automatically compare the rate of exposure to the risk factor extracted with the rate exposure of exposure in the general population. I will estimate the latter using the Behavioral Risks Surveillance System, the world's largest telephone survey that tracks health risks in the United States. Lastly, the system will combine the facts from each study using an existing set of statistical techniques, called meta-analysis. I will demonstrate that the system works by: (i) measuring the accuracy of the information extracted; and (ii) using the interactive system to perform a meta-analysis between breast cancer the risk factors smoking and alcohol.

The current manual approach to conduct meta-analysis is costly and time-consuming. Although the computer science community has developed extraction techniques, they neither extract the kind of information required to perform a meta-analysis, nor do they satisfy the accuracy requirements which are required when performing a meta-analysis between breast cancer and candidate risk factor. This project is innovative because the proposed system is interactive and enables scientists to verify that the information extracted is accurate and perform preliminary manipulations of the extracted facts before performing a meta-analysis. Software exists to perform a meta-analysis; however, it is not coupled with the extraction process.


Final Report (2004)
Introduction: The quantity of electronic information resources available to breast cancer researchers continues to increase at an overwhelming rate. Although retrieval systems have eased the task of collecting articles, users struggle to incorporate new findings into their work practices and have few opportunities to explore implicit connections within a corpus of journal articles.

Topic Addressed: Inspired by our study of scientific users in medicine and public health as they synthesized evidence from literature, we have developed both a methodology and supporting technology that enables a user to identify candidate risk factors for breast cancer, which are hidden within existing scientific articles. Our goal is to build tools that facilitate the discovery of new risk factors. We chose to explore risk factors because other than age and gender, currently known risk factors explain only half of the existing breast cancer cases.

Progress towards specific aims: To facilitate the discovery of risk factors from scientific literature we followed the following specific aims:
(1) We collected a collection of 1800 full-text breast cancer journal articles.
(2) Independent annotators manually extracted information from a subset of our article collection, which provided a gold standard for our system evaluation.
(3) We developed a methodology called Information Synthesis, and a computer system called the Multi-user Extraction for Information Synthesis (METIS) system. The METIS system automates critical tasks within the information synthesis process. Specifically the METIS system: (i) identifies information from full-text articles, (ii) estimates a comparison group using the extracted information as an index to an external database, and (iii) compares the facts reported in each article with the comparison group using a meta-analytic technique. The METIS system produces a quantitative summary that incorporates typically unused information. The visual summary identifies redundancies and contradictions that are inevitable in a collection of breast cancer articles.
(4) Our evaluation of the METIS system showed that: (i) the shallow natural language processing used within the METIS system achieves precision and recall that are consistent with current state of the art; the METIS system estimates a comparison rate that is similar to the rates found in a traditional analysis; and, (iii) the quantitative summary produced by METIS is the same as examples in textbooks and existing meta-analyses.

We explored both alcohol and tobacco consumption within our breast cancer article corpus. Our results showed that the information synthesis approach was: (1) more comprehensive than existing published analyses that explored a similar hypothesis projection, and (2) able to identify phenomena that existing synthesis techniques were unable to detect.

Future Direction: We are currently exploring methods to increase the performance of the extraction algorithms used in the METIS system and experiments to better evaluate the interactive aspects of the system.

Impact: As the quantity of information continues to increase, tools that enable researchers to effectively use hidden facts provides a promising new way to identify risk factors.