Model-building with Complex Environmental Exposures

Institution: Cancer Prevention Institute of California
Investigator(s): David  Nelson , Ph.D. -
Award Cycle: 2009 (Cycle 15) Grant #: 15UB-8401 Award: $280,753
Award Type: SRI Request for Proposal (RFP)
Research Priorities
Etiology and Prevention>Prevention and Risk Reduction: ending the danger of breast cancer

Initial Award Abstract (2009)

Introduction: Cancer epidemiologists are beginning to generate data sets that are much larger and many times more complex than what has been available to them in the past. Analyzing data sets of ever increasing size and complexity presents daunting obstacles. Unfortunately, the tools typically used by epidemiologists to analyze their data are not sufficient. At the same time, computer-intensive, statistical methods are being developed to solve complex problems in other scientific areas, like the Human Genome Project. The overall goal of this proposal is extend these computer-intensive methods to the complex cancer data sets. As a concrete example, we will use data from a large, ongoing study to address an important public health topic: understanding the relationship between agricultural pesticide use and the occurrence of breast cancer.

Question: Can statistical data mining methods being developed to explore and discover associations and relationships in large, complex data sets be profitably applied to determining which, if any, of the thousands of pesticide compounds being used in California agriculture pose a risk of breast cancer? General methodology: Researchers will focus on two unique California resources. The first, the California Teachers Study, is an ongoing research effort begun in 1995 and involving over 130,000 active and retired California teachers. The second is California’s Pesticide Use Reporting System. This system has tracked every commercial application of over 1000 different agricultural pesticides throughout the state of California since 1990. The database contains information about the what, when, where, who, how, and how much, of every agricultural pesticide application in California. Integrating these data sets provides a unique opportunity to evaluate the relationship between pesticide exposures and breast cancer. The researchers will then use expert knowledge of pesticides to apply statistical data mining methods, and to determine whether these methods can do better than simpler, more traditional methods.

Innovative elements: To date, computer-intensive methods have mainly been applied to biological problems and used on data with a large number of very simple variables. For instance, the Human Genome Project considered each of the over 30,000 different genes independently; only recently have scientists begun to explore more complex relationships among the genes. This is one of the first studies to explore the use of data mining methods with the more complex types of data that cancer epidemiologists produce. In addition, project researchers will take into account the complex relationships among pesticides and use expert knowledge the health effects of pesticides to organize and guide this exploration.

Community involvement: The California Teachers Study has actively involved members of the community from its inception. The CTS External Advisory Committee includes members from teachers’ unions, from the State Teachers Retirement System, and from community-based breast cancer advocacy organizations. Feedback from these groups, as well as from members of the cohort itself (as part of an email feedback system in place for the CTS), has been influential in determining ongoing research priorities for the study.

Final Report (2013)

Cancer epidemiologists are beginning to generate data sets that are much larger and many times more complex than what has been available to them in the past. We at CPIC are particularly interested in the effects of known or suspected cancer-causing hazardous air pollutants on the potential for developing breast cancer. These types of exposures consist of a complex mixture of hundreds of measured compounds of vastly varying concentrations. Analyzing data sets of this size and complexity presents daunting obstacles. Traditional methods just don't work very well when applied to problems with hundreds of environmental variables, especially when these variables may be related to each other in some way.

At the same time, computer-intensive, statistical methods have been (and are being) developed to solve similar large, complex problems in other scientific areas, like modern genomic biology, which may involve assessing the potential effects of thousands of genes. The goal of this project was to determine the extent to which these computer-intensive methods be adapted and extended to the complex cancer data sets we are currently generating. To do so, we combined data from a large, ongoing study of more than 100,000 California teachers with an extensive database describing the spatial concentrations of hundreds of hazardous air pollutants throughout California. We used these California-specific resources to address an important public health topic: better understanding the relationship between the level of exposure to a mixture of known or potential mammary carcinogens and the occurrence of breast cancer.

Our first specific aim was to integrate data from two resources, the California Teachers Study and an appropriate year of the US Environmental Protection Agency’s National-Scale Air Toxics Assessment (NATA) database into a common database that can be used to assess the relationship between exposure to hazardous air pollutants (HAPS) and the risk of breast cancer. Our second specific aim was to evaluate the how modern, so- called “data mining” approaches compare with more traditional approaches when it comes to exploring the structure of high-dimensional epidemiologic data sets. Our third specific aim was to determine how results obtained in Aim 2 varied according to the way exposure was defined. Our fourth specific aim was to create a way for any tools developed in the project to be used by others in similar projects.

All four aims were addressed in the project, albeit with changes. The first aim was changed from using data on commercial pesticide exposures, which were largely confined to specific areas of California and were very noisy and incomplete, to using data on hazardous air pollutants, which are ubiquitous throughout the state and more precisely quantified. The EPA’s 2002 NATA data set was eventually chosen for analysis. It contained a per census tract estimated annual average concentration for approximately 180 air toxics, of which 41 contained enough non-zero data to be useful for subsequent analyses. These 41 hazardous air pollutants included 24 known or suspected mammary carcinogens that were of particular importance to us.

The results from Aim 2 were disappointing. Despite repeated attempts at focusing analyses on different sets of HAPS, the estimated variable importance of any individual HAP in the model was negligible, irrespective of the form of the model. An analysis of the HAPS data made it clear that a change to Aim 3 was required. The structure of the hazardous air pollutant exposure data proved to be much more complex than the PUR data. First, pollutants, as opposed to pesticides, never appear by themselves. The concentrations of the pollutants that exist at a particular location are highly correlated. This is important because it has recently been shown that highly correlated data can wreak havoc with variable importance measures constructed by machine learning approaches such as Random Forests. Second, the effect of any individual pollutant is likely to be quite modest. Hence, the original goal of “separating the wheat from the chaff,” is the wrong question to ask.

Aim 3 was revised to address the issue of how can we use data mining to combine the exposures from a large number of potential pollutants in a way that effectively captures the total risk to a subject. Novel methods were developed that combined a two-phase approach. First, so-called “data mining” methods were used to summarize a large number of highly correlated exposures. Then machine-learning approaches were used on these summary measures to assess the relative value of the each summary produced.

For Aim 4, we designed, implemented, and tested two R packages to facilitate immediate integration of simulation data and results into subsequent displays. In addition, we have designed our subsequent summarization R code to be easily assembled into a package and will make it, as well as R vignettes describing its use, available as part of the supplemental information when the results are published.

Hazardous air pollutants and breast cancer risk in California teachers: a cohort study

Hazardous air pollutants and breast cancer risk in California teachers: a cohort study doi:10.1186/1476-069X-14-14
Periodical:Environmental Health
Index Medicus:
Authors: Garcia, E., Hurley S., Nelson D.O., Hertz A., Reynolds P.
Yr: 2015 Vol: 14 Nbr: 14 Abs: Pg: