Cambridge Healthtech Institute’s Ann Nguyen recently interviewed Sheila Reynolds of the Institute for Systems Biology. Dr. Reynolds shares her presentation on “The ISB Cancer Genomics Cloud” at the Cloud Computing and Cancer Informatics
conferences taking place April 5-7, 2016 as part of the 15th Annual Bio-IT World Conference & Expo in Boston, MA.
Cancer Genome Analysis at the Institute for Systems Biology
Q1: How did your background in signal processing and machine learning lead you to work in cancer genome analysis at the Institute for Systems Biology in Seattle, Washington?
After working in industry for 15 years (spanning sonar signal processing and cellular telecommunications), I decided to go back to school to earn a Ph.D. While doing research in natural language processing in the Electrical Engineering department at the
University of Washington, I became interested in the field of computational biology where many of the same methods were being applied to understand and "parse" DNA and protein sequences. When I completed my Ph.D., I was offered the opportunity to
come to ISB and be a part of the TCGA GDAC.
Q2: The ISB is one of The Cancer Genome Atlas (TCGA) network’s Genome Data Analysis Centers. What kind of resources there support your work?
I played an active role in our TCGA GDAC since its inception, and the deep familiarity that I (and others on the ISB-CGC team) gained through that work -- both with the types of data being generated by the TCGA program and the types of analyses performed
by each working group leading up to the publication of each of the tumor-specific "marker papers" -- has really guided our efforts in building the ISB-CGC platform. Our goal with the ISB-CGC is to provide a cloud-based framework and to develop tools
that will make these data and analyses more accessible to a wider range of researchers while continuously exploring how best to make use of the variety of technologies being deployed by Google on their Cloud Platform.
Q3: What are the most persistent challenges – technical and otherwise – your team has faced throughout your tumor-specific and pan-cancer analysis efforts? How are you addressing them?
The size and the complexity of the TCGA dataset, which is an unprecedented resource, also poses many challenges. When the number of "features" available is over 1000 times greater than the number of cases in any one study, great care must be taken to
avoid over-fitting the data or over-interpreting the significance of statistical associations. Pan-cancer analyses efforts are further complicated by differences in tissue types, and the evolution of the technologies and data-processing pipelines
used over the course of the TCGA project -- all of which are potential sources of systematic variation. Much of the TCGA data analysis to date has been tumor-specific or broadly pan-cancer, but the ISB-CGC platform will allow each user to define their
own custom "cohorts" based on clinical or molecular phenotypes -- allowing researchers to subset the ~12,000 patients in the TCGA dataset based on the hypotheses they are most interested in exploring. We are also hoping that the ISB-CGC will enable
much deeper, interactively driven, integrative analyses -- allowing a user to be exploring "high-level" data such as gene expression or copy number segments and then drill down to the "low-level" RNA-seq or DNA-seq reads to understand why a particular
sample is an extreme outlier relative to a larger cohort.
Q4: You’ll elaborate on “The ISB Cancer Genomics Cloud” during your presentation on April 6. For now, can you describe its genesis and goals?
I have already described a few of our goals above, but broadly speaking, the ISB Cancer Genomics Cloud (ISB-CGC) is one of three NCI-funded pilot projects which were designed to democratize access to The Cancer Genome Atlas (TCGA) data. The ISB-CGC is
being built by scientists and software engineers at ISB, Google and SRA International (now CSRA). Our platform sits on top of the Google Cloud Platform and leverages a wide range of cloud technologies and services that provide access to a large-scale
data repository, the computational infrastructure, and the interactive exploratory tools to support and drive cancer genomics research.
The ISB-CGC aims to serve the needs of a wide range of cancer researchers, including:
- scientists or clinicians who prefer to use an interactive, web-based application;
- computational scientists who want to write custom scripts using languages such as R or Python; and
- algorithm developers who may need thousands of virtual machines to analyze hundreds of terabytes of sequence data.
Speaker Information:
Sheila Reynolds, Ph.D., Senior Research Scientist, Ilya Shmulevich Laboratory, Institute for Systems Biology
Dr. Reynolds comes from a background in signal processing and machine learning, and for the past six years has been an integral part of a multidisciplinary team of Research Scientists, Bioinformaticians, and Software Engineers from ISB and MD Anderson,
at one of the TCGA network’s Genome Data Analysis Centers. Dr. Reynolds has participated in numerous tumor-specific analysis working groups as well as TCGA’s “pan-cancer” analysis efforts and contributed to TCGA publications,
particularly in the areas of heterogeneous data integration and pathway-based analyses. Most recently, Dr. Reynolds has been helping to define and implement an innovative cloud-based system intended to support a new model for the computational analysis
of biological data by providing the cancer research community with easier access to large-scale datasets and the computational tools necessary to drive the next revolution in cancer genomics. This “cancer genomics cloud” pilot project
is a collaboration between ISB, Google and SRA International.
To learn more about Dr. Reynolds’s presentation during a shared session for the Cloud Computing and Cancer Informatics conferences, visit www.Bio-ITWorldExpo.com/Cloud-Computing or www.Bio-ITWorldExpo.com/Cancer-Informatics.