• 6/14/2005
  • Bethesda, MD
  • Karyn Hede
  • Journal of the National Cancer Institute, Vol. 97, No. 12, 876-878, June 15, 2005

It is easy to take for granted that a few keystrokes from a laptop in Ohio can retrieve in seconds data stored in an anonymous computer thousands of miles away. Day-to-day operations of countless businesses rely on this type of infrastructure that makes it possible to share information and for others to locate it. If it’s possible to do it for education and commerce, why not for cancer research?

It was that simple question that launched a massive undertaking at the National Cancer Institute to create the Cancer Biomedical Informatics Grid (caBIG), a $60 million project that its organizers like to call the “Internet of cancer.”

Proponents of caBIG started with a simple, but daunting, goal: Create a seamless network of resources that makes available data from the entire spectrum of cancer research from genomic and microarray data to clinical trials outcomes in a common language that any investigator can understand and use.

NCI has high hopes for the program. According to NCI’s literature on the project, “nearly every facet of NCI’s strategic plan to eliminate suffering and death due to cancer is predicated on the revolutionizing potential of caBIG.”

The project, from its inception in July 2003, has been designed as an open-source network to enable investigators to readily share data and technology now formatted by many incompatible software programs and tools. In some sense, the initiative reflects the realization that biomedical research in general, and specifically cancer research, have matured into data-intensive enterprises.

“We believe that cancer research has a whole set of requirements that are not being met today for data and system and tool interoperability,” said Peter Covitz, Ph.D., director of NCI’s Bioinformatics Core Infrastructure for the caBIG project. “We believe these requirements can be satisfied by building an infrastructure with a common language and tools. We are not rebuilding the Internet; we are building on top of it in a way that’s been proven to be successful by the information technology industry.”

From the beginning, NCI made daylong visits to 50 NCI-funded cancer centers and sought investigators’ input on existing resources and needs. Eventually, 44 sites signed on to contractual relationships to help develop caBIG. These initial projects have been organized into eight so-called domain workspaces designed to produce research tools for areas such as clinical trial management, integrative cancer research, and tissue banking. Other groups are meeting to address issues such as software licensing arrangements, de-identification of patient data, and intellectual property issues. Two crucial “cross-cutting” work groups designed and built the caBIG infrastructure, called caCore, defined acommon vocabulary, and worked out network grid design, providing the glue that holds the entire project together.

A year after signing on the first development contracts, the caBIG infrastructure is starting to take shape. In January, developers released caCore, the project’s main software development kit.

“It [caCore] was a very important deliverable to get out early in the caBIG program,” said Covitz. “It is the toolkit that people can use to build caBIG systems.” caCore consists of what’s called a modeling tool that helps match the developer’s system requirements to what’s available in caBIG’s toolkit. A built-in code generator then takes the model, writes the required computer code, and according to Covitz, “now you have a caBIG-compatible system.”

“It’s pretty flexible,” added Covitz. “If you have a system already, you can just use part of the kit and then just glue it on to the system you already have.”

There are currently more than 70 funded caBIG projects, but most are still in the development stage. One early release is called caArray, a combination database and annotation tool designed to make microarray data accessible in a uniform format. Released in March 2005, caArray is available at NCI’s data portal.

Investigators can open an account at NCI and store data privately, explained Mervi Heiskanen, Ph.D., caArray project director at NCI’s Center for Bioinformatics, and “when they are ready to publish, they can make it public.”

Heiskanen also noted that the data do not have to be cancer specific in caArray and that anyone can use it. Developers are creating a data conversion utility due to be released in mid-2005 that will support conversion from virtually any data format. “We hope that we will be able to support any type of data format with a minimum amount of effort,” she said.

Early caArray adopter Jack London, Ph.D., director of the Kimmel Cancer Center Shared Computer Facility at Thomas Jefferson University in Philadelphia, has signed onto the caBIG project. He is in the process of making available data from thousands of microarray experiments generated by the worldwide zebrafish consortium, a group of researchers using zebrafish as a model for cancer, among other diseases.

“Typically the way we used to do things in the past is we’d say, ‘Well, we’ll write some code and throw it out there and see what happens,’” said London. “But this caBIG project is being done in a very methodical, collaborative approach.”

London is also a “developer–adopter” of caArray, making modifications to the program so that it better fits the typical workflow at Kimmel Cancer Center.

“For the government, things are going along at an amazingly fast rate,” said London, “The caBIG experience has been interesting and unique. The pace has been quicker than the usual grants system. … Here we are going on the second year now, and my enthusiasm for the whole initiative is definitely growing.”

The ultimate success of caBIG will depend on development of useful tools that can query the data and provide useful information. Andrea Califano, Ph.D., professor of biomedical informatics at Columbia University’s Institute of Cancer Genetics in New York, has spearheaded an effort to create just such a data analysis tool, called caWorkBench (formerly called BioWorks).

“The idea behind caWorkBench was to be a truly interoperable open-source foundation for writing modules that would be automatically incorporated and that would be naturally talking to each other,” said Califano. “It is similar to Lego blocks in allowing modules to plug into each other. This application spans a wide variety of data types and data sources.”

caWorkBench originally started as an NCI project to analyze gene expression data, but it quickly became an integrated genomics environment. The program is available for download, complete with tutorials, at the NCI Web site or at Columbia University.

“One of the big advantages is that you can go across the different data types naturally,” said Califano. “For instance, you can load a set of microarray data, then analyze promoter regions of interest, then you can take those promoter regions and look to see if there [are] any systematic changes in terms of human variability in the promoter region.”

Califano uses caWorkBench in his own research studying networks of coregulated genes using “reverse engineering” from genome-wide expression profiles. Califano’s group published a research article in the April 2005 issue of Nature Genetics that reconstructs the gene regulatory network of human B cells using a combination of data sources and software programs that are fully integrated into the caWorkBench suite of tools.

“We select a transcription factor in the [algorithm for the reconstruction of accurate cellular networks (ARACNE)], we get all of its first-neighbors (i.e., directly connected genes), we can then retrieve their upstream sequences from the GoldenPath database in Santa Cruz and finally analyze them for conserved DNA binding motifs,” said Califano. “Normally this would be a very complex set of bioinformatics script that you would have to write and customize. You can do it in caWorkBench without ever leaving the application.”

Using a similar approach, the group identified the MYC gene as major regulatory hub that controls a network of known and new target genes during B-cell maturation. Califano says the same tools could be used for systematic analysis of gene regulatory networks across a variety of normal and disease states.

“The name of the game has moved from being able to do ultrasophisticated analysis of one data modality to actually getting at the intersection of several different varieties of data where the gold sits somewhere at the cross-section of all these different data,” said Califano. “The hard lesson that people have learned is that to some extent it is actually better to work with a large number of weak clues than to work with a small number of sophisticated clues.”