The Catalogue of Somatic Mutations in Cancer (COSMIC)

COSMIC is currently the most comprehensive global resource for information on somatic mutations in human cancer, combining curation of the scientific literature with tumor resequencing data from the Cancer Genome Project at the Sanger Institute, U.K. Almost 4800 genes and 250000 tumors have been examined, resulting in over 50000 mutations available for investigation. This information can be accessed in a number of ways, the most convenient being the Web‐based system which allows detailed data mining, presenting the results in easily interpretable formats. This unit describes the graphical system in detail, elaborating an example walkthrough and the many ways that the resulting information can be thoroughly investigated by combining data, respecializing the query, or viewing the results in different ways. Alternate protocols overview the available precompiled data files available for download. Curr. Protoc. Hum. Genet. 57:10.11.1‐10.11.26. © 2008 by John Wiley & Sons, Inc.


INTRODUCTION
The Catalogue of Somatic Mutations in Cancer (COSMIC) is currently the most comprehensive global resource accessing the world literature on somatic mutations in human cancer. The system is designed to preserve and improve the usefulness of published somatic mutation data by full curation of the scientific literature for genes known to be involved in cancer, as defined in the Cancer Gene Census (Futreal et al., 2004). It has been further extended to include tumor resequencing results from the Cancer Genome Project (CGP) at the Wellcome Trust Sanger Institute.
Since its initial release in 2004, COSMIC has grown to include data from nearly 4800 genes investigated for somatic mutations in cancer; almost 250000 tumor samples have been investigated, resulting in over 50000 mutations. The system is regularly updated, receiving new data once every two months. COSMIC now contains details of most of the genes known to have tumor-promoting point mutations, and curation of fusion events between multiple genes has commenced, since these are also common occurrences in cancer.
In this unit, the Basic Protocol describes how to examine the data in COSMIC using the graphical Web system. Two alternate protocols briefly describe accessing the data at a more basic level, using exported datasheets or by obtaining a copy of the database itself. The Commentary section discusses the context within which COSMIC exists, some of the issues it resolves, and some that may be encountered. Some examples are provided to illustrate how easily and precisely the data can be investigated.

Simple COSMIC searching
2. To perform a simple search, enter the search term of choice in the Text Search box. In this example, enter KRAS into the search box to perform a search by gene (searching is not case-sensitive). The search results window is displayed, with each option providing a small description.
Most searches in COSMIC are performed simply through the Text Search box. It is possible to search COSMIC for gene names and their HUGO synonyms. Simple searches can also be performed for tissue types (primary locations such as "pancreas," or subspecializations such as "uterus," classified in COSMIC as Soft Tissue: Striated Muscle: Uterus), tumor morphology (e.g., "glioma"), sample name (e.g., "HCC38"), or a mutation description (e.g.,the common KRAS mutation "c.35G>A";"p.G12D"). If a COSMIC ID number (the numeric internal database identifier) is known for any of these, it can also be used in a simple search to retrieve very specific datapoints (the KRAS p.G12D mutation has ID number 521). Search terms can be combined: 521 mutation will return only mutations, excluding, for instance, all samples with "521" in their name.
In the Mutation Summary, the spread of mutations across the gene is shown graphically (the scale is the peptide sequence of the gene's product). The mutations combined are drawn in green, with the most frequently mutated position highlighted in red, and the subsequent breakdown by mutation type drawn in black. The mutations graphic is  Tabulating selected data 5. Click on the More Details link on the right side of the Details table. In Figure  10.11.4A, the link has been clicked for "pancreas," resulting in a popup box. The three top links offer methods to refine the query used to produce the histogram page.    Under Sample Data, export functions are available to tabulate the data summarized in the The Histogram page: Zooming 6. Click the browser's "back" button to return to the histogram page in cDNA view. Click the peak mutant shown in Figure 10.11.3B, and a small popup menu will offer zooming options and further details links. Click the "Zoom in +−5bp" option, and the view will zoom in on the 10 bp surrounding the selected nucleotide ( Fig. 10.11.5).  Fig. 10.11.4A) allows further specialization of the phenotype being investigated. This detailed phenotype search is also available via the Browse by Tissue option on the main COSMIC home page. Options are offered from the COSMIC database, which can be selected singly or in multiples. In this example, to specify a tumor site (tissue), select "pancreas," click Next, then "ampulla of Vater," and click Next. To further specialize by tumor morphology (histology), select "carcinoma" and click Next, then select "ductal carcinoma" and click Next. The Tissue Summary page will present a list of genes with statistics to show how many tumor samples have been examined in each gene, and the tumor type's mutation frequency in that gene ( Fig. 10.11.6). Click on the KRAS gene in the table to go to the Histogram page with the new specialized query, which will now display only mutations for the new phenotype selection ( Fig. 10.11.7). These selection pages behave differently depending on the number of selections made. At the beginning, selecting five tissues or less will allow further specialization of the tumor site, selecting more will simply skip to the histology selection. The histology section will only offer further specialization if one choice is made. Once a specific selection is made (and a selection does not have to be this specific), the resulting page will show which genes have been analyzed through the selected phenotype and which were mutated in any of the samples. The five genes with the highest mutation frequency will be described in more detail, both graphically and in a table (Fig. 10.11.6). The ordering of these five genes is a statistical evaluation of their impact in cancer, a combination of the mutation frequency, and the number of samples examined. Clicking on a gene name in the table or a gene's bar in the chart produces the histogram page (as described above). However, the graphic and tables now reflect only the mutations found in the phenotype specified ( Fig. 10.11.7), so the numbers are much reduced in both.
To view the full details of the gene again (i.e., remove the specialized phenotype), click on the Switch View button above the histogram graphic. While navigating specialized phenotypes, the current tissue/histology selection is shown in the sidebar on the left hand side; clicking on these links allows you to respecialize.   Examining a sample in more detail 11. Click the browser's "back" button to return to the Mutation Summary page ( Fig. 10.11.8), then click on a sample name from the long list on the Mutation Summary page (e.g., "1040576," the first pancreas sample part way down this page). The Sample Summary page will be shown ( Fig. 10.11.10), detailing all available information about the sample. Further navigation in the histogram: Exporting 13. The paper's list of genes are separated in alphabetical brackets (not shown). For the Bergmann paper, click on the J-L tab, then click KRAS, and then click the Histogram button to return to the histogram as seen before. Click on the scale bar just above 100 to zoom in on this position, then click the Export button. Examining gene fusions 14. Press the browser's back button to return to the KRAS histogram ( Fig. 10.11.7).

Examining a mutation in more detail
In the navigation box under the histogram, a pull-down menu offers a similar view

10.11.15
Current Protocols in Human Genetics Supplement 57 of all the genes in COSMIC. Select TMPRSS2 and press the Display button. Click ETV1 in the information box under the TMPRSS2 histogram ( Fig.10.11.11A), and the Fusion Summary page will be displayed ( Fig. 10.11.11B), detailing the different fused structures observed. Click mutation "115" in the Inferred Breakpoints table to retrieve details for this mutation.

In this case, TMPRSS2 has no classic small mutations, but an information box indicates that COSMIC has data of fusion events involving this gene, mostly with ERG, but also
with ETV1 (Fig. 10.11.11A). These are specified in the Mutations table (press the red Mutations button), which also link to the fusion variant of the Mutation Summary page. However, the first step in this example is to view the summary of a fused gene pair. Figure 10.11.11B shows the Fusion Summary page for the selected gene pair. In order to accurately describe the published data while ensuring navigability, the fusion data are described in two ways, Inferred Breakpoints and Observed mRNAs. This is due to many papers using expression technologies such as RT-PCR to determine fusions between genes. A number of these studies identify more than one transcript per sample, some finding more than four different products between the same gene pair in one tumor. This implies significant alternative splicing of the mRNAs expressed from the fused gene pair. In order to simplify these data for display and navigation, the position of the genomic breakpoint has been inferred from the experimental data while maintaining the original results. To do this, it has been assumed that each sample's breakpoint lies between the most 3 expressed exon of the 5 gene partner and the most 5 exon of the 3 gene partner, from the mRNAs reported in that sample. For instance, in sample "MET26-LN," two TMPRSS2/ETV1 fusion mRNAs were identified, both containing the downstream sequence from exon 4 of ETV; one fusion (ID 14) contained only the first exon of TMPRSS2, while the other (ID 15) contained the first two exons. Since both were observed in the same sample, the default assumption is that these are splice variants of a fusion between somewhere downstream of TMPRSS2 exon 2 and somewhere upstream of ETV1 exon 4, and the inferred breakpoint for this sample (ID 115) reflects this.
The Mutation Details page (not shown, but accessed by clicking "115" from the Fusion Summary page in Fig. 10.11.11B) shows the mutation ID, whether it is an inferred or observed fusion mRNA, and the HGVS-compliant syntax describing the exact details of the mutation, followed by two graphical representations of the mutation. The first graphic describes the mutant mRNA in relation to its wild-type parent genes, the second shows the related mutations. Lastly, a listing of samples containing this fusion mutation is presented.

The Sample Summary page also describes fusion mutations if present, replacing the standard mutations table with a tabulated inferred breakpoint and a graphical list of observed mRNAs; this can be seen by clicking on "MET26-LN" in the Mutation Details page.
Core COSMIC workflow 15. Before beginning a new search, it is useful to review the main pathways through the data in COSMIC. The workflow and interrelationships of the core pages are summarized in Figure 10.11.12.

ACCESSING SPREADSHEETS VIA FTP
The data in the COSMIC database can be examined in several ways. The Web site is the easiest method of visualizing the data, but other offline methods are occasionally useful. All the data in COSMIC are exported onto its FTP site automatically every release, thus ensuring its contents are always up to date. The FTP site is described at the bottom of the "Additional information" link on the COSMIC home page, and can be found at ftp://ftp.sanger.ac.uk/pub/CGP/cosmic.

Necessary Resources
Hardware Any Internet-connected computer.

Software
FTP client software (including Web browsers), Spreadsheet software.

Files
Input files are downloaded from the COSMIC FTP site.

EXPLORING THE ORACLE DATABASE DIRECTLY
As well as exporting the data from COSMIC in spreadsheets, the whole COSMIC database is exported and made available at ftp://ftp.sanger.ac.uk/pub/CGP/cosmic/ oracle export. This file contains the entire database that is used to drive the Web site and export utility. However, this requires significantly more IT infrastructure, and it is recommended that an IT specialist be employed to do this.

Necessary Resources
Hardware

Files
Input files are downloaded from the COSMIC FTP site ("oracle export" section) 1. Download the Oracle export file from the FTP site, decompress it, and install it into the Oracle server using the Import utility ("imp"). Once successful, the database can be investigated using the SQL language with reference to the structure of the schema, available diagrammatically in PDF format in the oracle exports folder.
Although it usually requires specialized informatics personnel, this is the most powerful method of analysis, as all data are individually available and custom-combinable, both within COSMIC and, potentially, between this and other databases. This method will only be required for complex data mining. Export files are available for each release since v14 (January, 2006), for Oracle version 9 and 10 g only. They are all compressed using "gzip" to ensure small download size (recent releases, e.g., v33, are ∼50 Mb).

Background Information
COSMIC grew out of the need to combine cancer somatic data, which was available from many sources, but mostly distributed throughout the scientific literature. The literature is not searchable by any aggregate or automated methods, and online resources usually comprise single-locus databases, which though sometimes extensive (IARC p53 database, Petitjean et al., 2007) do not provide phenotype-specific genetic overviews. Larger online resources (e.g., OMIM, Hamosh et al., 2002;HGMD, Cooper et al., 2005) store minimal information, usually only on highfrequency mutant alleles, thus losing much context detail. COSMIC solves most of these drawbacks by extracting as much detail as possible from targeted literature and combining this with CGP's own data as it is confirmed in the laboratory. Large datasets are therefore Current Protocols in Human Genetics made available for deep data mining, while maintaining sample sizes that can still achieve good statistical significance. The COSMIC project can be subdivided into two distinct portions. Just over half its contents are derived from complete and up-to-date manual curation of the scientific literature for nominated genes from the Cancer Gene Census (Futreal et al., 2004). Most of these genes promote tumorigenesis via simple point mutations, but curation has begun more recently of oncogenic gene fusion events, usually occurring via chromosomal rearrangements. The other half of COSMIC's contents are confirmed somatic mutation data derived exclusively from the tumor resequencing project (Cancer Genome Project or CGP) at the Sanger Institute. The (largely prepublication) CGP data are further subdivided between two studies, the "cancer cell line project," which aims to resequence all point-mutated cancer genes through a large series of almost 800 common cancer cell lines, and the "CGP resequencing project," which has the ultimate aim of se-quencing tumors of specific types through a selection of over 4000 candidate genes. Approximately half of the tumor analysis experiments described in COSMIC are derived from literature curation, the other half from CGP laboratories. Nearly 50000 mutations have been curated, with ∼90% deriving from the literature curation study, which focuses on genes known to be mutated in cancer. Much lower mutation rates derive from the CGP, which is hunting for novel oncogenes. Four color-coded Web sites allow the investigation of the data, independently or in combination (Fig. 10.11.13 With an emphasis on data quality, all the information contained in all the papers published for nominated genes has been curated Figure 10.11.13 The COSMIC Web site overviews the data from three distinct subprojects. The gold pages describe the data derived from curation of the scientific literature, the red pages display results from the CGP resequencing project, and the green pages detail results of the cancer cell line project. The gold page is simply descriptive, while the green and red pages front Web sites that allow full independent navigation of the subproject's data. For color version of this figure see http://www.currentprotocols.com.

10.11.19
Current Protocols in Human Genetics Supplement 57 manually. Occasionally, the published data are inconsistent or incomplete, and this may cause a paper to fail to be included in COSMIC. As of October 2007, COSMIC version 33 contained 55 fully curated genes, for which over 5000 individual papers were manually scrutinized. For the CGP data, the highest data quality is maintained by only releasing mutation data once the mutation is confirmed somatic by resequencing several times ("oversequencing") with a matched normal sample when available (not for many cell lines) from the same individual.
Once curated, the information is configured to international standards before publication, most significantly for mutation and tumor classification data. Each gene in COSMIC has a single reference cDNA sequence, to which all mutations are localized; if the literature describes an alternative transcript, the mutations' coordinates are recalculated to COS-MIC's equivalent before being additionally localized to the human genome, currently golden path NCBI36. Each mutation is given a numeric ID and two short descriptions. These descriptions use the HGVS recommendations on mutation nomenclature to generate concise yet precise definitions of the sequence change observed at both the nucleotide level ("c." prefix) and amino acid level ("p." prefix). Numerous international standards exist for the classification of tumor phenotypes, many of which focus on specialized phenotypes. To allow COSMIC to encompass as broad a range of cancer phenotypes as possible while maintaining highly detailed phenotype information, a new tumor classification nomenclature has been defined. For each sample curated, the published description is retained, but also translated into COSMIC's classification system, which is used for Web site navigation.
Literature data are considered releasable only once the entire paper has been fully curated; the paper itself is a coherent and finite unit of curation. The CGP equivalent to a paper is a "study," a group of related samples examined through a group of related genes. This means it is possible to combine papers or studies for subsequent meta-analyses of the data. These studies are usually ongoing, and each is at a different stage of completeness. Mutation data are released into COSMIC once confirmed. The absence of any mutations at a particular point does not indicate its sequence to be wild-type, but could simply mean it has yet to be fully examined.

Critical Parameters
COSMIC contains as much detail as can be extracted from each study curated; however, there are four key, "minimum-informationset" parameters. These key elements, while fairly obvious, will benefit from further clarification.

Sample
Every sample must have a name. Sometimes, however, the name is not published (the sample is merely one of a stated count), or it has an overly simple name such as "1" (32 entries) or a common name such as the cell line "PC-3" (36 entries). If a published sample has no name, it is given an anonymous reference, usually the database ID value (older data have an E or S prefix). For samples with simple or common names, the original name is kept, but given a new database entry; a pre-existing entry with the same sample name is never reused unless there is genotypic evidence that it is identical. In the case of PC-3, 36 entries are maintained: most are cell lines, but some are primary tumors, some are prostate cancers, some are lung, and the spread of analyzed genes varies widely. If a number of samples are stored with the same name, a link at the top of the Sample Summary page can be used to browse the full list. A sample is an instance of a portion of a tumor being examined for mutations. Potentially, a number of samples can be taken from a single tumor, and a number of tumors can be obtained from one individual, and each of these samples can vary slightly in their mutation spectra. While reports of such an extensive analysis are rare, COSMIC does contain such analyses and does retain the aggregations between sample/tumor/individual, although it has yet to be represented on the Web site. Primary tumors have been identified from many sources other than (the most usual) surgery and autopsy, including blood, stool, and urine. The exact source of the sample is recorded, since different interpretations may be placed on the analysis of a primary tumor than a cell line, the latter of which would be expected to have a higher number of sequence variants. Further features attributable to a sample, tumor, or individual (including cell lines) are often published, such as a tumor's karyotype, an individual's ethnicity, or the exact derivation of multiple samples from a single tumor. These are held and displayed (on the Sample Summary page) as Features, nonstandard extra details that do not fit within the usual data expected by COSMIC. Current Protocols in Human Genetics

Tumor classifications
As described above, COSMIC uses a new standard of tumor classification, designed to encompass as much detail in as broad a range of phenotypes as possible. Sometimes the change from the published classification is minor or expected, such as "Bone: Femur; Osteosarcoma: Microcellular" becoming "Bone: Femur; Osteosarcoma: Small cell." On other occasions, a substantial change may lead to difficulties finding the right data; for instance, a sample with the published phenotype "Brain: Cerebellum; Haemangioblastoma" will be translated in COSMIC to "Soft Tissue: Blood Vessel: Brain; Haemangioblastoma," since the tumor originates in blood vessel, not brain-specific tissue. Of course, simply using "Haemangioblastoma" as a search term from the front page is acceptable and negates the need to know how to navigate to it. A spreadsheet of the relationships between published tumor classifications and their COSMIC counterparts is available from the Additional Information link on the COSMIC home page.

Genes
A gene in COSMIC refers to a single representative transcript for a given gene; splice variants are not available. The transcript accession number, usually an "NM "-prefixed reference sequence, is versioned and does not necessarily refer to the latest version. Similarly, gene names are not necessarily the latest HUGO-approved identifiers, but should at least always exist in HUGO's list of synonyms for that gene name. Gene names, synonyms, and transcript accession numbers are all acceptable terms for searching COSMIC. Additionally, the cDNA sequence in COSMIC refers to the coding domains (CDS) only. Untranslated regions (UTRs) are only used in COSMIC where they have an impact in gene fusions, which may fuse a portion of the donor gene to the upstream region of the acceptor gene, altering its splicing or frame.

Mutations
As discussed, two types of mutation are represented in COSMIC: simple/small sequence changes (point mutations or small insertions, deletions, replacements) and complex genome rearrangements resulting in the fusion of two or more genes. These are treated slightly differently on the Web site. Simple mutations are drawn in the Histogram graphic, probably COSMIC's core page. Point mutations, mostly missense changes, are by far the most common variant type in COSMIC and are easy to locate on the gene sequence, forming the vertical axis of the graph. The other simple mutation types are represented singly underneath the graph, as the exact change more rarely coincides (the notable exception to this is EGFR, which generates a significantly vertically extended graphic). All the mutations in the graphic are also displayed in the Mutations table, divided into their different types, along with any fusion mutations. The latter cannot be drawn in the histogram graphic, which is focused on a single gene sequence. Each mutation (of every type) also has a details page, linked from the Histogram page. Fusions additionally have a summary page for each fused gene pair. Mutation counts and frequencies, as well as spectra, are detailed in the Histogram page for simple mutations, and in the fusion gene pair summary page for fusions. Mutation counts are presented throughout the Web site, as this is possibly the most important information the system provides. In all these cases, the Mutations count refers to all the mutations seen in all the samples for the selection chosen, such that if one sample has two mutations in one gene, it is counted twice. This value does not reflect the number of mutant samples or unique sequence changes (totals of these are usually in the release news item).

Anticipated Results
The results of a query depend entirely on the selection chosen during Web site navigation. Some examples have been selected to demonstrate the ease with which significant examinations of tumor mutability can be obtained.

Type of oncogene
The literature-curated genes have large amounts of mutation data in their histograms, and these can be interpreted immediately by the spread of mutations across the x axis and the numbers of types involved. Figure 10.11.14 presents examples of a clear gain-of-function (KRAS, Fig. 10.11.14A) and loss-of-function (PTEN, Fig. 10.11.14B) mutation spectrum. KRAS is a transcriptional activator, signaling to the MAP/ERK pathway to promote cellular growth (among other responses) upon binding to a GTP molecule. This signaling is inactivated by hydrolyzing the GTP to GDP, requiring a GAP helper molecule, and this interaction with GAP is where p.G12 is key. If p.G12 is mutated, GAP cannot deactivate KRAS signaling, leading to a huge overactivation of downstream elements promoting growth (Scheffzek et al., 1997  of KRAS mutations to be missense changes at p.G12 (Fig. 10.11.14A). Conversely, PTEN is a tumor suppressor gene, negatively regulating cell cycle progression at G1 via the PI3 K signaling pathway (Mutter, 2001). Any sequence change that reduces PTEN's effectiveness has a resulting upregulating change on cell cycle activity, potentially promoting tumor formation. Again, this is reflected in COSMIC's histogram for the gene, which shows a wide spread of mutations of all types across its coding domain (Fig. 10.11.14B).