Use of multivariate analysis to suggest a new molecular classification of colorectal cancer

Abstract Molecular classification of colorectal cancer (CRC) is currently based on microsatellite instability (MSI), KRAS or BRAF mutation and, occasionally, chromosomal instability (CIN). Whilst useful, these categories may not fully represent the underlying molecular subgroups. We screened 906 stage II/III CRCs from the VICTOR clinical trial for somatic mutations. Multivariate analyses (logistic regression, clustering, Bayesian networks) identified the primary molecular associations. Positive associations occurred between: CIN and TP53 mutation; MSI and BRAF mutation; and KRAS and PIK3CA mutations. Negative associations occurred between: MSI and CIN; MSI and NRAS mutation; and KRAS mutation, and each of NRAS, TP53 and BRAF mutations. Some complex relationships were elucidated: KRAS and TP53 mutations had both a direct negative association and a weaker, confounding, positive association via TP53–CIN–MSI–BRAF–KRAS. Our results suggested a new molecular classification of CRCs: (1) MSI+ and/or BRAF-mutant; (2) CIN+ and/or TP53– mutant, with wild-type KRAS and PIK3CA; (3) KRAS- and/or PIK3CA-mutant, CIN+, TP53-wild-type; (4) KRAS– and/or PIK3CA-mutant, CIN–, TP53-wild-type; (5) NRAS-mutant; (6) no mutations; (7) others. As expected, group 1 cancers were mostly proximal and poorly differentiated, usually occurring in women. Unexpectedly, two different types of CIN+ CRC were found: group 2 cancers were usually distal and occurred in men, whereas group 3 showed neither of these associations but were of higher stage. CIN+ cancers have conventionally been associated with all three of these variables, because they have been tested en masse. Our classification also showed potentially improved prognostic capabilities, with group 3, and possibly group 1, independently predicting disease-free survival. Copyright © 2012 Pathological Society of Great Britain and Ireland. Published by John Wiley & Sons, Ltd.


Introduction
Two main molecular types of colorectal carcinoma (CRC) have been described, based on the 'molecular phenotypes' of chromosomal instability (CIN) and microsatellite instability (MSI or MIN). CIN is the more common and is generally detected by the presence of an abnormal chromosome complement or number (aneuploidy or polyploidy) [1]. MSI is the result of mismatch repair deficiency [2], resulting in an increased mutation rate that is principally manifest as insertions and deletions in repetitive sequences. In terms of the somatic genetic pathways followed by MSI + and CIN + CRCs, there seems to be considerable functional overlap, but the specific mutations tend to differ: for example, MSI + tumours tend to acquire mutations in AXIN1 , BRAF and BAX , whereas CIN + tumours have mutations in APC , KRAS and TP53 [3]. In addition, CIN + and MSI + tumours are associated with different clinico-pathological features: the former 442 E Domingo et al tend to be well/moderately differentiated and distal, and the latter poorly differentiated, proximal and more frequent in women [4]. MSI is also an established marker of good prognosis. However, there remains considerable heterogeneity within the MSI + or CIN + groups. A third molecular phenotype, known as the CpG island methylator phenotype (CIMP), has also been described [5]. CIMP is characterized by a high degree of ageindependent methylation in gene promoters and tends to overlap with MSI, in part because promoter methylation of the mismatch repair gene MLH1 is the most usual alteration leading to MSI [6].
The very concept of somatic genetic pathways implies that some mutations are co-selected, presumably as a result of variation in the cancer cell's microenvironment, one component of which is the preexisting mutations in that cell. However, whilst some consistent pairwise associations between mutations in CRC have been identified, it is far from clear which mutations are co-selected and which are secondarily associated via other genetic changes. We hypothesized that some of the considerable residual heterogeneity in the behaviour of CRCs could be explained by refining the established genetic pathways of tumorigenesis and by identifying new ones. However, most previous studies, including our own, have used insufficient samples and/or analysed too few genes for this objective to be achieved [7][8][9][10][11][12]. In this study, we analysed over 900 CRCs from the VICTOR clinical trial of stage II/III colorectal cancer. We profiled 11 somatic genetic alterations and performed multivariate analysis using regression, clustering and Bayesian network approaches. This strategy allowed us to better characterize the existing pathways of colorectal tumorigenesis, to find additional, less common pathways and to propose primary molecular determinants of tumour behaviour.

Materials and methods
The VICTOR randomized trial of rofecoxib or placebo post-primary treatment recruited a total of 2434 stage II/III CRC patients between 2002 and 2004 [13]. Formalin-fixed, paraffin-embedded blocks were available for 965 of these patients. Haematoxylin and eosin (H&E)-stained sections were reviewed, and normal tissue and colorectal carcinoma within each section were identified. Samples from 59 patients were discarded because of lack of tumour, leaving 906 cancers (

Baseline analysis of molecular associations
The set of 906 stage CRCs (Table 1) was analysed for CIN, MSI and almost all of the most common somatic mutations in colorectal cancer (KRAS , NRAS , BRAF , PIK3CA, TP53 and FBXW7 /CDC4 ). LOH analysis was performed in 795 tumours from which constitutional DNA was available, targeting chromosomes 5q near APC , 17p near TP53 and 18q near SMAD4 . A summary of the molecular findings is shown in Table 2. Overall, the frequencies of molecular alterations, the mutation spectra (see Supplementary material, Figure  S1) and the pairwise associations (see Supplementary material, Tables S1, S2), were in good agreement with those previously established in the literature   [4,6,12,[14][15][16][17][18][19], and we shall not consider them further here. A new finding was that FBXW7/CDC4 mutations were not associated with any other mutation or clinicopathological variable. We also examined stage-specific associations, the only significant results being that stage III cancers tended to be CIN + . 180 (21% in total) CRCs were 'double-negative' (MSI -CIN -). In general, the double-negative tumours resembled MSI -CIN + cancers more than MSI + CINcancers, and we therefore compared MSI -CINwith MSI -CIN + tumours. The MSI -CINcancers presented at an earlier stage (102/180 versus 244/557 stage II, respectively; p = 0.003, q = 0.01) had lower frequencies of TP53 mutation and 17p LOH, and showed a borderline association with KRAS mutation (see Supplementary material, Tables S2, S3). Logistic regression-based multivariate analysis showed that the MSI -CINtumours remained associated only with lower stage and lack of TP53 mutation (see Supplementary material, Table S3).
Twenty-three cancers (3%) were 'double-positive' (MSI + CIN + ). Compared with all other cancers, the double-positive tumours tended to be right-sided (14/22 versus 278/804; p = 0.007, q = 0.03). No molecular alteration, including TP53 mutation and 17p LOH, was significantly associated with this small group of tumours, although they had a relatively high frequency of BRAF mutation (see Supplementary material, Table  S2). Overall, double-positive cancers appeared to resemble MSI + CINcancers most closely.

Searching for the primary intermolecular associations
In order to define new genetic pathways of colorectal carcinogenesis, we sought evidence for the primary drivers of the associations between somatic mutations (KRAS , NRAS , BRAF , PIK3CA, TP53 , FBXW7 ), CIN, MSI and the three sites of LOH (see Supplementary  material, Tables S2, S3). Approximately half of the associations found by pairwise analysis were no longer significant, suggesting that they were secondary to a primary association.
Since logistic regression analysis to identify the primary determinants of associations can be sensitive to small, chance associations or missing data when multiple highly correlated events occur, we additionally performed a Bayesian network analysis to detect primary associations among selected molecular variables (KRAS , BRAF , NRAS , PIK3CA, TP53 , MSI, CIN) that had been successfully typed in the full dataset. FBXW7 was excluded owing to its lack of any association, and the LOH events were omitted owing to their strong associations with each other, CIN and TP53 mutation. The network analysis took the form of a probabilistic model that represents the conditional dependencies between random variables via a directed acyclic graph. We found that the primary positive associations were between: (a) CIN and TP53 ; (b) MSI and BRAF ; and (c) KRAS and PIK3CA. The primary negative associations were between MSI and both CIN and NRAS , and between KRAS and each of BRAF , NRAS and TP53 . These associations are shown in a simple, graphical form in Figure 1A. In almost all of these cases, the network analysis found the same primary associations as the logistic regression analysis ( Figure 1A; see also Supplementary material, Table  S3). Exceptions were the associations between PIK3CA mutation and MSI and between PIK3CA mutation and TP53 that were only present in the logistic regression analysis ( Figure 1A; see also Supplementary material, Table S3).
The network analysis additionally detected associations that were formally absent from the logistic regression, because those variables were dropped from the latter owing to excessive co-variation. A case in point was the primary negative associations between NRAS mutation and both KRAS mutation and MSI. The network analysis also showed the underlying reasons why some associations that were significant in the pairwise analysis were no longer significant in the multivariate logistic regression analysis. Examples included the negative associations between CIN and BRAF (which was secondary to associations with MSI) and between MSI and TP53 (which was secondary to associations with CIN) ( Figure 1A).
Of particular interest in the network analysis was the detection of association loops ( Figure 1A). The loops between MSI and NRAS suggested two independent negative associations, one direct and the other indirect via CIN, TP53 and KRAS . Most intriguing was the loop between MSI and KRAS . Here, there existed not only a direct negative association between TP53 and KRAS mutations, but also a weaker, indirect positive association via CIN, MSI, KRAS and BRAF . The indirect positive association found by multivariate analysis was consistent with data in the literature from pairwise association testing, but the direct negative association was less well supported. We therefore performed a meta-analysis of other published studies [7,9,17,[20][21][22][23][24][25] and this confirmed an overall pairwise negative KRAS-TP53 mutation association (see Supplementary material, Table S4). The logistic regression and network analyses had provided strong evidence to show which of the reported inter-molecular associations in CRC were primary and which were secondary (indirect). We undertook one further analysis to support our findings by performing unsupervised hierarchical clustering of the same alterations used in the network analysis. The cluster analysis fully supported the other two methods, indicating the four basic groups, as: KRAS and/or PIK3CA; CIN and/or TP53 ; MSI and/or BRAF ; and NRAS ( Figure 1B, vertical axis). Clustering of the tumours ( Figure 1B, horizontal axis) showed them to be separated into two main groups, where the main discriminating molecular variable was KRAS mutation; all the mutants were present in the second cluster and none of the first cluster was mutant. Most NRAS and PIK3CA mutants were in the first and second clusters, respectively. CIN and TP53 mutations were found in both clusters but were over-represented in the first one. MSI and BRAF mutations were present as subclusters within both of the two main clusters.

Identifying groups of colorectal cancers based on shared molecular changes
We sought to identify groups of CRCs that would form the basis of our proposed molecular classification. The data suggested that this classification should be based not only on MSI, but also on NRAS mutation, on the negative association between TP53 and KRAS mutations, and on CIN (Figure 1; see also Supplementary material, Table S3). Based on the primary positive and negative intermolecular associations described in the previous section, we chose CRC groups characterized by: 1. MSI and/or BRAF mutation. 2. CIN and/or TP53 mutation, with wild-type KRAS and PIK3CA. 3. KRAS and/or PIK3CA mutation with CIN, but without TP53 mutation 4. KRAS and/or PIK3CA mutation without CIN or TP53 mutation. 5. NRAS mutation.
These groups encompassed over 80% of all the CRCs studied. In addition, we proposed two further groups: group 6 with no detectable mutations; and a 'miscellaneous' group 7. Rationale for the grouping is provided in Figure 2.

Association between CRC groups and clinico-pathological variables
Since it was hoped that the seven CRC groups would have clinical relevance, we used multiple logistic regression analysis to test the groups for associations with clinico-pathological variables (gender, age, tumour location, stage, grade, trial randomization arm, and treatment with chemotherapy or radiotherapy). Each group was tested in turn against all others, incorporating all variables in a reverse stepwise analysis (Table 3). Group 1 essentially included the MSI + group of tumours and, as expected, these cancers were strongly associated with proximal location, poor differentiation and female gender.
Most CIN + CRCs fell into groups 2 and 3. Although CIN is classically associated with distal location, male gender and higher stage, our two main sets of CIN + cancers (groups 2 and 3) showed distinct differences in their associations (Table 3). Group 2 cancers showed a strong tendency to be distally located and to occur in men, but had no association with stage. Group 3 cancers, on the other hand, were not associated with gender or location, but tended to be stage III tumours. The proposed grouping is based on the following process. Initially, we utilized the near-complete lack of overlap between MSI + and CIN + cancers to identify two groups. Owing to the observed strong, primary associations, we then provisionally added BRAF -mutant tumours to the MSI + group and TP53-mutant cancers to the CIN + group. We retained the MSI + and/or BRAF -mutant tumour group, irrespective of other genetic changes, since TP53, KRAS and NRAS mutations were uncommon in these cancers. We then formed a group of NRAS-mutant tumours, irrespective of their other genetic changes, since NRAS mutations were not positively associated with any other molecular variable. We next provisionally added a KRAS-and/or PIK3CA-mutant (but TP53-wild-type) group, owing to the negative association between KRAS and TP53. However, since we found no negative association between KRAS and CIN, we subdivided the KRAS-and/or PIK3CA-mutant group into CIN + and CINgroups, leaving the great majority of TP53-mutant cancers in a CIN + , KRAS-wild-type and PIK3CA-wild-type group. In total, this classification encompasses 736/906 (81%) cancers. In addition, 78 double-negative cancers had no detected changes in KRAS, PIK3CA, NRAS, TP53 or BRAF . The remaining 92 cancers had a variety of 'atypical' mutation combinations.
Group 6 cancers were predominantly found in men, for reasons that are unclear. The other three groups showed no association with the clinico-pathological variables, although we note that some groups, such as group 5 (NRAS -mutants), were small.

Association between cancer groups and disease-free survival
We tested whether our molecular groups of CRC were independent predictors of 5 year disease-free survival in the VICTOR study. All analyses were conditioned on the full set of clinico-pathological variables. We initially analysed group as a categorical variable in the regression model, and we subsequently tested each group against all other groups. Very similar results were obtained in each case. We found that group 3 (KRAS -and/or PIK3CA-mutant; MSI -; CIN + ; TP53wild-type) patients had poor survival, whereas all the other cancer groups showed no significant differences in survival (hazard ratio = 1.59, 95% CI 1.13-2.24, p = 0.008; Cox proportional hazards, group 3 versus all other groups; Figure 3). The only other independent predictor of survival was stage (hazard ratio = 1.99, p = 9.0 × 10 -6 , stage III versus stage II). Although Table 3. Associations between proposed molecular groups of colorectal cancers and clinico-pathological variables  group 1 was not an independent predictor of survival, we wondered whether this negative result was caused by some BRAF -mutant, MSIcancers having poor survival [26]. We therefore included MSI status in the survival model, since this variable has consistently been associated with better outcome in several studies. Group 3 (hazard ratio = 1.48, 95% CI 1.05-2.09, p = 0.027) and stage (hazard ratio = 1.98, p = 1.2 × 10 -5 ) remained independent predictors of poor prognosis, although MSI was a borderline significant indicator of good prognosis (hazard ratio = 0.59, p = 0.054).

Discussion
Colorectal carcinogenesis follows a multistep model in which sequential molecular alterations occur throughout tumour progression [27]. Here, we have analysed most of the common somatic mutations in a large CRC patient set with high-quality clinical trial data. We have undertaken multivariate regression, cluster analysis and Bayesian network analysis that have consistently identified the primary positive and negative associations between molecular changes, thus showing other associations to be indirect. The identification of loops in the pairwise association relationships ( Figure 1A) was of particular interest. One such loop provided the basis for explaining the previously-postulated negative association between TP53 and KRAS mutation, which we have now confirmed [7][8][9]17,[20][21][22][23][24][25]28]. One reason why the KRAS-TP53 association had not found wide acceptance is that it has seemed contrary to the well-established indirect association linking TP53 -CIN-MSI-BRAF-KRAS . Our data show that there is indeed an indirect positive association between TP53 and KRAS mutations that follows this route, but that this is outweighed by a direct negative association. One potential explanation for the negative association lies in the transcription of genes such as CDKN1A (p21) by mutant K-ras through p53-dependent and -independent mechanisms. Effects on cell cycle arrest, senescence and apoptosis might sometimes be suboptimal for tumour growth if both genes are mutated.
We confirmed that a small group of double-positive (MSI + CIN + ) CRCs exists, as does a larger group of double-negative (MSI -CIN -) cancers. The latter were generally similar to MSI -CIN + lesions, but had a much lower frequency of TP53 mutations and lower stage, although relatively high frequencies of KRAS and PIK3CA mutations. Whilst these findings support TP53 mutations in some way causing or being permissive for CIN, we found no quantitative differences in ploidy between TP53 -mutant and TP53wild-type CIN + tumours (data not shown), suggesting that an alternative (epi)mutation to TP53 inactivation may exist in the latter case.
Uniquely among the molecular changes, FBXW7/CDC4 mutations occurred randomly in CRC, irrespective of clinico-pathological or other molecular features. Very few other studies have addressed the issue of FBXW7 's position in the pathways of colorectal tumourigenesis, apart from functional studies linking its mutation to CIN [29], but we find no evidence of an association with CIN here. One possible explanation for the absence of associations with FBXW7 is that it mutates early in tumorigenesis, consistent with studies that have found FBXW7 mutations in colorectal adenomas [30,31]. Fitting (epi)mutations into genetic pathways is likely to become increasingly difficult as nextgeneration sequencing discovers more low-frequency mutations, such as FBXW7 , that drive tumourigenesis but do not obviously fit into any specific molecular pathway.
We have proposed a molecular classification of CRC into seven groups (Figure 2), based solely on primary positive and negative associations between molecular changes (MSI, CIN, TP53 and the type of Ras pathway mutation). This classification essentially retains the MSI + set of cancers as a separate group, but with the addition of BRAF -mutant MSIcancers. We remain open-minded as to whether the latter cancers are distinct from their MSI + counterparts, as some have suggested. Our classification split the CIN + cancers into two groups and established a further KRASand/or PIK3CA-mutant, CIN -, TP53 -wild-type group. Despite our classification being entirely independent of clinico-pathological variables, it showed interesting associations with clinical variables. Our data suggested that the often-reported associations between CIN and gender, tumour location and stage actually comprised two separate associations: (a) between gender, distal location and group 2 tumours (CIN + and/or TP53mutant with wild-type KRAS and PIK3CA); and (b) between stage III and group 3 tumours (KRAS -and/or PIK3CA-mutant, CIN + , TP53 -wild-type). These findings cannot readily be explained by differences in sample size and hence statistical power (Table 3). Group 3, moreover, was an independent predictor of survival, having worse prognosis than any of the other groups. Interestingly, TP53 mutation is not an established prognostic marker in CRC, despite its strong associations with stage and MSI, and our finding that CIN + TP53 -wild-type cancers have the poorest prognosis is consistent with this. Furthermore, Hutchins et al [30] had previously found poorer disease-free survival in KRAS -mutant stage II/III CRCs from the QUASAR trial, although they only performed a univariate analysis. There was also some evidence in our data that MSI might additionally be an independent survival predictor.
Overall, our results show that sufficiently large and homogeneous sample sets and methods based on multivariate and cluster analysis can allow the genetic pathways of cancer to be teased apart for a relatively well-characterized tumour such as CRC. However, this is a complex task that may become even more difficult as next-generation sequencing discovers more low-frequency mutations, such as FBXW7 , that drive tumorigenesis but do not obviously fit into any specific molecular pathway. However, such studies may provide additional insights through different analyses, one example being the identification of hypermutant, yet MSI -, CRCs in a recent large, landmark exome sequencing study [31]. Our proposed groups of CRC require replication and refinement, but our data already suggest that a finer-scale molecular classification of CRCs is both possible and desirable, and we expect that a more complex, validated classification of CRCs will emerge gradually in the next few years.

SUPPLEMENTARY MATERIAL ON THE INTERNET
The following supplementary material mat be found in the online version of this article: Figure S1. Spectrum of mutations found in KRAS , NRAS , BRAF , PIK3CA and FBXW7 (A) and TP53. Table S1. Summary of associations between molecular and clinico-pathological data. Table S2. Pairwise associations between molecular changes. Table S3. Logistic regression analysis: summary of significant results for molecular changes.