First-Line Maintenance Treatment in Metastatic Colorectal Cancer (mCRC): Quality and Clinical Benefit Overview

Different strategies of maintenance therapy (sequential CT, intermittent CT, intermittent CT and MAbs, or de-escalation MAbs monotherapy) after first-line treatment are undertaken. Many randomized clinical trials (RCT), which evaluated these approaches, suffer from incorrect design, heterogenous primary endpoints, inadequate size, and other methodology flaws. Drawing any conclusions becomes challenging and recommendations are mainly vague. We evaluated those studies from another perspective, focusing on the design quality and the clinical benefit measure with a more objective and accurate methodology. These data allowed a clearer and more exact overview of the statement in maintenance treatment.


Introduction
The introduction of new drugs (Oxaliplatin and Irinotecan) used in doublets with 5 Fluorouracil (FU) added or not to monoclonal antibodies (MAbs) in first line has completely modified the survival rate of metastatic colorectal cancer (mCRC) patients but has increased the toxicity. Different strategies have been evaluated as maintenance treatments to improve tolerance without impairing outcome.
Several randomized clinical trials (RCT) evaluated these strategies. Different primary endpoints were chosen, and different designs were used. The main studies were planned to demonstrate a superiority from one arm in front of the other. However, when the purpose of a study is to show less toxicity or better tolerability, without impairment or a decrease in efficacity, a non-inferiority hypothesis is more suitable. Between them, the studies with an Overall Survival (OS) endpoint are the most critical in order to dismiss any real impairment. Other endpoints (Time to Progression (TTP), Duration Disease Control (DDC), or Tumor Control Disease (TCD)) are more debatable because there are not always surrogate markers of survival in mCRC. This flaw is amplified when the authors of a guideline want to settle recommendations. Firstly, they need to evaluate the available evidence. Several systems exist such as the Oxford Centre [1] or the GRADE [2]. Both are based on the category of the studies (randomized controlled study, controlled study without randomization, etc.) and use hazy and subjective definitions of the different categories. These systems do not heed size, bias existence, confounding issues, or other important issues.
In this context, we aimed to analyze those randomized studies that evaluate the different strategies, focusing first on design quality.
Secondly, we focused on clinical benefit evaluation. With this purpose, we assessed the studies according to the ESMO-Magnitude of Clinical Benefit Scale (ESMO-MCBS). ESMO-MCBS is a validated tool developed by the European Society of Medical Oncology (ESMO, Lugano, Switzerland) to evaluate the magnitude of clinical benefit for new anticancer drugs, the first version of which was published in May 2015) [3] and updated in 2017 (version1.1) [4]. This last version adds an evaluation, form 2c, for therapies that are not likely to be curative with a primary endpoint other than overall survival (OS), progressionfree survival (PFS), or equivalent studies and was used for non-inferiority design studies. However, this evaluation form is based on overall response (OR) and improvement in some symptoms or quality of life (QoL). Frequently, studies with survival as primary endpoint do not report OR and commonly lack quality of life evaluation. Furthermore, they do not establish the cutoff for loss of OS that should be considered clinically admissible. As a result, this score seemed unusable and we proposed a score modification based on a Hazard ratio (HR) limit of 1.15, which means we considered an increase in the hazard of death of 15% or less acceptable [5,6]. In this way, we expected to provide more details, hoping to carry out useful and concise conclusions.

Material and Methods
Published randomized studies evaluating different strategies of maintenance treatment after first-line chemotherapy +-MAbs in advanced colorectal cancer were selected and reviewed. At the time of review, published randomized studies with MAbs evaluated only Cetuximab, Panitumumab, and Bevacizumab. In order to simplify the analysis, each study was included in one of the four categories, which correspond to one of these strategies: (1) sequential chemotherapy (CT) vs. upfront CT doublets, (2) continuous vs. intermittent CT doublets, (3) continuous doublets plus MAbs vs. intermittent, and (4) continuous CT plus MAbs vs. continuous MAbs monotherapy. All the studies in each category were evaluated putting the emphasis on the quality of trial design and the clinical benefit of the ESMO-MCBS score.

Quality of Trial Design (QTD)
The score used to evaluate the QTD includes three points: (1) achieved prespecified objective, (2) no change in predefined sample size and primary endpoint, and (3) adequate control arm. The level of evidence based on quality of the design ranged from 0 to 3 and, as a result, were categorized as low quality (0 to 1 point) or high quality (2 to 3 points) ( Table 1). Table 1. Items for quality trial design (QTD) analysis.

Variables to Analyze Yes No
Change in the originally pre-planned, sample size or primary end-point 0 1 Achieved pre-specified objective 1 0 Adequate control arm 1 1 0 1 At the time of trial design. Level of evidence based on the quality of the design: 0 to 1, low quality; 2 to 3, high quality.

ESMO-MCBS
Every study was evaluated for clinical benefit according to the ESMO-MCBS version 1.1. In studies with a superiority trial design, the evaluation form used was 2a (Table 2) or 2b (Table 3), depending on the primary endpoint (OS or PFS, respectively). In studies with a non-inferior trial design, the ESMO form 2c was used (Table 4).  Early stopping or crossover: • Did the study have an early stopping rule based on interim analysis of survival? • Was the randomization terminated early based on the detection of overall survival advantage at interim analysis?
If the answer to both is "yes", then see letter "E" in the adjustment section below Early stopping or crossover: • Did the study have an early stopping rule based on interim analysis of survival? • Was the randomization terminated early based on the detection of overall survival advantage at interim analysis?
If the answer to both is "yes", then see letter "E" in the adjustment section below Toxicity assessment Is the new treatment associated with a statistically significant incremental rate of: «Toxic» death > 2%, Cardiovascular ischemia > 2%, Hospitalization for «toxicity» > 10%, Excess rate of severe congestive heart failure > 4%, Grade 3 neurotoxicity > 10%, Severe other irreversible or long lasting toxicity > 2% (Incremental rate refers to the comparison versus standard therapy in the control arm) Toxicity assessment Is the new treatment associated with a statistically significant incremental rate of: «Toxic» death > 2%, Cardiovascular ischemia > 2%, Hospitalization for «toxicity» > 10%, Excess rate of severe congestive heart failure > 4%, Grade 3 neurotoxicity > 10%, Severe other irreversible or long lasting toxicity > 2% (Incremental rate refers to the comparison versus standard therapy in the control arm)

Quality of life/Grade 3-4 toxicities assessment
• Was QoL evaluated as secondary outcome? • Does secondary endpoint QoL show improvement? Are there statistically significantly less grade 3-4 toxicities impacting on daily well-being? (This does not include alopecia, myelosuppression, but rather chronic nausea, diarrhea, fatigue, etc.)

Quality of life/Grade 3-4 toxicities assessment
• Was QoL evaluated as secondary outcome? • Does secondary endpoint QoL show improvement? Are there statistically significantly less grade 3-4 toxicities impacting on daily well-being? (This does not include alopecia, myelosuppression, but rather chronic nausea, diarrhea, fatigue, etc.) Table 3. Cont.

If Median PFS with Standard Treatment ≤ 6 Months If Median PFS with Standard Treatment > 6 Months
Adjustments A: When OS as secondary endpoint shows improvement, it will prevail and the new scoring will be done according to form 2a.
B: Downgrade 1 level if there is one or more of the above incremental toxicities associated with the new medicine. C: Downgrade 1 level if the medicine ONLY leads to improved PFS (mature data shows no OS advantage) and QoL assessment does not demonstrate improved QoL D: Upgrade 1 level if improved QoL or if less grade [3][4] toxicities that bother patients are demonstrated. E: Upgrade 1 level if study had early crossover because of early stopping or crossover based on detection of survival advantage at interim analysis. F: Upgrade 1 level if there is a long-term plateau in the PFS curve, and there is > 10% improvement in PFS at 1 year. Highest magnitude clinic benefit grade that can be achieved grade 4. Non-curative setting grading-5 and 4 indicates a substantial magnitude of clinical benefit Adjustments A: When OS as secondary endpoint shows improvement, it will prevail and the new scoring will be done according to form 2a.
B: Downgrade 1 level if there is one or more of the above incremental toxicities associated with the new medicine.  ESMO non-inferiority (NI) modified: There are no set methods for determining limits in defining non-inferiority. We proposed an effect retention method to set the superior upper limit of 1.15. This method is supported by a significant systematic review of non-inferiority studies, as well as Federal Drug Administration (FDA) evaluation requirements [6]. The two other chosen points, loss of < 2.5 months in Median Survival (MS) and 3-year OS loss < 5%, seemed acceptable margins for clinical outcomes. This score was used for the evaluation of studies with non-inferiority design (Table 5). Table 5. ESMO-M non inferiority (NI)designed studies modified.

QTD
It is noteworthy to claim that the survival results expressed in the four former studies [7][8][9][10] were evaluated with the intention to treat a population. In the FOCCUS study [7] three strategies were compared. In planning to demonstrate an increase in OS from the second or third arm, in comparison with the first arm, a new primary endpoint of noninferiority between the last two arms was added during the enrollment and was reached. That generated a global quality score of 2. The CAIRO trial [8] and the Cunningham study [10] had similar two-arm designs, but no differences were observed in the primary endpoint, OS. So, it generated a global quality score of 2. The other two studies shared a weak quality. The FFCD study [9] was prematurely closed and randomized only 410 patients from the 700 planned, as well as having an awkward primary endpoint. The more recent AIO-KKO study [11] adopted a non-inferiority design. Slow recruitment caused a sample size reduction and failed to demonstrate non-inferiority.

ESMO-MCBS
Only one of the three studies with OS as primary endpoint can be evaluated with ESMO-MCBS evaluation form 2a, and its magnitude of clinical benefit was low. The FFCD study [9] was the only one with PFS as a primary endpoint, and was evaluated with ESMO-MCBS evaluation form 2b (PFS), with a low magnitude of clinical benefit.

Non-Inferiority Evaluation
The FOCCUS and AIO KRK0110 studies were evaluable with ESMO-MCBS evaluation form 2c.
In the FOCCUS study the ESMO score was 0 as a result of the lack of an overall response report or improvement on QoL. On the contrary, our ESMO NI modified achieved a score of 2. In the AIO-KKO study the ESMO score was 0, and in the evaluation of OS in the ESMO NI Modified a score of 2 was obtained.

Recommendations
In patients with a good PS and with non-curative intention, we could suggest the use of sequential treatment strategy. Strong evidence supports the lack of detrimental survival results in comparison to starting with doublets.

Continuous vs. Intermittent CT Treatment
Only three randomized studies have evaluated this intriguing question (Table 7).

QTD
The OPTIMOX2 study [12] achieved the prespecified primary endpoint, higherduration disease control (DDC) in the maintenance arm. However, the selection of this flawed endpoint, the premature closure of the study with only one third of the planned patients included, and the imbalance of patients submitted to surgery between the two arms favoring the maintenance arm (15% vs. 8%) weakened the quality of the study.
The COIN study [13], with a non-superiority design, was planned to show no OS differences in HR with a 1.62 threshold. Despite a large sample being reached, nonsuperiority could not be confirmed and neither non-inferiority between continuous nor intermittent treatment could be demonstrated. The Chinese study [14] was undertaken to show an increase in PFS between Capecitabine maintenance treatment vs. control that was achieved. The study got the highest score of 3.

ESMO-MCBS
The primary endpoint in OPTIMOX2 study [12] was DDC endpoint and was not an evaluable one. If the study is evaluated with the ESMO-MCBS evaluation form 2b, the score is 3, as in the evaluation of the Chinese study, in favor of the maintenance arm. However, the real benefit is difficult to ascertain when no difference in OS is observed.

Non-Inferiority Evaluation
In the COIN study, the improvement in several factors in QoL set a grade 3 in the ESMO-MCBS evaluation form 2c. In the evaluation of ESMO-NI modified, the low numerical differences in OS supported non-inferiority of the intermittent treatment with a high grade 3 score.

Recommendations
Considering these results, intermittent treatment is highly recommended.

Continuous Doublets plus MAbs vs. Intermittent
Five studies evaluated planned de-escalation as a treatment strategy for patients without progression, after induction treatment with chemotherapy plus MAbs (Table 8).

QTD
The AIO 0207 [15] and CAIRO3 [16] trials achieved their prespecified objectives and shared the highest quality design [3]. Conversely, SAKK 4106 [17] and PRODIGE 9 [18] did not achieve their prespecified objectives, and COIN-B [19] had to change its originally preplanned inclusion criteria when KRAS mutations were identified as predictors of resistance to EGFR MAbs and had the lowest score.

ESMO-MCBS
The CAIRO3 trial [16] achieved its goal and, in evaluation with the ESMO-MCBS evaluation form 2b, had a score of 3. However, PRODIGE9 and COIN-B could not be evaluated with the ESMO-MCBS evaluation form 2b. In the former, the chosen TCD endpoint did not fulfil the ESMO evaluation score. On the other hand, COIN-B was designed as an exploratory study to complement the COIN-B trial. So, no conclusions could be drawn about the role of Cetuximab maintenance.

Non-Inferiority Evaluation
The AIO 0207 and SAKK 4106 had non-inferiority designs but neither could be evaluated with the ESMO-MCBS evaluation form 2c. As in the first study, no significant differences were noted between arms in the mean value of general health status, the QoL score, or toxicity was reached. In the second one, QoL and toxicity were not evaluated. However, with the ESMO-NI modified the scores were 2 and 3, respectively.

Recommendations
If a maintenance treatment is considered following first-line treatment with FOLFOX-Bevacizumab, a Bevacizumab-Fluoropyrimidine combination is recommended.

QTD
All the studies shared high-quality design with the exception of MACBETH, which had the lowest score [1]. Not only were the inclusion/exclusion criteria modified during the study, excluding RAS-and BRAF-mutated tumors, but the endpoint was not achieved. The study did not have enough statistical power to detect the differences between the two arms. It is worth noting that in none of the studies was a favorable correlation rate between PFs and OS HRs shown.

ESMO MCBS
Of the five studies evaluating this approach, only the VALENTINO2 study could be evaluated with the ESMO-MCBS evaluation form 2b. In the SAPPHIRE [24] study, nondefinitive screening comparisons were undertaken and, like the MACBETH study, were not evaluable.

Non-Inferiority Evaluation
The MACRO 2 [21] study showed that PFS with Cetuximab maintenance was noninferior to continuous treatment. However, the lack of improvement in tolerability, or the differences observed in survival, translated into a score of 0 in the ESMO evaluation form 2c, as well as in the ESMO-NI modified.
The maintenance of Bevacizumab was analyzed in a specifically non-inferiority designed trial, MACRO [20]. However, it could not be proven to be non-inferior to continuous treatment. The HR superior limit of 1.35 was observed, exceeding the 1.32 threshold considered. This result does not mean that continuous treatment is superior; it was simply not informative and was graded as 0 in the two evaluation scores.

Recommendations
Regarding the EGFR inhibitors' overview, Cetuximab maintenance was not shown to be superior to the control arm and, even if Panitumumab-5FU seemed to increase the PFS over Panitumumab, no improvement in OS was observed and no clinical benefit obtained. Therefore, no recommendation could be made.

Discussion
This detailed overview of the substantial RCT evaluating first-line maintenance in mCRC revealed the light and more common shadows on this landscape. Drawing conclusions, despite more than 18 studies appraising this setting, was not obvious and seemed lax. One of the principal reasons for this is the choice of an incorrect hypothesis. When maintenance treatment was evaluated, the endpoint was not to find an improvement in efficacy but to demonstrate less toxicity, better treatment tolerance, or improvement in quality of life. A non-inferiority design must be undertaken. Only seven of 18 studies followed this logic. Collecting this information is important in order to understand the real weight of the data. However, other guidelines (ESMO, French Intergroup, Australian Cancer Council) [25][26][27] do not reflect these facts. They are based on grading the evidence in trial categories (randomized, quasi-randomized, observational), adjusted by confusing definitions, as "further research is very unlikely to change the confidence or to have an impact in confidence of the effect". In this way, in the ESMO guidelines, three of the four recommendations in maintenance treatment grade IV-A were equal to expert opinion. In the French Intergroup Clinical guideline, only two statements related to maintenance therapy were exposed and one of them was grade C, meaning that "further investigation is likely to have an important impact in our confidence of the effect". Therefore, the main guidelines lacked focus on maintenance treatment issues and contributed only weak information.
A further step to establish a recommendation must be to evaluate the clinical benefit. For this purpose, in the maintenance treatment setting, it is not only important to evaluate the decrease in toxicity but, even more so, the improvement in some symptoms or in quality of life. It should be a requirement to establish which limit of impairment in the risk of death is clinically acceptable. ESMO-MCBS only evaluated the first two points but did not rely on any objective measure for this purpose.
In our proposal, we selected clinically acceptable margins of the HR 1.15. The margin selected was supported by the effect retention method. This methodology of selecting the non-inferiority margin is the least criticized by experts, endorsed by larger systematic revisions [5], and used, as well as being required, by the FDA for new drug approval based on non-inferiority studies [6]. This HR must be criticized and, from our perspective, must be taken only as a proof of concept.
Furthermore, most of the studies with an OS primary endpoint did not detail OR (two of four). In studies where sequential treatment was compared to an upfront CT, the selection of which OR (after first line or second line) must be used for comparison were highly questionable and dismissed the ESMO evaluation. QoL was barely evaluated (five of 18) and, if it was assessed, a small number of patients fulfilled the QoL test, resulting in poor representative results. If we take as an example the FOCCUS study, where an amendment was undertaken to include a non-inferiority design comparison between the FU and sequential doublet arm vs. upfront doublet treatment arm, the primary endpoint was achieved. Nevertheless, no OR or QoL was reported and the study's ESMO-MCBSderived score of 0 was non-informative. With our proposal, a score of 2 was observed, eliciting more useful information. Similarity was noted in the SAKK study evaluation. Non-inferiority in TTP with control vs. Beva maintenance was observed. However, the ESMO-MCBS score was 0, whereas a high score was obtained in our proposed scale.
The other guidelines (ESMO, French Intergroup, American Society of Clinical Oncology ASCO resources stratified guidance, and Cancer Council Australian) [25][26][27][28] and our review agreed that, after a first line of FOLFOX and bevacizumab, to continue with Beva and Fluoropyrimidine is the most highly recommended. ESMO and the Australian Cancer Council believe that there is not enough data to define the best maintenance treatment after a first line with CT and EGFR inhibitors. Parameters that identified subgroups of patients that could benefit from more, or less, active maintenance strategies were lacking and seemed to be of great interest. It is regrettable that the inclusion of more than 3300 patients did not provide evidence of this clinical need.
Altogether, even if no new results from methodologically well-planned studies are expected in the near future, refining the clinical benefit evaluation tools would seem to be the most correct way to move forward.

Conclusions
Several studies were established to evaluate different approaches in first-line maintenance treatment in mCRC. Several issues such as different chosen endpoints and inaccurate design studies, among others, made it difficult to draw clear conclusions.
Sequential treatment does not seem to be detrimental in comparison to starting with combined CT. If a doublet CT is started as up-front therapy, intermittent treatment does not seem to compromise the outcome, improves the quality of life, and is to be recommended. Unfortunately, with the use of MAbs, no conclusion was reached with regard to decreasing toxicity without jeopardizing outcomes. Concerns about the methodology used in the studies' design and the lack of accurate evaluation tools emerged as hurdles in arriving at conclusions. A huge effort to solve them would be a very useful step in making progress.