piNET: An Automated Proliferation Index Calculator Framework for Ki67 Breast Cancer Images

In this work, a novel proliferation index (PI) calculator for Ki67 images called piNET is proposed. It is successfully tested on four datasets, from three scanners comprised of patches, tissue microarrays (TMAs) and wholeslide images (WSI), representing a diverse multicentre dataset for evaluating Ki67 quantification. Compared to state-of-the-art methods, piNET consistently performs the best over all datasets with an average PI difference of 5.603%, PI accuracy rate of 86% and correlation coefficient R = 0.927. The success of the system can be attributed to a number of innovations. Firstly, this tool is built based on deep learning, which can adapt to wide variability of medical images – and it was posed as a detection problem to mimic pathologists’ workflow which improves accuracy and efficiency. Secondly, the system is trained purely on tumour cells, which reduces false positives from non-tumour cells without needing the usual pre-requisite tumour segmentation step for Ki67 quantification. Thirdly, the concept of learning background regions through weak supervision is introduced, by providing the system with ideal and non-ideal (artifact) patches that further reduces false positives. Lastly, a novel hotspot analysis is proposed to allow automated methods to score patches from WSI that contain “significant” activity.


Introduction
Breast cancer is the second most frequently diagnosed cancer worldwide and the leading cause of cancer-related deaths in women [1]. Approximately 2.1 million women every year are impacted, and in 2018, approximately 627,000 women died from the disease. Invasive ductal carcinoma (IDC) accounts for 80% of breast cancer cases and is the most invasive of the breast cancers. Over time, if left untreated, IDC can metastasize to other areas of the body. Breast cancer survival is attributed largely to timely and accurate diagnosis, which is dependent upon many features including tumor size, tumor cell proliferation and morphological features of the tumor [4]. Histopathology plays a vital role in obtaining these measurements, and is used for diagnosing breast cancer disease, patient management and for predicting prognosis [3]. Traditionally, pathologists examine tissue under magnification with hematoxylin and eosin (H&E) stains to highlight cellular morphology and tissue microstructure for diagnosis and tumour grading. Grading systems such as Bloom-Richardson [5] are used to report the mitotic index, as well as the nuclear and tubule grades for breast tumours.
To characterize tumors further, and to increase response to therapy via targeted approaches, there is growing interest in using immunohistochemical (IHC) biomarkers [6]. A promising IHC biomarker for breast cancer is MKi67 (or Ki67), a protein that is indicative of cell proliferation and it is found in all active phases of the mitotic cell cycle [7]. Ki67 tumor proliferation rates have been clinically shown to be correlated to tumour aggressiveness, predicted survival rates, reoccurrence and also has great potential in improving decisions around treatment options [8][9] [10]. Although integrating Ki67 into clinical workflows could improve quality of care, computing Ki67 proliferation indices (PI) is time consuming and subject to inter-and intra-observer variability among pathologists [11]. The American Society of Clinical Oncology and European Society for Medical Oncology (ESMO) have stated that Ki67 would be a useful clinical tool if it were standardized [10] [12]. Moreover, the number of pathologists in the workforce is declining, and the number of cancer cases increasing [13] which creates pressures on pathologists. They are expected to complete more cases in less amount of time. Pathologists working more than 39 hours a week are found to be fatigued, overworked and burnt out [13][15] [16], which can affect quality of care. Therefore, to increase clinical applicability and overall utility of Ki67 biomarkers, PI measurements should be standardized and efficient. Digital pathology, a relatively new technology, can be leveraged to address these challenges.
Technological advances in whole slide scanners are creating high resolution digital images that can be analyzed automatically with image analysis, machine learning and artificial intelligence (AI). In fact, digital pathology is expected to be the "first frontier" of medical AI [17]. Computational pathology methods could standardize PI intervals and be used to study PI in large cohorts efficiently and scoring tools that provide automated PI would improve workflow efficiency, turn-around-times and reduce subjectivity, ultimately improving patient care. With the emergence of deep learning techniques achieving high performance in segmentation and classification tasks for images, there is potential for these methods to revolutionize the way digital pathology images are analyzed.
Automated PI quantification is an on-going research topic due to the challenging nature of the task. There is variability in stain and scanner vendors which contributes to an overall lack of reproducibility [18]. In [52] we proposed an IHC Color Histogram (IHCCH) approach to automatically compute the PI for TMAs in an unsupervised manner and achieving 92.5% PI measurement accuracy on two datasets (80 TMAs). This method was developed to detect any cell that was stained with DAB or hematoxylin and since the TMAs largely contain tumour cells only, there was high PI estimation performance [53]. However, there is challenges with this method when applied to new clinical datasets since there are both tumour and non-tumour cells which when detected reduce PI estimation performance. Additionally, different types of tissues, cells and artifact can all disrupt the overall PI measurement by mainly creating false positives in the nuclei detection phase.
The accurate detection of tumor cells is a key factor in accurate calculation of PI. The importance of distinguishing between cell morphologies was explored in [19] where two commercially available software tools were used to quantify PI in 53 breast cancer WSI biopsies from a single centre. It was found that cases with PI differences of > 10% were caused by errors in tumor cell classification, overlapping cells or staining artifacts.
To overcome the inclusion of non-tumour cells and false positives in PI quantification algorithms, several algorithms have been proposed to attempt to handle these challenges. In [21], PTM-NET was introduced to analyze Ki67 stained breast cancer whole slide images with machine learning to segment tumor regions that can then be passed to a Ki67 calculator. The PTM-NET achieved a dice correlation coefficient of 0.74 between the automated and manually delineated tumour regions from 87 patients acquired from two centres. To try and reduce the impact of non-tumour regions, in [23], the authors use a U-Net model to segment unwanted regions such as folds, smears and color distortions due staining variability in ki67 stained brain tumour specimens. If this work could translate to breast cancer tissue, these regions could potentially be used to remove false positives in the nuclei detection phase to increase PI calculation accuracy. In [20], Xing el. al propose, KiNET, a deep learning method to detect nuclei, perform tumor region segmentation and to classify between tumor and nontumor nuclei in one framework. The model was evaluated on 38 Ki67 pancreatic NET cases of size 500x500 and achieved an F1-score of 0.898 and 0.804 for nuclei detection and classification respectively. Although promising, the method requires several annotation and ground truth types, which can be cumbersome to obtain. In [24] a three stage algorithm was proposed for Ki-67 scoring, which included various algorithms to segment cell boundaries, extract nuclei features and classify each cell as tumour or non-tumour, and also as immunopositive or negative, which was then used to compute the PI. The challenge is that individual nuclei were segmented, which is time consuming to generate annotations for and must be very accurate for the model to compute PI accurately.
Deep learning has become a prominent choice for analyzing and quantifying digital pathology images. In [25] over 130 papers utilizing various datasets and machine learning strategies for digital pathology, including fully supervised, weakly supervised, unsupervised, transfer learning and various variants were critically analyzed. Despite the excitement of utilizing AI in computational pathology, a single unified approach does not exist. There are many challenges that remain that should be addressed for effective and reliable PI calculation in one framework. For example, other works have tried to mitigate the inclusion of non-tumour cells through tumour preprocessing or cell classification. These types of methods can be laborious to develop (lots of annotations required) and any error in this phase is propagated down to the PI computation step. Additionally, image artifacts, cell clustering, folds and non tumour cells all create false positives and although [23] work shows promise by detecting artifact regions, this work has not been translated to breast tissue or been used to specifically aid nuclei detection. Moreover many of these works are tested in single centre data with limited variability. Therefore, it is difficult to determine if the methods would generalize to new, unseen data.
In this work, we present piNET -a novel U-NET [26] based model for PI quantification of Ki67 breast cancer tissue to overcome the challenges of previous approaches. It automatically detects immunopositive and negative tumour nuclei separately within a single framework, and suppresses false positives using novel sampling strategies. Therefore, accurate tumour preprocessing is not required, and false positives reduction schemes are not needed to reduce issues created by folds, artifacts and nontumor cells/tissues. Performance is evaluated in 773 regions of interest (ROI), 86 TMAs and 55 WSIs (three image types), from four institutions, three scanner models and two types of mammary tissue (human and canine). It comprises one of the largest multicentre Ki67 datasets evaluated in the literature.
The framework is trained and tested on individually annotated tumor nuclei, to ensure stromal, epithelial and other cell types of non-tumor cells are not included in the final PI calculation. This offers an accurate PI measurement without dependence on tumor preprocessing. We strictly focus on nuclei detection (instead of segmentation) to mimic a pathologist's workflow in performing PI quantification, and to simplify ground truth generation; it is faster and more accurate to annotate nuclei centers vs. boundaries. Based on the nuclei center annotations, a Gaussian kernel is generated and different channels are used for the Ki67+ and the Ki67-nuclei so that both tumor nuclei types are annotated in a single RGB image. Motivation for using a Gaussian kernel is so the system can learn contextual information from the nuclei (versus a single point in the middle), which could help the classifier learn more robust features. These ground truths are then used to train and test the U-NET architecture, which detects Ki67+/-tumour nuclei. Regression-based loss functions are investigated and used since the ground truth and predictions are non-binary. A novel data sampling strategy is introduced which includes the concept of ideal and non-ideal patches. Ideal patches have mainly tumour nuclei, and no artifacts. Non-ideal patches can include tumour nuclei, but they also contain folds, staining issues and other non-tumour tissue to allow the classifier to learn these regions as part of the background class. Various fine-tuning is completed to find the optimal loss function and data splits. The model is trained on a single dataset and tested on large and diverse datasets for Ki67 which includes patches (ROI), TMAs and WSI data. Evaluation on this type of multi-center data will help to determine the generalizability and overall robustness of the proposed piNET for clinical use.

Methods and Materials
In this section the data and methods will be described. The data used to train and test the proposed piNET architecture is from five datasets from four centers, and the images range from regions of interest (ROIs), tissue micro-arrays (TMAs) and whole slide images (WSIs). Three of the datasets have ground truth annotations for individual nuclei. The other two have ground truth PI information. Of the five datasets, four are of human tissue and one is from canine mammary tissue. The proposed piNET architecture is shown in Figure 1 and includes data preparation, where the images are tiled if they come from a TMA or WSI, followed by automated tumour nuclei detection for Ki67+ and Ki67-using piNET. Using the detected nuclei, the PI is computed.  Table 1 summarizes the Ki67 stained breast cancer datasets used, along with the scanner type, magnification, ground truth available and number of images. In total, there are five multi-institutional datasets resulting in a total of 773 patches (ROI), 55 WSI and 90 TMAs. SMH patches were 256x256 in size and DeepSlides were 512x512 (which were down sampled to x20 yielding patches of 256x256). Data is from St. Michael's Hospital "SMH" in Toronto, Canada, which includes breast biopsies (WSIs) and corresponding patched regions of interest (ROI), ROI patches from an Open source dataset "DeepSlides" [27], TMAs from Protein Atlas (Open source) [28], and TMAs from the Ontario Veterinary College at the University of Guelph "OVC". This diverse dataset represents one of the largest multicenter Ki67 datasets analyzed in the literature. There are WSIs, images and TMAs, as well as images acquired by different scanning devices and with a variety of stain variations. Throughout the paper, the model was trained and tested on SMH patches while the other datasets were used to validate the pipelines accuracy and generalizability to different image sources and types -ROIs, WSIs and TMAs. These datasets will be used to test the proposed piNET system's accuracy, robustness, generalization and PI quantification accuracy. Visual representation of each dataset with the corresponding ground truth can be seen in Figure 2.  SMH breast tumour biopsies were stained with DAB (Ki67) and hematoxylin for 55 patients and from each patient a single representative slide was digitized at 20x using Aperio AT Turbo scanner. The SMH WSIs were further patched (described below) to generate annotations for Ki67+/-nuclei. Deepslides [27] contained Ki67 regions of interest cropped from WSIs from 32 different breast cancer patients scanned using the Aperio ScanScope at 40x magnification with a pixel size of 0.2461 x 0.2461µm 2 . The provided cropped regions were 1024x1024. A total of 113 randomly selected 512x512 tiles were cropped and annotated, followed by down sampling to 256x256 (to bring the effective resolution down to 20x to match the training data). The Protein Atlas [28] dataset consists of 90 Ki67 TMAs from the IHC lab in Uppsala, Sweden, which were acquired by Aperio ScanScope AT and Aperio ScanScope T2 scanners, using a magnification of x20. The last dataset used consists of 30 canine breast cancer TMAs stained for Ki67 with 129,404 individually labeled nuclei from the Ontario Veterinary College (OVC) at the University of Guelph. OVC data was used to test the robustness of the pipeline to different mammary cell types, but of same disease. The dataset was scanned using a Leica SCN400 Slide Scanner, with x20 magnification and were stained using Ki67 and hematoxylin. The total dataset consists of four databases, three image types (ROIs, WSI and TMAs), three scanner types and three different laboratory staining protocols representing a diverse set of test images.

Datasets
SMH, DeepSlide and OVC all have individual labels for both the immuno-positive and immunonegative tumor cells and annotations were completed by marking the center using ImageJ, with different colour markers for positive (Ki67+) and negative (Ki67-) cells [30]. The patches data (SMH and DeepSlide) were annotated by an experienced biomedical student that was trained by a breast pathologist and only tumour cells were annotated. The OVC TMAs were annotated by a DVM from the OVC department of pathobiology and the cores contained mainly tumorous tissue. Ground truth proliferation index (PI) of these three datasets were calculated using the expert nuclei annotations. To generate SMH patches, SMH biopsies (WSIs) were cropped in Pathcore's Sedeen viewer [29]. Only regions that held representative information of the overall slide were considered. SMH biopsies were scored for PI by an experienced breast pathologist. The pathologist initially scanned the entire whole slide image to distinguish between homogenous, heterogenous or hotspot pattern in proliferation. Afterwards, three representative regions across the entire tumor were identified to reflect the average proliferation across the tumor and total of 500 cells were included in the proliferation calculation. The representative regions were not disclosed. There were 55 corresponding PHH3 slides with the manually found mitotic index. Protein atlas supplied PI ranges (i.e. low: PI<10%, medium 25%≤PI≤75%, high: PI>75%) as opposed to individually annotated nuclei or specific PI values.
When cropping ROIs from SMH WSIs, 330 patches of ideal regions along with 330 patches of nonideal regions were cropped over the entire datasets. Ideal patches are defined as regions where 'pristine' data can be visualized, mainly consisting of only tumorous tissue (i.e. immuno-positive and tumor immuno-negative cells), with minimal nontumor cells/tissue or artifacts. Nonideal patches still can contain tumour cells, but they also include folds, artifacts, nontumor cells, including epithelial cells, inflammatory cells, and excessive tissue, such as stromal or adipose tissue. The importance of distinguishing between ideal and nonideal patches of data will be thoroughly discussed later.
In this work, a single dataset (SMH patches) is used for model tuning and training/validation purposes. All four datasets will be used for testing piNET, which includes SMH patches and WSIs, as well as the other datasets not seen by the classifier to examine generalizability: DeepSlides, Protein Atlas and OVC. We use the SMH patch data to train and validate the proposed piNET system since SMH data is acquired natively at 20x (which aligns with most of the datasets we are using), it consists of individual nuclei annotations and is composed of good quality images to build a model from. Individually annotated nuclei ground truths will be used to examine nuclei detection performance of piNET using F1-Scores, and PI values will be used to assess accuracy and reproducibility. To determine PI accuracy all datasets are used: computed PI from the patched data (SMH, DeepSlides, OVC), PI ranges from the Protein Atlas TMAs, and whole-slide (patient-level) PI from SMH WSIs.

Methods
In this study we propose a novel deep learning-based pipeline called piNET which detects tumor cells and differentiates between immunopositive (Ki67+ or diaminobenzidine) and immunonegative (Ki67-or hematoxylin) cells in one framework. piNET is constructed from a multiclass UNET-based system trained on tumour cells from SMH patches, and regression losses are used. In images with individually annotated nuclei, a Gaussian proximity map is used to demarcate the nuclei. Based on detected Ki67+/-tumor cells, a proliferation index (PI) is computed for any type of image (TMA, ROI or WSI). As will be shown, one of the major contributions is the architecture's overall robustness, in that the model can quantify Ki67 PI for a variety of image types (ROI, TMAs, WSIs), scanner/stain vendors, lab staining protocols and the presence of artifacts or non-tumour cells. This is investigated using different dataset sources, image types, magnification levels and scanner types (see Data). The different datasets were evaluated throughout using various validation metrics depending on the ground truth type (see Validation Metrics). The overall pipeline is shown in Figure 1 and consists of three major components, (i) Patch Extraction, (ii) piNet: Ki67+/-Detection and (iii) Proliferation Index Quantification. These are discussed next.

Dataset Preparation
In this proposed framework, we investigate a fully automatic pipeline, which quantifies Ki67 PI of breast cancer tissue for various image types, including TMAs, WSIs and patch-based analysis. The pi-NET architecture has been designed to accept 256x256 images so for TMAs and WSI the images need to be preprocessed and tiled first. TMAs are directly tiled into patches of 256x256 since they contain mostly relevant tissue to be analyzed for PI quantification. On the other hand, WSI tumor resections have large areas of normal regions, artifacts, fat etc. Therefore, to speed up processing (number of tiles to analyze) and to isolate only regions of interest, the WSI undergo thresholding (described next) to find candidate analysis regions, i.e., Figure 3, which are then patched into tiles.  Figure 4 shows the three steps that are performed to segment candidate areas of interest from the WSI and reduce the number of patches to process. First, the lowest resolution level of the Aperio ScanScope Virtual Slide (svs) file is extracted. WSIs are stored in a pyramid-based structures, where each level of the pyramid has a different level of resolution and dimension. Next, Otsu thresholding is applied on low resolution version of the image, on the red channel [32][33] to detect regions of interest. Lastly, the thresholded segmentation mask is used to applying tiling on the WSI. To obtain the tiles in the original resolution, the coordinates in the low-resolution representation are upscaled to the WSIs original resolution and are used to process the images. The tiled regions are then fed into piNET for nuclei detection and PI Quantification.

piNET: Automated Ki67 +/-Detection of Tumor Cells
Automatically detecting and differentiating between positive and negative tumor cells on multicenter datasets is a challenging task and is the central design issue for automated PI computation [35]. Different laboratory practices as well as varying staining vendors, scanner manufactures, staining protocols or biomarker expression levels can reduce accuracy of the approaches. Variation within digital pathology imaging data can cause robustness issues within algorithms or lead to overfitting a certain dataset. Moreover, there are many non-tumor cell types that can interfere with automated PI computation which traditionally required an accurate tumor segmentation preprocessing step [36][37] [38]. In this subsection, we introduce piNET; a U-NET based deep learning architecture that overcomes these challenges. It is a multiclass framework where Ki67+/-tumour nuclei are detected at the same time across multicenter, clinical images. Since tumour nuclei are detected, tumour preprocessing is not required in tiled images. The success of piNET is due to a number of factors. Firstly, due to the nature of deep learning the method is able to adapt to variability in images much better than traditional approaches. Secondly, the proposed work detects Ki67+ and Ki67-tumour cells in a single framework in a way that mimics a pathologist's workflow and differentiates between normal and tumor cells for more accurate PI quantification. As will be shown piNET can be used to automatically compute PI given patches from TMAs or WSI, which allows the tool to be applied on a variety of real, clinical data. Lastly, the model is tuned to be more robust to artifacts, folds or noise within the image due to a novel data sampling strategy.

Figure 5. Ki67+/-Detection uses a U-NET for the prediction of inputted
The piNET pipeline was built based on the original U-NET architecture, a convolutional neural network (CNN) with an encoding arm which compresses the images into a compact representation, and a decoding arm that reconstructs the prediction. See Figure 5 for the architecture used in this work. The original U-NET architecture was designed for segmenting biomedical images with excellent results in variable biomedical data [26], where the model used a per-pixel classification approach to segment cells, followed by watershed to separate overlapping cells in the prediction. Instead, in this method, we use a Gaussian defined proximity map as ground truths for individual nuclei and use a regressionbased loss function on a per pixel basis to identify central regions of the tumor nuclei. The model contains 6 levels of depth, with an initial number of filters set to 32, all the way down to the last layer which ends at 1024. The final activation layer used is the rectified linear unit (ReLU) as ReLU-based models are scale-invariant, are faster than sigmoid, have better gradient propagation and are computationally efficient [40] [41]. To ensure the model can robustly detect Ki67+ (immuno-positive) or Ki67-(immuno-negative), and omit nontumor cells/tissues/artifacts three classes are used: Ki67+ (red channel), Ki67-(green channel) and non-tumor cells and tissues comprise the background. The remainder design aspects of the method are described next.

Ground Truth Generation
For PI quantification, pathologists do not consider the boundaries of the nuclei, but instead are interested in a raw nuclei count of immunopositive (Ki67+) and immunonegative (Ki67-) tumor cells. Therefore, in this work, instead of segmentation, we are focused on automatic detection of immunopositive and immune-negative nuclei to mimic pathologists' workflow. The proposed method also allows for faster, efficient ground truth generation, as tumor cell nuclei centers are annotated with a single seed, rather than segmenting cell boundaries. This specific method of annotating tumor cell detection mimics the pathologist's workflow and has gained traction, as more deep learning algorithms beginning to utilize this highly efficient method [38][39] [20]. Although convenient and in line with pathological analysis, single pixel point annotations for the Ki67+/-nuclei is not enough information to train a deep learning system. Instead, a Gaussian kernel is generated and placed in the center of each nuclei based on the expert annotation. Given an image ( ! , " ) with ( ! , " ) ∈ " , and a nuclei center at spatial location ( ! , " ), a Gaussian kernel is created by: where is the standard deviation (or the width of the function). The Gaussian kernel emphasizes the center pixel (highest weight) and gives increasingly less importance to pixels further away from the center. This representation can be beneficial since it incorporates spatial context from within the cell, but still places highest priority on the center. This extra contextual information may be beneficial to the CNN layers, since additional information from within the cells can be considered. We call this a spatial proximity map. A small kernel was used: = 3, to ensure that the Gaussian does not extend past the boundary of the cell. Since all images are 20x this kernel represents approximately the same size across the datasets. The individual spatial proximity maps of tumor cell centers (for Ki67+/-cells) are shown in Figure 6. Note that for visualization purposes, Ki67+ tumour cells are red, whereas the Ki67-cells are green, but when training the system each class comprises its own channel. After obtaining the piNET's RGB prediction map, the image is processed per marker stain type and the red channel corresponds to predicted Ki67+ tumor cells and the green channel for Ki67-tumor cells. Each map is thresholded using the Otsu [32] method and the count of the number of Ki67+ and Ki67-tumour cells is obtained to quantify the PI.

Loss Functions
A loss function is a critical design element in deep learning since it determines the way weights are updated [31]. Given a prediction from the system and the corresponding ground truth, the goal of the loss function is to measure the similarity between them and to drive the system to learn parameters that maximize this similarity through backpropagation. Loss functions can be categorized as regression-or classification-type. Classification loss functions focus on solving a categorization issue, where the predicted value is binary or categorical [41][42] [43]. Segmentation problems are classification problems on a per-pixel basis. Regression based problems use losses that can predict a quantity (vs. a binary or categorical response).
For the detection of Ki67+/-nuclei, we pose this as a regression problem, since the Gaussian kernels are non-binary and the prediction should represent the proximity to the center of the nuclei. Recall from Figure 6 that each tumor cell annotation has a Gaussian distribution with the highest value close to the center of the nuclei. For piNET, we have separate markers for the Ki67+ and the Ki67-nuclei and regress both separately and within one framework. The gradient of the loss function is used to update the weights at the same time for both of the classes separately.
The selection of a loss function is critical for the performance and accuracy of a model [31]. Hence, several regression loss functions are investigated: Huber, LogCosh, Mean Squared Error (MSE) and Root Mean Squared Error (RMSE). Given a predicted output image -( ! , " ), the corresponding ground truth ( ! , " ), both with samples, these loss functions are shown in Table 2. The Huber loss is a robust loss function which shows great potential for regression tasks [44]. A drawback is the manual selection of delta ( ) and for this work, the default = 0.1 was used. Huber loss is known to be less sensitive to outliers, in comparison to other loss functions [44] [45]. LogCosh is another commonly used regression loss function based on the logarithm of the hyperbolic cosine of the difference between prediction and expected. LogCosh may be able to suppress outlier data to result in more robust optimization [46][47]. Mean squared error (MSE) is another intuitive loss function, which is the squared difference between actual and predicted output across the entire training dataset. A potential drawback of MSE is if outliers are present; it can cause heavily penalized values from the squaring operation for outliers. The last loss function to be analyzed is Root Mean Squared Error (RMSE); the MSE square-rooted which is less sensitive to outliers [41] [41][42] [45]. As generalizability and adaptability are considered for the design of piNet, the model must be able to accurately detect tumor cells regardless of the presence of folds, artifacts or other cell and tissue types. All four loss functions are investigated to determine which is best for these tasks.

Data Sampling
In Ki67 breast cancer imaging data, there are artifacts, folds, and normal tissues/cells which traditionally create false positives for nuclei detection [19] [20]. This can inflate estimates of tumour nuclei counts which negatively affects PI accuracy. Therefore, as part of the design, we considered a few novel data sampling strategies to help piNET learn the representations of Ki67+/-tumour nuclei while suppressing irrelevant background regions, in attempt to reduce false positives. We considered two types of patch types -ideal and non-ideal. Ideal patches are those that are "pristine" in some sense; they have mostly tumor cells and no artifacts or folds. Non-ideal patches on the other hand may have tumor cells, but they can also have folds, artifacts and other types of non-tumor cells. These non-ideal patches may help the model generalize by helping the model learn non-tumor cells and artifacts as part of the background class. The hypothesis is that perhaps data partitions with more non-ideal samples will weakly supervise learning in order to help to reduce false positives caused by folds, artifacts and other non-tumor cells since these regions are learned as part of the background class. Annotations follow the same protocol for both the ideal and non-ideal patches, where tumor nuclei are annotated in the center, for two classes: Ki67+ and Ki67-nuclei. Examples of ideal and non-ideal patches are shown in Figure 7.
Three datasplits were investigated. First, the 100-0 split which is 100% ideal patches and 0 nonideal patches. The second split is 50-50 where 50% ideal and 50% non-ideal patches of data are used. The last split is 70-30 to give more importance to the pristine data (70% ideal), with still some importance given to the non-ideal scenarios (while not "over" learning). For each datasplit, we use a 70:10:20 training, testing and validation strategy stratified on PI ranges (low, medium, high).

Ideal Patches
Non-Ideal Patches For a given region of interest or patch, Equation 2 gives the effective PI for that region. However, TMAs and WSI are comprised of many patches and so there needs to be a way to compute the PI from the recomposed images (i.e. all patches). In TMAs we recompose the images and compute the total PI for the TMAs using equation (2). On the other hand, if a WSI is being analyzed, a pathologist will typically select hotspots or representative regions for analysis. To mimic this analysis and ideally use similar hotspot regions we proposal a novel method to perform WSI PI quantification. In particular, we find the average number of tumor nuclei per patch denoted +,-and use this as a threshold to find patches with "significant activity" that should be included in the final PI calculation. Namely, only patches with more tumor nuclei than +,-will be included in the final PI. Regions that adhere to this have the total number of Ki67+ and Ki67-nuclei (separately) accumulated over all patches. Given N(i) number of tumor cells for patch number i, this can be represented in the following way where represents the number of patches within the WSI. The total accumulated Ki67+/-values, for active regions, are then used to calculate PI:

Results
Evaluation of the proposed piNET algorithm for nuclei detection and PI computation over five multicentre datasets, including regions, TMAs and WSI will be presented here. Firstly, the piNET model is tuned, and the effect of the design parameters for nuclei detection are evaluated for the proposed (i) Loss functions and (ii) Data Splits. The optimal design is selected, and the final model is used for further testing of generalizability to unseen data. Secondly, performance is then analyzed as a function of nuclei detection performance and "Accuracy" for both Ki67+ and Ki67-tumor nuclei using SMH as the test set. To examine "Generalizability", we then explore the performance of the architecture on two unseen datasets (DeepSlides and OVC) as a function of nuclei detection and PI accuracy rate using Low (PI<25), Medium (25-75) and High (PI>75). Finally, the piNET is compared to three deep learning methods, Res-UNet [49], Dense UNet [50], FCN8 [51], and the previous unsupervised approach, IHCCH [52], across five datasets.
All piNET models are trained with SMH patches, with 100 epochs, batch size of 16, learning rate of 0.001 and Adam optimizer. The model was trained using stratified data of PI ranges (Low PI<10, Medium PI 10-30 and High PI >30), of 70:10:20 training, testing and validation (SMH patches). The input image is the 256x256x3 RGB pathology image with the corresponding ground truth represented as another 256x256x3 image with three classes, Ki67+ (green channel), Ki67-(red channel), background (black), see Figure 6. The final prediction map is thresholded using Otsu [32] to highlight the most prominent predictions and find maxima corresponding to the nuclei centers. If the cells detected are overlapped, watershed method is used for separation. The binary detection result is used to calculate performance of the system as a function of F1 Scores, Accuracy Rates, Proliferation Index difference and Pearson related coefficients (see next).

Validation Metrics
To validate the accuracy of piNET, a series of validation metrics were considered that measured nuclei detection performance and PI quantification accuracy. In total, four different validation metrics were used. F1 scores, which consider both sensitivity and precision of nuclei detection was used for Ki67+ and Ki67-nuclei detection performance (separately) with where TP, FP and FN are the true positives, false posties and false negatives respectively. The images of binary predicted and ground truth images are overlaid to examine whether each automatic nuclei detected corresponds to a manually labeled nuclei resulting in a true positive (TP) if the labeled seed is coincident with the predicted seed, an FP if the ground truth nuclei was missed and a FN if a nuclei was predicted where there wasn't one in the ground truth image. These values are then utilized to generate an F1 score for each image in the dataset and to compare F1 scores across different experiments, the average F1 score is obtained per cohort.
To measure PI estimation accuracy, Proliferation Index (PI) difference is computed for overall accuracy, the Pearson correlation coefficient to measure the consistency of PI quantification and the accuracy rate of PI classification into Low, Medium and High PI ranges. PI difference is utilized to investigate the overall error between predicted ( +012 ) and actual PI ( 3+'0+4 ) value as shown in equation 7. If a value for PI is given (SMH WSIs), the automatic PI and ground truth PI are compared. If the ground truths of a given dataset is marker based, the PI is calculated using the markers, per stain, (Equation 2) to find 3+'0+4 which is compared to the automated value +012 . PI difference can be computed as follows: Pearson's correlation coefficient (R) was applied as a validation metric to analyze the model's ability output PI values that are consistent and correlated to the ground truth data. This metric is used to measure the linear association between the predicted PI ( +012 ) and actual ground truth PI ( 3+'0+4 ), across a given cohort and is computed by where +012 nnnnnnnn represents the mean value of the predicted PI and 3+'0+4 nnnnnnnnnnn corresponds to mean value of the actual PI values.
A final measure known as the PI Accuracy Rate is also computed that considers whether the method predicts accurately in the respective PI ranges of Low, Medium and High. The ranges used in most experiments is Low (<10), Medium (10-30), High (>30) as these ranges are common in the literature [11]. For experiments that include Protein atlas we use PI ranges of Low (<25), Medium (25-75), High (>75) as these are the ground truth labels that are supplied. To measure PI range accuracy, if the range of the predicted PI value falls into the correct range this is counted as a correct classification. The total number of correctly classified PI ranges are accumulated and divided by the total number of data samples per cohort.
In addition to F1 scores, the difference between the methods can be elucidated in the PI Accuracy Rate and R which informs as to how well the classifier is performing over different PI ranges (and as a as a function of disease severity). F1 scores report raw nuclei detection performance, and if the model is operating on an image with low numbers of tumour nuclei and a single nuclei is missed, this can greatly skew the overall F1 score of the patch allowing for outliers. Therefore, additional metrics, such as the PI Accuracy rate and R are needed. All four validation metrics will be analyzed together throughout the paper.

Model Development
Hyperparameter tuning is a critical component of model development. The two major concepts investigated here are the choice of the regression-based loss function and the optimal dataset split between ideal and nonideal patches. SMH patches are used for testing and F1 validation metrics and PI agreement measures are used to analyze design combinations. PI range accuracy uses the proliferation index ranges of Low (<10), Medium (10-30) and High (> 30) values. Pearson Coefficient (R) values are used to examine correlation between automatically measured PI values versus the manual ground truth PI. Throughout the course of the model development, the following experiments will be conducted to tune the detection model for optimal results: (i) loss function analysis across four functions and the best loss is used for the (ii) datasplit experiment, where variations of ideal to nonideal patches will be tested to demonstrate the adaptability of the model.

Loss Function
Loss functions are a vital component of deep learning and optimal loss functions can enhance the performance of a model significantly [31]. As mentioned previously, piNet detects Ki67 +/-tumor cells using Gaussian proximity maps in cell centers and therefore, a regression-based loss function is needed to optimize the model. Here we measure the performance of the presented loss functions: (i) Log Cosh, (ii) Huber Loss, (iii) Mean Square Error and (iii) Root Mean Square Error.
The model was trained using the 70:10:20 (training, testing and validation) data split on stratified data of equal distributions of Low (<10), Medium  and High (> 30) PI expressions. A total of 330 ideal ROIs from the SMH patches dataset were used, 240 for training, 30 for validation and 60 ROIs, unseen by the model, for testing analysis. Holding all other hyperparameters consistent, the performance over each loss function was tested on same 60 SMH patches. The training dataset, 240 ROIs, remained consistent and without augmentations as well. Example detection results are shown for all loss functions, alongside the corresponding ground truths and original images in Figure 8. As shown, the tumour nuclei have been accurately demarcated over most loss functions with some slight variability in the results depending on the loss function used.

Original
Ground Truth Huber Log Cosh MSE RMSE  To analyze the performance of piNET quantitatively, validation metrics are computed over the testing dataset. The box plots for the average F1 scores for each of the tumor cell classes (Ki67+/-) are shown in Figure 9 for each of the loss functions. As can be seen, over all loss functions the F1 scores of piNET are mostly high with the negative class (Ki67-) having better performance than the positive Ki67+ class (Ki67-class has higher means, lower variance and less outliers). The poorer performance of the Ki67+ class can perhaps be attributed to larger cell and stain variability, or less training data for this class. Ki67+ nuclei detection performs the worst with the MSE loss (high variance of F1 scores and extreme outliers). To examine the performance more closely, we summarized the average F1 scores for each tumor cell (Ki67+/-), overall accuracy rate and R for the four regression loss functions in Table 3. As shown, all the proposed loss functions accurately detect nuclei centers and differentiate between the two stains; all achieved an average F1 score above 0.8, except for MSE. RMSE achieved the highest Accuracy Rate and R, with competitive F1 scores for both classes. For these reasons, the RMSE is the leading option for the Gaussian based tumor nuclei detection model and is selected for piNET. The scatter plot with corresponding R for all loss functions are shown in Figure 10.

Data Splits
In Ki67 breast cancer images, there are various types of tissue (adipose, stromal), epithelial cells and different artifacts such as folds that can create false positives in nuclei detection. Also, for a robust PI quantification tool, non-tumour cells should not be included in the final PI calculation. Therefore, in this experiment, we introduce the concept of an addition of weakly supervised images, with minimal ground truth annotations. Weakly supervised labels consist of patch-based labels of "ideal" and "nonideal". Recall non-ideal patches could have some tumour cells, but they also largely have folds, artifacts, other tissue types and non-tumour cells. Nonideal patches are used to investigate whether the model can use this data to improve nuclei detection accuracy and reduce false positives. The goal is to let the classifier see more or less examples of these other objects so that they are learned as part of the background class. This could potentially improve the nuclei detection performance with the addition of weakly supervised images. An example of ideal (comprised of largely tumour tissue and cells) and nonideal patches are shown in Figure 7.
The 660 patches from SMH are comprised of 330 ideal patches and 330 nonideal patches. Throughout the experiment, the 240 training, 30 validation and 60 testing images of ideal patches do not change. Ideal data are stratified in terms of images of low, medium and high PI (as per the loss function experiment). The main addition here is due to nonideal patches and the ideal ROIs remain consistent throughout the analysis. In other words, the addition of nonideal images, such as artifacts, folds, mixture of tissues -patch-level (weak) labeled images. This addition allows for minimal ground truth generation, while remaining consistent in teaching the models contextual and spatial information. Ideal and non-ideal are split up by percentages based on the amount of non-ideal data used and there are three types of data stratifications investigated. The first datasplit is 100:0 and that is 100% ideal patches and no non-ideal patches. The 330 ideal patches are then split into 70:10:20 sets for training, validation and testing. The second datasplit is 70:30, which contains 30% nonideal patches resulting in 330 and 172 ideal and non-ideal patches, respectively. The last split is 50-50 and all 330 ideal patches are used, and all 330 non-ideal patches are used. The number of images per stratification and datasplit for each experiment are listed in Table 4. For the datasplit experiments, the proposed piNET framework with the RMSE loss is trained and validated on the 70:20:10 datasplit for each partition, and testing is completed on the 60 held out ideal patches (and any non-ideal as well). Example nuclei detection results for three ideal: nonideal ROI data splits (100:0, 70:30 and 50:50) are shown in Figure 11. Figure 11 visually illustrates the importance of different data splits for analyzing images with large amounts of variation. The figure represents the models' response to nonideal patches in the training sets. Using the 100-0 datasplit, it is obvious the nuclei detection phase is working well in the artifact free images (a-d). However, in images with overstaining, tissue folds or non-tumour nuclei (e-g) the classifier is detecting many false positive Ki67+/-nuclei. As nonideal patches of data are introduced in the 70-30 datasplit, the model detects tumour nuclei but is also correctly suppressing overstaining, folds and other non-tumour calls as part of the background class (e-g). Similar results are seen with the 50-50 split. This demonstrates the importance of including more variability in the background class. Figure 11 visually illustrates the importance of different data splits for analyzing images with large amounts of variation. The figure represents the models' response to nonideal patches in the training sets. Using the 100-0 datasplit, it is obvious the nuclei detection phase is working well in the artifact free images (a-d). However, in images with overstaining, tissue folds or non-tumour nuclei (e-g) the classifier is detecting many false positive Ki67+/-nuclei. As nonideal patches of data are introduced in the 70-30 datasplit, the model detects tumour nuclei but is also correctly suppressing overstaining, folds and other non-tumour calls as part of the background class (e-g). Similar results are seen with the 50-50 split. This demonstrates the importance of including more variability in the background class.  Table 5 and the boxplots of the F1 scores are shown in Figure 12. The F1 Score of Ki67+/-of the 70:30 data partition preforms the worst out of all three splits, specifically in the Ki67+ channel (DAB). The 100:0 and 50:50 data partitions have comparable F1 scores for both tumor cells, 0.9 for Ki67-and 0.8 for Ki67+, but it is evident from the visual results that artifacts, nontumour cells/tissue and folds would create challenges in the 100:0 datasplit. Perhaps this is reflected in the PI Accuracy Rate, which shows that the 50-50 ratio classifies the patches correctly, obtaining an accuracy rate of 93% on classifying PI within the Low (<10), Medium  and High (>30) ranges. This may be attributed to improved classification of folds, artifacts, stromal tissue and other cell/tissue types as background (as shown in Figure 11) Therefore, to minimize false positives, and to obtain good detection performance, the 50:50 datasplit of ideal and nonideal patches is used.   Figure 13. PI scatter plot of manual PI vs. automated PI for each data split experiment, with the corresponding Pearson Coefficient (R), and equation of the line.

Nuclei Detection Accuracy & Generalization to other Datasets
Using a 50-50 datasplit of ideal-nonideal training patches and RMSE loss function for piNET, the overall nuclei detection accuracy and performance on unseen datasets of varying scanner vendors, image types (ROI, TMAs, WSIs), laboratory protocols or biomarker expression levels will be examined.
In addition to reporting metrics on the held out SMH patches, DeepSlides (human breast tissue), and the OVC TMAs (canine mammary) will be investigated using F1 (since these datasets have individually annotated nuclei). See Figure 14 for the box plots for the F1 scores for each of the datasets, and for each tumour nuclei type separately. The average metrics per dataset is shown in Figure 15a and the correlation between the computed and ground truth PI are shown in Figure 15b. The average metrics are summarized in Table 6. As shown by the F1 scores, there is the highest performance on the held out SMH patches in the model. Similar performance is noted for DeepSlides although the Ki67+ detection accuracy was higher in DeepSlides than in the SMH dataset which is interesting since data from this cohort was not seen in training. The data was also downsampled (to 20x) and despite this, the model generalized well to this dataset. The OVC data had the lowest F1 scores, and given that it was canine mammary Ki67 TMAs there can be some performance drop expected due to differences in cells, but overall the model is still performing well. The correlation between manual PI and automatic PI of the various datasets were calculated using R and the PI Accuracy Rate were also computed. All three datasets achieved an R value above 96%, demonstrating that the proposed piNET model can generalize to various unseen datasets. Interestingly, the OVC TMAs achieved the lowest F1 score and accuracy rate, but highest R (despite being of canine mammary samples). There were some TMAs in the OVC dataset with low PI that had few cells missed (which brought the overall F1 down). But when looking at the correlation between manual and automated PI calculation, there is clear indication that the model can predict PI accurately (R = 0.982).

Proliferation Index Accuracy on Diverse Data
Previously, piNet was tested on individually annotated nuclei to analyze the overall model's ability to detect nuclei across three datasets, while only trained on a single dataset. In this section, piNET will be applied to all datasets mentioned in Table 1 and the PI accuracy will be measured with Proliferation Index Difference, Pearson correlation coefficients and PI Accuracy Rate. Of the five datasets, four provide proliferation index values, while Protein Atlas uses proliferation index range classifications (Low <25, Medium 25-75, High >75). Hence, four datasets will be analyzed using PIs directly (PI difference and correlation coefficients) and all five datasets will be analyzed using the PI range classifications of Protein Atlas. There are three types of images included in this analysis: TMAs, ROIs and WSIs. TMAs and ROIs are generally easier to process than WSIs since they are cleaner data, meaning artifacts folds are minimal or nonexistent and the majority of the image contains tumor cells. WSIs typically are complex to quantify and difficult to analyze due to the variability within a single slide (various cell types, tissues and artifacts) and is a realistic illustration of clinical data pathologists work with.
To visualize results within the dataset, Figure 16 illustrates some nuclei detection results and the corresponding computed PI across 4 datasets. SMH patches are perceptively different to DeepSlides for example, which has more purple-ish hematoxylin and darker diaminobenzidine stains, and larger tumor cells. The Protein Atlas TMAs vary in staining color, scanning vendor, and the overall biomarker expression levels. Also, these images are largely of tumour regions. OVC TMAs, as previously mentioned, is canine mammary data. However, regardless of the image or tissue type, the model is able to accurately analyze the tumor cells effectively.
The average PI accuracy metrics are summarized in Table 7 over all five datasets using the PI difference, accuracy rate using Low <25, Medium 25-75 and High >75 PI ranges, along with the Pearson's correlation coefficient between manual and automatic PI. The difference in PI over all datasets is relatively low. Both the patched datasets (SMH and DeepSlides) as well as the OVC TMAs obtain a PI difference below 5% indicating that piNET is accurately measuring the PI. SMH WSIs achieves a PI difference of approximately 11%. Although this is higher than the three comparative datasets, WSIs are bigger images to analyze and they carry larger amounts of variation. A similar trend can be seen for the Pearson related coefficient value, where the three datasets consistently obtain a higher value and SMH WSIs achieves the lowest correlation. Analyzing WSIs is a challenging problem, due to the sheer size of the data and there are varying methods for analysis [21][22] [23]. Whole slide images can consist of 100,000x100,000 pixels [54]. Despite the large amount of data that is processed, our model is able to achieve a 11% PI difference between the pathologist's representative regions approach and the automated piNET method. Across the five datasets, three image types, four laboratory sources, the piNet model is able to achieve an PI Accuracy Rate of 85.2%, while being trained on data from a single source. The accuracy rate of the two ROI datasets (SMH and DeepSlides), combined, on average is 88% with PI difference of 4.23% and a R of 97%. The two TMA datasets also can obtain high PI Accuracy Rates: 97% on the OVC dataset, which is non-human breast cancer data, and 80% on the Protein Atlas dataset, both unseen by the model. The proliferation index difference of the WSI dataset is approximately 11%, with an accuracy rate of 76% and a PI Pearson related coefficient of 68%. The WSI dataset scores the lowest of the 5 datasets because of how the proliferation index ground truth values were obtained.  Figure 17. PI difference (left) and manual vs automated PI (right) over all four datasets.

Comparison to Other works
In addition to PI accuracy over all datasets, we also compare the performance of the proposed piNET system to other traditional and deep learning pipelines. The piNET framework was built based on the UNET architecture and ResUNet [49], which is what KiNet [20] was built upon, DenseUNet [50] and FCN8 [51] are other deep learning architectures investigated. All models are a variation of the original UNET convolutional neural network. Each model used the same learning rate (learning rate = 0.001), Adam optimizer, RMSE loss function over 100 epochs. The SMH training dataset that was previously used utilized in the other experiments are used for training and testing. The deep learning methods are also compared to our previous work on the IHC Color Histogram (IHCCH) [52]. It is an unsupervised method that uses a novel colour separation method to analyze hematoxylin (Ki67-) and diaminobenzidine (Ki67+) stains separately, and a gradient based nuclei detection method to detect cells in each channel and quantify PI. Each one of the five datasets for each method were tested for generalizability using the PI accuracy measurements. To examine the PI estimation accuracy for each approach, the manual and automatic PI were plotted in Figure 18 for all datasets combined. Of the four comparative methods, the piNET seems to be performing the best, with an average PI difference of 5.6% across four datasets (and overall lower variance and outliers). This demonstrates that piNET can generalize to new datasets well and better than competing methods. Table 8 represents the average PI difference categorized per dataset, where you are able to see it is able to achieve the lowest PI differences in comparison to the other models analyzed. The unsupervised method, IHCCH, obtained the lowest Average PI difference on SMH WSI, but the highest on DeepSlides and the SMH patches, hence this method did not excel in the generalizability test. The correlation between manual and automatically generated PI values per model, the Pearson correlation coefficient was calculated, per dataset, are shown in Table 9. The proposed method, piNet, is able to obtain the most consistent R value over all most datasets, including the WSIs. However, the SMH WSIs values are lower than other datasets due to the variability within a WSI. Breast cancer disease expresses heterogenous characteristic traits, where there is a high degree of variability within the tumor's pathology, this can be visually illustrated in Figure 3, where a WSI is annotated to illustrate the various levels of biomarker expression within a single image. FNC8, ResUNet and IHCCH method all perform well, but when analysing the WSIs, piNet outperforms the others. Finally, accuracy rate, Table 10, of each image classification of Low (PI<25), Medium (PI = 25-75) and High (PI>75) were analyzed. The accuracy rate, as mentioned, is a proportion of the models' ability to accurately classify the PI ranges accurately across the data sample. This was conducted across all five datasets, and overall, piNet outperforms the rest. The proposed method is ability to achieve the highest value consistently throughout each dataset. Hence, piNet is able to generalize to unseen data and is the optimal configuration for automated tumour nuclei detection for Ki67 stained breast tumour data.

Discussion
In digital pathology, there are a variety of challenges that algorithms must overcome when analyzing multi-institutional data , a large challenge being, variability. The datasets analyzed in this work vary in stain vendors, antibody biomarkers expression levels, laboratory staining protocols, different scanner vendors, scanner magnifications and image types (TMA, WSI, ROI). When using neural networks as a tool for biomedical images, the issue of overfitting to a single dataset can be a concern. The model may learn features from a single institution that do not generalize to other datasets from different laboratories due to varying staining protocols and scanner vendors. The proposed pipeline was implemented with robustness and generalizability in mind, using five datasets for validating these critical challenges. The multi-institutional dataset consists of four sources, three types of images, three scanner models and two types of mammary tissue (human and canine). An overall accuracy rate of 85.2% across five datasets was achieved, an average of 87.9% across ROIs, 85.5% for TMAs and the single WSI dataset obtained 76.4%, using the Low<25, Medium 25-75 and High >75, PI classification. To examine the WSI performance, outlier analysis was performed using the z-score. A single outlier was identified, and if removed from the SMH WSIs analysis, the PI difference changes from 10.997% to 10.460% and R increased from 0.684 to 0.75. Figure 19 illustrates the scatterplot of the automatic versus manual PI with the exclusion of a single outlier.
Analysis of WSIs face different challenges, in comparison to ROIs and TMAs as biopsy slides tend to have variations of tissue types, artifacts, folds and different types of cells (tumor and nontumor). When training a neural network model, if the patches that are being trained are "ideal", i.e. only tumour tissue and cells (no artifacts), the model may fail to generalize in WSI data. To overcome this, a novel concept of data partitions or ideal/nonideal labels (weak supervision) were introduced. As was shown, the data partition of 50:50 outputs F1 scores within the same range as the ideal data partition of 100:0, however the 50:50 data partition has reduced false positives in the presence of folds, artifacts and nontumour cells, indicating the system learned these features as part of the background class. Figure  11 illustrates the models ability to generalize to artifacts, stromal tissue and folds within WSIs, using the addition of weakly annotated images, experimenting three different data partitions. These patches were cropped from SMH WSIs, including the nonideal ROIs, and the first four images (a-d) illustrate a mixture of ideal and nonideal images and the last three patches (e-g) represent nonideal ROIs frequently found in the digital pathology images. Visually it is easy to see that the model trained on solely ideal patches (100:0), preforms poorly with many false positives when folds and artifacts are present in the patches. The model trained on a mixture of ideal and nonideal can adapt to these artifacts, commonly found on whole slides.
Due to variations in biomarker expression, tumour heterogeneity in combination with the WSIs large size, analyzing the entire WSI is not time efficient for pathologists. Automated methods can be used to alleviate some of these burdens and for WSI validation, the automated PI must be compared to the pathologist (expert) PI. The ground truths for the WSI dataset were obtained by a breast pathologist who utilized representative regions to get the PI score. The pathologist observes the overall biomarker expression throughout the tumour and focuses analysis on three regions (of 500 cells) that are representative of the overall tumour. These representative regions are used to compute the final wholeslide PI. In this work, we propose an innovative way of automatically finding only hotspot regions to provide a PI score. While we do not only use three regions, this method finds patches that have significant activity and use the cumulative tumour cell counts to measure whole-slide PI. Using this approach, there was a PI difference of 10.997%, on the WSI dataset and represents a low error rate and an accurate quantification without the known regions the pathologist analyzed. In the future, upon integration to whole-slide imaging viewers, we will consider scoring only regions that are selected by pathologists. Figure 19. Removal of an outlier in SMH WSIs of automated versus manual PI When analyzing the comparative methods, piNET achieves the lowest average PI difference across all datasets, and is the best for the DeepSlides, OVC TMAs, but is slightly outperformed by Res-NET on SMH patches and by the unsupervised IHC Color Histogram based pipeline on the WSI data (0.5% difference). The benefit of the current approach is that it is easy to change architectures (Res-NET, Dense-NET, etc.,) and see which is optimal for different datasets. This is the subject of future works. Considering the Pearson related coefficient and classification accuracy rate on the WSI; piNET surpasses the IHCCH method, where piNET achieves R=0.773 and an accuracy rate of 76.4% while IHCCH obtains R = 0.578 and a rate of accuracy of 72.7%. Therefore, although the PI differences were slightly higher on the IHCCH method, piNET is demonstrating robustness over all PI ranges through the correlation and PI accuracy. Furthermore, piNET has higher accuracy (generalizes) across all five datasets and the comparative models do not generalize as well to multi-center data.
The preprocessing steps used for TMAs and WSIs eliminate the need for users to annotate regions of interest. When analyzing TMAs, specifying regions is not necessary, since TMAs are relatively clean datasets and the entire region can be used for quantification. As shown in Figure 2, similarly both Protein Atlas and the OVC TMA datasets are visually very different looking, the stains and overall nuclei shape vary from the SMH dataset, which the dataset which the model was trained upon. Protein Atlas consists of ground truth values which are PI classification ranges of Low (<25), Medium  and High (>75) and the OVC TMAs ground truths are PI values, which can be classified into ranges. The proposed method will quantify PI of the inputted TMAs, which will then be classified based on the predicted PI value. A total of 60 Human Breast tissue TMAs from the Protein Atlas dataset were used for validation, each of these TMAs varied in image size, staining intensity, biomarker expression and antibody types (HPA000451, HPA001164, CAB000058, CAB068198). From the 60 TMAs the proposed method correctly classified 48 images, achieving an accuracy rate of 80.0%. The OVC dataset consists of 30 canine mammary tumor TMAs, of which 29 were correctly classified, obtaining an accuracy rate of 96.7%. Overall, the TMA dataset, combined with Protein Atlas and OVC, composed of 90 TMAs, a total of 77 were precisely classified, obtaining 85.6% accuracy rate. Figure 16 illustrates six images of the ground truth classifications or PI values with the original TMAs from Protein Atlas (g-i) and OVC (j-l), underneath you can visually see the overlaid predictions on each TMA with the corresponding classification or PI value. You can visually see the model's ability to ignore stromal tissue, epithelial cells and other non-tumor cells, while only examining the tumor cells of Ki67+/-. To demonstrate proof concept and clinical utility, we have also started looking at automated WSI PI scores and comparing them to PHH3. PHH3 is a mitotic activity marker, and there is interest in finding relationships between proliferation and mitotic activity. Ki67 stains nuclei of all active phases within the cell cycle, i.e G1, S and G2 phases. PHH3, phosphohistone H3, actively stains to mitotically active cells, which is also proliferation specific, specifically in the late G2 and metaphase stages [55]. In [56], a strong linear correlation was found between Ki67 and PHH3 in ductal carcinoma in situ (DCIS. In this work, based on invasive breast cancer disease, manual PHH3 counts were correlated to the automated Ki67 PI scores for the SMH WSI data. The PHH3 counts were acquired by an experienced breast pathologist, who evaluated 10 high power fields on adjacent tumour slides to the Ki67 stained tissue sections, with a field diameter of 0.65m, per corresponding PHH3 WSI. Comparing piNET's automated PI score to the PHH3 index a correlation of R = 0.57 obtained as shown in Figure 20. Comparing the three supervised methods and the unsupervised IHC Color Histogram based method, the piNET is able to achieve the highest correlation coefficient. Comparing the manual PI and PHH3 scores, a correlation value of R = 0.7425 was achieved. This may be due to the fact that the same regions were scored in both the Ki67 and the PHH3 slides by the pathologist. In comparison, the automated method (piNET) used all hotspot regions that had significant activity. In the future we will investigate further the impact of the hotspots selected. In [55], the study concluded that digital image analysis tools to analyze PHH3 and Ki67 stains, in combination with Mitotic Activity Index (MIA), has a strong prognostic value. The use of digital image analysis for Ki67 and PHH3 or MIA and Ki67 had stronger prognostic insight in contrast to Ki67 alone. Overall, the use of DIA in digital pathology has opportunities to improve patient quality of care, predict prognostic outcomes and survival rates.
Digital image analysis is a great tool to aid pathologists in their everyday workflow. It can relieve them of stressful tasks, overall workload, and the speed of analysis can improve patient quality of care. As artificial intelligence becomes more prominent on the biomedical space, in future works, we plan to minimize the piNET's error rate by considering novel stitching methods, investigating the impact of Otsu's threshold on prediction accuracy, data augmentation and more. The use of GANs to add diversity to the training dataset to improve prediction, as well as post processing methods to minimize error rates will also be investigated.

Conclusions
The proposed framework is a fully automatic Ki67 Proliferation Index Calculator, piNET, which calculates the proliferation index (PI) estimates over five datasets from four sources, three image types (TMAs, Patches and WSIs) and three scanner types, with varying colour and stain characteristics. The pipeline was trained on a single dataset, and when tested across datasets achieved an average PI difference of 5.603%, correlation coefficient R = 0.927, and an overall accuracy rate of 85.2% across five datasets, an average of 87.9% across ROIs, 85.5% for TMAs and the single WSI dataset obtained 76.4%, using the Low<25, Medium 25-75 and High >75, PI classification. The pipeline's PI difference, Pearson related coefficient and accuracy rate in combination with five datasets outperformed state-of-the-art methods. The success of the system can be attributed to a number of novelties and innovations in the system. Firstly, this tool is built based on deep learning, which can adapt to wide variability of medical images -and it was posed as a detection problem, instead of a segmentation one to mimic pathologists' workflow which improves accuracy and efficiency. Secondly, the system is trained purely on tumour cells, which avoids a usual necessary pre-requisite tumour segmentation step for Ki67 quantification, which helps to reduce inflated PI estimates and false positives from non-tumour cells. Thirdly, we introduce the concept of learning background regions through weak supervision, by providing the system with ideal and non-ideal patches which contain folds, artifacts and non tumour cells and tissue. Lastly, we provide a novel hotspot analysis that allows automated methods to only score patches from WSI that contain "significant" activity.