Whole genome sequencing resource identifies 18 new candidate genes for autism spectrum disorder

Yuen et al. developed a cloud-based database with 5,205 whole genomes from families with autism spectrum disorder (ASD). They identified 18 new candidate ASD-risk genes and approximately 100 risk genes and copy-number loci, which account for 11% of the cases. They also found that individuals bearing mutations in ASD-risk genes had lower adaptive ability. We are performing whole-genome sequencing of families with autism spectrum disorder (ASD) to build a resource (MSSNG) for subcategorizing the phenotypes and underlying genetic factors involved. Here we report sequencing of 5,205 samples from families with ASD, accompanied by clinical information, creating a database accessible on a cloud platform and through a controlled-access internet portal. We found an average of 73.8 de novo single nucleotide variants and 12.6 de novo insertions and deletions or copy number variations per ASD subject. We identified 18 new candidate ASD-risk genes and found that participants bearing mutations in susceptibility genes had significantly lower adaptive ability (P = 6 × 10−4). In 294 of 2,620 (11.2%) of ASD cases, a molecular basis could be determined and 7.2% of these carried copy number variations and/or chromosomal abnormalities, emphasizing the importance of detecting all forms of genetic variation as diagnostic and therapeutic targets in ASD.

r e S O u r C e Autism is a term coined about a century ago, derived from the Greek root referring to 'self ' , and describes a wide range of human interpersonal behaviors 1 . Autistic tendencies may be recognized in many individuals as part of human variation 2 , but these features can be severe and therefore disabling [3][4][5] . The most recent Diagnostic and Statistical Manual of Mental Disorders, the fifth edition (DSM-5), uses the single omnibus classification 'autism spectrum disorder' (ASD) to encompass what once were considered several distinct diagnostic entities (such as autistic disorder, Asperger's disorder and pervasive developmental disorder not otherwise specified). The spectrum concept reflects both the diversity among individuals in severity of symptoms, from mild to severe, and the recognition of overlap among a collection of clinically described disorders [6][7][8] . Features of this neurodevelopmental disorder, which has a worldwide population prevalence of ~1%, typically include impaired communication and social interaction, repetitive behaviors and restricted interests; these may also be associated with psychiatric, neurological or physical comorbidities and/or intellectual disability.
Despite the unitary diagnostic classification, ASD is a heterogeneous spectrum, both in clinical presentation and in terms of the underlying etiology. Individuals with ASD are increasingly seen in clinical genetics services, and ~10% have an identifiable genetic condition 4,9 . In fact, there are over 100 genetic disorders that can exhibit features of ASD (for example, Rett and Fragile X syndromes) 10 . Clearly, ASD is strongly associated with genetic factors. Dozens of susceptibility genes (for example, SHANK-and NRXN-family genes) 11,12 and copy number variation (CNV) loci (for example, 16p11.2 deletion and 15q11-q13 duplication) facilitate a molecular diagnosis in ~5-40% of ASD cases. The variation largely depends on the cohort examined (for example, syndromic or idiopathic) and the technology used (i.e., karyotyping, microarray, whole-exome sequencing) 5,9,13-17 . Whole genome sequencing resource identifies 18 new candidate genes for autism spectrum disorder r e s o u r c e In fact, the genetic predisposition toward ASD may be different for almost every individual 18 , making this a prime candidate for the coming age of precision medicine 6,7,19 . The first beneficiaries of a genetic diagnosis are young children, in whom formal diagnosis based on early developmental signs can be challenging but who benefit most from earlier behavioral intervention 3,8 . Understanding the genetic subtypes of ASD can also potentially inform prognosis, medical management and assessment of familial recurrence risk, and in the future, it may facilitate pharmacologic-intervention trials through stratification based on pathway profiles 14 . The vast heterogeneity also means that meticulous approaches are needed to catalog all the genetic factors that contribute to the phenotype and to consider how these interact with one another and with nongenomic elements.
To move forward toward the goal of understanding all of the genetic factors involved in ASD, we recognize the need to scan the genome in its entirety using whole genome sequencing (WGS) 14,18,20,21 on thousands of samples, if not tens of thousands (or more) 13,[22][23][24] . Risk variants that remain undiscovered to this point are expected to be individually rare 9,18,22 , possibly involved in complex combinations 18,25 and include single nucleotide variants (SNVs), small insertions and deletions (indels), more complex CNVs 14,18,20 and other structural alterations 15,26 . Some will reside in the ~98% noncoding genome largely unexplored by other microarray and exome sequencing studies 21,27,28 . Abundant genome sequences may help to resolve the role of common variants in ASD 2,29 , and integrating these data with those on rare variants will aid understanding of the issues of penetrance, variable expressivity and pleiotropic effects 4,6 .
Such research brings us to the realm of 'big data': massive sequence datasets from multitudes of individuals, requiring fast and intensive searches for meaningful patterns 30 . This is where cloud-based computing excels. Its capacity for bulk data storage, with efficient processing and built-in security, is ushering in a new model for data sharing, enabling access and collaboration across continents 31,32 .
In our MSSNG initiative (where omission of letters from the name represents the information about autism that is missing and yet to be uncovered), we are collecting whole genome sequences and detailed phenotypic information from individuals with ASD and their families and making these data widely available to the research community ( Fig. 1). Here we describe the MSSNG infrastructure, new analyses of the first 5,205 genomes and examples of how to use the resource.

Samples and phenotypes
Our pilot work 14,18,21 established four principles guiding the prioritization of the samples selected for WGS ( Table 1). (i) DNA from whole blood is preferred for detecting de novo mutations (especially for the proband) rather than DNA from cell lines, which may acquire variants in vitro. (ii) For a comprehensive ASD resource, it is important to sample families with different genetic characteristics in order to delineate the full spectrum of relevant variation (for example, heritable variants may differ from those arising de novo, and ascertainment biases can influence the frequency of genetic variants identified). (iii) Families with extensive phenotype data who are accessible to participate in further study are most informative for ASD. (iv) For an individual's genomic data to be used in ASD genomic research on this scale, consent must be in place, or obtainable, for WGS and for the data to be stored in a cloud-based platform.
Here we report on the WGS and analysis of 5,205 samples (5,193 unique individuals; 12 individuals were sequenced on two different platforms for technical replication or were from different DNA sources). From nine collections, these samples included 2,626 samples from 2,620 individuals (2,066 unique families) diagnosed with ASD ( Table 1). Of the total, 3,100 samples (3,090 individuals) are from simplex (one child with ASD) and 2,105 samples (2,103 individuals) are from multiplex families (two or more affected siblings); 1,745 samples (1,740 individuals) are from probands and 879 samples (878 individuals) are from affected siblings (with the exception of two affected individuals within this cohort who are the father and paternal grandfather of a proband). The samples from individuals with ASD include 2,067 from males (2,062 individuals) and 559 from females (558 individuals), a 3.7:1 male-to-female ratio. For 339 samples (46 probands and 293 parents) only cell-line DNA was available. Based on self-reports and confirmed with genotypes from WGS or microarrays, the majority (72.6%) of participants are of European ancestry (Supplementary Table 1 and Supplementary Fig. 1). We obtained informed consent for all individuals, as approved by the respective research ethics boards. We have also developed a prospective consent form for WGS in persons with ASD (Supplementary Note).
An ASD diagnosis was of research quality when it met criteria on one (n = 437) or both (n = 1,361) of the diagnostic measures ( Table 2) Autism Diagnostic Interview-Revised and Autism Diagnostic Observation Schedule; it was considered a clinical diagnosis (n = 819) when given by an expert clinician according to DSM-IV or -5. Additionally, many participants were assessed with standardized measures of intelligence (IQ), language and general adaptive function. Out of the 1,102 individuals with IQ data available, 216 (19.6%) had scores within the range for intellectual disability (full scale IQ < 70). Physical measurements are also available for some individuals (n = 1,022). Most samples of affected individuals (n = 1,658) were genotyped on high-resolution microarrays (see below) and some by karyotyping or gene-specific assays.

WGS
We used different WGS platforms as they became available and were tested to assess data quality characteristics. We present data generated using Complete Genomics (n = 1,233), Illumina HiSeq 2000 (n = 561) and HiSeq X (n = 3,411). The different WGS approaches and tools used for mapping and variant-calling yield data with different characteristics (Fig. 2), but all platforms reliably called SNVs and smaller indels (up to 100 bp; larger CNVs are described below). Relative to the human reference sequence (hg19), the average coverage across all samples on three platforms was 93%, with an average of 40.4× sequence depth ( Table 1 and Supplementary Table 1). On average, we detected 3,654,992 SNVs and 722,816 indels per sample (Supplementary Table 1). Systematic detection of sequence-level de novo mutations and candidate ASD-risk genes Identification of multiple de novo mutations occurring in the same gene from unrelated individuals highlighted candidate ASD-risk genes 13,22 . Modifying our previous approaches 14,18,21 (Online Methods), we studied those 1,239 families (1,627 parents + child trios) for which child and parental WGS data were available (excluding children whose DNA was derived from cell lines). We identified 86.4 spontaneous events per genome (73.8 SNVs and 12.6 indels) (Supplementary Tables 2 and 3), including 1.3 de novo exonic variants per genome 14,18,21 . Experimental validation rates for selected de novo SNVs and indels were 88.2% (494 of 560) and 85.1% (103 of 121), respectively. Most (58.3%) of the nonvalidated calls were caused by false-negative detection in the parents. In total, we detected 230 experimentally validated de novo loss-of-function (LOF) mutations ( Supplementary Tables 4 and 5). r e s o u r c e To increase the power for ASD-risk gene identification, we combined our data with the de novo mutations detected from other largescale systematic whole-exome or WGS studies, which included 2,864 de novo missense mutations and 599 de novo LOF mutations in 4,087 trios 13,23,[33][34][35] . To identify candidate ASD-risk genes, we initially considered genes likely to be mutation-intolerant based on the ExAC database 36 (with a probability of LOF intolerance rate (pLI) > 0.9 for LOF mutations; with z-score of z > 0.2 for missense mutations) and higher-than-expected mutation rate (false discovery rate (FDR) < 15%). This approach yielded 54 putative ASD-risk genes (Fig. 3a).
In addition to the de novo LOF mutations, we also combined the de novo or maternally inherited LOF mutations on the X chromosome in the affected males. We identified seven genes (MECP2, AFF2, FAM47A, KIAA2022, NLGN3, NLGN4X and PCDH11X) with multiple LOF mutations and with pLI > 0.65 (Fig. 3a). Taken together, 112 of the 2,620 subjects (4.3%) bear de novo LOF or missense mutations or inherited LOF  Figure 1 Schematic of sample and data processing in MSSNG. An executive committee oversees the project. The parameters for DNA sample selection and (genetic and phenotypic) data are managed by the committee, including consenting and ethics protocols. Coded identifiers for samples selected for WGS are posted as they are identified at MSSNG portal (http://www.mss.ng/research), so the ASD research community can monitor progress. Phenome data include subject information (identity number, year of birth, sex), family code (proband, parent, sibling), results of diagnostic tests (for example, Autism Diagnostic Observation Schedule (ADOS), Autism Diagnostic Interview-Revised (ADI-R), age at diagnosis, functional assessments, intelligence tests, body measurements and dysmorphic features). The database accommodates as much of this information as is available for each sample, although this varies widely. Future plans include incorporation of fields for comorbidities, related conditions, exposures, extended family history, interventions and other parameters as they become apparent. WGS technologies were Complete Genomics and Illumina HiSeq (2000 and X). WGS data are transferred to Google Genomics for data processing through the Google cloud. Ref-blocked genome Variant Call Format files (gVCFs) were generated and stored in Google cloud storage, and they were also processed for de novo mutation detection in the local cluster (for Complete Genomics data using filtering methods) and Google compute engine (for Illumina data using DenovoGear). Ref-blocked gVCFs and de novo mutations were annotated through the Google compute engine (using Annovar), which can be accessed through the BigQuery tables. Quality controls (QC) for the genomic data were performed in the local cluster and the Google cloud. The processed genetic data and the phenotypic data are accessible through the MSSNG Portal interface. The MSSNG database is designed to support incremental addition of data without changes in architecture, scaling to at least tens of thousands. New WGS and phenotype data are continually added to MSSNG as new batches of 1,000 samples are processed. DACO, data access committee; UPD, uniparental disomy; Ti/Tv, Transition to transversion ratio; IBS, identity by state; ID, identity; FASTQ, a text format for storing sequencing data with quality scores; BAM, binary alignment/map; IGV, Integrative Genomics Viewer. r e s o u r c e mutations in the 61 ASD-risk genes identified ( Fig. 3a and Supplementary Table 5). Among these, 43 were found to be ASD-risk genes in a previous meta-analysis of exome sequencing 24 or in other CNV studies 10,15,17 .

Detection of CNVs and chromosome abnormalities
We examined CNVs detected from WGS using two calling algorithms for samples sequenced in Illumina platforms or provided by Complete Genomics, and for a subset of samples we examined CNVs using additional microarray data (Online Methods). From the WGS derived CNVs, we detected 401.4 CNVs (>2 kb) per genome. We validated these using laboratory-based methods and/or WGS read-depth comparisons (Figs. 3b and 4 and Online Methods). We found that 189 of 2,620 (7.2%) subjects carry one or more pathogenic chromosomal abnormalities (n = 21), megabase CNVs (n = 25),   r e s o u r c e CNVs involving genomic disorder loci (n = 69) or CNVs affecting previously reported ASD-risk genes (n = 58), all determined by standard diagnostic reporting criteria 16,17,37 and many associated with known syndromes of which ASD can be a component feature 5,9,10 . There were also 22 CNVs that overlapped with the ASD-risk genes found in this study (Fig. 3a). Three of the CNVs were around or less than 10 kb, which were only detectable using WGS, and five were noncoding (Fig. 4c).

Medical genetics and functional properties
Among these 61 ASD-risk genes with sequence-level mutations, 18 had not previously been reported in the literature (CIC, CNOT3, DIP2C, MED13, PAX5, PHF3, SMARCC2, SRSF11, UBN2, DYNC1H1, AGAP2, ADCY3, CLASP1, MYO5A, TAF6, PCDH11X, KIAA2022 and FAM47A). For two of these putative novel ASD-risk genes, mutations were found in at least three families from our data ( Supplementary  Fig. 2); MED13, which is related to the intellectual disability gene MED13L 38 , carried putative damaging mutations in three families. PHF3 mutations, related to PHF2, which is known to be involved in ASD 24 , were found in four families; this gene encodes a PHD finger protein that regulates transcription by influencing chromatin structure 39 , a mechanism increasingly being implicated in ASD 17,21,40 .
Other mutation-intolerant genes implicated in three or more families included PER2 and HECTD4 (Supplementary Fig. 2). While they did not meet the statistical significance criteria shown in Figure 3a, they may still represent interesting functional candidates for ASD or associated complications in these individuals. Notably, of these 61 ASD susceptibility genes, 36 (59%) were associated with known syndromes and/or phenotypes in Online Mendelian Inheritance in Man (OMIM), of which CHD8, SHANK2 and NLGN3 are associated only with ASD. Most (78%) of the known syndromes and phenotypes involved were intellectual disability or other related disorders, which may highlight the pleiotropic effects of these genes (Supplementary Table 6). Combining the list of 61 genes with the CNV data identified in the WGS analysis yielded a framework map of ~100 ASD-linked loci or chromosomal abnormalities (all listed in Fig. 3) for molecular diagnostic comparisons, accounting for 11.2% (294 of 2,620) of the subjects included in this study. Consistent with our previous findings 18 , ASD-relevant mutations were often different in affected siblings (Supplementary Fig. 2).
To assess the functional impact of genotypes, we compared the phenotypes ( Table 2) of participants with de novo LOF mutations, participants with mutations in ASD-risk genes and participants with pathogenic CNVs and no identified mutation in ASD-risk genes or CNVs. Only the differences in Vineland Adaptive Behavior Score (FDR = 0.04) and IQ full scale standard score (FDR = 5 × 10 −4 ) were significant after multiple testing corrections using Benjamini Hochberg approach. Consistent with the previous studies 41, 42 , we found that individuals with pathogenic CNVs had significantly lower IQ (P = 2 × 10 −3 , median difference: −8.5, 95% CI: −16 to −3; Fig.  5a). Similarly, individuals with mutations in ASD-risk genes also had a trend of lower IQ (P = 0.02, median difference: −11, 95% CI: −15 to −1.6 × 10 −6 ). More strikingly, however, we found that those individuals carrying mutations in ASD-risk genes had significantly lower Vineland adaptive ability scores (P = 6 × 10 −4 , median difference: −6.5, 95% CI: −10 to −3; Fig. 5b). Given that the Vineland adaptive score captures adaptive functioning better than cognitive ability, it may suggest that the ASD-risk genes identified here are more specific to ASD behavioral traits than to general cognitive deficits 43 .
Many of the ASD-risk genes (80%; 49 of 61) identified were connected to gene networks (Fig. 6). These genes are enriched in synaptic transmission, transcriptional regulation and RNA processing r e s o u r c e functions, consistent with previous findings 17, 21 . We found that genes associated with transcriptional regulation and RNA processing are more often expressed in the brain prenatally, while synaptic-function-related genes are expressed in brain throughout development 44 . Our extended gene network revealed more interactions of genes. The candidate ASD-risk genes identified here, such as SRSF11, may closely interact with known ASD-risk genes, such as UPF3B (Fig. 6).

Data access and processing
All data are available in the MSSNG Google cloud or linked databases, with Autism Speaks as the MSSNG data custodian. A web-based portal was also developed (Figs. 1 and 4 and Supplementary Fig. 3).
Example queries include those for retrieving predicted damaging variants for one (or more) genes of interest or for retrieving all damaging de novo variants in a subject. In addition, variant annotations, sequence-read pile-ups (using the Integrative Genomics Viewer plugin) and psychometric measurements can be accessed. Researchers receive authorization from the MSSNG's data access committee via an online application (https://research.mss.ng/). Autism Speaks uses the Public Population Project in Genomics and Society to independently recommend access according to guidelines established by Autism Speaks and Public Population Project in Genomics and Society, based on consents provided by the data donors or on research ethics boardapproved waivers of consent for retrospective collections.

DISCUSSION
Considering the breadth of data in our pilot WGS studies 14,18,21 and the global impact of ASDs, it became evident that an autism WGS project, encouraging use of data in a manner as unrestricted as possible for wide-ranging research questions, would be beneficial. We could move quickly because investments had already been made in developing biosample repositories from individuals with ASD and their families who had provided consent for their material to be used for genetic research. The resources generated and managed through MSSNG support ASD research in three related areas: (i) new gene discovery and diagnostics, (ii) genetic disease pathways, mechanisms, X-linked LOF in male or de novo LOF in female (7 genes   Other LOF mutations, including inherited LOF mutations and LOF mutations with unknown inheritance (where parents are unavailable for testing), as well as CNVs found in the MSSNG cohort, are also indicated (except for genes found by higher-than-expected de novo missense mutation rates). MSSNG data are in green and published data are in yellow. Putative ASD-risk genes identified in this study carry an asterisk. ∆ indicate genes with druggable protein domains identified (Supplementary Table 6). (b) Pathogenic chromosomal abnormalities and CNVs identified as falling into one of four categories: chromosomal abnormalities; DECIPHER loci and other genomic disorders associated with ASD; large rare CNVs between 3-10 Mb and CNVs disrupting ASD candidate genes not described above in Figure 3a. Red, deletions; blue, duplications; purple, complex variants; #, CNVs shared between affected siblings; ‡, a CNV carried by an individual with a second pathogenic CNV; †, a CNV shared between individuals within an extended pedigree. Details can be found in Supplementary Table 8. Examples of CNVs affecting the NRXN1 and CHD8 genes, and the PTCHD1-AS noncoding gene identified from the WGS, are shown in Figure 4.    Figure 3b plus 17 additional CNVs, we derived a more accurate estimate of the breakpoints by visual inspection of read depth from the BAM file in the MSSNG browser. On average, the size difference between the CNV predicted by microarray data and the estimated size from WGS data was 6.9 kb, and for 31 of 49 CNVs (63%), the size of the CNV was smaller in the microarray data than WGS. For four CNVs, the WGS-resolved breakpoints altered the exons of genes being annotated as deleted or duplicated. In another case, this resulted in a CNV from microarray no longer being classified as pathogenic, as the revised breakpoints no longer included the coding sequence.
r e s o u r c e and pharmacologic development and trials, and (iii) open-science queries of any type, including exploring the significant heterogeneity underlying ASD, as well as the noncoding genome, most of which can only begin to be conceptualized now that the resource exists. First, using the statistical framework defining genes with higherthan-expected mutation rates, we have identified 18 new candidate genes for ASD or associated complications (Fig. 3 highlights ~100 diagnostic loci for ASD). Some of these newly detected mutations could reasonably be considered pathogenic and/or have possible implications for clinical management or genetic counseling for the subject or family members 4,8 . Examples include screening for cardiac defects or maturity-onset diabetes of the young in cases with 1q21.1 or 17q12 deletions, respectively; secondary prevention to avoid obesity in those with 16p11.2 microdeletion 4,45 ; and monitoring the use of growth factors (for example, IGF-1) in PTEN mutation carriers who may react negatively 8 . In numerous other cases, including all instances of CNV and chromosomal abnormalities, detection of the mutation would lead to prioritization of these individuals for comprehensive clinical assessment and referral for earlier intervention 3,4 and could end long-sought questions of causation 8,16 . The roles of other mutations in ASD need to be closely followed in the literature. Having the data accessible in a browser portal will continue to enable diagnosticians worldwide to remotely perform genotype-phenotype explorations of new testing results against the latest WGS research data.
Second, 80% of the 61 ASD-risk genes on our refined list are connected in networks representing potential targets for pharmacologic intervention 19 . Sixteen genes contained subdomains that could be targeted by pharmaceutical intervention and seven contained subdomains for which specific drug-gene interactions are known (Supplementary Table 6). For example, individuals with mutations in SCN2A identify carriers as potential candidates for drug trials involving allosteric modulators of GABA receptors 46 . By extending the search to genes affected by CNVs and/or to proteins that interact with or regulate these genes, the list of potential targets for modulating the pathways impacted in ASD expands. Additionally, the focus here was on gene products that could be pharmacologically modulated with small molecules, but the use of technologies such as oligonucleotide-based therapeutics or gene therapy further increases the list of  Figure 6 Interaction similarity network of ASD-risk genes. Connections represent gene similarity based on physical protein interactions and pathway interactions. Connection thickness is proportional to the fraction of interaction partners shared by the connected genes. The size of the node for each gene is proportional to the total mutation count (Fig. 3). Circles, genes associated with LOF mutations; diamonds, genes associated with missense mutations. Node colors correspond to the BrainSpan brain expression principal component 1 (yellow, prenatal; blue, postnatal; light blue, balanced; gray, undetermined). The labels of ASD-risk genes identified in this study are displayed in red. The network was visualized using Cytoscape. r e s o u r c e potential interventions that could be used in addressing the biological deficits created by loss of function in these genes.
Third, efforts at solving the problem of the significant heterogeneity involved in ASDs will be furthered by expanding this initiative, including by partnering with other WGS projects and coordinating all information in a single open-science platform, for which MSSNG provides a foundation. Regarding genotypic heterogeneity, using established criteria 17,47 , in this study we were able to resolve a molecular basis in 11.2% of ASD cases; this tally should rise as we acquire more rare variant data to compare against 22 . A notable outcome of our study was validation of the findings that CNV and chromosomal alterations contribute significantly to ASD (Fig. 3b). These genetic alterations also often encompass multiple genes (Fig. 3b), isoforms of single genes 48 and their regulatory elements 27,49 , and they can include noncoding genes such as the known ASD-risk gene PTCHD1-AS (Fig. 4c) and combinations of mutations (seven cases have both ASD-relevant SNVs and CNVs), necessitating the use of a comprehensive technology like WGS.
Regarding phenotypic heterogeneity, our previous analysis of a subset of multiplex families in the MSSNG resource showed that siblings with discordant mutations tended to demonstrate more between-sibling variability than those who shared a risk variant 18 . In this study, with a larger sample size and access to richer phenotypic measures, our data reveal that participants bearing mutations in ASD susceptibility genes had lower adaptive ability than participants without such identified risk variants. Adaptive functioning as measured here using the Vineland Adaptive Behavior score was composed of estimates of socialization, communication, daily living skills and motor skills 50 . This finding needs to be further dissected to determine whether the association with risk variants is specific to one of these subdomains or is more linked to the composite. The same is true for the association with IQ.
Large-scale computing for this project can be done from within the MSSNG cloud and/or using the investigator's local resources. Our intent is that researchers will move new code to the data (i.e., to access data for analysis with the cloud platform), in particular for massive WGS and phenotypic queries, including performing meta-analyses incorporating their own data. Ultimately, the new information arising should then be broadly shared. MSSNG researchers, for example, can use open-standard tools supported by the Global Alliance for Genomics and Health application programming interfaces 31 so that tools developed by individual groups can be applied to data published elsewhere. This kind of continued interactive participation in shared open-access research will continue to enable a better understanding of ASD and set a course for other genomic initiatives in neuroscience.

METHODS
Methods, including statements of data availability and any associated accession codes and references, are available in the online version of the paper.

ONLINE METHODS
Samples for wgS and data access policy. We collected 5,205 unique samples (5,193 individuals) from 2,066 unique families with children diagnosed with ASD. The cohort consists of 2,618 children with ASD (1,740 probands and 878 affected siblings). Details on the collections the samples were drawn from are described in Supplementary Table 7. Data collection and analysis were not performed blind to the conditions of the experiments. We recruited other siblings and members of the family across generations whenever possible. We obtained informed consents, or waivers of consent, which were approved by the Western Institutional Review Board, Montreal Children's Hospital-McGill University Health Centre Research Ethics Board, McMaster University-Hamilton Integrated Research Ethics Board, Eastern Health Research Ethics Board, Holland Bloorview Research Ethics Board and The Hospital for Sick Children Research Ethics Board. According to the consent or waiver of consent forms, participants agreed to make their coded genetic and phenotypic information available to researchers to help in the discovery of the DNA alterations underlying ASD and ASD-related disorders. Their coded data were uploaded to the MSSNG Google cloud database. Based on the current approved consent form, genomic and phenotypic data can be submitted to this type of online database, provided that all data is coded, that access to data is controlled and that specific data access policies are in place. The data access policy generated by the legal team at Autism Speaks was modeled on accepted practices in international research consortia, such as the International Cancer Genome Consortium (ICGC). A researcher seeking to gain access to the data and perform their analyses in the cloud-based environment or download the data to use their own analysis tools must apply for access following the process outlined in the Data Access Policy (Supplementary Note). Sequencing data is coded and access to the data is controlled and governed via an research ethics boards/IRB-approved data access policy (Western Institution Review Board for use of AGRE data and other review boards for specific sites contributing data). At the time of writing, 7,214 samples from individuals with ASD or their family members were analyzed by WGS and available. The goal of this project was to collect a large cohort of families to facilitate genetic analysis as previously described 22 . No statistical methods were used to predetermine sample sizes.
Researchers can access data at multiple stages and levels of analysis: (i) through the MSSNG portal, which provides an interface for searching, filtering and browsing the final, curated variants, annotations and statistics via a web application; (ii) using BigQuery tables (a petabyte-scale distributed data warehousing (storage) and analytics (query) service) under the user's own account to perform custom queries, which allows flexibility for development of new analyses and applications; and (iii) via the user's own Google cloud storage (GCS) bucket on request for raw sequencing data and results of primary mapping (BAM files) and variant calling (MasterVar, VCF, gVCF) processes. At the time of writing, 75 researchers from 17 institutions in four countries (Canada, South Korea, UK and USA) were approved for access to MSSNG data.
wgS and data storage. We extracted DNA from whole-blood or lymphoblastderived cell lines (LCLs). We assessed the DNA quality using PicoGreen and gel electrophoresis. We sequenced the 5,205 genomes using Complete Genomics (n = 1,233), Illumina HiSeq 2000 (n = 561) and HiSeq X (n = 3,411) technology. WGS by Complete Genomics (Mountain View, CA) and Illumina HiSeq 2000 were performed as previously described 14,18,21 . For WGS by Illumina HiSeq X, we used between 100 ng and 1 µg of genomic DNA for genomic library preparation and WGS. We quantified DNA samples using a Qubit High Sensitivity Assay and checked sample purity using the Nanodrop OD260/280 ratio. We used the Illumina TruSeq Nano DNA Library Prep Kit following the manufacturer's recommended protocol. In brief, we fragmented the DNA into 350-bp average lengths using sonication on a Covaris LE220 instrument. The fragmented DNA was end-repaired, A-tailed and indexed using TruSeq Illumina adapters with overhang-T added to the DNA. We validated the libraries on a Bioanalyzer DNA High Sensitivity chip to check for size and absence of primer dimers and quantified them by qPCR using a Kapa Library Quantification Illumina/ABI Prism Kit protocol (KAPA Biosystems). We pooled the validated libraries in equimolar quantities and sequenced the paired-end reads of 150-bp lengths on an Illumina HiSeq X platform following Illumina's recommended protocol.
For samples sequenced on Illumina platforms, raw reads were uploaded to GCS. For samples sequenced by Complete Genomics, only analyzed results from the Complete Genomics pipeline were uploaded to GCS. Results of variant calling and filtering pipelines were also uploaded to GCS for permanent archiving and sharing with MSSNG researchers, and they were processed into BigQuery tables for access via the portal.
Alignment and variant calling. Alignment and variant-calling for genomes sequenced by Complete Genomics were performed as previously described 18 . We processed genomes sequenced by Illumina platforms on Google cloud using Google Genomics APIs with a pipeline that follows the best practices recommended by the Broad Institute 51 . We inputted primarily the paired FASTQ files (with a few samples processed from binary alignment maps (BAMs)). We aligned the reads to the reference genome (build GRCh37) using the Burrows-Wheeler Aligner (BWA, version 0.7.10). We removed duplicated reads using Picard (version 1.117). We performed local realignment and quality recalibration with the Genome Analysis Toolkit (GATK; version 3.3) on each chromosome. We detected single SNVs and indels using GATK with HaplotypeCaller. We extracted nonvariant segments (reference intervals) that were emitted by HaplotypeCaller using a custom Java program (NonVariantSiteFilter.jar). The output file was generated in the universal variant call format (VCF). Both the VCF output by this process and the calls from Complete Genomics samples (MasterVar) were converted to separate variants and reference blocks in VCF and saved in GCS. The variants and reference blocks were imported into Google Genomics and then exported to a BigQuery table.
Sample quality controls. We performed quality control checks for samples using codes from Google Genomics Codelab following the methodology developed previously 52 . We performed (i) duplicate samples, (ii) samples per platform, (iii) genome call rate, (iv) missingness rate, (v) singleton rate, (vi) heterozygosity rate, (vii) homozygosity rate, (viii) Ti/Tv ratio, (ix) inbreeding coefficient and (x) sex inference checks. To reduce the batch and cross-platform effects for analysis, we applied additional quality filters to remove variants caused by technical issues. We required variants to have genotype quality scores (GQ for Illumina; VAF for Complete Genomics) of at least 99. Since our analyses focused on rare variants, we required variants to be found in the population less than 1% of the time. We also required the variant to be called more than 95% of the time as a reference allele and less than 1% of time as a variant in the parents. Batch and cross-platform biases were substantially reduced after filtering (Fig. 2). Detailed procedures and findings can be found in the Supplementary Note. detection of de novo SNVs and indels. As described previously 14,18,21 , we considered a variant to be a potential de novo mutation when it was inconsistent with Mendelian inheritance (present in the offspring, but not in either parent). For a variant in the autosomal region, we considered it to be a potential de novo mutation when there was a heterozygous alternative genotype in the offspring and homozygous reference genotypes in both parents. For a variant in the X chromosome, we considered male and female offspring with different criteria: in sexspecific regions of male offspring, we considered it to be a de novo variant when there was a hemizygous alternative genotype in the offspring and a homozygous reference genotype in the mother. We considered X-linked variants in female offspring and X-linked variants in pseudo-autosomal regions in male offspring as for autosomal regions. We considered a variant in the Y chromosome to be de novo when a hemizygous alternative variant was present in the male offspring but absent at the same position in the father.
We performed de novo SNV and indel detection from Complete Genomics data as previously described 18 , except that here we considered both parents with each offspring in the same family as separate trios. We used DenovoGear (version 0.5.4) for de novo SNV and indel detection on genomes sequenced by Illumina platforms, running the program on each chromosome. We also extracted high-quality variants (i.e., those that passed the quality filter) that were inconsistent with Mendelian inheritance based on GATK with allelic frequency among parents less than 1%. We defined a putative de novo SNV as an SNV with a pp_DNM from DenovoGear greater than 0.95 and overlap with GATK calls (GQ of at least 99). We defined a putative de novo indel as an indel found by both DenovoGear and GATK methods with the same start site. In addition, we performed a manual inspection on the quality of variants by inspecting reads from the BAM for variants found to be de novo by DenovoGear or GATK for ASD-risk genes.
We used Primer 3 to design primers spanning at least 100 bp upstream and downstream of a putative variant. In designing primers, we avoided regions of repetitive elements, segmental duplication or known SNPs. We randomly selected putative de novo SNVs from the Illumina WGS data of two probands (2-1266-003 and 3-0141-000) in the trio families and from Complete Genomics WGS data of one proband (2-1292-003) in a quartet family for Sanger sequencing (Supplementary Table 3). In addition, we validated all the de novo LOF SNVs and indels and reported pathogenic variants from all families by Sanger sequencing, using DNA from whole blood. Candidate regions were amplified by PCR for all families and assayed by Sanger sequencing (Supplementary Table 3).
No substantial difference in de novo mutation detection rate (average number of de novo mutation for CG: 88.9; Illumina: 85.2) or distribution ( Supplementary  Fig. 4) was found between platforms. There was a difference in the validation rate for de novo LOF mutations between two platforms (CG: 78%; Illumina: 92%), but samples from CG only constituted 23.7% of the total samples ( Table 1). We found that 27.9% of total exonic de novo mutations were contributed by CG, which is proportional to the given number of samples.
Identification ASd-risk genes. We performed a meta-analysis of de novo mutations for identification of ASD-risk genes. We concatenated the de novo mutations detected here with those detected from five previous large-scale, systematic whole-exome or WGS studies (from a total of 4,087 trios) 13,23,[33][34][35] . To avoid sample duplication, we checked through registries to ensure that none of the samples from MSSNG were studied in the previous large-scale exome or WGS studies. Since the raw data for the previous studies were not easily accessible, we could not identify duplicate samples based on genotypes. However, we checked the possibility of duplicated samples based on the de novo mutation profiles given by each study. Focusing on exonic de novo mutations examined, there were only four pairs of samples sharing the same de novo variant in the 4,087 trios examined. Two of these pairs were found within same studies. While these pairs could be derived from the same samples, they only constituted a small portion (<0.1%) of the cohort.
The variants were reannotated using our custom annotation pipeline (see below). There were a total of 2,864 de novo missense mutations and 599 de novo LOF mutations reported. Combining them with the de novo mutations detected in the present study (Supplementary Table 3), we performed a statistical analysis to identify genes with higher than expected mutation rate based on the model framework described previously 47 . Observed rate of de novo mutation for each gene was compared with its expected rate using a binominal test. To address potential bias on mutation rate between observed data and expected simulation, we rescaled the statistics with a constant, k, which was derived from the ratio of the overall de novo mutation rate observed to that expected. For LOF mutations, we required genes to have at least two de novo LOF mutations and a probability of loss-of-function intolerant rate (pLI) > 0.9. For missense mutations, we required genes to have at least four de novo missense mutations and a missense z-score > 0.2 (derived based on scores from known ASD-risk genes and comparable gene number distributions with pLI > 0.9). We corrected P-values with the Benjamini-Hochberg procedure and defined candidate ASD-risk genes as having a false discovery rate (FDR) < 0.15. We also analyzed X-linked LOF mutations. We defined candidate ASD-risk genes as having at least two LOF mutations found in males or de novo LOF in females, and we required genes to have pLI > 0.65 (since X-linked genes and autosomal genes have different constraints, we derived the score for X-linked genes from the score of MECP2: pLI = 0.69).

SNV and indel annotation.
We annotated the variants on the Google cloud engine using a custom pipeline based on Annovar, as previously described 14,18,21,53 . The annotation process infrastructure includes a separate internal portal, which automates distribution of annotation jobs in parallel over a dedicated virtual machine (VM)-based computing infrastructure cluster. The variant annotations were then exported to a BigQuery table. Variant information was downloaded from databases for the allele frequency (using the Exome aggregation Consortium 36 , 1000 Genomes 54 , NHLBI-ESP 55  genetic network construction. For each of the 61 ASD-risk genes, we retrieved the top 200 closely interacting gene neighbors using GeneMANIA 64 . We generated an aggregate interaction network in GeneMANIA, based on physical protein interaction and pathways with the 'gene ontology biological process' weighting option. We then computed a pairwise-weighted Jaccard index to model the similarity of the genes' interacting neighborhoods, resulting in the final gene network (Fig. 6). Finally, we performed hierarchical clustering and manually optimized the weighted Jaccard index cutoff for displaying the gene network in Cytoscape 65 , so the gene clusters suggested by the network layout algorithm were similar to the clusters suggested by hierarchical clustering. cNV analysis. For samples sequenced on Illumina platforms, we detected CNVs from WGS for each sample using two algorithms, CNVnator 66 and ERDS 67 , as previously described 14,21 . Algorithms were run using their default parameters. We used 500 bp as the window size for CNVnator. For CNVnator, we removed calls with >50% of q0 (zero mapping quality) reads within the CNV regions (q0 filter), except for the homozygous autosomal deletions or hemizygous X-linked deletions in males (with normalized average read depth; NRD < 0.03). We defined stringent calls as those that were called by both algorithms (with 50% overlap). For samples sequenced by Complete Genomics, CNV calls were taken as provided as described previously 18 . Sixty-five samples had a total number of CNVs ≥ 2 s.d. of the average number, including 28 affected individuals. We retained CNVs with size > 2 kb. We defined a rare CNV as one found less than 1% of the time in the parents, in less than 0.1% in the population from microarray data and overlapping with a region that is at least 75% copy-number-stable according to the CNV map of the human genome 68 . We also performed a manual inspection on the quality of CNVs by inspecting reads from the BAM for confirmation.
We also analyzed CNV data for 1,658 affected individuals genotyped on one or more of the following microarrays: Illumina 1M single or duo; Affymetrix CytoScan HD; Affymetrix single-nucleotide polymorphism 6.0; Illumina OMNI 2.5M; Agilent 1M CGH array; Affymetrix GeneChip Human Mapping 500K Array (Supplementary Table 8). We defined rare, stringent CNVs as previously described 17 and also required them to overlap with a region that is at least 75% copy-number-stable according to the CNV map of the human genome 68 .
We determined pathogenic CNVs as those resulting in chromosome abnormities; large rare CNVs between 3 and 10 Mb in size; genomic disorders with recurrent breakpoints (including all DECIPHER loci and other loci known to be associated with ASD) 10,17 and CNVs impacting coding exons of known ASD-risk genes or noncoding exons of PTCHD1-AS or MBD5. All pathogenic CNVs found by microarray were found by WGS, except CNVs that were filtered out based on the quality issues or size difference (Supplementary Table 8).
Statistical tests. We compared the scores for phenotype tests ( Table 2) available for four groups of samples: (i) samples with pathogenic CNVs (n = 177), (ii) de novo LOF mutations (n = 170), (iii) mutations in ASD-risk genes (n = 116) and (iv) other samples without any of these mutations (n = 2,153). Samples included in each category were mutually exclusive to each other and there were no replicates (randomization not applicable). Phenotype tests investigated included (i) Vineland Adaptive Behavior standard score, (ii) Repetitive Behavior Scale-Revised overall score, (iii) Repetitive Behavior Scale overall score, (iv) Social Responsiveness total T-score, (v) Social Communication Questionnaire total score, (vi) language OWLS total standard score, (vii) Language PLS total standard score, (viii) IQ full scale standard score and (ix) nonverbal IQ standard score. Data distribution was assumed to be normal, but this was not formally tested. We performed ANOVA for the mean difference of the four groups in each of these tests: (i) degree of freedom (df) = 3, (ii) df = 3 in, (iii) df = 2, (iv) df = 3, (v) df = 3, (vi) df = 3, (vii) df = 3, (viii) df = 3 and (ix) df = 1. The differences between samples with mutations and samples without mutations were further tested using Wilcoxon signed-rank tests (one-sided) since they were not normally distributed. A Supplementary methods checklist is available.
gene-based drug targets. The 61 genes listed in Figure 3a were annotated using D.A.V.I.D. 69 for a number of gene-ontology categories and structural elements including PFAM subdomains. The PFAM labels were compared to lists of protein families generally considered to be druggable 70,71 . To identify previously validated gene-drug interactions, the 61-gene list was used to search the Drug Gene Interaction Database 72 (http://dgidb.genome.wustl.edu/). Only those results with associated peer-reviewed publications were reported. data Availability. Sequence data can be accessed via the European Genomephenome Archive (EGA) using the accession codes EGAS00001001023 and EGAS00001000850, and all sequence data can be accessed through the MSSNG database on Google Genomics (for access, see http://www.mss.ng/researchers). code availability. Codes used in MSSNG database can be found in the Supplementary Note. Others are available upon reasonable request.