Promoter G-quadruplexes and transcription factors cooperate to shape the cell type-specific transcriptome

Cell identity is maintained by activation of cell-specific gene programs, regulated by epigenetic marks, transcription factors and chromatin organization. DNA G-quadruplex (G4)-folded regions in cells were reported to be associated with either increased or decreased transcriptional activity. By G4-ChIP-seq/RNA-seq analysis on liposarcoma cells we confirmed that G4s in promoters are invariably associated with high transcription levels in open chromatin. Comparing G4 presence, location and transcript levels in liposarcoma cells to available data on keratinocytes, we showed that the same promoter sequences of the same genes in the two cell lines had different G4-folding state: high transcript levels consistently associated with G4-folding. Transcription factors AP-1 and SP1, whose binding sites were the most significantly represented in G4-folded sequences, coimmunoprecipitated with their G4-folded promoters. Thus, G4s and their associated transcription factors cooperate to determine cell-specific transcriptional programs, making G4s to strongly emerge as new epigenetic regulators of the transcription machinery.

DNA G4s have been indicated as physical obstacles for RNA polymerase processivity, and thus transcriptional repressors in vitro [6][7][8][9] . However, recent evidence in cells suggests that several factors are involved in the outcome of G4-mediated transcription regulation: the template/coding strand harbouring the G4 sequence 10 , transcription factors (TFs) [11][12][13] , chromatin remodelling proteins 14 , interaction with distal genomic loci 15 ; in this context, G4s where rather linked to high gene expression 4,16 . Since G4s are dynamic structures that can be induced or unfolded upon interaction with G4-binding proteins, we reasoned that the G4 landscape could be unique to each transcriptional program, hence cell type. We thus compared the G4 landscape in two cell types and explored the specific interactions at the G4-folded and transcriptionally active sequences to ascertain the mechanism behind G4-mediated transcription regulation.
We investigated by G4 ChIP-seq 17   Non-random distribution of G4s in the human genome has been previously reported 22 . In line with this, ChIP-G4s in 93T449 cells were strikingly prevalent in promoter regions (79%) (Figure 1A), fact that strongly supports G4 involvement in transcription regulation. RNA-seq data (Supplementary File 2) and subsequent analysis of the percentage of expressed genes that contained at least one ChIP-G4 vs G4-location showed that: i) G4-containing genes were more prone to be actively expressed with respect to G4-depleted genes; ii) genes with G4s in promoter regions (annotated as promoter-TSS and 5'-UTR) were almost 100% expressed ( Figure 1B) and iii) produced significantly higher amounts of transcripts ( Figure 1C). In particular, there was positive correlation between higher gene expression and the presence of G4s, especially when G4s were within 1 kb from the TSS (Supplementary Figure 3A and B); this trend was smoothed as the distance from the TSS increased ( Figure 1D, top). Moreover, G4s mostly clustered within 250 bp from the TSS and the closer they were to the TSS, the higher the expression of the corresponding gene ( Figure 1D, bottom). Representative regions showing the relative position of G4s with respect to gene TSS are reported in Figure 1F. We next compared the amount of ChIP-G4s to that of putative G4s (pG4s) calculated by Quadparser and of "observed G4s" (oG4s), i.e. genomic regions previously observed to stop polymerase progression in vitro 23 . Almost all ChIP-G4s were also found as pG4s (~100%) and oG4s (~95%) when considering all genes; in gene promoters a similar percentage was found for ChIP-G4s matching pG4s (~100%), while a much lower percentage was found for oG4s (~23%) (Figure 1E). This difference can be explained by the fact that the majority of ChIP-G4s correspond to canonical G4 motifs ( Supplementary Figure 2A), while more than 60% of oG4s contained long loops or non-canonical patterns 23 . When analyzing the expression level of promoters in these three G4 categories, we found a significant increase of transcripts in ChIP-G4s. The other categories yielded significantly lower levels (Supplementary Figure 4). These data indicate that only the sequences that are actually folded into G4 are associated to high transcript levels, while pG4s or unfolded G4-forming sequences are not.
Analysis of accessible chromatin regions by Omni-ATAC-seq 24 showed that 93% of promoter ChIP-G4s  Figure 5B), however, only 12% of these overlapped with G4s ( Figure 2B), suggesting that the G4 regulatory role is limited to specific DNA region.   To test the above hypothesis that G4s in promoters stimulate TF binding, we first predicted the presence of putative TF binding sites (TFBS) within the ChIP-G4s in 93T449 cells using HOMER tool (http://homer.ucsd.edu/homer/). The most relevant TFBS corresponded to AP-1 (24.08% in ChIP vs 3.36% in background sequences) and Sp1 (24.98% in ChIP vs 9.45% in background sequences) ( Figure   4A, Supplementary Files 8.1.1, 8.2.1). AP-1 TFBS were mostly centred at the G4 peak, while Sp1 TFBS were more broadly distributed within the peak, suggesting that they can either overlap or be adjacent to the G4 (Figure 4B). When TFBS were calculated on an extended promoter region (-1000 to +750 bp from TSS) of either ChIP-G4-enriched or depleted genes, different TFBS were found and with consistently lower representation (Supplementary Figure 9). When considering all the target genes of AP-1 and Sp1 (data from ENCODE database), those presenting G4s (ChIP-G4s, pG4s and oG4s) were highly enriched with respect to those lacking G4s; among G4-containing categories, ChIP-G4s were the most enriched, suggesting association between TF binding and G4-folding capacity ( Figure 4C). As a further proof, AP-1 and SP1 were co-immunoprecipitated using anti-G4 primary antibody (Figure 4D, lanes 2-3) and, vice versa, primary antibodies against SP1 and AP-1 co-immunoprecipitated the anti-G4 antibody added to the samples as marker of G4 regions (Figure 4D, lanes 4-8), proof that these TFs interact with their binding sites in the presence of folded G4s ( Figure 4D). These data indicate that TFs AP-1 and SP1 are strictly linked to G4s and suggest that G4s are exploited by cells to facilitate TF interaction with the DNA: this effect is reached by G4-mediated exposure of the binding region or altered DNA methylation state, as G4s were reported to prevent CpG island methylation, condition which is required for transcription initiation 3,14,27 . An alternative possibility is that TF binding to G4 regions stimulates G4 folding to maintain the underneath chromatin in a transcription permissive state. Our data also indicate that the use of an anti-G4 antibody is a powerful tool to investigate the G4 landscape in cells. There have been doubts about the possible induction of G4 folding by this approach.
However, our data help dissipate this doubt since G4s have been differently detected in the same sequence in different cell types, fact that indicates the actual recognition of a previously formed G4 rather than G4 induction.
Overall our results point to G4s as key epigenetic regulatory elements that, in association with transcription factors, activate gene expression.

Primers and compounds
Desalted primers were purchased from Eurofins (Munich, Germany). A detailed list of primers name and sequence is available in the Supplementary Table 1.

G-quadruplex ChIP-qPCR
The immunoprecipitated sample (IP and Mock) and the input were used to quantify G4 enrichment via qPCR, using Fast SYBR PCR mix (Applied Biosystems), with a LightCycler 480 (Roche) quantitative PCR machine. Cycling conditions were 95 °C for 20 s followed by 50 cycles of 3 s at 95 °C and 30 s at 60 °C.
We employed primer pairs that target G4 ChIP positive and negative regions (Supplementary Table   1). Relative enrichments were derived with respect to their inputs.

G4 ChIP-seq library preparation
The immunoprecipitated sample and the input were subjected to Nextera library preparation as HiSeq 1500 platform in single-end using 50-bp reads. The experiment was repeated twice.

Putative G4s prediction
The presence of pG4s in the BG4 immunoprecipitated peaks was assessed by two different computational tools: i) Quadparser, based on a regular expression matching algorithm 19 ; ii) G4Hunter, based on a scored sliding window approach that considers also the G-richness and skewness of all bases in each window 20,28 . A fasta file containing the sequences of BG4 immunoprecipitated regions was obtained from HOMER output by mean of bedtools 29 and used as input of both prediction algorithms.

Comparison of pG4s, oG4s and ChIP-G4s
To obtain direct evidence of the association between G4s in the folded state and transcriptional level, three G4 categories were compared, namely: pG4s, oG4s and ChIP-G4s. This comparison was used to distinguish the transcriptional effect of a G4 permissive sequence versus the folded G4 structure. pG4s

Identification of G4-associated transcription factors binding sites
Bed files of 93T449 ChIP-G4s, Quadparser predicted pG4s (G 2-5, loop 0-12) and oG4s from Chamber et al. 23 were used to annotate peaks and extract genes with promoter G4s by mean of HOMER software.
The complete list of human genes annotated on the GRCh38-hg38 reference genome was retrieved from BioMart Ensembl database (http://www.ensembl.org/biomart/martview) and used to divide genes into promoter G4 containing/depleted genes according to the three categories of ChIP-G4s

Co-immunoprecipitation of G4s and transcription factors and Western blotting:
Co-immunoprecipitations to detect G4s interaction with TFs were performed by immunoprecipitating either BG4-G4complex, or AP-1 and Sp1 TFs. In the first case G4s immunoprecipitation was performed as described for the BG4 ChIP-qPCR procedure, but after the last wash, three replicates of BG4 captured material were collected and eluted in 35 µl Elution buffer (10 mM Tris-HCl pH 8.0, 5 mM EDTA, 300 mM NaCl and 0.5 % SDS) by incubating beads for 1 h at 55 °C to revert the cross-linking without degrading proteins. The collected proteins are expected to be G4s interacting proteins, since they were co-immunoprecipitated together with BG4. The presence of BG4 (internal positive control) and the TFs AP-1 and Sp1 was then checked by Western Blot as described below.
For the immunoprecipitation of AP-1 and Sp1, the fixed and sheared chromatin from 1.5 million cells was treated with 0.7 mg/ml RNase A (ThermoFisher) for 30 min at 37 °C, precleared with protein-A magnetic beads (Pierce™ ThermoFisher) for 30 min at 4°C under rotation to reduce the background due to the non specific adhesion of sample to the beads. 20 μl protein-A magnetic beads (Pierce™ ChIPAb+™ Merck #17-601 , anti-FLAG Sigma Aldrich #F3165, anti-NCL Santa Cruz Biotechnology #sc-803), washed in TBS-tween 0.1%, next incubated with secondary goat anti-rabbit and goat anti-mouse HRP antibody (Millipore #). Images were acquired on a Uvitec instrument by reading HRP bioluminescence.

Data analysis
Raw FASTQ reads were trimmed to remove adaptor contamination and aligned to the primary assembly of the human reference genome version GRCh38 using Bowtie1 (http://bowtiebio.sourceforge.net/index.shtml). Reads with more than 2 mismatches and multimapped reads were excluded from further analysis.
Only peaks with at least 2 fold more normalized tags count in the target experiment with respect to the input (used as control) were considered, the analysis was performed with disabled local tag count and poisson p-value threshold of 0.0001. HOMER was next used to associate peaks with the nearby gene, determine the genomic annotation of the region occupied by the peak and merge peaks from replicates.
RNA-seq reads were aligned to the human reference genome with TopHat and filtered by using samtools 33 to remove alignments with quality lower than 20, not primary alignments and PCR duplicates. Gene expression levels were quantified as transcripts per million (TPM). Genes differentially expressed between 93T449 cells and HaCaT cells and (s-value < 0.1; fold change > 1.0) were identified using the Bioconductor package DESeq2 and 'apeglm' for LFC shrinkage 25 .
ATAC-seq peaks were identified and mapped using HOMER and employing the ChIP-seq input sample as genomic reference. A fixed peak size was estimated by the software on the basis of autocorrelation analysis. The other parameters were set todefault. Bedtools were used to merge HOMER peak files into a single bed file 29 . HOMER was next used to associate peaks with the nearby gene and determine the genomic annotation of the region occupied by the peak. For the intersection of ChIP-seq and ATAC-seq peaks the bedtools intersect function was employed.
All the further statistical analysis was performed using R 34 .

DATA AVAILABILITY
All genomic data produced in the present project (93T449 G4-ChIP-seq, ATAC-seq and RNA-seq) have been deposited in the NCBI GEO database under accession number GSE145543 (reviewer token: wzwvciymxvqrlal).
HaCaT cells datasets for G4-ChIP-seq, ATAC-seq and RNA-seq were instead downloaded from GEO at the following accession number GSE76688.