The Vibrio harveyi master quorum-sensing regulator, LuxR, a TetR-type protein is both an activator and a repressor: DNA recognition and binding specificity at target promoters

Quorum sensing is the process of cell-to-cell communication by which bacteria communicate via secreted signal molecules called autoinducers. As cell population density increases, the accumulation of autoinducers leads to co-ordinated changes in gene expression across the bacterial community. The marine bacterium, Vibrio harveyi, uses three autoinducers to achieve intra-species, intra-genera and inter-species cell–cell communication. The detection of these autoinducers ultimately leads to the production of LuxR, the quorum-sensing master regulator that controls expression of the genes in the quorum-sensing regulon. LuxR is a member of the TetR protein superfamily; however, unlike other TetR repressors that typically repress their own gene expression and that of an adjacent operon, LuxR is capable of activating and repressing a large number of genes. Here, we used protein binding microarrays and a two-layered bioinformatics approach to show that LuxR binds a 21 bp consensus operator with dyad symmetry. In vitro and in vivo analyses of two promoters directly regulated by LuxR allowed us to identify those bases that are critical for LuxR binding. Together, the in silico and biochemical results enabled us to scan the genome and identify novel targets of LuxR in V. harveyi and thus expand the understanding of the quorum-sensing regulon.


Analyzing the protein binding microarray (PBM) data
The PBM experiments yielded a fluorescence value for each spot on the array. The fifty sequences with highest fluorescence from each array design (100 sequences total) were collectively analyzed using MultiFinder (Huber & Bulyk, 2006). This program, which integrates four different previously developed motif discovery algorithms, can identify multiple position weight matrices (PWMs) and has the user-specified option (which we employed here) to output the single PWM with the most significant group specificity score (which here corresponds to the PWM that is most specific to the input sequences as compared to the rest of the sequences on the arrays). Within these 100 sequences, MultiFinder identified a 21 bp over-represented motif. The resulting PWM described the binding specificity by assigning a probability for each base at each of the 21 nucleotide positions.
To identify LuxR binding sites within known directly regulated promoters from V. harveyi, we used this PBM-derived PWM in conjunction with MotifLocator (Thijs et al., 2001), which scans for potential binding sites. MotifLocator uses as inputs the PWM, a background model of the genome, a chosen threshold probability score, and a list of target sequences, and then outputs a list of the target sequences that are above the input threshold score. Using MotifLocator, we analyzed both the known LuxR-regulated promoters and the PBM sequences using a variety of thresholds. We found that high threshold probabilities yielded too few expected hits, while low thresholds resulted in high false positive rates. Modifying the PWM by reducing it to a 20 bp sequence and enforcing symmetry between the visibly important 5 bp half-sites did not significantly improve overall performance. We concluded that a more sophisticated approach was needed to identify the true LuxR binding sites, and therefore applied a machine learning algorithm called a Support Vector Machine (SVM) (Bishop, 2006).

Refining the binding-site model
SVMs are a family of machine learning algorithms that map data sets into higher dimensions to separate the data points into classes. This form of supervised learning is well-studied and software implementations are publicly available. The first step in training a classification SVM model is to define positive and negative examples. In our case, these examples were obtained from the PBMs. Specifically, we used the PWM from MultiFinder with a low enough threshold probability (85% confidence) to return at least one 21 bp subsequence for each LuxR-bound 60-mer sequence on the array. We considered LuxR-bound sequences to be those whose normalized fluorescence was greater than 20,000 fluorescence units, closely corresponding to the top 50 sequences from each array. To be conservative, we considered as unbound sequences those whose fluorescence was below a cutoff of 7,500 fluorescence units. This cutoff was chosen to optimize the performance of the SVM on its own data, using leave-on-out validation. For each of the bound array sequences, the highest scoring 21 bp subsequence was taken as a positive example for the SVM.

Scanning and scoring promoter sequences
The first step in scanning promoter sequences for putative LuxR-binding sites was the same as the first step of SVM training. We scanned the promoter sequences using the PWM with the same threshold (>85%) and identified the above-threshold 21 bp subsequences, which were then converted into binary form. The binary sequences were scored by the trained SVMs and the average score for each 21 bp subsequence was compiled. Sequences with scores greater than 0 were considered true binding sites, while scores less than 0 indicated false positives. To perform the in silico mutagenesis of the qrr4 and qrgB LuxR-binding sites, we calculated the average SVM scores for the appropriately modified 21 bp sequences.

Identification of novel genomic targets
We scanned the V. harveyi genome for putative binding sites using our dual-layered (PWM/SVM) scoring system. Although we expect a large fraction of the sites with positive SVM scores to bind LuxR, we focused on sites located within 300 bp of a putative gene. The V. harveyi genome has been computationally annotated for open reading frames (ORFs) by the sequencing center at Washington University using the program GeneMarkHMM (Lukashin & Borodovsky, 1998). We used this information to compile a list of approximately 200 candidate LuxR binding sites, on both V. harveyi chromosomes. To enrich for functional sites, we evaluated conservation of the promoter regions, as well as the downstream ORFs themselves, with respect to V. harveyi's closest sequenced relative, V. parahaemolyticus. By eliminating candidate binding sites for which the respective promoter regions were less than 80% conserved, we shortened the candidate gene list to approximately 40 genes. With input from an energetic model (Kinney et al., 2007), we picked five of the highest-scoring of these 40 sequences to test for LuxR-dependent regulation. True positives = % of true positives predicted to be positive False positives = % of true negatives predicted to be positive