Using DNA microarrays to study host-microbe interactions.

Complete genomic sequences of microbial pathogens and hosts offer sophisticated new strategies for studying host-pathogen interactions. DNA microarrays exploit primary sequence data to measure transcript levels and detect sequence polymorphisms, for every gene, simultaneously. The design and construction of a DNA microarray for any given microbial genome are straightforward. By monitoring microbial gene expression, one can predict the functions of uncharacterized genes, probe the physiologic adaptations made under various environmental conditions, identify virulence-associated genes, and test the effects of drugs. Similarly, by using host gene microarrays, one can explore host response at the level of gene expression and provide a molecular description of the events that follow infection. Host profiling might also identify gene expression signatures unique for each pathogen, thus providing a novel tool for diagnosis, prognosis, and clinical management of infectious disease.


Databases
Although many laboratories are now capable of collecting microarray data, few have access to a database that can effectively meet their data requirements. With considerable investment of resources, a few full-featured, relational gene expression databases have been developed, but these are not available for public deposition of data (e.g., http://genome-www4.stanford.edu/Mi-croArray/MDEV/index.html; http://www. nhgri.nih.gov/DIR/LCG/15K/HTML/dbase.html). Recently released, the freely available AMAD software package (http://www.microarrays.org/ software.html) provides basic microarray data storage and retrieval capabilities to the average laboratory.
A grander goal for the community is establishing a consolidated resource for public distribution of microarray data (39)(40)(41). Again, the lack of a standard format for microarray data interferes with creating such a resource (38,39). The European Bioinformatics Institute, recognizing this obstacle, has proposed defining a standard based upon XML, a computer markup language that combines data and formatting in a single file for distribution over the World-Wide Web (40; http://www.ebi.ac.uk/arrayexpress/).

Algorithms
Inferring biologically meaningful information from microarray data requires sophisticated data exploration. Most global gene expression analyses have used some form of unsupervised clustering algorithm (16,(42)(43)(44) to find genes coregulated across the dataset (Figure 1). A primary justification for this approach is that shared expression often implies shared function (38,43). In datasets containing many experiments, clustering can also group experiments on the basis of gene expression profiles, an approach that has been successful in classifying tumor-derived cell lines (19,45) and tumor subtypes (12)(13)(14)(15)(16)(17).
When a coregulated class of genes is known, supervised clustering algorithms, which are trained to recognize known members of the class, can assign uncharacterized genes to that class. For example, a machine-learning method known as a support vector machine has been used to classify yeast genes by function on the basis of shared regulation (46). Robust determination of coregulated gene clusters may be achieved by using a tiered approach: unsupervised clustering to identify coregulated genes followed by testing and refinement with supervised algorithms (47).

Genomics
Although clustering algorithms will continue to be a mainstay in the analysis of gene expression datasets, a wealth of other datamining techniques have yet to be applied (38,48). Preliminary reports indicate that many algorithms and visualization methods are being developed, but their ability to extract biologic insight has yet to be established (49)(50)(51).
The study of microbial pathogens, and prokaryotes in general, will require the development of some specialized analysis tools. First, the compact and modular structure of prokaryotic genomes-and in particular, the presence of operons and pathogenicity islands-suggests that important insights may be gained by mapping gene expression information onto genomic structure. In addition, because gene expression will be measured in many different pathogens, often under the same environmental conditions, tools for cross-species comparison of gene expression data will permit the detection of conserved transcription responses.

Examining a Microorganism: Application of DNA Microarrays
Microarray technology promises to speed the study of uncharacterized or poorly characterized microbes by contributing to annotation of the microbial genome, enabling exploration of microbial physiology, and identifying candidate virulence factors.

Designing a Microbial Genome Microarray
Designing a whole-genome DNA microarray for a fully sequenced microbe is conceptually straightforward. Several sensitive microbial gene-finding programs can quickly and accurately predict most ORFs (52)(53)(54)(55)(56)(57). DNA fragments representing each of the ORFs can be obtained by PCR amplification that uses ORFspecific oligonucleotides, the design of which can be automated with primer design software such as Primer3 (58). Homology-searching algorithms should be used to choose regions of genes that will not cross-hybridize with other regions of the genome. After a simple purification step, PCR fragments can be arrayed by a robotic arrayer (5). This basic approach has been used to construct a 4,290-ORF E. coli microarray (10, 11) and a 3,834-ORF Mycobacterium tuberculosis microarray (30) as well as full-genome arrays for Helicobacter pylori (S. Falkow, pers. comm.) and Caulobacter crescentus (L. Shapiro, pers. comm.).
Microarray fabrication based on photolithographic synthesis of oligonucleotides in situ is also a viable approach and has been successfully used for the production of an E. coli complete ORF chip (E. coli Genome Array, Affymetrix, Santa Clara, CA).
The utility of microarrays is not restricted to fully sequenced organisms. A powerful screening tool can be obtained by arraying DNA libraries, as has been done for the eukaryotic pathogen, Plasmodium falciparum (59). A DNA microarray of 3,648 random genomic clones was used to identify >50 genes for which expression differed significantly between the trophozoite and gametocyte stages. The major limitation of this approach is that the identity of any element of interest must be determined after the experiment.

Annotating the Function of a Microbial Genome
For many pathogens, the number of genes for which function information is available is usually low. Moreover, the relative insufficiency of genetic tools can make obtaining such information difficult. However, because >70% of bacterial proteins have orthologs in other organisms (60,61), one can leverage extensive knowledge of function from the model organisms to infer function for a pathogen's genome. Similarity searches alone will predict functions of many genes.
We expect the study of genomewide expression patterns to contribute even further to annotation of function. The rationale for this belief follows from the observation that shared expression often implies shared function (38). As suggested by Brown and Botstein (21), the inclusion of a gene with a characterized ortholog in a coregulated gene cluster can predict the function of the remaining genes in that cluster, thus bootstrapping the function annotation of the pathogen's genome. This assertion is borne out in a study of global gene expression in Saccharomyces cerevisiae. Clustering of 2,467 gene expression profiles across a series of 78 experiments representing eight cellular processes demonstrated coregulation of genes that participated in shared cellular function (43). Therefore, the acquisition of a pathogen's gene expression data from even a modest number of experimental conditions may lead to testable hypotheses about function for a substantial number of genes, even those lacking sequence similarity to genes whose function has been characterized.

Probing a Microbe's Physiologic State
The assumption that genes are preferentially expressed when their function is required allows inference of gene function directly from physiologic gene response. For example, genes preferentially transcribed during the diauxic shift in yeast are predicted to contribute in the metabolic transition to respiration (9). Thus, gene expression studies will contribute to function annotation by identifying the specific environmental and physiologic conditions in which each gene is expressed. Furthermore, as annotation improves, the direction of this inference may be reversed, i.e., if information on function is known for many genes, genomic expression profiling may reveal the physiologic state of the organism.
Two studies have used whole-genome DNA arrays to explore gene expression response to environmental stimuli in E. coli. First, treatment with isopropyl-ß-D-thiogalactopyranoside (IPTG) was shown to induce only the lac operon, and to a lesser extent, the melibiose operon (11). In a second study, comparison of strains grown in minimal versus rich media revealed 344 genes that were differentially expressed between the two conditions: preferential expression of the translation apparatus in rich media and the amino acid biosynthetic pathways in minimal media were entirely consistent with prior data (10). Finally, examination of gene expression during heat shock revealed 119 genes with altered expression levels, all but 35 of which were previously recognized as heat shock genes (11). These studies confirm that the physiologic state of bacteria can be inferred from gene expression data.
In the first report of global gene expression monitoring in a bacterial pathogen, oligonucleotide microarrays were used to measure the relative transcript levels of 100 Streptococcus pneumoniae genes during the development of natural competence and during stationary phase (29). The results confirmed induction of the cin operon and identified 11 genes differentially regulated in stationary versus exponential phase. Of course, gene expression monitoring is not restricted to the study of bacterial pathogens. Transcription of the CMV genome was measured during infection by using an array of 75-mer oligonucleotides representing each of the 226 predicted CMV ORFs (62). By blocking translation or DNA replication, the researchers revealed a detailed classification of CMV genes into four kinetic classes, in agreement with previous reports, and assigned many ORFs, for which expression data were not previously available, into these groups.

Identifying Candidate Virulence Factors
Because expression of virulence-associated genes is tightly regulated (4), measuring a pathogen's gene expression in microenvironments specific to the pathogen and germane to the disease process is critical. Exploration of pathogen gene expression in the host environment may be technically challenging because of the relatively small number of pathogens present in an infected animal (29). Until more sensitive detection protocols are developed, examining global gene expression will be more practical in environmental conditions that mimic aspects of the host environment, such as elevated temperature, iron limitation, and changes in pH (4,63) and in cell culture models. In fact, a microarray has been used to monitor gene expression in M. tuberculosis while it infects cultured monocytes (64). Even after measurement of bacterial gene expression from infected hosts becomes feasible, the ex vivo datasets will facilitate deconstruction of the in vivo gene expression response into component responses, leading to detailed understanding of the pathways of virulence factor regulation.
Identifying candidate virulence factors through a global gene expression method relies on two assumptions. First, because virulence-associated genes are often coordinately regulated (4), new virulence factors are likely to be coregulated with known ones. By clustering gene expression profiles across a large number of conditions, we can precisely monitor coregulation, thus revealing subtleties of regulation and leading to the identification of bona fide regulons. Second, because virulence-associated genes are tightly regulated (4), genes that are specifically expressed during infection or under conditions mimicking infection are candidate virulence factors. This assumption has been justified by numerous studies using in vivo expression technology (IVET) and differential fluorescence induction (DFI), in which genes induced during infection are often required for virulence (4,65). When RNA from in vivo microbial samples can be efficiently isolated and labeled, microarrays will provide substantial advantages over IVET and Genomics DFI technologies for identifying putative virulence factors, including immediate identification of differentially expressed genes and detection of temporal profiles of transcription induction and repression. As is demanded for candidate genes identified by any expression screening approach, a role in pathogenesis must be confirmed by mutation and subsequent assays of virulence.
By identifying factors expressed in the host, microarray methods may also identify potential vaccine targets. Furthermore, one could identify candidate epitopes for vaccine development for intracellular pathogens by predicting whether genes that are preferentially expressed inside host leukocytes will encode promiscuous human leukocyte antigen class II ligands (66).
Gene expression studies may also reveal key regulatory differences that lead to differing virulence between closely related pathogen strains. For example, variations in virulence of Listeria monocytogenes serotypes have been correlated with differential transcription of PrfAregulated virulence genes (67,68). However, because microarrays cannot measure expression of genes that are absent from the reference strain, genotypic differences such as horizontal transfer of virulence factors will not be detectable by this method.

Pharmacogenomics
Yet another application for microarrays is the study of drug effects on microbial cellular physiology, as revealed by global gene expression patterns (69). This approach has been used to identify drug-specific gene expression signatures in yeast and human cells (18,19,70). Correlation of gene expression with drug activity may suggest molecular details of drug action, and correlation of transcription profiles in untreated cells with drug response may reveal mechanisms for sensitivity and resistance (19).
This approach has recently been used to characterize gene expression response in M. tuberculosis exposed to known inhibitors of the mycolic acid biosynthesis pathway, isoniazid and ethionamide (30). Both of these compounds elicited a similar gene expression response profile, characterized by pronounced transcription induction of five adjacent genes encoding fatty acid biosynthesis enzymes. Because a proven isoniazid target, KasA, was among these genes, the authors proposed that the adjacent, coregulated loci might be targets for new anti-tuberculosis drugs. Finally, these results suggested that the mode of action of a novel compound may be inferred from gene expression response to that compound.
Using microarrays to detect microbial polymorphisms linked to known drug-resistance phenotypes will also influence diagnosis and subsequent drug treatment. For example, an oligonucleotide array was used to detect mutant alleles of the M. tuberculosis rpoB gene, which are known to confer resistance to rifampicin (71).

Microbial Genotyping
One microarray application that interrogates DNA rather than RNA is the identification of genomic deletions in mutant strains and environmental isolates by measuring the number of DNA copies at each locus, a technique termed array-based comparative genome hybridization (72). This technique was used to identify several large deletions in a number of BCG vaccine strains and reconstruct their phylogeny (73).
Oligonucleotide arrays have also been used for fine-scale genotyping of polymorphisms in related pathogens. Accurate identification of Mycobacterium species using a GeneChip containing a set of 82 polymorphic oligonucleotides from the 16S ribosomal RNA gene demonstrated the potential power of this approach for molecular diagnostics (71). As additional microbial genome ORF microarrays become available, molecular surveys of the genomic structure of multiple strains will become far more precise and feasible. Two caveats should be mentioned: the ability to characterize genome insertions relative to the reference sequence is lacking, and the degree to which sequence variability can be characterized on the basis of microarray hybridization is unknown.

Examining a Host: Application of DNA Microarrays Designing Microarrays for Host Organisms
The currently described human DNA microarrays are largely composed of expressed sequence tags (ESTs). Culling ESTs from many different tissue sources and limiting representation of any single Unigene cluster (see http:// www.ncbi.nlm.nih.gov/UniGene/Hs.stats.shtml) have resulted in better than 50% representation of the predicted 80,000-100,000 human coding regions (28). A variety of human DNA and Genomics oligonucleotide microarrays are available commercially (e.g., Incyte, Palo Alto, CA; Affymetrix; NEN Life Science Products, Boston, MA).
For in vivo studies of host response, infection of animal models will often be necessary. If the animal is a primate, human DNA microarrays might be used to monitor host gene expression because of the high level of primary sequence similarity between species. Sequence similarity is too low to permit reliable cross-hybridization with nonprimate vertebrates, but microarrays composed of mouse and rat sequences have been described (74) and are available (e.g., Incyte, Affymetrix).

Understanding Pathogenesis
Microarrays promise to accelerate our understanding of the host side of the host-pathogen interaction. A large fraction of the genome can be simultaneously interrogated, and clustering of the data may identify groups of genes that implicate activation or repression of key regulatory pathways. Microarrays also allow the temporal sequence of transcription induction and repression to be followed, a prerequisite for determining the order of events following an encounter. Finally, ascertainment of the host cell's physiologic state, particularly apoptosis and necrosis, by genomewide profiling will facilitate separation of primary and secondary effects.
One important caveat of studying transcription in any system is that post-transcription regulatory events cannot be detected. This is particularly important in the case of host response because many important host cell events, such as cytoskeletal rearrangements, occur after transcription (75). Therefore, some key aspects of the molecular program may not be easily characterized by gene expression profiling. Eventually, it may be possible to monitor simultaneously the levels, activities, and interactions of all proteins in the cell (76).
Although analyzing gene expression of infected tissues is feasible, cellular heterogeneity may make analysis of host response complicated. Examining the response in infected cultured cells by using cell types most likely to encounter the pathogen may reduce the complexity of the system being examined. Results obtained in cell culture systems will be instrumental in interpreting gene expression profiles of specific cell types from whole tissue datasets.
The first application of global gene expression methods to pathogenesis used oligonucleotide arrays to monitor gene expression in primary human fibroblasts infected by human CMV (37). The transcript abundance of 258 out of 6,600 human genes changed by more than fourfold compared to uninfected cells at either 8 or 24 hours after infection. Some of these changes, such as induction of cytokines, stressinducible proteins, and many interferon-inducible genes, were consistent with induction of cellular immune responses.
A similar experimental design has been used to examine the global effects of HIV-1 infection on cultured CD4-positive T cells. One study concluded that HIV-1 infection resulted in differential expression of 20 of the 1,506 human genes monitored and that most of these changes occurred only after 3 days in culture (36). In contrast, the preliminary results of an independent study using a similar design indicated that substantial HIV-induced transcription changes began very early after inoculation (77). The latter study confirmed activation of nuclear factor-κB (NF-κB), p68 kinase, and RNase L.
DNA expression arrays have recently been used to examine the response of host cells to infection by bacterial pathogens. Transcription profiling of macrophages and epithelial cells infected by Salmonella confirmed increased expression of many proinflammatory cytokines and chemokines, signaling molecules, and transcription activators and identified several genes previously unrecognized to be regulated by infection (33,34). The macrophage study demonstrated that exposure to purified Salmonella lipopolysaccharide resulted in a very similar response profile to whole cells and that activation of macrophages with gamma interferon before infection modified the response (34). In epithelial cells, overexpression of κB (an inhibitor of NF-κB) blocked induction of gene expression for a number of regulated genes, underscoring the importance of NF-κB in the proinflammatory response (33).
Similarly, the transcription response of human promyelocytic cells to L. monocytogenes infection has been determined by both oligonucleotide arrays and filter-based arrays (32). Comparison of these data with the Salmonella infection data suggests that the proinflammatory response is grossly conserved: in both cases many key components including interleukin-1, intercellular adhesion molecule-1, and macrophage inflammatory protein 1-β are induced. Although differences were observed between the two experiments, including induction of apoptosispromoting genes by Salmonella versus induction of anti-apoptotic genes by L. monocytogenes, the disparities between cell lines, methods, and genes assayed in these reports make direct comparison difficult. However, we speculate that differences in pathogen virulence strategies may account for some of these differences in host response at the molecular level.
The initial reports demonstrate the potential power of using microarrays to characterize host response but also suggest that interpretation of host gene expression profiles will be challenging. For example, modulation of mRNAs encoding components of the prostaglandin E2 biosynthetic pathway suggested that CMV induced synthesis of this proinflammatory second messenger (37). The authors of this study proposed three potential explanations for this observation: this pathway could be induced by a cellular response intended to limit spread of the infection by promoting the killing of infected cells; viral regulators could induce prostaglandin E2 production to lure monocytes, which could subsequently be infected, leading to viral dissemination within the host; and these genes could be induced secondarily through induction of interleukin-1β since a similar pattern of regulation was observed in cells treated with that cytokine. Microarrays can identify interesting cellular events, but because expression patterns cannot distinguish between these mechanisms, the need for further investigation is obvious.
The experiments described above are strictly exploratory and attempt to catalog the transcription events that occur after an infection. However, expression profiling also lends itself to a more hypothesis-driven experimental design. For example, comparison of host responses to related strains of the same pathogen could explain differences in pathogenesis. In fact, comparison of gene expression in human monocytes infected by two distinct strains of Ebola virus, one infectious for humans and one not, revealed divergent transcription responses (78). Similarly, by examining responses to isogenic mutant pathogen strains lacking single virulence genes, or virulence factor-associated biologic activities, one might attribute components of the response to specific virulence attributes, which in turn might yield mechanistic insight into those virulence factors. Finally, comparing transcription responses to families of structurally related virulence factors, e.g., bacterial pore-forming toxins, may explain how pathogens expressing similar virulence factors can cause different pathologic responses.

Diagnostic Gene Expression Profiles
Most microarray-based gene expression studies in humans have searched for genes that are differentially expressed in various pathologic states. For example, clustering gene expression profiles can classify tumors into separate molecular subtypes (12)(13)(14)(15)(16)(17). In the case of diffuse large B-cell lymphoma, two distinct molecular classes exhibit substantially different survival rates, suggesting that future clinical intervention, at least in the case of cancer, could be guided by diagnostic gene expression profiling (14). Microarrays have also been used to measure the response of cultured cells to distinct external stimuli, including drugs (19) and environmental toxins (79).
How can this paradigm be applied to the diagnosis of infectious disease? In collaboration with Pat Brown (Stanford) and Lou Staudt (National Cancer Institute), we hypothesize that the unique constellation of virulence factors expressed by a specific pathogen will elicit a unique transcription response in the host (80). By extension, the cascade of events leading to inflammation and acquired immunity, including secretion of mediators and subsequent cell-cell interactions, might leave a unique trail of transcription signatures in the leukocytes participating in that response. Despite conserved overall virulence strategies, microbial pathogens exhibit specialization and unique attributes for any given strategy at the molecular level (81). Thus, by measuring the aggregate gene expression pattern in peripheral blood mononuclear leukocytes, for example, we may find signatures diagnostic of infection by specific pathogens or categories of pathogens.
The potential advantages of using host gene expression signatures as diagnostic markers of infection are profound. First, this technique might permit early detection of exposure to pathogens, even uncultivatable or uncharacterized Genomics pathogens. Second, variations in host signatures could be used to infer time since exposure. Third, because host response may continue in the absence of the pathogen, this method might detect exposure to pathogens that only transiently colonize the host, are sequestered in poorly accessed anatomic sites, or do not colonize the host at all (e.g., Clostridium botulinum and C. perfringens, in some cases). Finally, a single, easily collected sample could be used for diagnosing exposure to a wide array of agents.
Before the proposed method becomes an accepted diagnostic tool, one must determine whether exposure to a pathogen leads to a robust, persistent, and specific gene expression signature in peripheral blood mononuclear leukocytes and whether this signature is universal in patients of different genetic backgrounds. Experiments are under way in our laboratory to assess the feasibility of this approach. Thus far, identification of gene expression profiles common to many different pathogens is leading to a more detailed understanding of early events in the development of immune response, and inflammation in particular, but the goal of these experiments (to define unique signatures for each pathogen) has not yet been realized.

Conclusion: The Two-Way Conversation
The few published studies reviewed here represent what is certain to be the beginning of a deluge of genome-scale pathogen data. At Stanford University alone, microarray-based studies of Bordetella pertussis, Salmonella, H. pylori, Campylobacter jejuni, V. cholerae, M. tuberculosis, and E. coli, as well as the nonpathogenic microbes Streptomyces coelicolor and C. crescentus, are under way (S. Falkow, G. Schoolnik, S. Cohen, and L. Shapiro, pers. comm.).
The longer term goals of functional genomics and microarray technology in infectious diseases include describing the host-pathogen interaction in molecular detail and identifying critical target molecules and pathways for diagnosis and intervention. Realizing these goals will require additional technology, extensive data collection, sophisticated computational tools, and efforts to discern cause and effect. We are on the verge of being able to listen to the two-way conversation between pathogen and host through devices of immense power.