Computational Evolutionary Methodology for Knowledge Discovery and Forecasting in Epidemiology and Medicine

Humanity is facing an increasing number of highly virulent and communicable diseases such as avian influenza. Researchers believe that avian influenza has potential to evolve into one of the deadliest pandemics. Combating these diseases requires in‐depth knowledge of their epidemiology. An effective methodology for discovering epidemiological knowledge is to utilize a descriptive, evolutionary, ecological model and use bio‐simulations to study and analyze it. These types of bio‐simulations fall under the category of computational evolutionary methods because the individual entities participating in the simulation are permitted to evolve in a natural manner by reacting to changes in the simulated ecosystem. This work describes the application of the aforementioned methodology to discover epidemiological knowledge about avian influenza using a novel eco‐modeling and bio‐simulation environment called SEARUMS. The mathematical principles underlying SEARUMS, its design, and the procedure for using SEARUMS are discussed. The bio‐simulations and multi‐faceted case studies conducted using SEARUMS elucidate its ability to pinpoint timelines, epicenters, and socio‐economic impacts of avian influenza. This knowledge is invaluable for proactive deployment of countermeasures in order to minimize negative socioeconomic impacts, combat the disease, and avert a pandemic.

The aforementioned issues are sahent characteristics of avian influenza which is caused by H5N1, a highly vimlent strain of the influenza-A vims [2,3,4]. In Spring 2006, it was established that infected migrating waterfowl causes transcontinental dispersion of H5N1. The vims has a devastating impact on poultry causing 100% mortality within 48 hours of infection [5]. Currently, mass culling of poultry is the recommended approach to contain the disease and this aspect causes significant economic hardships [6]. Moreover, the pathogen also spreads to humans through direct contact with infected poultry and contaminated surfaces [5]. Many researchers have stated that H5N1 has the potential to cause the next pandemic [5,7].
Proactive administration of suitable vaccination is the prevention mechanism for influenza [2,5]. Unfortunately, a myriad of technological and socio-political issues have rendered manufacturing and distribution of H5N1 vaccine a significant challenge [2,5]. Some of these issues include: (i) rapid mutations in the vims make it practically infeasible to predict antigenic characteristics for manufacturing vaccines; (ii) current vaccine manufacturers do not have sufficient capacity; and (Hi) targeted distribution of limited quantities of the vaccine is a chaUenge [5,8]. These issues necessitate that preventing and controlling avian influenza requires proactive intemational coordination and strategic deployment of countermeasures. The effectiveness of proactive strategies, including those proposed by the World Health Organization (WHO), wiU greatly benefit from prediction and forecasting of epicenters and timelines of H5N1 outbreaks. However, forecasting and proactive deployment of countermeasures requires global knowledge about the current and future epidemiology of the disease.
Analogous scenarios apply to current and future communicable and vector-bome diseases as well. Humanity needs to be prepared to proactively and effectively combat emergent diseases to minimize loss of human lives and impacts to global economy. Combating disease requires extensive knowledge about epidemiology, vaccinal characteristics, epicenters of disease outbreaks, timelines, economic impacts, and human mortality. Accordingly, we propose a computational evolutionary methodology to address the aforementioned need.

PROPOSED APPROACH
One of the most promising methodologies to rapidly analyze a complex, stochastic system consisting of miUions of oiganisms is bio-simulation [9,10]. Bio-simulations have shown to be intuitive, realistic, and provide close reflection of epidemiological characteristics [10,11]. The proposed approach aims to leverage bio-simulations to perform epidemiological analysis through the use of evolutionary eco-models of selected oiganisms in a global ecosystem. In the eco-models individual organisms are modeled (typically as smart agents) to evolve in a virtual ecosystem [12]. As the ecosystem evolves, necessary attributes at different levels of resolution are recorded to infer various characteristics. Note that in this evolutionary approach the behaviors of oiganisms (modeled as smart agents) are not coerced. Instead they are permitted to proceed "naturally" reacting to changes in the ecosystem and reflecting their real world counterparts. Once verified, the evolutionary bio-simulations can be used to further analyze a wide variety of scenarios to discover more comprehensive knowledge.

Background on Computational Evolutionary Approaches
Computational evolutionary methodology employs computers and digital models to study and analyze evolution, phytogeny, and associated biological characteristics of organisms [12,13,14]. This methodology includes a wide variety of other modeling, simulation, and analysis techniques as well [12,13,14]. Since our proposed methodology uses bio-simulations to analyze the natural progression and evolution of avian influenza, it falls under the category of computational evolutionary methods.
The fundamental motivation for computational approaches, that enable study oiganisms in silico rather than in vitro, stems from many of their advantages [12]. The primary advantage is that existing knowledge can be used to create virtual organisms, rapidly and economically analyze numerous future generations, and draw inferences [12,13,14]. Computational evolutionary approaches have shown to be excellent candidates for analyzing phytogeny of microorganisms such as bacteria and viruses [ 12]. These approaches are gaining significant importance in biology and medicine.

Epidemiology of Avian Influenza
The proposed computational evolutionary methodology can be applied to a broad range of biological systems [12,13,14]. However, here we wiU focus on the issues involved in applying the proposed methodology to study and analyze the epidemiology of avian influenza along with its socio-economic impacts. The objective is to use existing information on various organisms in the ecosystem, namely migrating waterfowl, poultry, and humans, to discover additional knowledge for forecasting timehnes and epicenters of disease outbreaks. The motivation is to empower disease control centers and emeigency management agencies with crucial information required to contain and control outbreaks to save human lives and minimize socio-economic impacts.

SEARUMS: A COMPUTATIONAL EVOLUTIONARY ENVIRONMENT FOR AVIAN INFLUENZA
Realizing the advantages of the proposed computational, evolutionary methodology requires the use of an effective software environment for eco-modehng, bio-simulation, and epidemiological analysis [10]. Specifically, in this case, the software environment must be conducive for analyzing the epidemiology of avian influenza. The software environment must facilitate rapid modeling and simulation (M&S) to minimize analysis time frames, thereby accelerating deployment of countermeasures. Moreover, the software must be portable and accessible to enable its widespread use [10]. Accordingly, we have endeavored to design and develop such a software environment caUed SEARUMS. SEARUMS is an acronym for Studying Epidemiology of Avian Influenza Rapidly Using Modeling and Simulation. It is an Java-based, integrated, graphical modeling, simulation, and analysis environment that is specialized for epidemiological study of avian influenza. SEARUMS provides an extensible, agent-based, spatially explicit modeling front-end coupled with a discrete event simulation kernel. The conceptual model of the software agents has been developed from corresponding Markov processes that provide a strong mathematical grounding. SEARUMS has been integrated with libraries for plotting graphs and charts that ease visuahzation of simulation results and aid in analysis.
In addition, SEARUMS includes comprehensive models that incorporate real-world statistical data on: waterfowl migration, published by Global Register Of Migratory Species (GROMS) [15]; waterfowl species that are at higher risk to carry the vims [6]; global poultry population and distribution, published by the Food and Agriculture Organization (FAO) of the United Nations [16]; and human population in metropolitan areas of the United States, obtained from the U.S. Census Bureau [17]. The model is stored as a portable XML document that can be readily reused and further extended. This aspect alleviates many procedural hurdles and promotes sharing and collaboration. SEARUMS is envisioned to serve as a global, multi-disciplinary environment that seamlessly integrates knowledge from various fields so that epidemiologists, economists, and disease control centers can collaboratively use it to combat avian influenza.

Conceptual Epidemiological Model
The epidemiology of avian influenza currently involves three primary types of entities, namely: migrating waterfowl flocks, humans, and poultry [6,8]. Migrating waterfowl are the primary vectors of the H5N1 virus [18]. They indirectly transmit the vims to poultry in their vicinity through infected food, water, and feces [19,20]. The disease is transmitted to humans that come in direct contact with contaminated surfaces or infected poultry [21]. In other words, the transmission of infections is determined both by space and time. Consequently, an epidemiological model needs to incorporate temporo-spatial interactions between the aforementioned three entities. Accordingly, we have developed our conceptual model using spatially explicit, discrete-time Markov processes as described further below [22]. Note that, since the goal of this section is to present factors steering the design and implementation of SEARUMS, it does not delve into mathematical detafls that are available in the literature [10]. The proposed conceptual model consists of a set of interacting entities. An entity may represent a flock of waterfowl, poultry, or confederation of humans. Each entity is mathematically modeled by a corresponding Markov process. A Markov process undeigoes instantaneous, probabilistic state transitions to reflect the behavior of its real-world counterpart [10]. Temporal state transitions induce changes to state variables, which in this case include population counts, location, infection percentage, and migration attributes. These changes corresponding various conditions such as: migration, healing or regeneration, and death or reduction in population. In conjunction with temporal changes, spatial interactions between entities are also captured as state transition probabilities. The spatial interactions, that occur in a spatially explicit models, have been mathematically modeled using principles of Euclidean geometry [23]. The surface of the earth is represented as an Euclidean plane and each entity has a circular area of influence. Interactions between entities occur when their circle of influence overlaps. Interactions between entities and the degree of overlap are used to model the intensity and spread of infection. Such a modehng approach is widely used in spatially explicit ecological models to reduce computational complexity [23,24,25].
In SEARUMS, each Markov process in the aforementioned mathematical formalism is modeled as a smart agent [24]. The simulation consists of a collection of interacting smart agents whose temporal behavior is causally coordinated using a discrete event kernel. The state has been implemented using a set of attributes that incorporate both temporal and spatial variables. Note that our methodology does not coerce behaviors or interactions between entities. Instead, each entity mimics the behavior of their realword counterpart which significantly simplifies implementation, reduces computational complexity, enables lucid use of real-world data, and eases addition of new instances including new types of software agents.

ARCHITECTURE OF SEARUMS
SEARUMS is an eco-modehng and bio-simulation virtual environment that uses a computational evolutionary methodology for study and analysis of epidemiology of avian influenza. It has been developed in Java by capitalizing on many of its object oriented programming features [26]. SEARUMS is designed to be a user friendly, integrated, graphical modehng, simulation, visualization, and analysis environment for conducting epidemiological analysis of avian influenza using an Agent-based, Spatially Explicit (ASE) model. These design goals have been achieved by composing the system using a coUection of interdependent but loosely coupled modules. Each module has a well defined functionality that can be accessed and utflized via a set of Application Program Interface (API) method calls. API of the modules are Java interface classes that are implemented by each module. Interactions between modules is achieved through interface classes to ensure loose couphng. This approach permits seamless "plug and play" of modules and the environment is composed by loading suitable modules dynamically on-demand via Java's reflection API [26]. Such an implementation approach has been adopted to ease customization and extension of SEARUMS without requiring changes to its design or impacting existing modules.
An architectural overview of the modules constituting SEARUMS is shown in Figure 1. The modules can be broadly classified as core modules and Graphical User Interface (GUI) modules. The core modules of SEARUMS are the Agent Repository, Agent Customizer, Persistence Module, Dynamic Control & Steering Module, Simulation Module, and Logging Module. These modules provide the core M&S functionality of SEARUMS. The GUI facilitates interactions with the core modules via convenient and intuitive user interfaces. The GUI modules can be further categorized into the Editor subsystem, the Simulation Controller, and the the Visuahzation & Analysis subsystem. SEARUMS uses the Model-View-Controller pattern to couple the core modules, the GUI models, and the Eco-description. The design permits the GUI modules to be easily replaced with a minimal command-line text interface for miming SEARUMS in offline batch mode. The batch mode is useful for performing repeated runs or analyzing different scenarios on computational clusters.
The modules and subsystems constituting SEARUMS cooperatively operate on a shared, in-memory representation of the model called the Eco-description. The Ecodescription is a centralized data stmcture that includes aU the information necessary for modehng, simulation, and analysis. It is composed using a collection of Java classes and that provide efficient access to data and information required by the various modules. The primary information encapsulated by the Eco-description relates to the smart agents [24] that constitute the model. As shown in Figure 1, the agents are organized into an Agent Repos i tory to facilitate instantiation and use via Java reflection API. Currently, SEARUMS includes the following three smart agents: Waterfowl Agent that represents a migrating waterfowl flock. Poultry Agent that models behavior of poultry flocks, and Human Agent that models humans. Each agent has its own behavior that reflects the characteristics of its real-world counterpart. The behaviors are customized to represent specific instances of an agent by specifying suitable values for the exposed attributes via the Attribute Editor GUI module. The attributes of an agent include: 1. Geographic attributes that indicate the location (latitude and longitude) and logical association with countries and continents. In addition, each agent has a circle or influence that circumscribes its neighborhood. 2. Migratory attributes are specified only for agents whose location changes over the lifetime of the simulation. The migratory attributes are described as a sequence of migration points. Each migration point has geographical and chronological (arrival and departure dates) attributes associated with it. In SEARUMS, only one complete migration cycle needs to be specified. The software automatically reuses the information to simulate annual migratory cycles. 3. Statistical attributes for agent instances include their initial population, density and distribution, initial infection percentage, infection spread parameters, incubation periods, mortality rates, and population regrowth parameters.
The agents implement the conceptual, mathematical model of the system developed using Markov processes. They are added to a model via suitable toolbar buttons or menu options provided by SEARUMS. Agent instances are created with default attributes from the Agent Repository by the Agent Custom!zer module using Java's reflection API. Once instantiated, the attributes for agents can be modified via the attribute editor module. The agents are implemented as a family of Java classes by extending a common base class called Agent. The Agent class provides methods for interacting with the simulation kernel, inspecting the neighborhood, scheduhng events, and interfacing with the GUI modules.
The agents in a model are logically organized into hierarchical sets called groups. SEARUMS permits multiple top-level groups with an arbitrary number of hierarchies, with one or more sub-groups at each hierarchical level. An agent can be a member of multiple groups. The groups serve several different purposes in SEARUMS. A group can be used as a parameter for statistical analysis and for plotting charts. For example, a group caUed "United States" can be created with 50 different sub-groups, one for each state, encompassing various agents. The main "United States" group can be selected for plotting charts and SEARUMS automatically collates and plots data for each state. Note that even though graph plotting is restricted to one hierarchical level, statistics for plotting are collated in a recursive, depth-first manner and includes data from all agents in underlying hierarchies. A modeler can use a combination of groups to perform multifaceted analysis at different scales. In addition, groups can be included or excluded from simulations for analyzing different scenarios. The GUI modules utilize groups to provide control on visibflity of agents to manage details displayed on the screen. The Group Editor module provides the user interface for managing group entries and hierarchies.
Once all the agent instances and groups have been established in a model, the parameters for observation are added to the Eco-description. These parameters are selected by the user via the Statistics & Charts Editor from a list of options. The list includes the attributes of the agents and the groups in the Eco-description. Each parameter is configured to be sampled hourly, daily, or weekly in terms of simulation time. Moreover, each parameter can be subjected to statistical operations, such as sum, mean, and median. SEARUMS can dynamically (i.e., during simulation) plot and save a variety of charts including: line graphs and pie charts. Multiple charts can be simultaneously used for analyzing a variety of data.
All of the aforementioned information is stored as an integral part of the Ecodescription. The Eco-description can be saved for future reuse via the Pers i stence Module. The Eco-description is unmarshalled into an XML document that is compliant with a predefined XML schema. Serializing to an XML document has its advantages. First, it enables simple scripts to be developed that can modify specific values and perform multiple simulation mns in batch mode. Second, XML documents can be readily version controlled and archived using commonly available revision control systems like CVS and Subversion. Third, it eases documentation, validation, sharing, and reuse of valuable domain-specific statistical data collated by different researchers from diverse sources. Such features play an important role in facilitating large-scale, collaborative epidemiological studies.
The simulation module performs the task of conducting a Discrete Event Simulation (DES) using the Eco-description. This module utflizes a multi-threaded DES kernel that manages and schedules the discrete events generated by the Agents. Multithreading enables the DES kernel to exploit the compute power of multi-processor or multi-core machines thereby reducing the wall-clock time for simulation. The number of threads spawned by the DBS kernel is configurable. Bach thread processes concurrent events (events with the same timestamp) in parallel without violating the causal constraints between events. The Dynamic Control and Steering Module provides the infrastructure to control the DBS kernel. In addition, it permits selected agent attributes to be modified during the course of simulation. The Simulation Controller module provides the graphical interface to the Dynamic Control & Steering Module.

ECO-MODEL FOR VERIFICATION AND USE OF SEARUMS
The initial experiments conducted using SBARUMS centered around its verification and validation. For this purpose, we have developed an Bco-description using selected, high risk species of waterfowl as reported by Hagemeijer et al [27]. Table 1 lists the waterfowl species, including high risk species [27], used in the model. The migratory fly ways of the waterfowl and their population has been collated from data pubhshed by various organizations [2,4,15,16,17,27]. For modeling and simulation purposes the dates for migration were approximated to the middle of the months reported in the statistics. The initial positions of the flocks were set to correspond to 01/01/2006, which is the realworld time when this specific simulation is logically set to commence. The dispersion of poultry population in different continents has been approximated to circular regions with even density in the Bco-description [16,25]. Poultry data has been collated from statistics published by national organizations and government databases [16]. Currently, we have represented human population, as reported by U.S. Census Bureau [17], in all 26 major metropolitan areas of the United States, approximated to circular regions. However, the Bco-description can be readily extended to include otherparts of the world. Note that to the best of our knowledge, it is the most comprehensive model of its kind reported to date.

VERIFICATION OF SEARUMS
Having developed the Bco-description, we verified the validity of the Bco-description by performing extensive simulations with the initial source of infection set to outbreak in Indonesia [5]. We established the vahdity of the Bco-description and SBARUMS by confirming that the timing and chronology of several outbreaks observed in the simulations correlate with significant real-world incidents as reported by WHO [5]. The data in Table 2 presents a comparison of real-world and simulated outbreaks. The data is presented for some of the significant, initial avian influenza outbreaks and not for repeated outbreaks that occur in these regions. Note that deviations of ±2 weeks is expected due to approximation of migration dates. In addition, deviations in dates also occur because the Bco-description does not include all types of migratory waterfowl flocks but only the high risk species. However, the sufficiently close concordance between simulated and real-world outbreaks establishes validity and effectiveness of SBARUMS. Moreover, it significantly increases confidence in inferences drawn from the simulation. Currently, a variety of case studies using SBARUMS are already underway to predict and avert a  [27].  Scalability and Performance Evaluation

Description of AgentType
One of the important objectives underlying the design of SEARUMS is to facflitate rapid analysis of epidemiology of avian influenza. Accordingly, the vahdated Eco-description was also used to conduct scalability and performance evaluation of SEARUMS using three different platforms. The first platform was a Sun Netra-T12 SMP workstation with eight 1.2 GHz SPARCv9 processors with 16 gigabytes of RAM running Solaris 9. The second platform was a conventional personal computer (PC) running Windows XP (32-bit). The PC had a dual-core, 64-bit Turion processor running at 2 GHz with 2 GB of memory. The third platform was the same PC but running Linux (Fedora Core). Specifically, the PC was setup to a dual-boot configuration enabling it to run either Windows XP or Linux. These simulations were conducted by running SEARUMS  in batch mode, without any GUI overhead to ensure emphasis on core performance.
The graph in Figure 2 plots the wall-clock time taken to complete the simulation using a varying number of threads on the aforementioned platforms. As shown by the curves, initially the time taken for simulation reduces as the number of threads are increased. The wall-clock time decreases because the threads run in parallel on the multiprocessor/multi-core machines and rapidly process concurrent events. The initial performance improvement highlights SEARUMS' scalability. However, the performance decreases as the number of threads are further increased. The deterioration occurs because the Eco-description used in this experiment does not have sufficient, inherent concurrency to leverage the available compute power On the other hand, larger models with more concurrency will benefit from the scalable design.
These experiments highlight the portability, scalability, and performance aspects of SEARUMS design. Simulations involving more than 20 million events complete within 4 minutes in optimal cases demonstrating that SEARUMS enables rapid epidemiological analysis of avian influenza. Furthermore, the experiments provide empirical evidence that the design goals of SEARUMS have been successfully achieved.

Case Study: Spread and Impact to United States
In this case study, we analyzed the potential impact of avian influenza to poultry farming in the United States using the validated Eco-description presented earlier The study was conducted using a number of bio-simulations with three different experimental groups of migrating waterfowl. The three flocks were chosen based on their close proximity to known primary sites of disease outbreaks. The initial infection in each experimental group was varied for analysis. In all these experiments, the initial conditions were seeded corresponding to January 1, 2006 and the simulations were run for a period of 5 years.  Figure 3 illustrates one of the trans-Atlantic transmission pathways to the continental United States. We observed that the spread was determined by migratory pathways and timelines of different | species of waterfowl rather than initial in-f fection percentages. One of the notable ob-1 servations is that our experiments correctly predicted an outbreak in the United Kingdom [4]. The graph in Figure 4 presents the impact of avian influenza outbreaks on poultry population in continental United States. Decrease in poultry population corresponds to H5N1 induced death and culhng of birds to control the disease. Increase in poultry population reflects regeneration of poultry flocks after an outbreak. As illustrated by the graph, infections in poultry also follow a cychc pattern that correlate with annual migration of waterfowl. The mortality figures can be translated to corresponding dollar figures for financial analysis.

CONCLUSIONS
The apphcation of computational evolutionary methodology to analyze the epidemiology of avian influenza was presented. Specifically, the practical issues involved in applying this methodology using an eco-modeling and bio-simulation environment caUed SEARUMS were discussed. The evolutionary epidemiological model of avian influenza was developed and empirically verified. The verification experiments highlight that the proposed methodology closely reflects real world scenarios. The bio-simulations conducted using SEARUMS provide good forecast of time lines and epicenters of outbreaks. Furthermore, it was observed that outbreaks follow a cyclical, annual pattern. This new, comprehensive knowledge gained about epidemiology of avian influenza is invaluable to combat outbreaks and prevent a pandemic. We are optimistic that the proposed software environment wiU enable mankind to strategically invest precious time and resources to combat diseases like avian influenza, minimize their impacts on human life and global economy.