Research Projects

Dissertation Research [ home ]

Stochastic Evolutionary Modeling of cis-Regulatory Modules
Advisor: Saurabh Sinha, Chengxiang Zhai

In a cell, when and where genes are expressed is regulated by other genes, called transcription factors, and the complex networks of such regulatory interactions direct cells to perform their proper functions. The recognition of transcription factors by a target gene is through its regulatory DNA sequences, whose basic units are transcription factor binding sites}, pieces of DNA that serve as molecular switches to turn on or off genes. Unlike coding sequences, regulatory elements do not have simple structural rules, thus it is a great challenge to identify these sequences in genomes. My dissertation research is to address this challenge through rigorous computational models. The central component of my work is a computational framework, STEMMA, that models the tendency of binding sites to form clusters using a Hidden Markov Model, and the evolution of binding sites, including both conservation and turnover among multiple species, using a stochastic evolutionary model. This framework serves as a general tool for deciphering the regulatory mechanism using cross-species comparisons. Using this framework, we are analyzing a large fruit fly gene expression dataset for constructing a global map of gene regulation related to fruit fly development. Besides more accurate decoding of the regulatory network, this work also provides a powerful manner of revealing why certain binding sites are created or destroyed by mutations, yielding new insights into the evolutionary forces that differentiate one species from the others.

Other Research Projects

Identification of evolutionary constraint gene clusters
Jan. 2006 - Present, University of Illinois at Urbana-Champaign, Urbana, Illinois
Advisor: Chengxiang Zhai, Jiawei Han

I conducted original research on developing efficient data mining algorithms to study the gene arrangement patterns across multiple species. During evolution, the order and relative proximity of genes in genomes are generally not well conserved because of the rapid rearrangement events that reshuffle genomes. On the other hand, functionally related genes may be constrained to remain close to each other due to natural selection, forming so called conserved gene clusters. Thus, identifying conserved gene clusters is an appealing way of revealing functional relationship of genes and the forces underlying the evolution of genome organization. However, substantial genome rearrangements and the sheer amount of data imposed computationally demanding problems. I developed a very efficient algorithm, MCMuSeC, that borrows ideas from the field of data mining, to detect conserved gene clusters. A statistical method of evaluating predicted clusters, which is critical to distinguish truly important clusters from a large number of false positive predictions, was further developed. Using this combined algorithmic and statistical framework, I performed extensive studies on more than one hundred bacterial genomes, made interesting findings about genome evolution, and predicted functions of many poorly characterized genes.

Text Mining in biomedical literature
Jan. 2005 - Present, University of Illinois at Urbana-Champaign, Urbana, Illinois
Advisor: Chengxiang Zhai, Bruce Schatz

I conducted original research on Text Mining and Information Retrieval of biomedical literature. I developed software Gene Summarizer, that automatically generates structured summary of any gene from biomedical literature. By using techniques from textual information retrieval, the program is able to extract the most relevant and representative sentences describing a gene from a potentially large document collection, covering a number of aspects of the gene such as the phenotype of its mutant, the spatial expression pattern of the gene, etc. The generated summaries not only are directly useful to biologists but also serve as useful entry points to enable them to quickly digest the retrieved literature articles.

Identification of cis-regulatory DNA motifs associated with social behavior in honey bees
Jun. 2004 - Oct. 2006, University of Illinois at Urbana-Champaign, Urbana, Illinois
Advisor: Saurabh Sinha, Gene Robinson, Chengxiang Zhai

I conducted research on gene regulations associated with social behavior in honey bees. Honey bees' socially regulated transition from working in the hive to foraging is associated with changes in the expression of thousands of genes in the brain. In order to elucidate the molecular basis of these socially regulated behavior in honey bees, we applied a probabilistic model (Hidden Markov Model) based approach in genome-wide scan of putative target genes of particular motifs, in honey bee genome. In particular, we performed computational analysis by examining the motif target gene set for statistical enrichment with respect to certain aspects of honey bees' social regulation. Interestingly, we found that transcription factors that perform nervous system-related functions in fruitfly are also likely regulating behavior-related genes. This study provides important clues to the social behavior of honey bees, and evidence to a general principle that the same set of genes may be reused to direct different developmental patterns as well as behavior in different species.

Multi-faceted text mining and summarization
Jan. 2006 - Oct. 2008, University of Illinois at Urbana-Champaign, Urbana, Illinois
Advisor: Chengxiang Zhai

I conducted research on developing and applying topic modeling methods to specific text mining and retrieval tasks, such as extracting topics from blogs, and summarize/organize arbitrary topics in a text collection. In the blog mining project, we proposed a novel probabilistic model to simultaneously capture the mixture of topics and sentiment in blog articles. This model has broad applications to any text collections with latent topical facets and associated sentiments. In the project on multi-faceted summarization of arbitrary topics, I proposed a more realistic new setup of the problem, which allows a user to flexibly describe each facet with keywords for any arbitrary topic and generates a multi-faceted overview in an unsupervised manner. Together, these studies demonstrate that probabilistic models are powerful tool for automatic representation and summarization of knowledge from textual data.

Mining search logs for query alteration
May. 2008 - Aug. 2008, Microsoft Research, Redmond
Research Intern

I conducted original research on mining search logs for query alteration, with ultimate goal of improving Live search.