Computer Sciences Dept.

Learning Expressive Computational Models of Gene Regulatory Sequences and Responses

Keith Noto

The regulation and responses of genes involve complex systems of relationships between genes, proteins, DNA, and a host of other molecules that are involved in every aspect of cellular activity.

I present algorithms that learn expressive computational models of cis-regulatory modules (CRMs) and gene-regulatory networks. These models are expressive because they are able to represent key aspects of interest to biologists, often involving unobserved underlying phenomena. The algorithms presented in this thesis are designed specifically to learn in these expressive model spaces.

I have developed a learning approach based on models of CRMs that represent not only the standard set of transcription factor binding sites, but also logical and spatial relationships between them. I show that my expressive models learn more accurate representations of CRMs in genomic data sets than current state-of-the-art learners and several less expressive baseline models.

I have developed a probabilistic version of these CRM models which is closely related to hidden Markov models. I show how these models can perform inference and learn parameters efficiently when processing long promoter sequences, and that these expressive probabilistic models are also more accurate than several baselines.

Another contribution presented in this thesis is the development of a general-purpose regression learner for sequential data. This approach is used to discover mappings from sequence features in DNA (e.g. transcription or sigma factor binding sites) to real-valued responses (e.g. transcription rates). The key contribution of this approach is its ability to use the real values directly to discover the relevant sequence features, as opposed to choosing the features beforehand or learning them from sequence alone, and without losing information in a discretization process.

Finally, I present and evaluate a gene-regulatory network that learns the hidden underlying state of regulators from expression data and a set of cellular conditions under which expression is measured. I show that using sequence data to estimate the role of regulators (activator or repressor) increases the accuracy of the learned models.

Download this report (PDF)

Return to tech report index

Computer Science | UW Home