Combinatorial problems in genetics

Leo van Iersel

The DNA of so called diploid organisms, for example humans, consists of pairs of chromosomes. These chromosomes can be seen as strings over the alphabet {A,C,G,T}. Although it is possible to obtain the data from both chromosomes of a pair in separate strings, this is still very expensive and time-consuming. It is much easier and faster to obtain a string with the conflated data of both chromosomes. However, this string (the genotype) does not capture all the information from the strings of the two chromosomes (the haplotypes). It merely describes, for each site, the two letters the chromosomes have at that site. But it does not say which letter is on which chromosome. This data is thus ambiguous and an important problem in computational biology is to resolve this ambiguity without the expensive process of sequencing the two chromosomes of a pair separately. The well studied “parsimony” model considers genotypes of a population and its objective is to find the smallest set of haplotypes resolving all the genotypes. This problem is known to be NP-hard in general, but we consider different restrictions and show polynomial time algorithms for some of them and give approximation results for others. We extend our results to the Minimum Perfect Phylogeny Problem that asks for the minimum number of haplotypes that resolve all the genotypes but can also be embedded as the leaves of an evolutionary tree.

back to EIDMA Seminar Combinatorial Theory announcements