Code and Biology

by

As science and software advances, we have the ability to fuse the two together to discover and treat diseases in the hopes of prolonging life. Tasks like sequencing the human genome, isolating genetic markers, and handling large amounts of data are now all possible through a scientific field called Bioinformatics – the study and process of biological data through software, engineering, and mathematics.

As a developer with a biochemistry degree, bioinformatics naturally fascinates me. I thought, how might algorithms like finding complete sets of “()” in strings have a purpose in biology? Turns out, searching for “()” sets is similar to searching for certain patterns in DNA strands.

DNA is the human blueprint and contains a vast array of information. In text, strands are represented by four nucleotides, “A,T,C,G” as in “ATTGGCATAC…” Targeting certain patterns in our DNA can lead to the discovery of new proteins, mutations, ultimately providing valuable insights like risks for cancer, diabetes, or heart disease.

As an exercise, I wrote a short script that takes a DNA strand and a target pattern and returns the start and end of all matches for both strands of a mock DNA (DNA is a double helix composed by direct and reverse strands).

Suppose we have the DNA strand:

STRAND = "aggcgtatgcgatcctgaccatgcaaaactccagcgtaaatacctagccatggcgacacaaggcgcaaga" +
"caggagatgacggcgtttagatcggcgaaatattaaagcaaacgacgatgacttcttcgggaaattagttccctactcgt" +
"gtactccaattagccataacactgttcgtcaagatatagggggtcacccatgaatgtcctctaaccagaccatttcgtta" +
"cacgaacgtatctgatgacttcttcgggaaattagttccctactcgtgtactccaattagccataacactgttcgtcaag" +
"atatagggggtcacccatgaatgtcctctaaccagaccatttcgttacacgaacgtatcttcggggcgtatgcgatcctg" +
"accatgcaaaactccagcgtaaatacctagccatggcgacacaaggcgcaagacaggagatgacggcgtttagatcggcg" +
"aaatattaaagcaaacgacgatgacttcttcgggaaattagttccctactcgtgtactccaattagccataacactgttc" +
"gtcaagatatagggggtcacccatgaatgtcctctaaccagaccatttcgttacacgaacgtatctgatgacttcttcgg" +
"gaaattagttccctactcgtgtactccaattagccataacactgttcgtcaagatatagggggtcacccatgaatgtcct" +
"ctaaccagaccatttcgttacacgaacgtatcttcggggcgtatgcgatgaaattagttccctactcgtgtactccaatt" +
"agccataacactgttcgtcaagatatagggggtcacccatgaatgtcctctaaccagaccatttcgttacacgaacgtat" +
"ctgatgacttcttcgggaaattagttccctactcgtgtactccaattagccataacactgttcgtcaagatatagggggt" +
"cacccatgaatgtcctctaaccagaccatttcgttacacgaacgtatcttcggggcgtat"

And a target pattern:

PATTERN = "ctccag"

Two functions, one to reverse the direct DNA strand and another to do the search:


def reverse_strand(strand, out = '')
	reverse = strand.reverse
	# Nucleotide key
	key = {
		a: 't',
		t: 'a',
		c: 'g',
		g: 'c'
	}
	reverse.length.times do |i|
		out += key[reverse[i].to_sym]
	end

	return out
end

def search_strand(strand, pattern)
	out = {
		pattern: pattern,
		matches: []
	}

	strand.length.times do |i|
		new_pattern = strand[i...i+pattern.length]
		if new_pattern == pattern
			out[:matches] << {start: i, finish: i+pattern.length}
		end
	end

	return out
end

Running

puts "Direct: #{search_strand(STRAND, PATTERN)}"
puts "Reverse: #{search_strand(reverse_strand(STRAND), PATTERN)}"

Yields:

Direct: {
	:pattern=>"ctccag", 
	:matches=>[
		{:start=>28, :finish=>34}, 
		{:start=>401, :finish=>407}
	]
}
Reverse: {
	:pattern=>"ctccag", 
	:matches=>[]
}

Where :start and :finish represent the indexes of where the targeted pattern starts and ends. With this information, scientists can go in and research specific areas around these markers.

Although this algorithm might seem simple, imagine searching a DNA strands hundreds of thousands nucleotides long by hand. Or even the whole human genome which contains 3 billion pairs! One worthwhile note: while this algorithm works for the strand above, there might be performance challenges with longer strands and target patterns which would then call for more advanced search techniques. However, I see the above as a good demonstration of how code and biology work together.

1 Comment

Leave a Reply

Your email address will not be published. Required fields are marked *