de Bruijn sequence: Part 1

I was thinking about genetic sequences and the problem of assembly. I have followed Titus Brown‘s use of Bloom Filters. To learn more about Bloom Filters and their application to genetic sequence assembly check out Handling ridiculous amounts of data with probabilistic data structures by Titus Brown. The problem is both the amount of data and the number of combinations or sequence samples. I am interested in the most compact for to store all possible combinations of sequences of a given length. The solution is a de Bruijn sequence

Below are a few different functions I experimented with to produce a de Bruijn sequence.

In [2]:
import memory_profiler
import timeit
import string

def de_bruijn_1(k, n):
    """
    De Bruijn sequence for alphabet size k (0,1,2...k-1)
    and subsequences of length n.
    From wikipedia Sep 22 2013:
        Modified last line to return a string rather than a list
    """
    a = [0] * k * n
    sequence = []
    def db(t, p,):
        if t > n:
            if n % p == 0:
                for j in range(1, p + 1):
                    sequence.append(a[j])
        else:
            a[t] = a[t - p]
            db(t + 1, p)
            for j in range(int(a[t - p]) + 1, k):
                a[t] = j
                db(t + 1, t)
    db(1, 1)
    #return sequence  #original
    return ''.join([str(i) for i in sequence])

################
def de_bruijn_strings(k, n):
    """
    De Bruijn sequence for alphabet size k (0,1,2...k-1)
    and subsequences of length n.
    Modifed wikipedia Sep 22 2013 to use strips
    """
    global sequence
    global a
    a = '0' * k * n
    sequence = ''
    def db(t, p):
        global sequence
        global a
        if t > n:
            if n % p == 0:
                for j in range(1, p + 1):
                    sequence = sequence + a[j]
        else:
            a = a[:t] + a[t - p]  + a[t+1:]
            db(t + 1, p)
            for j in range(int(a[t - p]) + 1, k):
                a = a[:t] + str(j)  + a[t+1:]
                db(t + 1, t)
        return sequence
    db(1, 1)
    return sequence

################
_mapping = bytearray(b"?")*256
_mapping[:10] = b"0123456789"

def de_bruijn_bytes(k, n):
    """
    By Peter Otten on python-list
    """
    a = k * n * bytearray([0])
    sequence = bytearray()
    extend = sequence.extend
    def db(t, p):
        if t > n:
            if n % p == 0:
                extend(a[1: p+1])
        else:
            a[t] = a[t - p]
            db(t + 1, p)
            for j in range(a[t - p] + 1, k):
                a[t] = j
                db(t + 1, t)
    db(1, 1)
    return sequence.translate(_mapping).decode("ascii")

d1 = de_bruijn_1(4, 9)
d2 = de_bruijn_strings(4, 9)
d3 = de_bruijn_bytes(4, 9)

print('They are the same?', d1==d2==d3)

# from de_bruijn import *
%memit de_bruijn_1(4, 10)
%memit de_bruijn_strings(4, 10)
%memit de_bruijn_bytes(4, 10)
('They are the same?', True)
peak memory: 43.86 MiB, increment: 17.18 MiB
peak memory: 28.51 MiB, increment: 0.53 MiB
peak memory: 33.89 MiB, increment: 5.03 MiB

As you can see above the memory requirments for the de_bruijn_strings and de_bruijn_bytes and significantly less. This is really the limiting factor in increasing the word size. On the machine I am testing on I think I can go up to de_bruijn_bytes(4, 17) before I fill the memory.

In [3]:
!python -m timeit -s 'from de_bruijn import de_bruijn_1' 'de_bruijn_1(4, 10)'
!python -m timeit -s 'from de_bruijn import de_bruijn_strings' 'de_bruijn_strings(4, 10)'
!python -m timeit -s 'from de_bruijn import de_bruijn_bytes' 'de_bruijn_bytes(4, 10)'
10 loops, best of 3: 585 msec per loop
10 loops, best of 3: 59.5 sec per loop
10 loops, best of 3: 168 msec per loop

Notice That although the de_bruijn_strings uses low memory it is really slow. de_bruijn_bytes is my far a better solution.

 

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.