Posted on

Recently, I am diving into some deep learning for genomics. My go-to deep-learning framework Keras has a nice feature fit_generator that fetches mini-batches of data from a indefinite python generator and train the neural network incrementally using each mini-batch. The python generator should yield a batch of feature and label from every next(generator) call. An example is as follow:

class data_generator():
    '''
    suppose the input data file is two columns:
    where first columns is feature, second column is label   
    '''

    def __init__(self, tsv_file, batch_size = 500):
        self.X = []
        self.y = []
        self.batch_size = batch_size
        self.data_file = tsv_file
        self.sample_generator = open(self.data_file)
        
    def __next__(self):
        # reinitiate samples
        self.X = []
        self.y = []
        
        # populate this batch with features and labels
        data_gen()
        
        # return for keras model fit_generater
        return np.array(self.X), np.array(self.y)
    
    def data_gen(self):
        sample_count = 0
        while self.batch_size > sample_count:  # break loop when batch is filled
            try:
                line = next(self.sample_generator)
                feature, label = line.split('\t') ## extract feature and labels from the two columns
                self.X.append(feature)
                self.y.append(label)
                sample_count += 1
            except StopIteration: # if it loops to the end (finished one epoch), re-open the file and loop again
                self.sample_generator = open(self.data_file)  ## open the file again
                line = next(self.sample_generator)
                feature, label = line.split('\t')  ## extract feature and labels from the two columns
                self.X.append(feature)
                self.y.append(label)
                sample_count += 1

However, one drawback of this generator is that batches are created sequentially from the data file, such that training samples are not shuffled. To introduce randomness into mini-batches, we can add a line of if random.random() > 0.5: before putting sample into the batch:

def data_gen(self):
    sample_count = 0
    while self.batch_size > sample_count:  # break loop when batch is filled
        try:
            line = next(self.sample_generator)
            if random.random() > 0.5:  ### added randomness ###
                feature, label = line.split('\t') ## extract feature and labels from the two columns
                self.X.append(feature)
                self.y.append(label)
                sample_count += 1
        except StopIteration: # if it loops to the end, re-open the file and loop again
            self.sample_generator = open(self.data_file)  ## open the file again
            line = next(self.sample_generator)
            if random.random() > 0.5: ### added randomness ###
                feature, label = line.split('\t')  ## extract feature and labels from the two columns
                self.X.append(feature)
                self.y.append(label)
                sample_count += 1

The builtin random module in python is nice enough to generator a number between 0 and 1, but it can be a bit slow. So in this post, I will show an implementation of random float number between 0 and 1 using cython and see how much speed up we can get.

Below is the cython random function:

%%cython

from libc.stdlib cimport rand, RAND_MAX

cpdef double cy_random():
    return rand()/RAND_MAX

Let's check if the results are similar by looking at the distibution of 10000 random numbers:

import random
import matplotlib.pyplot as plt
import seaborn as sns

ax = plt.subplot(111)
sns.distplot([random.random() for i in range(10000)], ax = ax, label='builtin')
sns.distplot([cy_random() for i in range(10000)], ax = ax, label = 'Cython')
ax.legend(fontsize=15, bbox_to_anchor = (1,0.5))
ax.set_xlabel('Random number', fontsize = 15)
ax.set_ylabel('Density', fontsize=15)
sns.despine()

That looks similar enough, both of them are more or less uniform over [0,1]. Now, lets see how much time it takes for each of them to run:

%timeit random.random()

108 ns ± 7.83 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

%timeit cy_random()

54.2 ns ± 2.73 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

The cython version yielded a 2X speed-up.