Genomics ML Library Speedrun - built from scratch for fun, inspired by @magicalbat

I was talking to a friend about this guy (Ian) who recently coded a machine learning library from scratch in C. It's pretty awesome, and I had to try for myself.

The goal: take a C-based ML library someone built from scratch and extend it with Python tools specifically for genomics.

As always, done in one sitting ~3.5 hours. Code can be found here. Only used NumPy.

To begin with, Genomics ML has a few quirks that general-purpose libraries don't handle well:

  1. Sequences aren't images: DNA is a 1D string over a 4-letter alphabet (A, C, G, T). My guess would be this would need different encoding.
  2. Motifs matter: Transcription factors bind specific patterns. CNNs for this maybe?
  3. Data is imbalanced: Most of the genome isn't anything interesting. I saw this with my Perturbation Speedrun where most cells had a delta of 0.
  4. Count data is weird: RNA-seq gives you read counts but the shit is so messy

So, I'd need encoding utilities, 1D convolutions, attention (for long-range dependencies) over the genome, and biology-aware loss functions.

Sequence Encoding§

Feeding "ACGT" into a neural network; I'd take the standard approach where each nucleotide becomes a 4-dimensional vector:

$$
\text{A} \rightarrow [1, 0, 0, 0]
$$
$$
\text{C} \rightarrow [0, 1, 0, 0]
$$
$$
\text{G} \rightarrow [0, 0, 1, 0]
$$
$$
\text{T} \rightarrow [0, 0, 0, 1]
$$

A sequence of length $L$ becomes a matrix $X \in \mathbb{R}^{L \times 4}$.

def one_hot_encode(sequence, base={"A": 0, "C": 1, "G": 2, "T": 3}):
    encoding = np.zeros((len(sequence), 4), dtype=np.float32)
    for i, nuc in enumerate(sequence.upper()):
        if nuc in base:
            encoding[i, base[nuc]] = 1.0
    return encoding

For ambiguous bases (U), I either zero-pad (default) or use uniform distribution $[0.25, 0.25, 0.25, 0.25]$ to average out, or raise an error. Not sure yet.

K-mer Frequencies§

For variable-length sequences, one-hot doesn't work directly. Alternative would be to count k-mer frequencies. Although position info is lost with this.

For $k=3$ (codons), there are $4^3 = 64$ possible 3-mers. Count occurrences, normalize, and I'd get a fixed length feature vector where sequence doesn't matter.

$$
f_i = \frac{\text{count of k-mer } i}{\sum_j \text{count of k-mer } j}
$$

The Layers§

I need differentiable layers that work on sequences. Everything needs a forward() and backward() method.

Conv1D For Learning Motifs§

A 1D convolution slides a filter across the sequence, detecting patterns. If I use a kernel size of 8, the network can learn 8-bp motifs which from research should be the right size for binding motifs.

The forward pass for a single filter:

$$
y_i = \sigma\left(\sum_{c=1}^{C_{in}} \sum_{k=1}^{K} W_{c,k} \cdot x_{c, i+k-1} + b\right)
$$

Where:

  • $x \in \mathbb{R}^{C_{in} \times L}$ is the input (4 channels for one-hot DNA)
  • $W \in \mathbb{R}^{C_{in} \times K}$ is the filter
  • $K$ is kernel size
  • $\sigma$ is activation (ReLU)
# Forward pass (simplified)
for i in range(out_len):
    start = i * stride
    patch = x[:, :, start:start + kernel_size]
    for f in range(out_channels):
        output[:, f, i] = np.sum(patch * weights[f]) + bias[f]
output = np.maximum(0, output)  # ReLU

Backward pass:

$$
\frac{\partial L}{\partial W_{c,k}} = \sum_i \frac{\partial L}{\partial y_i} \cdot x_{c, i+k-1}
$$

This still needs work :/

Position Weight Matrices (PWMs)§

A PWM is basically a learned motif, it tells you how likely each nucleotide is at each position of a binding site.

Given aligned binding sequences, compute:

$$
\text{PWM}_{i,j} = \log_2 \frac{f_{i,j}}{0.25}
$$

Where $f_{i,j}$ is the frequency of nucleotide $j$ at position $i$, and 0.25 is the background frequency (as mentioned when padding for ambiguous nucleotides).

To score a sequence against a PWM:

$$
S = \sum_{i=1}^{L} \text{PWM}_{i, x_i}
$$

Higher score = better match to the motif.

I made this differentiable so it can learn PWMs from data. The MotifLayer starts with random filters and backprop turns them into meaningful motifs. From test runs, it recovers known transcription factor binding sites.

Attention For Long-Range§

Enhancers can be thousands of base pairs away, so self-attention is necessary.

Standard scaled dot-product attention:

$$
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V
$$

Where $Q$, $K$, $V$ are linear projections of the input:

$$
Q = XW_Q, \quad K = XW_K, \quad V = XW_V
$$

# Forward
Q = np.dot(x, W_q)
K = np.dot(x, W_k)
V = np.dot(x, W_v)

scores = np.matmul(Q, K.transpose(0, 2, 1)) / np.sqrt(d_k)
attention = softmax(scores)
context = np.matmul(attention, V)
output = np.dot(context, W_o)

The backward pass through softmax is weird:

$$
\frac{\partial \text{softmax}_{i}}{\partial z_{j}} = \text{softmax}_{i} \cdot (\delta_{ij} - \text{softmax}_{j})
$$

Need to get this right as well, but the goal is to visualize attention weights and see what the model is looking at. Just helps with interpretability tbh.

Loss Functions§

Binary Cross-Entropy§

For binary classification (promoter vs. not-promoter):

$$
L = -\frac{1}{N}\sum_{i=1}^{N} \left[ y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i) \right]
$$

Gradient:

$$
\frac{\partial L}{\partial \hat{y}_i} = \frac{\hat{y}_i - y_i}{\hat{y}_i(1 - \hat{y}_i)}
$$

Or if working with logits (pre-softmax), it simplifies to $\hat{y} - y$.

Focal Loss§

99% of sequences aren't promoters. The model can achieve 99% accuracy by predicting "not promoter" every time.

Focal loss down-weights the easy examples:

$$
L = -\alpha_t (1 - p_t)^\gamma \log(p_t)
$$

Where $p_t$ is the predicted probability for the true class. When $\gamma = 2$, easy examples with $p_t \approx 1$ get very low weight. Hard examples dominate the gradient so it cant falsify accuracy.

Poisson Loss for RNA-seq Counts§

Gene expression is measured as read counts. These follow a Poisson-ish distribution:

$$
P(y | \lambda) = \frac{\lambda^y e^{-\lambda}}{y!}
$$

Negative log-likelihood:

$$
L = \lambda - y \log(\lambda)
$$

If the model outputs log-counts:

$$
L = e^{\hat{y}} - y \cdot \hat{y}
$$

Gradient:

$$ e^{\hat{y}} - y $$

Still figuring out the overdispersion issue.


Other Utilities§

Reverse Complement§

DNA is double-stranded. If one strand is 5'-ACGT-3', the other is 3'-TGCA-5', which we read as 5'-ACGT-3'.

5'-AACG-3' → complement is TTGC → reverse is CGTT

COMPLEMENT = {"A": "T", "T": "A", "G": "C", "C": "G"}

def reverse_complement(seq):
    return "".join(COMPLEMENT[b] for b in seq[::-1])

Translation§

DNA → Protein via the genetic code. Every 3 nucleotides (codon) encodes one amino acid.

ATG → M (Methionine, start codon)
TGA → * (Stop)
AAA → K (Lysine)
...

64 codons, 20 amino acids + stop. I manually input the standard genetic code table.

ORF Finding§

Open Reading Frames: regions from start codon (ATG) to stop codon (TAA/TAG/TGA). Just gene finding.

def find_orfs(seq, min_length=100):
    orfs = []
    for frame in [0, 1, 2]:
        for start in find_all_starts(seq, frame):
            for stop in find_next_stop(seq, start):
                if stop - start >= min_length:
                    orfs.append((start, stop))
    return orfs

Also search the reverse complement for genes on the other strand.

A minimal promoter classifier:

# Encode
X = batch_encode(sequences)  # (batch, length, 4)
X = X.transpose(0, 2, 1)     # (batch, 4, length) for conv

# Model
conv = Conv1D(in_channels=4, out_channels=32, kernel_size=8)
pool = GlobalMaxPool1D()
dense = Dense(32, 1, activation='sigmoid')

# Forward
x = conv(X)
x = pool(x)
pred = dense(x)

# Loss + backward
loss, grad = binary_cross_entropy(pred, labels)
grad = dense.backward(grad)
grad = pool.backward(grad)
grad = conv.backward(grad)

# Update
conv.weights -= lr * conv.weights_grad
dense.weights -= lr * dense.weights_grad

From what I've ran, this is surprisingly fast for small models. I mean this is NumPy only, so no GPU/CUDA, and it ran well on my fried thinkpad.

Metrics§

  • AUROC: discrimination ability
  • AUPR: precision-recall
  • F1: harmonic mean of precision and recall

For regression:

  • Pearson correlation: linear relationship
  • Spearman correlation: monotonic relationship (robust to outliers)
def auroc(predictions, targets):
    # Sort by prediction, count true positives above each threshold
    # Integrate TPR vs FPR
    ...

What I'd Add Next§

  1. Batch normalization; stabilizes training, already implemented but not tested much
  2. GPU support; NumPy is fine for prototyping but doesn't scale

Final Thoughts§

This was pretty hard. For genomics specifically, the library handles:

  • One-hot and k-mer encoding
  • FASTA/FASTQ I/O with quality scores
  • Sequence operations (reverse complement, translation, alignment)
  • Differentiable layers designed for 1D sequences
  • Biology-aware losses (Poisson, negative binomial, focal)