A Better UMAP

2025.11.14 :: {Python,NumPy,scikit-learn} :: #dimensionality-reduction #single-cell #visualization

Just an idea, working on it in my spare time. Code can be found here. ~3 hours worth of work, and still in progress but I consider it a success so far.

I did this because of how unreliable UMAPs are. I mean when you take super high-dimensional data like genomes and compress them into low-dimensional formats, you're bound to lose some nuance. But with UMAPs it's even worse, tbh.

Chari & Pachter (2023) demonstrated that you can embed single-cell data into the shape of an elephant and preserve distances just as well as UMAP does.

As Jeff Leek says: "plots should communicate findings, not decorate." Also check out Nikolay Oskolkov's article on Medium.

The Core Failures§

Problem	Evidence
Distance distortion	4-200x inflation of max/min distance ratios
Local structure loss	70% of nearest neighbors lost (Jaccard > 0.7)
Global structure loss	Cell-type ranking correlations ≤ 0.4
Theoretical limit	Johnson-Lindenstrauss: need ~1,842 dims for 20% error on 10k points
Artificial clustering	Single Gaussian → multiple fake clusters

I guess this wouldn't be that big of a deal if UMAPs told you all that they destroyed, but they don't.

HiDRA§

Hierarchical Distance-Regularized Approximation

Four key innovations:

1. Local Structure Preservation (t-SNE-style KL Divergence)§

I used a perplexity-based affinity matrix with KL divergence + exaggeration in early iterations, similar to t-SNE.

High-dimensional affinities via perplexity-calibrated Gaussians:

$$ p_{j|i} = \frac{\exp(-|x_i - x_j|^2 / 2\sigma_i^2)}{\sum_{k \neq i} \exp(-|x_i - x_k|^2 / 2\sigma_i^2)} $$

Where $\sigma_i$ is found via binary search to match target entropy $H(P_i) = \log(\text{perplexity})$.

Symmetrized joint probabilities:

$$ p_{ij} = \frac{p_{j|i} + p_{i|j}}{2n} $$

Low-dimensional affinities:

$$ q_{ij} = \frac{(1 + |y_i - y_j|^2)^{-1}}{\sum_{k \neq l} (1 + |y_k - y_l|^2)^{-1}} $$

KL divergence loss:

$$ \mathcal{L}{\text{KL}} = \text{KL}(P | Q) = \sum{i \neq j} p_{ij} \log \frac{p_{ij}}{q_{ij}} $$

Gradient:

$$ \frac{\partial \mathcal{L}{\text{KL}}}{\partial y_i} = 4 \sum{j} (p_{ij} \cdot \alpha - q_{ij}) \cdot (1 + |y_i - y_j|^2)^{-1} \cdot (y_i - y_j) $$

Where $\alpha$ is the early exaggeration factor (12.0 during early phase, 1.0 otherwise).

2. Global Distance Preservation§

Instead of categorical near/mid/far pairs, I used stratified sampling across the full distance spectrum with a scale-invariant normalized stress loss.

Pair sampling: Select $|P| = \min(3000, 10n)$ pairs uniformly stratified across the high-dimensional distance distribution.

Normalized stress loss:

$$ \mathcal{L}{\text{dist}} = \frac{1}{|P|} \sum{(i,j) \in P} \left( \frac{\hat{d}{ij}}{\bar{\hat{d}}} - \frac{d{ij}}{\bar{d}} \right)^2 $$

Where:

$d_{ij} = |x_i - x_j|$ is the original distance $\hat{d}_{ij} = |y_i - y_j|$ is the embedded distance $\bar{d}, \bar{\hat{d}}$ are the mean distances (makes the loss scale-invariant)

Gradient (with stability):

$$ \frac{\partial \mathcal{L}{\text{dist}}}{\partial y_i} = \sum{j:(i,j) \in P} \text{clip}\left( \frac{2(\hat{d}{ij}/\bar{\hat{d}} - d{ij}/\bar{d})}{\hat{d}_{ij} \cdot \bar{\hat{d}}}, -0.1, 0.1 \right) \cdot (y_i - y_j) $$

3. Two-Phase Optimization§

HiDRA uses a two-phase approach with momentum-based updates and adaptive per-parameter learning rates.

Total loss:

$$ \mathcal{L} = \mathcal{L}{\text{KL}}(\alpha) + \lambda(t) \cdot \mathcal{L}{\text{dist}} $$

Phase schedule:

Phase	Iterations	Exaggeration $\alpha$	Distance weight $\lambda$	Momentum
Early Exaggeration	$0$ to $\min(250, T/4)$	12.0	0.0	0.5
Main	remaining	1.0	$\lambda_0 \cdot \min(1, 2 \cdot \text{progress})$	0.8

Where $\text{progress} = (t - t_{\text{exag}}) / (T - t_{\text{exag}})$ and $\lambda_0 = 0.1$ by default.

Adaptive gains (delta-bar-delta rule):

$$ g_i^{(t+1)} = \begin{cases} g_i^{(t)} + 0.2 & \text{if } \text{sign}(\nabla_i) \neq \text{sign}(v_i)
0.8 \cdot g_i^{(t)} & \text{otherwise} \end{cases}, \quad g_i \in [0.01, 10] $$

Momentum update:

$$ v^{(t+1)} = \mu \cdot v^{(t)} - \eta \cdot g^{(t+1)} \odot \nabla \mathcal{L} $$ $$ Y^{(t+1)} = Y^{(t)} + v^{(t+1)} $$

Where $\eta = \max(n / \alpha / 4, 50)$ when set to 'auto'.

4. Uncertainty Quantification§

Run $B$ bootstrap embeddings, compute positional variance:

$$ \sigma_i^2 = \text{Var}{b=1}^{B}\left[ y_i^{(b)} \right] = \sum{d=1}^{D} \text{Var}{b}\left[ y{i,d}^{(b)} \right] $$

Where $y_i^{(b)}$ is the position of point $i$ in bootstrap sample $b$ (using NaN for points not sampled).

Points with high $\sigma_i^2$ have unreliable positions, their embedding location varies substantially across bootstrap resamples to help user identify regions to not over-interpret.

Pseudocode§

def HiDRA(X, n_components=2, n_iter=1000, perplexity=30.0,
          early_exaggeration=12.0, distance_weight=0.1, init='random'):
    
    n = X.shape[0]
    
    # 1. Compute joint probability matrix P (perplexity-based affinity)
    D = pairwise_squared_distances(X)
    P = compute_affinities_with_perplexity(D, perplexity)  # Binary search for sigma
    P = (P + P.T) / (2 * n)  # Symmetrize
    
    # 2. Sample stratified distance pairs for global structure
    pairs, hd_dists = sample_distance_pairs(X, n_pairs=min(3000, n * 10))
    
    # 3. Initialize embedding
    if init == 'pca':
        Y = PCA(n_components).fit_transform(X) * 1e-4
    else:
        Y = np.random.randn(n, n_components) * 1e-4
    
    # 4. Setup optimization state
    lr = max(n / early_exaggeration / 4, 50)
    velocity = np.zeros_like(Y)
    gains = np.ones_like(Y)
    exag_iters = min(250, n_iter // 4)
    
    # 5. Two-phase optimization
    for t in range(n_iter):
        if t < exag_iters:
            exag = early_exaggeration
            dist_w = 0
            momentum = 0.5
        else:
            # Main phase - balance local + global
            exag = 1.0
            progress = (t - exag_iters) / (n_iter - exag_iters)
            dist_w = distance_weight * min(1.0, progress * 2)
            momentum = 0.8
        
        # Compute t-SNE gradient (KL divergence on Student-t kernel)
        Q = student_t_kernel(Y) 
        grad_tsne = 4 * sum_j((P_ij * exag - Q_ij) * (1 + d_ij)^-1 * (y_i - y_j))
        
        # Compute distance preservation gradient (normalized stress)
        hd_norm = hd_dists / mean(hd_dists)
        ld_norm = embedding_dists(Y, pairs) / mean(embedding_dists)
        error = ld_norm - hd_norm
        grad_dist = clip(error_gradient, -0.1, 0.1)  
        
        grad = grad_tsne + grad_dist
        
        gains = where(sign(grad) != sign(velocity), gains + 0.2, gains * 0.8)
        gains = clip(gains, 0.01, 10)
        
        velocity = momentum * velocity - lr * gains * grad
        Y = Y + velocity
        Y = Y - Y.mean(axis=0)  # Center embedding
    
    return Y

Evaluation§

Compare against PCA, t-SNE, UMAP, HiDRA on:

Metric	Measures
k-NN Recall	Local structure: $\|N_k(y) \cap N_k(x)\| / k$
Spearman $\rho$	Global structure: rank correlation of distances
Distortion Ratio	$(\max \hat{d}/d) / (\min \hat{d}/d)$
Trustworthiness	False neighbors in embedding
Continuity	True neighbors missing from embedding

Test datasets: PBMC 3k, Mammoth (3D structure), Swiss Roll, synthetic Gaussians.

Results§

Dataset	Method	kNN Recall ↑	Spearman ρ ↑	Distortion ↓
Swiss Roll	PCA	0.468	0.834	229.9
	t-SNE	0.855	0.335	30.4
	UMAP	0.760	0.162	162.1
	HiDRA	0.811	0.692	25.2
S-Curve	PCA	0.318	0.964	46.3
	t-SNE	0.833	0.898	17.3
	UMAP	0.762	0.880	91.1
	HiDRA	0.802	0.903	29.5
Blobs 50D	PCA	0.110	0.888	89.8
	t-SNE	0.278	0.653	206.8
	UMAP	0.189	0.578	479.5
	HiDRA	0.236	0.527	112.8
Gaussian 50D	PCA	0.041	0.239	141.0
	t-SNE	0.177	0.263	428.2
	UMAP	0.127	0.080	83.3
	HiDRA	0.141	0.254	100.3

t-SNE: Best kNN (local structure) on all datasets
PCA: Best Spearman ρ (global structure) on S-Curve and Blobs
HiDRA: Best balance - beats UMAP on nearly all metrics, competitive with t-SNE on kNN while having much better global structure (ρ) as shown in the high dim structures
UMAP: Generally underperforms on these benchmarks

Note: These results are from synthetic data from scikit-learn - I have yet to test on real datasets:

from sklearn.datasets import make_swiss_roll
X_swiss, y_swiss = make_swiss_roll(n_samples=1500, noise=0.5, random_state=42)

First iteration did not yield optimal results, as KNN was super low meaning global context was prioritized. I changed from using KNNs to a perplecity-based affinity matrix, PCA instead of geodesic MDS, and changed loss weight cladssifiers to make it just two-phased (exaggeration + dist_w). Instead of the original category losses (near/mid/far), I adapted a KL from t-SNE and optimized using momentum instead of just a gradient descent. Also, computation times were super slow as gradient loops were slow, due to numba I believe. I went back and chnages the distance regularization (the novel aspect of this) which was originally breaking things. Regarding the algorithm, I:

Fixed gradient computation - The original had a broken kernel normalization that prevented optimization from working
Cleaned t-SNE-style optimization
Implemented proportional distance regularization - Added global distance preservation that scaled gradient proportionally to distance+weight and is scale-invariant

I mean PCA has been the gold standard for a while, epecially for population genomics, and it does perform the best here Spearman-wise. I think cotext is important when deciding which to use (PCA vs UMAP), but inter-cluster distances aren't really accurate on UMAPs, which is what HiDRA is trying to solve with multiphase optimization and uncertainity quantification.

The benchmarks show that HiDRA beats standard UMAPs on pretty much everything though which was the goal.