Handling high-dimensional data is a non-trivial task in Exploratory Data Analysis. Numerous tools ranging from the famous Principal Component Analysis (PCA) to the Stochastic Neighbour Embedding (SNE) variants focus on embedding the high-dimensional data on a low-dimensional visualization, especially in a two-or three-dimensional representation. The vanilla Stochastic Neighbor Embedding converts the high-dimensional Euclidean distance from a point to its K-nearest neighbors into conditional probabilities that represent similarities. Stochastic Neighbour Embedding variants are developed for task-specific applications, including t-SNE, Hierarchical-SNE, RNA-velocity techniques, and UMAP.
Biological science often deals with high-dimensional velocity data such as embryo development, cell fluid transfer, cell-type transition, cell differentiation, and mRNA velocity. Recent work by Volker Bergen, et al., the scVelo has introduced a powerful velocity visualization in a transient cell system. However, the method relies on an intuitive approach rather than a solid mathematical approach. This method yields unreliable results for highly dynamic and unpredictable cell systems such as messenger-RNA (mRNA) velocity visualization.
Songting Shi of the Peking University, China, has developed a mathematically sound approach to visualize cell components’ velocity. This method is named as DSNE, the abbreviation for Directional Stochastic Neighbor Embedding. It can be viewed as a variant of the vanilla SNE proposed to handle velocity visualization problems such as cell differentiation and embryo development.
DSNE converts the high-dimensional Euclidean distance between the unit-length velocity and the unit-length direction from the point of interest to the nearest neighbor points into a conditional probability distribution. This conditional probability distribution represents the similarity between the velocity and the direction of cell particles.
Mathematically, the conditional probability can be expressed as:
Similarly, the conditional probability distribution for the dimensionally-reduced low-dimensional counterparts can be mathematically expressed as:
An entropy loss function (the Kullback-Leibler Divergence loss) is incorporated to measure the impurity between the high-dimensional representations and the low-dimensional representations. DSNE is optimized during training by reducing this loss function. The loss function can be expressed mathematically as:
Pancreas Visualization using DSNE
Pancreas is one of the sensitive and vital organs in the human body. Pancreas looks after food digestion and blood sugar regulation. Pancreas cells possess highly dynamic functionality inter- and intra-cells. Timely study of Pancreas leads to better medication in case of impairment. Pancreas cell velocity can be analyzed and visualized using DSNE’s python package.
- DSNE is now available as a PyPi package. We can simply pip install it.
!pip install dsne
- Create the development environment by importing the necessary modules
import numpy as np import scvelo as scv from scipy.sparse import issparse from dsne import DSNE, DSNE_approximate
- Configure the visualization settings to suit the application and download the built-in pancreas dataset.
scv.settings.verbosity = 3 # show errors(0), warnings(1), info(2), hints(3) scv.settings.presenter_view = True # set max width size for presenter view scv.settings.set_figure_params('scvelo') # for beautified visualization adata = scv.datasets.pancreas() scv.pp.filter_and_normalize(adata, min_shared_counts=20, n_top_genes=2000) scv.pp.moments(adata, n_pcs=30, n_neighbors=30) scv.tl.velocity(adata)
Output:
- Define a helper function to obtain the velocity components that are required to plot pancreas velocity.
def get_X_V_Y(adata,vkey="velocity", xkey="Ms", basis=None, gene_subset=None, ): subset = np.ones(adata.n_vars, bool) if gene_subset is not None: var_names_subset = adata.var_names.isin(gene_subset) subset &= var_names_subset if len(var_names_subset) > 0 else gene_subset elif f"{vkey}_genes" in adata.var.keys(): subset &= np.array(adata.var[f"{vkey}_genes"].values, dtype=bool) xkey = xkey if xkey in adata.layers.keys() else "spliced" basis = 'umap' if basis is None else basis X = np.array( adata.layers[xkey].A[:, subset] if issparse(adata.layers[xkey]) else adata.layers[xkey][:, subset] ) V = np.array( adata.layers[vkey].A[:, subset] if issparse(adata.layers[vkey]) else adata.layers[vkey][:, subset] ) # V -= np.nanmean(V, axis=1)[:, None] Y =np.array( adata.obsm[f"X_{basis}"] ) nans = np.isnan(np.sum(V, axis=0)) if np.any(nans): X = X[:, ~nans] V = V[:, ~nans] return X,V,Y
- Get the two-dimensional velocity embedding of high-dimensional pancreas velocity data with standard DSNE.
X,V,X_2d = get_X_V_Y(adata,vkey="velocity",xkey="Ms",basis="umap") V_2d = DSNE(X, V, Y=X_2d, perplexity=3.0, K=16, threshold_V=1e-8, separate_threshold=1e-8, max_iter=600, mom_switch_iter=250, momentum=0.5, final_momentum=0.8, eta=0.1, epsilon_kl=1e-16, epsilon_dsne=1e-16, seed=6, random_state=None, copy_data=False, with_norm=True, verbose=True)
- Plot the standard DSNE velocity data in a two-dimensional plot.
adata.obsm["X_DSNE"] = X_2d adata.obsm["V_DSNE"] = V_2d title ="DSNE" scv.pl.velocity_embedding_stream(adata, title=title+' stream', basis='umap',V=adata.obsm["V_DSNE"], smooth=0.5,density=2,) scv.pl.velocity_embedding_grid(adata, title=title+' grid' , basis='umap',V=adata.obsm["V_DSNE"], smooth=0.5,density=2,) scv.pl.velocity_embedding(adata, title=title+' embedding',basis='umap',V = adata.obsm["V_DSNE"])
Output:
- Plot the compute-efficient DSNE approximate version of the above plots.
title ="DSNE-approximate" V_2d = DSNE_approximate(X, V, Y=X_2d, perplexity=3.0, K=16, threshold_V=1e-8, separate_threshold=1e-8, seed=6, random_state=None, copy_data=False, with_norm=True, verbose=True) adata.obsm["X_DSNE_approximate"] = X_2d adata.obsm["V_DSNE_approximate"] = V_2d scv.pl.velocity_embedding_stream(adata, basis='umap',V=adata.obsm["V_DSNE_approximate"], title=title+' stream', smooth=0.5,density=2,) scv.pl.velocity_embedding_grid(adata, basis='umap',V=adata.obsm["V_DSNE_approximate"], title=title+' grid', smooth=0.5,density=2,) scv.pl.velocity_embedding(adata, basis='umap',V = adata.obsm["V_DSNE_approximate"], title=title+' embedding')
Output:
Comparison of DSNE with scVelo on Exact Simulation
- Import the necessary packages and data; and configure hyper-parameters.
import os import numpy as np import scvelo as scv from anndata import AnnData, read_h5ad from dsne import DSNE, DSNE_approximate N=500 D=300 d=2 K=16 perplexity =6 n_rep=1 exact = False with_norm = True basis = 'exact_embeddings' verbose = False
- Define helper functions for determining unit length and velocity accuracy.
def unitLength(V): V_ = V/np.sqrt(np.sum(V*V,axis=1,keepdims=True)) return V_ def velocity_accuracy(V, V_exact): V_unit = unitLength(V) V_exact_unit = unitLength(V_exact) accu = np.sum( V_unit* V_exact_unit )/(V.shape[0]*1.) return accu
- Define a function to preprocess and simulate the built-in data.
def simulate_data(N=50, D=3, d=2, save =True, file_name_prefix ="./data" ): if not os.path.exists(file_name_prefix): print("Directory: {} do not exist, create it! \n".format(os.path.abspath(file_name_prefix))) os.makedirs(os.path.abspath(file_name_prefix)) V_2d = np.random.randn(*(N * 3, d)) * 6 err_2d = np.random.randn(*(N * 3, d))*2 x_1 = np.asarray([0, ] * d) x_2 = np.asarray([50, ] * d) x_3 = np.asarray([160, ] * d) X_2d = np.zeros_like(V_2d) X_2d[0, :] = x_1 X_2d[N, :] = x_2 X_2d[N * 2, :] = x_3 for i in np.arange(N - 1): X_2d[i + 1, :] = X_2d[i, :] + V_2d[i, :] + err_2d[i,:] X_2d[i + N + 1, :] = X_2d[i + N, :] + V_2d[i + N, :] + err_2d[i + N, :] X_2d[i + N * 2 + 1, :] = X_2d[i + N * 2, :] + V_2d[i + N * 2, :] + err_2d[i + N * 2, :] y = np.asarray([0, ] * N + [1, ] * N + [2, ] * N) U = np.array(np.random.randn(*(d, D))) X = X_2d.__matmul__(U) V = V_2d.__matmul__(U) adata = AnnData(X=X, layers={"velocity": V},obs={"clusters": y}, obsm={"X_exact_embeddings":X_2d, "V_exact_embeddings":V_2d}) if save: file_name = file_name_prefix+"simulated_data_N_{}_D_{}.h5hd".format(N,D) adata.write_h5ad(file_name) return adata
- Process the data for simulation and prepare it for two-dimensional plotting.
adata = simulate_data(N=N,D=D,d=d,save=False) X = adata.X V = adata.layers["velocity"] X_basis = f"X_{basis}" X = np.asarray(X, dtype=np.float64) V = np.asarray(V, dtype=np.float64) Y = None if (X_basis in adata.obsm.keys()) and adata.obsm[X_basis] is not None: Y = adata.obsm[f"X_{basis}"] if Y is None: print("Do not get the low dimesnional embedding Y! \n") # raise Y = np.asarray(Y, dtype=np.float64)
- Plot the simulation with the recent model, scVelo and determine the accuracy.
adata_tmp = AnnData(X=X, obsm={"X_umap": Y}, layers={"velocity": V, "spliced": X}) scv.tl.velocity_graph(adata_tmp, xkey='spliced') scv.tools.velocity_embedding(adata_tmp, basis="umap") W = adata_tmp.obsm["velocity_umap"] vkey = "velocity_scvelo_original" str_exact = "exact" if exact else "approx" method = 'scvelo_velocity_original' adata.obsm[f"{vkey}_{str_exact}_{basis}"] = W W_exact = adata.obsm["V_exact_embeddings"] accu = velocity_accuracy(W, W_exact) print(f" {method}, {str_exact}, accu: {accu}\n") method_str = "scVelo" title = "{} on exact embeddings with accuracy {:5.3f}".format(method_str, accu) scv.pl.velocity_embedding(adata, basis=basis, V=W, title=title,density=2,) scv.pl.velocity_embedding_stream(adata, basis=basis, V=W, title=title,density=2,) scv.pl.velocity_embedding_grid(adata, basis=basis, V=W, title=title,)
Output:
- Plot the exact simulation visualization with DSNE-Approximate version and determine the accuracy.
W = DSNE_approximate(X, V, Y=Y, perplexity=perplexity, pca_d=None, threshold_V=1e-8, separate_threshold=1e-8, seed=16, random_state=None, copy_data=False, verbose=verbose) vkey = "velocity_scvelo" str_exact = "exact" if exact else "approx" method = "DSNE_approximate" adata.obsm[f"{vkey}_{str_exact}_{basis}"] = W W_exact = adata.obsm["V_exact_embeddings"] accu = velocity_accuracy(W, W_exact) print(f" {method}, {str_exact}, accu: {accu}\n") method_str = "DSNE-approximate" title = "{} on exact embeddings with accuracy {:5.3f}".format(method_str, accu) scv.pl.velocity_embedding(adata, basis=basis, V=W, title=title,density=2,) scv.pl.velocity_embedding_stream(adata, basis=basis, V=W, title=title,density=2,) scv.pl.velocity_embedding_grid(adata, basis=basis, V=W, title=title,)
Output:
- Plot the exact simulation visualization with standard DSNE and determine the accuracy.
W = DSNE(X, V, Y=Y, K= K, perplexity=perplexity, pca_d=None, threshold_V=1e-8, separate_threshold=1e-8, max_iter=1000, mom_switch_iter=250, momentum=0.5, final_momentum=0.8, eta=0.1, epsilon_kl=1e-16, epsilon_dsne=1e-16, with_norm=with_norm, seed=16, random_state=None, copy_data=True, verbose=verbose) vkey = "velocity_dsne" method = 'DSNE' str_exact = "exact" if exact else "approx" adata.obsm[f"{vkey}_{str_exact}_{basis}"] = W W_exact = adata.obsm["V_exact_embeddings"] accu = velocity_accuracy(W, W_exact) print(f" {method}, {str_exact}, accu: {accu}\n") method_str = "DSNE" title = "{} on exact embeddings with accuracy {:5.3f}".format(method_str, accu) scv.pl.velocity_embedding(adata, basis=basis, V=W, title=title,density=2,) scv.pl.velocity_embedding_stream(adata, basis=basis, V=W, title=title,density=2,) scv.pl.velocity_embedding_grid(adata, basis=basis, V=W, title=title,)
Output:
Find the Colab Notebook with these code implementations here.
Wrapping up
Directional Stochastic Neighbor Embedding (DSNE) outperforms the recent successful methods including scVelo in velocity visualizations. DSNE achieves extraordinary results in cell biology applications such as messenger RNA velocity analysis, embryo development, cell transition and cell differentiation with accuracies close to unity!
Further Reading: