Spatial clustering of CODEX spatial proteomics data#

This tutorial shows an example of spatial clustering of a CODEX dataset of mouse spleens from Goltsev et al., 2018 using CellCharter. The dataset is composed of 3 healthy samples (BALBc-1 to BALBc-3) and 6 samples from mice with the systmeic lupus erythematosus disease (MRL-4 to MRL-9) for a total of more than 700k cells assyed for around 30 protein markers.
We are going to use trVAE for dimensionality reduction and CellCharter to compute the spatial clusters of all samples together.

IMPORTANT: to decrease the amount of time required to run the notebook, we are running the analysis on the same dataset but with parameters different from the original publication. Hence, the results may slightly differ from the ones reported in the paper.

Let’s first load the packages that will be necessary for the analysis.

import squidpy as sq
import cellcharter as cc
import scanpy as sc
from lightning.pytorch import seed_everything
seed_everything(0)

Data preprocessing#

We can download the full dataset directly through CellCharter.

adata = cc.datasets.codex_mouse_spleen('./data/codex_mouse_spleen.h5ad')
adata

AnnData object with n_obs × n_vars = 707474 × 29
    obs: 'cell_type', 'i-niche', 'tile', 'area', 'dataset', 'stage', 'sample'
    uns: 'spatial', 'spatial_cluster_colors'
    obsm: 'blanks', 'spatial'

We are now going to scale the intensity values of each marker individually using scanpy’s pp.scale function.

First, we save the unscaled data in the raw version of our anndata object, so that we can retrieve it any time.
Then, each sample will be scaled indepenendently. However, scanpy does not implement this functionality, so we will do it manually.

adata.raw = adata.copy()
for sample in adata.obs['sample'].cat.categories:
    adata.X[adata.obs['sample'] == sample, :] = sc.pp.scale(adata[adata.obs['sample'] == sample], copy=True).X

Dimensionality reduction#

Given the limited number of markers in current spatial proteomics technologies, dimensionality reduction is not stricly necessary. However, we noticed that it can help in reducing noise and speed up the clustering step. This is crucial especially in the cluster stability analysis, that, to suggest the best candidates for the number of clusters, repeats the clustering multiple times.

We are going to use a trVAE model already pretrained on this dataset.
First we load the model. In map_location you can specify the device in which you want to load the model. This allows to load, for example, to load on CPU a model trained on a GPU. In fact, this model was trained on a GPU, but we are going to load it on the CPU.

model = cc.tl.TRVAE.load('./tutorial_models/codex_mouse_spleen_trvae', adata, map_location='cpu')

AnnData object with n_obs × n_vars = 707474 × 29
    obs: 'cell_type', 'i-niche', 'tile', 'area', 'dataset', 'stage', 'sample'
    uns: 'spatial', 'spatial_cluster_colors'
    obsm: 'blanks', 'spatial'

INITIALIZING NEW NETWORK..............
Encoder Architecture:
	Input Layer in, out and cond: 29 128 1
	Hidden Layer 1 in/out: 128 128
	Mean/Var Layer in/out: 128 10
Decoder Architecture:
	First Layer in, out and cond:  10 128 1
	Hidden Layer 1 in/out: 128 128
	Output Layer in/out:  128 29

Then we are going to extract the dimensionality-reduced embeddings for all cells of the dataset. We use the dataset labels, which are the same for all cells of this dataset, as condition, because there was no need to correct for batch effects.

adata.obsm['X_trVAE'] = model.get_latent(adata.X, adata.obs['dataset'])

If you are using a different dataset, you will have to train your own trVAE model. You can do it using cc.tl.TRVAE, a slightly modified version of the original trVAE model from scArches.

Apart from internal changes, the model has the same interface as the original one, so you can rely on the documentation of the original model for any further information.

CellCharter’s spatial clustering#

It’s now time to compute the spatial clusters.
CellCharter encodes all cells of each sample as a network. Every cell is a node and two cells are connected by an edge if they are physically close to each other in the tissue. We can obtain this network using squidpy’s gr.spatial_neighbors function.

sq.gr.spatial_neighbors(adata, library_key='sample', coord_type='generic', delaunay=True)

However, the Delaunay triangulation used by squidpy has the drawback of generating connections between distant cells.
We can see it by zooming in one area of the BALBc-3 sample.

sq.pl.spatial_scatter(
    adata, 
    color='sample', 
    library_key='sample', 
    img=None, 
    title=['BALBc-3'],
    size=2500,
    connectivity_key='spatial_connectivities',
    edges_width=0.3,
    legend_loc=None,
    crop_coord=(900000, 900000, 1700000, 1700000 ),
    library_id=['BALBc-3']
    )

../_images/6f6bbaff9257aa5bc365317873dc8719c182c56cd95da99800daec6bd851122d.png

One could decide a maximum radius and pass it as a parameter to the function, but it depends from dataset to dataset based on the spatial resolution of the data. We developed a method called gr.remove_longLinks that removes links that have a distance greater than a certain percentile (the default is the 99th percentile). We can see that it greatly reduces the long links without affecting the rest.

cc.gr.remove_long_links(adata)

sq.pl.spatial_scatter(
    adata, 
    color='sample', 
    library_key='sample', 
    img=None, 
    title=['BALBc-3'],
    size=2500,
    connectivity_key='spatial_connectivities',
    edges_width=0.3,
    legend_loc=None,
    crop_coord=(900000, 900000, 1700000, 1700000 ),
    library_id=['BALBc-3']
    )

../_images/95113322ca51e091df58a32d459725198860fc4f1c7816d3a3136467b166c4d1.png

The next step is the neighborhood aggregation. It consists of concatenating the features of every cell with the features aggregated from neighbors ad increasing layers from the considered cell, up to a certain layer n_layers. Aggregation functions are used to obtain a single feature vector from the vectors of multiple neighbors, with the default being the mean function.

In this case we use 3 layers of neighbors, so we obtain, for each cell, a feature vector of length 40. That is the cell’s reduced vector size from trVAE (10) plus the 3 aggregated vectors of length 10 each, from the 3 layers of neighbors.

cc.gr.aggregate_neighbors(adata, n_layers=3, use_rep='X_trVAE')

We create the Gaussian Mixture model, specifying 11 as desired number of clusters. This value was obtained from the cluster stability analysis. For more details, you can check our paper.

gmm = cc.tl.Cluster(
    n_clusters=11, 
    random_state=12345,
    # If running on GPU
    #trainer_params=dict(accelerator='gpu', devices=1)
)

Then we fit the model to the data and predict the cluster label of all cells.

gmm.fit(adata, use_rep='X_cellcharter')
adata.obs['spatial_cluster'] = gmm.predict(adata, use_rep='X_cellcharter')

Epoch 45: 100%|██████████| 1/1 [00:00<00:00,  1.12it/s, nll=21.30]
Predicting DataLoader 0: 100%|██████████| 1/1 [00:00<00:00,  2.57it/s]

We can visualize the spatial clusters for some samples. In this case, we chose a healthy sample, and samples from the early, intermediate and late stage.

sq.pl.spatial_scatter(
    adata, 
    color='spatial_cluster', 
    library_key='sample', 
    img=None, 
    title=['BALBc-1', 'MRL-4 (early)', 'MRL-8 (intermediate)', 'MRL-9 (late)'],
    size=2500,
    ncols=1,
    library_id=['BALBc-1', 'MRL-4', 'MRL-8', 'MRL-9'],
)

../_images/cceb49546f2d1110c5047aea4208b9b130bc14089859e854a7eafd406d8c369b.png

cc.gr.enrichment(adata, group_key='spatial_cluster', label_key='cell_type')
cc.pl.enrichment(adata, group_key='spatial_cluster', label_key='cell_type', figsize=(4,4), fontsize=6, dot_scale=1)

../_images/3157ba2632601f8eae15caa460665e68bb038ff86453fa4ef9744d8117b1d695.png

Proximity analysis#

The next step is to describe the relative arrangement of spatial clusters in the samples and observe whether the relative proximity between spatial clusters changes between the healthy and disease condition. We extract the condition labels from the sample labels.

adata.obs['condition'] = adata.obs['sample'].str.split('-').str[0].astype('category')

We compute select the samples associated with the healthy condition and compute the neighborhood enrichment between the spatial clusters.

adata_balbc = adata[adata.obs['condition'] == 'BALBc']
cc.gr.nhood_enrichment(
    adata_balbc,
    cluster_key='spatial_cluster',
)

cc.pl.nhood_enrichment(
    adata_balbc,
    cluster_key='spatial_cluster',
    annotate=True,
    vmin=-1,
    vmax=1,
    figsize=(3,3),
    fontsize=5,
)

../_images/1d8245e880f864369a4397172d23957c1c1bed42a5e8913445dfc65b7d256fcc.png

We repeat the analysis for the lupus condition.

adata_mrl = adata[adata.obs['condition'] == 'MRL']
cc.gr.nhood_enrichment(
    adata_mrl,
    cluster_key='spatial_cluster',
)

cc.pl.nhood_enrichment(
    adata_mrl,
    cluster_key='spatial_cluster',
    annotate=True,
    vmin=-1,
    vmax=1,
    figsize=(3,3),
    fontsize=5,
)

../_images/ebe953955a02f1ed553f61a3a50ecb16f5f907be0f2c650ebe8587ad840b2137.png

While it would be possible to compare the differences in neighborhood enrichment by placing one plot next to the other, it is more convenient to use a single plot that shows the difference between the two conditions (i.e. differential neighborhood enrichment).

This is what diff_nhood_enrichment does. Optionally, we also decided to compute pvalues for each pair of clusters and set a threshold for significance. This increases the computational time of the function because it requires performing permutations to compute the pvalues, in this case n_perms=100 permutations.

Luckily our implementation of an analytical version of the non-differential neighborhood enrichment that we just ran is very fast, otherwise we would have to run permutations of permutations. diff_nhood_enrichment also leverages multiprocessing, with the number of cores used being set by the n_jobs parameter.

cc.gr.diff_nhood_enrichment(
    adata,
    cluster_key='spatial_cluster',
    condition_key='condition',
    library_key='sample',
    pvalues=True,
    n_jobs=15,
    n_perms=100
)

cc.pl.diff_nhood_enrichment(
    adata,
    cluster_key='spatial_cluster',
    condition_key='condition',
    condition_groups=['MRL', 'BALBc'],
    annotate=True,
    figsize=(3,3),
    significance=0.05,
    fontsize=5
)

../_images/137bfd5a3832610b2d1a2217aea810548c8b126af056e87deb608c2bb259aacb.png

Similarly as in the paper, we can see that the B-follicles (C8 in pink) lose de adjacency with marginal zones (C7 in dark blue) in the lupus spleen.

The order in the parameter condition_groups controls which condition is the observed one and which is the expected. So, if we invert the order, we see exactly the opposite trend.

cc.pl.diff_nhood_enrichment(
    adata,
    cluster_key='spatial_cluster',
    condition_key='condition',
    condition_groups=['BALBc', 'MRL'],
    annotate=True,
    figsize=(3,3),
    significance=0.05,
    fontsize=5
)

../_images/f67c552b7a4a2cac6cf5efbbc1d520daf989ad246cfc26e25b7869e21f5464ab.png

Shape characterization#

Another way to characterize changes in the spatial architecture of tissue samples is by measuring changes in shape of spatial clusters that are in common between the two conditions.

Since a spatial cluster may be repeated in multiple areas of the same sample, we need to find individual local clusters by computing the connected components of the spatial network.

cc.gr.connected_components(adata, cluster_key='spatial_cluster')

Then, we compute the boundaries around each local cluster.

cc.tl.boundaries(adata, min_hole_area_ratio=0.1)

In CellCharter we implemented four metrics described in the paper: linearity, curl, elongation and purity. In this example we are going to compare only the first two.

cc.tl.linearity(adata)
cc.tl.curl(adata)

cc.pl.shape_metrics(
    adata,
    condition_key='condition', 
    condition_groups=['BALBc', 'MRL'],
    cluster_key='spatial_cluster', 
    cluster_id=[10], 
    title='C4 - B-PALS shape',
    figsize=(4,3),
    fontsize=6
)

../_images/e647aaca23618d0e1562339009f211a1bc6a85b94890dd54e51bb1db2a400ed0.png

As we can see, the spatial cluster that is located between the B-follicles and the PALS tends to become less organized. This is reflected in a decrease of the linear shapes that characterized it, but more importantly in an increase in curl.

Spatial clustering of CODEX spatial proteomics data

Contents