单细胞论文记录(part6)--SpaGE: Spatial Gene Enhancement using scRNA-seq-CFANZ编程社区

学习笔记，仅供参考，有错必纠
阅读状态：略读

文章目录

SpaGE: Spatial Gene Enhancement using scRNA-seq

ABSTRACT
INTRODUCTION
MATERIALS AND METHODS

SpaGE algorithm
Datasets

SpaGE: Spatial Gene Enhancement using scRNA-seq

ABSTRACT

Single-cell technologies are emerging fast due to their ability to unravel the heterogeneity of biological systems. While scRNA-seq is a powerful tool that measures whole-transcriptome expression of single cells, it lacks their spatial localization. Novel spatial transcriptomics methods do retain cells spatial information but some methods can only measure tens to hundreds of transcripts. To resolve this discrepancy, we developed SpaGE, a method that integrates spatial and scRNA-seq datasets to predict whole transcriptome expressions in their spatial configuration. Using five dataset-pairs, SpaGE outperformed previously published methods and showed scalability to large datasets. Moreover, SpaGE predicted new spatial gene patterns that are confirmed independently using in situ hybridization data from the Allen Mouse Brain Atlas.

INTRODUCTION

Single cell technologies rapidly developed over the last decade and have become valuable tools for enhancing our understanding of biological systems. Single-cell RNAsequencing (scRNA-seq) allows unbiased measurement of the entire gene expression profile of each individual cell and has become the de facto technology used to characterize the cellular composition of complex tissues (1,2). However, single cells often have to be dissociated before performing scRNA-seq and results in losing the spatial context and hence limits our understanding of cell identities and relationships. Recently, spatial transcriptomics technologies have advanced and provide localizations of gene expressions and cellular structure at the cellular level (3,4). Many current protocols can be divided in two categories:

imagingbased methods (e.g. osmFISH, MERFISH and seqFISH+) (5–7),
sequencing-based methods (e.g. STARmap and Slide-seq) (8,9).

Imaging-based protocols have a high gene detection sensitivity; capturing high proportion of the mRNA molecules with relatively small dropout rate. While seqFISH+ and the latest generation of MERFISH can measure up to ∼10 000 genes (7,10), many different imaging-based protocols are often limited in the number of genes that can be measured simultaneously.

On the other hand, sequencing-based protocols like STARmap can scale up to thousands of genes, it has a relatively lower gene detection sensitivity. Slide-seq is not limited in the number of measured genes and can be used to measure the whole transcriptome. However, similar to STARmap, Slide-seq suffers from a low gene detection sensitivity. In addition, osmFISH, MERFISH and STARmap can capture genes at the single-molecule resolution, which can be averaged or aggregated to the single-cell level. While Slide-seq has a resolution of 10 单细胞论文记录(part6)--SpaGE: Spatial Gene Enhancement using scRNA-seq_机器学习 , which is comparable to the average cell size, but does not always represent a single-cell.

Given the complementary information provided by both scRNA-seq and spatial transcriptomics data, integrating both types would provide a more complete overview of cell identities and interactions within complex tissues. This integration can be performed in two different ways (11):

dissociated single-cells measured with scRNA-seq can be mapped to their physical locations in the tissue (12–14),
missing gene expression measurements in the spatial data can be predicted from scRNA-seq.

In this study, we focus on the second challenge in which measured gene expressions of spatial cells can be enhanced by predicting the expression of unmeasured genes based on scRNAseq data of a matching tissue. Several methods have addressed this problem using various data integration approaches to account for the differences between the two data types (15–18). All these methods rely on joint dimensionality reduction methods to embed both spatial and scRNAseq data into a common latent space. For example, Seurat uses canonical correlation analysis (CCA), Liger uses nonnegative matrix factorization (NMF), and Harmony uses principal component analysis (PCA). While Seurat, Liger and Harmony rely on linear methods to embed the data, gimVI uses a non-linear deep generative model. Despite recent benchmarking efforts (19), a comprehensive evaluation of these methods for the task of spatial gene prediction from dissociated cells is currently lacking. For example, Seurat, Liger and gimVI, have only been tested using relatively small datasets (<2,000 cells) (15,16,18). It is thus not clear whether a complex model, like gimVI, is really necessary. Moreover, Seurat, Harmony and gimVI lack interpretability of the integration procedure, so that it does not become clear which genes contribute in the prediction task.

Here, we present SpaGE (Spatial Gene Enhancement), a robust, scalable and interpretable machine-learning method to predict unmeasured genes of each cell in spatial transcriptomic data through integration with scRNA-seq data from the same tissue. SpaGE relies on domain adaptation using PRECISE (20) to correct for differences in sensitivity of transcript detection between both single-cell technologies, followed by a k-nearest-neighbor (kNN) prediction of new spatial gene expression. We demonstrate that SpaGE outperforms state-of-the-art methods by accurately predicting unmeasured gene expression profiles across a variety of spatial and scRNA-seq dataset pairs of different regions in the mouse brain. These datasets include a large spatial data with >60,000 cells, used to illustrate the scalability and computational efficiency of SpaGE compared to other methods.

MATERIALS AND METHODS

SpaGE algorithm

The SpaGE algorithm takes as input two gene expression matrices corresponding to the scRNA-seq data (reference) and the spatial transcriptomics data (query). Based on the set of shared genes between the two datasets, SpaGE enriches the spatial transcriptomics data using the scRNA-seq data, by predicting the expression of spatially unmeasured genes.
The SpaGE algorithm can be divided in two major steps:

alignment of the two datasets using the domain adaptation algorithm PRECISE (20),
gene expression prediction using k-nearest-neighbor regression.

略

Datasets

We used six dataset pairs (Table 1) composed of four scRNA-seq datasets (AllenVISp (22), AllenSSp (23), Zeisel (24)andMoffit (4)) and four spatial transcriptomics datasets (STARmap (8), osmFISH (5), MERFISH (4) and seqFISH+ (7)). The AllenVISp (GSE115746) and the AllenSSp datasets were downloaded from https://portal. brain-map.org/atlases-and-data/rnaseq.TheAllenVISp is obtained from the ‘Cell Diversity in the Mouse Cortex 2018’ release. The AllenSSp is obtained from the ‘Cell Diversity in the Mouse Cortex and Hippocampus’ release of October 2019. We downloaded the whole dataset and used the metadata to only select cells from the SSp region. The Zeisel dataset (GSE60361) was downloaded from http: //linnarssonlab.org/cortex/, while the Moffit 10X dataset (GSE113576) was downloaded from GEO.

The STARmap dataset was downloaded from the STARmap resources website (https://www. starmapresources.com/data). We obtained the gene count matrix and the cell position information for the largest 1020-gene replicate. Cell locations and morphologies were identified using Python code provided by the original study (https://github.com/weallen/STARmap). The osmFISH dataset was downloaded as loom file from http://linnarssonlab.org/osmFISH/, we obtained the gene count matrix and the metadata using the loompy Python package. The MERFISH dataset was downloaded from Dryad repository (https: //doi.org/10.5061/dryad.8t8s248), we used the first na ̈ıve female mouse (Animal ID = 1). The seqFISH+ dataset was obtained from the seqFISH-PLUS GitHub repository (https://github.com/CaiGroup/seqFISH-PLUS), we used the gene count matrix of the mouse cortex dataset.