单细胞论文记录(part5)--A joint model of unpaired data from scRNA-seq and ST for imputing missing gene ...-CFANZ编程社区

学习笔记，仅供参考，有错必纠
阅读状态：略读

文章目录

A joint model of unpaired data from scRNA-seq and spatial transcriptomics for imputing missing gene expression measurements

摘要
The gimVI probabilistic model
Posterior inference
Performance benchmarks

Integrating cells into a joint latent space
Imputing missing genes

A joint model of unpaired data from scRNA-seq and spatial transcriptomics for imputing missing gene expression measurements

摘要

转录组的空间研究为生物学家提供了异质性和复杂组织的基因表达图，如小鼠神经系统(Zeisel等人，2018). 目前存在一系列基于单分子荧光原位杂交（smFISH）（Shah等人，2016；Codeluppi等人，2018）、成像和测序相结合（starMAP）（Wang等人，2018）或高分辨率RNA原位测序（如滑动测序）（Rodriques等人，2019）的实验方案. 前两种技术具有很高的灵敏度（Codeluppi等，2018），但由于需要事先从整个转录组中选择一小部分基因进行量化（从smFISH的50个到starMAP的几百个不等）. 第三种技术不限于预先确定的基因子集，原则上可以捕获任何基因（在转录组中的数千个基因中）. 然而，在Slide-seq测量中有效捕获的基因数量大大低于用标准（即非空间）单细胞RNA测序（scRNA-seq）获得的，后者也更普遍，更容易实现（Klein等人，2015；Saunders等人，2018）.

因此，一个重要的研究问题是将空间转录组学数据与scRNA-seq测量结合起来，以便将细胞类型注释从scRNA-seq转移到空间spot（Welch等人，2018；Zhu等人，2018），估算空间检测中缺失的基因（Stuart等人，2018）或近似于用标准scRNA-seq测量的细胞在一个组织中的物理位置（Satija等人，2015）. 最先进的方法主要是将空间和标准数据集嵌入一个潜在空间–使用矩阵分解技术（Liger和Seurat Anchors）(Welch等人，2018；Stuart等人，2018) – 然后通过相互最近的邻居（Haghverdi等人，2018）或量化归一化进行校正. 然而，由于：

嵌入步骤依赖于数据的线性模型，尽管没有假设线性的依据
alignment是临时执行的，这种方法将有更大的机会重叠那些表现出很少生物相似性的样本

本文，我们重点讨论了基于同一生物组织的(unpaired)标准scRNA-seq数据的空间转录数据中缺失基因的归纳问题。值得注意的是，我们的问题与domain adaptation methods（CORAL）(Sun et al., 2016)以及unpaired image-to-image translation(Zhu et al., 2017)有关，该方法已被应用于基因组数据（MA- GAN）(Amodio & Krishnaswamy, 2018).

由于两套数据集中都有一定比例的特征（即基因），我们建议在领域适应的最新进展基础上（Ganin et al, 2017），引入基因归因与变异推断（gimVI），这是一个用于整合空间转录组学数据和scRNA-seq数据的深度生成模型，可用于归因缺失的基因. gimVI基于scVI（Lopez等人，2018），具有不同的架构以及条件分布的替代选择，以更好地考虑到技术特定的协变量转移.

在描述了我们的generative model（第1节）和inference procedure（第2节）之后，我们在真实数据集上将gimVI与其他方法进行了比较（第3节）。我们的源代码是基于PyTorch的，在https://github.com/YosefLab/scVI。

The gimVI probabilistic model

略

Figure 1 represents the probabilistic model graphically.

单细胞论文记录(part5)--A joint model of unpaired data from scRNA-seq and ST for imputing missing gene ..._机器学习

Posterior inference

略

Performance benchmarks

为了评估gimVI的性能，我们将介绍将细胞整合到一个联合潜伏空间的统计权衡（第3.1节），然后对我们的缺失基因置换方法进行基准测试（第3.2节）。在整个过程中，我们将gimVI（对于不同的κ值）与vanilla scVI、Liger和Seurat进行比较。在学习潜在空间时，只有gimVI能够利用G中的保留基因–这是相对于其他专注于基因集G′的方法的一个关键优势，而其他方法只将互补基因用于下游分析，如归因. 为了确定基准，我们用单细胞论文记录(part5)--A joint model of unpaired data from scRNA-seq and ST for imputing missing gene ..._数据_02 (潜在空间的维度)运行所有的方法.

我们在两对真实数据集上应用gimVI. 首先，我们使用3,005个小鼠体感皮层细胞的scRNA-seq数据集（Zeisel等人，2015）和来自同一组织的4,462个细胞和33个基因的osmFISH数据集（Codeluppi等人，2018）（简称为mSMS）。其次，我们使用一个由71,639个小鼠前额叶皮层细胞组成的scRNA-seq数据集（Saunders等人，2018）和一个由来自同一组织的3,704个细胞和166个基因组成的starMAP数据集（Wang等人，2018）（称为mPFC）。对于每一对数据集，我们保留了G′中20%的基因，并将G定义为添加了保留基因的G′。在未来的工作中，我们将研究该模型对更大范围的比率G′的稳健性，在英特尔i7-/G上的运行时错误（可能是内存问题）。由于Liger抛出4500U解决8GB的RAM，同时在mPFC对上运行，我们随机将相应的scRNA- seq数据集子样化为15000个细胞，用于基准测试。我们在NVIDIA Tesla K80 GPU上运行scVI和gimVI。对于所有的算法，拟合数据的时间不到几分钟。

Integrating cells into a joint latent space

We assess our model’s ability to integrate information from two drastically different datasets. First we compute the entropy of mixing (Haghverdi et al., 2018) to quantify how well the algorithms integrate the datasets in the latent space. As the pairs of datasets are unbalanced, we use the negative KL divergence between the k-nearest neighbors (k-NN) local sn distribution and the global sn distribution, which generalizes the entropy of mixing for that setting. Because this metric can be easily maximized by an algorithm that would ignore the input, we propose to use the k-NN purity (Xu et al., 2019) to evaluate whether an algorithm would return a similar latent space either by integrating the data or on individual datasets. For Seurat (resp. Liger, gimVI), we use PCA (resp. NMF, scVI) as the corresponding method for individual datasets and compute a Jaccard index to measure the overlap of the k-NN graphs. As such metrics could depend on the size of the neigborhood, we report these over a wide range of values for k in Figure 2.

单细胞论文记录(part5)--A joint model of unpaired data from scRNA-seq and ST for imputing missing gene ..._数据_03

Imputing missing genes

For this task, we also ran CORAL and MAGAN. However, MAGAN returned uniformly random values, despite our efforts to train the model (unshown). We report our results in Table 1.

单细胞论文记录(part5)--A joint model of unpaired data from scRNA-seq and ST for imputing missing gene ..._数据集_04

In particular, we investigate held-out gene Lamp5, a known marker gene for excitatory neurons and visualize them according to the cells’ locations on Figure 4 for the mSMS spatial dataset. Our result suggests that gimVI provides an imputation spatially more coherent than its competitors who guess high expression of this gene on wrong regions of the brain (e.g., layer 6).

单细胞论文记录(part5)--A joint model of unpaired data from scRNA-seq and ST for imputing missing gene ..._生物信息_05