复现原文(一):Single-cell RNA sequencing of human kidney(step by step)

https://www.nature.com/articles/s41597-019-0351-8
背景介绍
作为拥有多种功能的复杂器官,肾脏由多个解剖上不连续的部分构成。其中肾小球与肾小管共同构成了肾单位的核心结构。足细胞与肾小球内皮细胞协同作用形成了最终的过滤屏障——防止蛋白质损失到尿液中。此外,在肾脏结构中发现了一种特殊的顶叶上皮细胞类型(Parietal epithelial cells),它们在调节肾小球通透性方面发挥重要作用。近端小管(proximal tubule)通过控制特定离子转运实现对全身酸碱平衡的调节作用;而远曲小管则主要负责电解质物质的重吸收。在之前的研究中 bulk RNA测序技术被用来分析不同组织样本中的转录组信息;然而该技术仅能反映整体平均RNA表达水平而无法揭示单个细胞层面的情况
完整的人体肾脏细胞解剖结构对于阐明肾脏疾病及肾癌的细胞起源至关重要。某些特定类型的肾脏疾病可能与特定的肾小管细胞类型相关。为了系统地研究人类肾脏的分类及其转录组信息,作者通过快速获取了人 kidney 的单个细胞悬液并成功进行了单个核 RNA 测序 (scRNA-seq) 技术处理(重磅综述:三万字长文读懂单个核 RNA 测序分析的最佳实践教程 (原理、代码和评述))。其中,近端肾小管 (PT) 组织被划分为三个亚型区域,而集合导管组织则被划分为两个亚型区域。这些数据为 kidney 细胞学及其疾病的机制研究提供了重要的参考依据
下面我们按照作者的分析思路复现该文章的部分内容:
首先,从GSE131685中下载数据:

为了避免混淆文件名,在Read10X操作中将实验数据中的文件名分别更改为``barcodes.tsv、$genes.tsv``以及matrix.mtx$;同时,在Hemberg-lab单细胞转录组数据分析(七)- 导入10X和SmartSeq2数据Tabula Muris的研究中得到验证的操作流程下才能确保不会出现错误结果。
Kidney data loading
library(devtools)
install_github("immunogenomics/harmony")
library(Seurat)
library(magrittr)
library(harmony)
library(dplyr)
#Kidney data loading 并构建seurat object
K1.data <- Read10X(data.dir = "/Users/zhanghu1992/Documents/GSE131685_RAW/kidney1/")
K1 <- CreateSeuratObject(counts = K1.data, project = "kidney1", min.cells = 8, min.features = 200)
K2.data <- Read10X(data.dir = "/Users/zhanghu1992/Documents/GSE131685_RAW/kidney2/")
K2 <- CreateSeuratObject(counts = K2.data, project = "kidney2", min.cells = 6, min.features = 200)
K3.data <- Read10X(data.dir = "/Users/zhanghu1992/Documents/GSE131685_RAW/kidney3/")
K3 <- CreateSeuratObject(counts = K3.data, project = "kidney3", min.cells = 10, min.features = 200)
kid <- merge(x = K1, y = list(K2, K3)) #读取文件并用merge函数进行合并

提一句嘴, 来探讨一下华盛顿大学PhD Jared.andrews对merge函数的解读:

Seurat的整合方法在我眼中显得过于直接。如果你决定采用集成路线的话, 我建议使用SeuratWrapper搭配fastMNN
QC
# quality control
kid[["percent.mt"]] <- PercentageFeatureSet(kid, pattern = "^MT-") #提取有关线粒体的基因
VlnPlot(kid, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3) #由图可以看出分布还可以

plot1 <- FeatureScatter(kid, feature1 = "nCount_RNA", feature2 = "percent.mt")
plot2 <- FeatureScatter(kid, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
CombinePlots(plots = list(plot1, plot2))

kid <- subset(kid, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 30) #筛选条件
kid <- NormalizeData(kid, normalization.method = "LogNormalize", scale.factor = 10000)
kid <- NormalizeData(kid) #标准化
kid <- FindVariableFeatures(kid, selection.method = "vst", nfeatures = 2000) #查找高变基因
top10 <- head(VariableFeatures(kid), 10)
plot1 <- VariableFeaturePlot(kid)
plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE)
CombinePlots(plots = list(plot1, plot2))

# 计算细胞周期
s.genes <-cc.genes$s.genes
g2m.genes<-cc.genes$g2m.genes
kid <- CellCycleScoring(kid, s.features = s.genes, g2m.features = g2m.genes, set.ident = TRUE)
all.genes <- rownames(kid)
kid <- ScaleData(kid, vars.to.regress = c("S.Score", "G2M.Score"), features = all.genes)
我想谈谈,在我看过的文献资料中发现大多数研究均会在对细胞周期影响进行降维处理后,在分析细胞周期方面对分群的影响时会将其作为一个单独模块来描述。具体而言,在前期阶段并未考虑细胞周期基因是否会对聚类结果产生影响的情况下选择了去除相关基因。为了防止该因素干扰分群分析的结果
#当然我们还是要看是否细胞周期真的有影响,感兴趣的小伙伴可以看一下,确实是有一定影响的!#kid <- ScaleData(kid, features = rownames(kid))
#kid <- RunPCA(kid , features = c(s.genes, g2m.genes))
#DimPlot(kid)
利用harmony算法去除批次效应并细胞分类
#Eliminate batch effects with harmony and cell classification
kid <- RunPCA(kid, pc.genes = kid@var.genes, npcs = 20, verbose = FALSE)
options(repr.plot.height = 2.5, repr.plot.width = 6)
kid <- kid %>%
RunHarmony("orig.ident", plot_convergence = TRUE) #等候时间较长,请溜达溜达吧
harmony_embeddings <- Embeddings(kid, 'harmony')
harmony_embeddings[1:5, 1:5]
kid <- kid %>%
RunUMAP(reduction = "harmony", dims = 1:20) %>%
FindNeighbors(reduction = "harmony", dims = 1:20) %>%
FindClusters(resolution = 0.25) %>%
identity()
new.cluster.ids <- c(0,1, 2, 3, 4, 5, 6, 7,8,9,10)
names(new.cluster.ids) <- levels(kid)
kid <- RenameIdents(kid, new.cluster.ids)


通过"harmony"平台融合多源单组别数据进行深入探索
鉴定Marker基因
#Calculating differentially expressed genes (DEGs) and Save rds file
kid.markers <- FindAllMarkers(kid, only.pos = TRUE, min.pct = 0.25, thresh.use = 0.25)#寻找高变基因
write.table(kid.markers,sep="\t",file="/home/yuzhenyuan/Seurat/0.2_20.xls")
saveRDS(kid,file="/home/yuzhenyuan/kid/har/0.25_20.rds")
在单细胞分群完成后, 如何确定每一类群对应的Marker基因?
可视化
#Some visual figure generation
DimPlot(kid, reduction = "umap", group.by = "orig.ident", pt.size = .1, split.by = 'orig.ident')

DimPlot(kid, reduction = "umap", group.by = "Phase", pt.size = .1) #按照细胞周期进行划分

DimPlot(kid, reduction = "umap", label = TRUE, pt.size = .1) #注意作者在用同样参数设置后分为10个clusters,其实无关紧要,都需要通过marker重新贴现。

根据作者提供的marker对细胞亚群进行贴现,如下图所示:

实际上, 部分 marker 并非特异性 marker, 在进行辨别的时候一定要谨慎对待.

基本上与原图内容相仿,在tSNE中是否也存在一些随机种子相关因素呢?总能呈现出些许差异:

DoHeatmap(kid, features = c("SLC13A3","SLC34A1","GPX3","DCXR","SLC17A3","SLC22A8","SLC22A7","GNLY","NKG7","CD3D","CD3E","LYZ","CD14","KRT8","KRT18","CD24","VCAM1","UMOD","DEFB1","CLDN8","AQP2","CD79A","CD79B","ATP6V1G3","ATP6V0D2","TMEM213")) # 绘制部分基因热图

学习R语言制作美观的热力图
VlnPlot(kid, pt.size =0, idents= c(1,2,3), features = c("GPX3", "DCXR","SLC13A3","SLC34A1","SLC22A8","SLC22A7"))
VlnPlot(kid, idents= c(8,10), features = c("AQP2", "ATP6V1B1","ATP6V0D2","ATP6V1G3"))


R语言中的box plot是其数据可视化的重要工具;它结合了box-and-whisker plot的特点并附加了细节信息;此外还提供了Violin plot用于展示数据密度分布;Dot plot则用于显示具体数据点的位置;最后还有Scatterplot region plot以更直观的方式呈现数据分布情况
在数据可视化过程中, 选择箱线图的主要原因是什么?
tSNE plot
##tSNE Plot
kid <-RunTSNE(kid, reduction = "harmony", dims = 1:20)
TSNEPlot(kid, do.label = T, label = TRUE, do.return = T, pt.size = 1)
TSNEPlot(kid, do.return = T, pt.size = 1, group.by = "orig.ident", split.by = 'orig.ident')
TSNEPlot(kid, do.return = T, pt.size = 1, group.by = "Phase")
深入解析Hemberg-lab单细胞转录组数据分析系列(十二集),重点探讨Scater单细胞表达谱的tSNE可视化技术
与前面的图是相同的。
提取PT cells进行后续分析
#Select a subset of PT cells(近端小管)
PT <- SubsetData(kid, ident.use = c(0,1,2), subset.raw = T)
