Comparison of clusterProfiler and GSEA-P

原创 2017-06-08 Y叔 biobabble

上一篇why clusterProfiler fails比较了clusterProfiler和geneAnswers，比较的是超几何分布富集分析差异基因，我还顺道吐槽了一些陷阱，新手必看，免得掉坑。
这一次比较的是GSEA，也就是全表达谱分析的算法。两年前有人说通过比较broad institute开发的GSEA-P，发现clusterProfiler有问题，经过我测试，确实是这样，因为背景分布是双峰，需要区分对待。大家不用担心，早在两年以前，已经修复，现在久经考验！为了实现这个比较，我还让clusterProfiler支持Broad Institute的MSigDb数据，所以也就是说在这次比较之后，clusterProfiler就可以支持大家使用MSigDb数据进行富集性分析了，不管是超几何还是GSEA。
很高兴clusterProfiler得到大家的认可，上次文章的评论对我是很大的鼓励，也非常感谢大家的安利：

Thanks for raising to me and his efforts in benchmarking .

He pointed out two issues:

outputs from gseGO and GSEA-P are poorly overlap.
pvalues from gseGO are generally smaller and don’t show a lot of variation

For GSEA analysis, we have two inputs, a ranked gene list and gene set collections.

First of all, the gene set collections are very different. The GMT file used in his test is c5.cc.v5.0.symbols.gmt, which is a tiny subset of GO CC, while clusterProfiler used the whole GO CC corpus.

For instance, with his gene list as input, clusterProfiler annotates 195 genes as ribosome, while GSEA-P (using c5.cc.v5.0.symbols.gmt) only annotates 38 genes.

As the gene set collections is so different, I don’t believe the comparison can produce any valuable results.

The first step should be extending clusterProfiler to support using GMT file as gene set annotation, thereafter we can use identical input (both gene list and gene sets) and then benchmarking will be valuable for detecting issues that exclusively attributed to the implementation of GSEA algorithm.

clusterProfiler supports GMT file

Currently clusterProfiler supports via enricher and GSEA functions which require users provide their own annotation in a data.frame. This is a general interface for using user’s own annotation.

To support GMT file, we only need a function, read.gmt, to parse GMT file and output a data.frame that is suitable for enricher and GSEA. Now this function is available in devel branch (in BioC 3.3 or ) of clusterProfiler.

As used c5.cc, I also use it for further comparison. I have packed the file into clusterProfiler, so that users can use it for testing/practice.

> gmtfile <- system.file("extdata", "c5.cc.v5.0.entrez.gmt", 
                         package="clusterProfiler")## only 207K.## It's indeed a tiny subset of CC.> file.size(gmtfile)/1000[1] 207.608> c5 <- read.gmt(gmtfile)

hypergeometric test with GMT annotation

> require(clusterProfiler)
> data(geneList, package="DOSE")
> de <- names(geneList)[abs(geneList) > 2]
> head(de)
[1] "4312"  "8318"  "10874" "55143" "55388" "991"  > x <- enricher(de, TERM2GENE=c5)## omit some columns to make it more readable> head(summary(x)[, c(1, 3:7)], 3)
                                               ID GeneRatio  BgRatio
SPINDLE                                   SPINDLE     11/82  39/5270MICROTUBULE_CYTOSKELETON MICROTUBULE_CYTOSKELETON     16/82 152/5270CYTOSKELETAL_PART               CYTOSKELETAL_PART     15/82 235/5270
                               pvalue     p.adjust       qvalue
SPINDLE                  7.667674e-12 6.594200e-10 5.327016e-10MICROTUBULE_CYTOSKELETON 8.449298e-10 3.633198e-08 2.935019e-08CYTOSKELETAL_PART        2.414879e-06 6.623386e-05 5.350593e-05

gene set enrichment analysis with GMT annotation

> y <- GSEA(geneList, TERM2GENE=c5)
> head(summary(y)[, -c(1,2)], 2)
                          setSize enrichmentScore       NES      pvalue
EXTRACELLULAR_REGION          401      -0.3860230 -1.694322 0.001237624EXTRACELLULAR_REGION_PART     310      -0.4101043 -1.765775 0.001269036
                            p.adjust    qvalues
EXTRACELLULAR_REGION      0.03047874 0.02316228EXTRACELLULAR_REGION_PART 0.03047874 0.02316228

Comparison of clusterProfiler and GSEA-P

Now with read.gmt, we can compare clusterProfiler and GSEA-P with the same input.

First of all, I export geneList to a rnk file.

data(geneList, package="DOSE")
d=data.frame(gene=names(geneList), FC=geneList)
write.table(d, row.names=F,col.names=F, quote=F, file="geneList.rnk", sep="\t")

And run GSEA-P with the following parameters:

producer_class  xtools.gsea.GseaPreranked
producer_timestamp  1445941169480
param   collapse    false
param   plot_top_x  20
param   rnk /Users/guangchuangyu/Downloads/geneList.rnk
param   norm    meandiv
param   scoring_scheme  weighted
param   make_sets   true
param   mode    Max_probe
param   gmx /Users/guangchuangyu/Downloads/c5.cc.v5.0.entrez.gmt
param   gui false
param   rpt_label   my_analysis
param   help    false
param   out /Users/guangchuangyu/gsea_home/output/oct27
param   include_only_symbols    true
param   set_min 15
param   nperm   1000
param   rnd_seed    timestamp
param   zip_report  false
param   set_max 500

Re-run clusterProfiler::GSEA with pvalueCutoff=1.

g <- read.delim("gsea_report_for_na_neg_1445941169480.xls")
xx <- GSEA(geneList, TERM2GENE=c5, nPerm=1000, pvalueCutoff=1)
gy = merge(g, summary(xx), by.x="NAME", by.y="ID")
ggplot(gy, aes(NOM.p.val, pvalue)) + geom_point() + xlim(0, 1) + ylim(0, 1)

Now the comparison tells! clusterProfiler indeed produce smaller pvalues. As I said in , in general software produce more conservative result is more trustable. This is indeed an issue I couldn’t omit.

This issue is attributed to DOSE package, which serves as a backend of both clusterProfiler and ReactomePA.

In DOSE, we calculate the pvalue in the following way:

For each geneList, we calculate the observed ES, and then perform permutation to generate a null ES distribution. pvalue = (sum(ES >= permES)+1)/(nPerm+1), for greater side as an example, is calculated.

fixed bug of DOSE

Eventually I figured out that the way we calculate pvalue is not correct. As presented in et al, the distribution of ES is bimodal. Positive and negative ES values should be separated when calculating pvalues.

After I changed the , the pvalues generated by clusterProfiler::GSEA and GSEA-P are almost identical.

This bug was fixed in both release (>=2.8.1) and devel (>=2.9.1) branches. The fixed will affect DOSE, clusterProfiler and ReactomePA.

other concerns

There are other differences between clusterProfiler and GSEA-P.

clusterProfiler never produce pvalue=0

For calculating pvalues, GSEA-P may produce pvalue=0, while clusterProfiler never produce pvalue=0 since we add pseudocount, 1, in both numerator and demoninator. Some software may not accept pvalue=0, for example if you want to use topGO to visualize enrichment with GO topology, the result can’t contain any pvalue=0.

clusterProfiler test whole gene set collections

> g=read.delim("gsea_report_for_na_neg_1445941169480.xls")
> dim(g)
[1] 56 12> yy = summary(xx)
> yy = yy[with(yy, setSize >= 15 & setSize <=500),]
> dim(yy)
[1] 152   8> all(g$NAME %in% yy$ID)
[1] TRUE

GSEA-P didn’t filter result by pvalues, as it reported pvalues ranging from 0 to 1.

We cut our result by setSize in [15, 500] as default parameter of GSEA-P.

clusterProfiler tests all the gene sets while GSEA-P only tests a subset.

If I have some free time, I will figure out how they select gene sets to test. If it’s reasonable, we can add a parameter for clusterProfiler users to switch between two modes (full sets or subset). This is still an open issue/question. If you have any idea, don’t hesitate to let me know. :)

Citation

Yu G, Wang L, Han Y and He Q^*. . OMICS: A Journal of Integrative Biology. 2012, 16(5):284-287.

赞赏

李尚福、魏凤和双双被拿下，与美国一份报告是否有关？

“家属和记者取得联系”：记者的退场意味深长

圈内疯传某谣言

不要放过这些人渣

“被指居者”之死：嫌犯身体遭长时间束缚，警方称指居使用械具是惯例

Comparison of clusterProfiler and GSEA-P

clusterProfiler supports GMT file

hypergeometric test with GMT annotation

gene set enrichment analysis with GMT annotation

Comparison of clusterProfiler and GSEA-P

fixed bug of DOSE

other concerns

clusterProfiler never produce pvalue=0

clusterProfiler test whole gene set collections

Citation

您可能也对以下帖子感兴趣

李尚福、魏凤和双双被拿下，与美国一份报告是否有关？

“家属和记者取得联系”：记者的退场意味深长

圈内疯传某谣言

不要放过这些人渣

“被指居者”之死：嫌犯身体遭长时间束缚，警方称指居使用械具是惯例

生成图片，分享到微信朋友圈

Comparison of clusterProfiler and GSEA-P

clusterProfiler supports GMT file

hypergeometric test with GMT annotation

gene set enrichment analysis with GMT annotation

Comparison of clusterProfiler and GSEA-P

fixed bug of DOSE

other concerns

clusterProfiler never produce pvalue=0

clusterProfiler test whole gene set collections

Citation

您可能也对以下帖子感兴趣