0
点赞
收藏
分享

微信扫一扫

ANNOVAR region-based annotation-上篇

欢迎关注"生信修炼手册"!

通过gene-based annotation 可以得到变异位点与基因之间的关系,除了与基因的关系之外,变异位点在基因组上某些特征区域的分布(比如转录因子结合区域,启动子区,增强子区等)更引人关注,这一功能通过region-based annotation 来实现。

在进行区域相关注释时,需要各种数据库,不同的特征区域对应的数据库不同。​​annovar​​支持下列多种数据库

1. 物种间保守区域

对人,小鼠,大鼠等5个脊椎动物的基因组序列进行多序列比对,然后采用​​phastCons​​软件识别在不同物种间保守的基因组区域。在识别保守区域时,软件会对每个保守区域进行打分。

第一步: 下载phastConsElements46way数据库,命令如下

annotate_variation.pl -build hg19 -downdb phastConsElements46way humandb/

NOTICE: Web-based checking to see whether ANNOVAR new version is available ... Done
NOTICE: Downloading annotation database http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/phastConsElements46way.txt.gz ... Done
NOTICE: Uncompressing downloaded files
NOTICE: Finished downloading annotation files for hg19 build version, with files saved at the 'humandb' directory

数据库文件内容如下,第二列到第四列代表保守区域在基因组上的位置,第五列代表保守区域的名字,第六列代表该保守守区域的打分score值。

585     chr1    12002   12085   lod=33  343
585 chr1 12170 12232 lod=123 483
585 chr1 12594 12702 lod=219 545
585 chr1 12994 13054 lod=101 462

第二步,执行注释,命令如下

annotate_variation.pl -regionanno -build hg19 -out ex1 -dbtype phastConsElements46way ex1.avinput humandb/

NOTICE: Output file is written to ex1.hg19_phastConsElements46way
NOTICE: Reading annotation database humandb/hg19_phastConsElements46way.txt ... Done with 5163775 regions
NOTICE: Finished region-based annotation on 21 genetic variants

输出文件的后缀为​​hg19_phastConsElements46way​​, 在输入文件的前面新增了两列,内容如下

phastConsElements46way    Score=300;Name=lod=22
phastConsElements46way Score=387;Name=lod=50
phastConsElements46way Score=420;Name=lod=68
phastConsElements46way Score=385;Name=lod=49
phastConsElements46way Score=395;Name=lod=54
phastConsElements46way Score=545;Name=lod=218

第一列为对应的数据库的名字,第二列为基因组上保守区域的得分和名字。

2. TFBS

TFBS是Transcription factor binding site的缩写,代表转录因子结合位点。在UCSC网站上,提供了转录因子结合位点的数据库。

第一步:下载tfbsConsSites数据库,命令如下

annotate_variation.pl -build hg19 -downdb tfbsConsSites humandb/

NOTICE: Web-based checking to see whether ANNOVAR new version is available ... Done
NOTICE: Downloading annotation database http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/tfbsConsSites.txt.gz ... Done
NOTICE: Uncompressing downloaded files
NOTICE: Finished downloading annotation files for hg19 build version, with files saved at the 'humandb' directory

数据库文件内容如下,第二列到第四列代表转录因子在基因组上的结合位置,第五列代表转录因子的名字

591     chr1    894640  894654  V$P300_01       842     -       1.68
591 chr1 894641 894657 V$ELK1_01 898 - 2.7
591 chr1 894644 894654 V$CETS1P54_01 971 - 2.22

第二步,进行注释,命令如下

annotate_variation.pl -regionanno -build hg19  -dbtype tfbsConsSites
ex1.avinput humandb/

NOTICE: Output file is written to ex1.avinput.hg19_tfbsConsSites
NOTICE: Reading annotation database humandb/hg19_tfbsConsSites.txt ... Done with 5797266 regions
NOTICE: Finished region-based annotation on 21 genetic variants

输出文件的后缀为​​hg19_tfbsConsSites​​, 在输入文件的前面新增了两列,内容如下

tfbsConsSites   Score=767;Name=V$PAX5_02
tfbsConsSites Score=880;Name=V$CEBPA_01
tfbsConsSites Score=878;Name=V$FREAC3_01

第一列为对应的数据库的名字,第二列为转录因子结合区域的得分和对应的转录因子的名字。

3. cytoband

UCSC提供了cytoband的数据库。

第一步,下载cytoBand数据库,命令如下

annotate_variation.pl -build hg19 -downdb cytoBand  humandb/

NOTICE: Web-based checking to see whether ANNOVAR new version is available ... Done
NOTICE: Downloading annotation database http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/cytoBand.txt.gz ... Done
NOTICE: Uncompressing downloaded files
NOTICE: Finished downloading annotation files for hg19 build version, with files saved at the 'humandb' directory

数据库文件内容如下

chr1    0       2300000 p36.33  gneg
chr1 2300000 5400000 p36.32 gpos25
chr1 5400000 7200000 p36.31 gneg
chr1 7200000 9200000 p36.23 gpos25

第二步,进行注释,命令如下

annotate_variation.pl -regionanno -build hg19 -dbtype cytoBand  ex1.avinput
humandb/

NOTICE: Output file is written to ex1.avinput.hg19_cytoBand
NOTICE: Reading annotation database humandb/hg19_cytoBand.txt ... Done with 862 regions
NOTICE: Finished region-based annotation on 21 genetic variants

输出文件的后缀为​​hg19_cytoBand​​, 在输入文件的前面新增了两列,内容如下

cytoBand    1p36.33
cytoBand 1p36.33
cytoBand 1p36.31
cytoBand 1q23.3
cytoBand 1p31.1

第一列为对应的数据库的名字,第二列为对应的cytoband区域的名字。

4.  microRNA和snoRNA

UCSC提供了microRNA和snoRNA在基因组上的位置,叫做wgRna,通过这个数据库,可以查看变异位点是否位于microRNA和snoRNA对应的基因组区域上。

第一步,下载数据库,命令如下

annotate_variation.pl -build hg19 -downdb wgRna  humandb/

NOTICE: Web-based checking to see whether ANNOVAR new version is available ... Done
NOTICE: Downloading annotation database http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/wgRna.txt.gz ... Done
NOTICE: Uncompressing downloaded files
NOTICE: Finished downloading annotation files for hg19 build version, with files saved at the 'humandb' directory

数据库中文件内容如下:

585     chr1    30365   30503   hsa-mir-1302-2  0       +       0       0       miRNA
593 chr1 1102483 1102578 hsa-mir-200b 0 + 0 0 miRNA
799 chr1 28160911 28161077 ACA35 0 + 0 0 scaRna
804 chr1 28833876 28834083 U17a 0 + 0 0 HAcaBox
804 chr1 28835069 28835274 U17b 0 + 0 0 HAcaBox

第二步,进行注释,命令如下

annotate_variation.pl -regionanno -build hg19 -dbtype wgRna  ex1.avinput humandb/

NOTICE: Output file is written to ex1.avinput.hg19_wgRna
NOTICE: Reading annotation database humandb/hg19_wgRna.txt ... Done with 1341 regions
NOTICE: Finished region-based annotation on 21 genetic variants

输出文件的后缀为​​hg19_wgRna​​, 在输入文件的前面新增了两列,内容如下

wgRna   Name=hsa-mir-1302-2
wgRna Name=hsa-mir-1290
wgRna Name=HBII-420

第一列为对应的数据库的名字,第二列为micoRNA/snoRNA的名字。

5. microRNA binding sites

UCSC给出了TargetScanHuman网站预测的microRNA结合位点。

第一步,下载targetScanS数据库,命令如下

annotate_variation.pl -build hg19 -downdb targetScanS  humandb/

NOTICE: Web-based checking to see whether ANNOVAR new version is available ... Done
NOTICE: Downloading annotation database http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/targetScanS.txt.gz ... Done
NOTICE: Uncompressing downloaded files
NOTICE: Finished downloading annotation files for hg19 build version, with files saved at the 'humandb' directory

数据库中文件内容如下:

591     chr1    879822  879830  SAMD11:miR-504  90      +
591 chr1 900599 900606 KLHL17:miR-299/299-3p 26 +
591 chr1 900605 900612 KLHL17:miR-124/506 7 +
591 chr1 900933 900941 KLHL17:miR-19 82 +
591 chr1 901054 901061 KLHL17:miR-137 14 +

第二步,进行注释,命令如下

annotate_variation.pl -regionanno -build hg19 -dbtype targetScanS  ex1.avinput humandb/

NOTICE: Output file is written to ex1.avinput.hg19_targetScanS
NOTICE: Reading annotation database humandb/hg19_targetScanS.txt ... Done with 54199 regions
NOTICE: Finished region-based annotation on 21 genetic variants

输出文件的后缀为​​hg19_targetScanS​​, 在输入文件的前面新增了两列,内容如下

targetScanS     Score=90;Name=SAMD11:miR-504
targetScanS Score=82;Name=KLHL17:miR-19

第一列为对应的数据库的名字,第二列为结合区域的打分和对应的基因和microRNA的名字。

6. segmental duplications

基因组上的重复序列区域,这部分序列在比对时由于同源性,会存在比对情况不正确的情况。

第一步,下载genomicSuperDups 数据库,命令如下

annotate_variation.pl -build hg19 -downdb genomicSuperDups  humandb/

NOTICE: Web-based checking to see whether ANNOVAR new version is available ... Done
NOTICE: Downloading annotation database http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/genomicSuperDups.txt.gz ... Done
NOTICE: Uncompressing downloaded files
NOTICE: Finished downloading annotation files for hg19 build version, with files saved at the 'humandb' directory

数据库文件列数较多,截取了前5列,内容如下:

585     chr1    10000   87112   chr15:102446355
585 chr1 10000 20818 chr12:84886
585 chr1 10000 19844 chrY:59352887
585 chr1 10000 19844 chrX:155249881
585 chr1 10464 40733 chr2:114330297

第二步,进行注释,命令如下

annotate_variation.pl -regionanno -build hg19 -dbtype genomicSuperDups  ex1.avinput humandb/

NOTICE: Output file is written to ex1.avinput.hg19_genomicSuperDups
NOTICE: Reading annotation database humandb/hg19_genomicSuperDups.txt ... Done with 51599 regions
NOTICE: Finished region-based annotation on 21 genetic variants

输出文件的后缀为​​hg19_genomicSuperDups​​, 在输入文件的前面新增了两列,内容如下

genomicSuperDups    Score=0.905283;Name=chr1:1439902
genomicSuperDups Score=0.99612;Name=chr1:13142561
genomicSuperDups Score=0.991956;Name=chr15:102446355

第一列为对应的数据库的名字,第二列为重复区域的名字和打分。

7. structural variants

DGV数据库中存储了基因组结构变异的信息,​​annovar​​利用这个数据库来分析变异位点是否在已发表的结构变异区间上。

第一步,下载dgvMerged数据库,命令如下

annotate_variation.pl -build hg19 -downdb dgvMerged  humandb/

NOTICE: Web-based checking to see whether ANNOVAR new version is available ... Done
NOTICE: Downloading annotation database http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/dgvMerged.txt.gz ... Done
NOTICE: Uncompressing downloaded files
NOTICE: Finished downloading annotation files for hg19 build version, with files saved at the 'humandb' directory

数据库文件列数较多,截取了前5列,内容如下:

9       chr1    0       2300000 nsv482937
585 chr1 10000 127330 nsv7879
585 chr1 10000 22118 dgv1n82
585 chr1 10190 10281 nsv958854
73 chr1 10376 1018704 esv2758911

第二步,进行注释,命令如下

annotate_variation.pl -regionanno -build hg19 -dbtype dgvMerged  ex1.avinput humandb/

NOTICE: Output file is written to ex1.avinput.hg19_dgvMerged
NOTICE: Reading annotation database humandb/hg19_dgvMerged.txt ... Done with 392583 regions
NOTICE: Finished region-based annotation on 21 genetic variants

输出文件的后缀为​​hg19_dgvMerged​​, 在输入文件的前面新增了两列,内容如下

dgvMerged    Name=nsv832536,nsv545407
dgvMerged Name=nsv830937,dgv235n100
dgvMerged Name=nsv1243
dgvMerged Name=nsv584699
dgvMerged Name=esv3638608

第一列为对应的数据库的名字,第二列为DGV数据库中结构变异的ID。

8. GWAS

分析变异位点是否在之前的GWAS研究中报导过。

第一步,下载gwasCatalog数据库,命令如下

annotate_variation.pl -build hg19 -downdb gwasCatalog  humandb/

NOTICE: Web-based checking to see whether ANNOVAR new version is available ... Done
NOTICE: Downloading annotation database http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/gwasCatalog.txt.gz ... Done
NOTICE: Uncompressing downloaded files
NOTICE: Finished downloading annotation files for hg19 build version, with files saved at the 'humandb' directory

数据库文件列数较多,截取了前5列,内容如下:

590     chr1    780396  780397  rs141175086
591 chr1 894572 894573 rs13303010
592 chr1 1005805 1005806 rs3934834
593 chr1 1079197 1079198 rs11260603
593 chr1 1173610 1173611 rs6697886

第二步,进行注释,命令如下

annotate_variation.pl -regionanno -build hg19 -dbtype gwasCatalog  ex1.avinput humandb/

NOTICE: Output file is written to ex1.avinput.hg19_gwasCatalog
NOTICE: Reading annotation database humandb/hg19_gwasCatalog.txt ... Done with 75593 regions
NOTICE: Finished region-based annotation on 21 genetic variants

输出文件的后缀为​​hg19_gwasCatalog​​, 在输入文件的前面新增了两列,内容如下

gwasCatalog    Name=Crohn's disease
gwasCatalog Name=Chronic inflammatory diseases

第一列为对应的数据库的名字,第二列与该变异位点存在关联的疾病或者形状的名字。

在region-based annotation中,相关的数据库非常多,本篇只介绍上述几个数据库,剩余的数据库在后续文章中在进行介绍。

扫描关注微信号,更多精彩内容等着你!

ANNOVAR region-based annotation-上篇_后缀


举报

相关推荐

0 条评论