반응형
안녕하세요 한주현 입니다.
오늘은 SnpEff 로 VCF annotation하는 간단한 명령어와 annotation된 VCF의 컬럼에서 ANN field 중 annotation, Annotation_Impact 컬럼에 대해 알아보겠습니다.
SnpEff 설치 방법에 대해 찾으신다면 다음 링크를 참고해주세요
http://korbillgates.tistory.com/61
SnpEff 사용방법1 - 다운로드 및 설치, vcf annotation
위의 링크에서도 알아봤듯이 VCF annotation 하는 명령어를 간단히 리뷰해보도록 하겠습니다.
SnpEff로 VCF annotation 하는 방법
java -jar snpEff.jar hg19 in.vcf > out.vcf
java 1.7이상 에서 작동하며 라인이 많으면 많은 메모리가 필요할 수 있습니다.
SnpEff로 annotated 된 VCF를 관찰해볼까요?
Annotation 하기 전 VCF
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001
20 14370 . G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51
20 14370 . G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51
Annotation 결과 VCF
##SnpEffVersion="4.1g (build 2015-05-17), by Pablo Cingolani"
##SnpEffCmd="SnpEff hg19 test.vcf "
##INFO=<ID=ANN,Number=.,Type=String,Description="Functional annotations: 'Allele | Annotation | Annotation_Impact | Gene_Name | Gene_ID | Feature_Type | Feature_ID | Transcript_BioType | Rank | HGVS.c | HGVS.p | cDNA.pos / cDNA.length | CDS.pos / CDS.length | AA.pos / AA.length | Distance | ERRORS / WARNINGS / INFO' ">
##INFO=<ID=LOF,Number=.,Type=String,Description="Predicted loss of function effects for this variant. Format: 'Gene_Name | Gene_ID | Number_of_transcripts_in_gene | Percent_of_transcripts_affected' ">
##INFO=<ID=NMD,Number=.,Type=String,Description="Predicted nonsense mediated decay effects for this variant. Format: 'Gene_Name | Gene_ID | Number_of_transcripts_in_gene | Percent_of_transcripts_affected' ">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001
20 14370 . G A 29.0 PASS NS=3;DP=14;AF=0.5;DB;H2;ANN=A|intergenic_region|MODIFIER|DEFB125|DEFB125|intergenic_region|DEFB125||||||||| GT:GQ:DP:HQ 0|0:48:1:51,51
##SnpEffCmd="SnpEff hg19 test.vcf "
##INFO=<ID=ANN,Number=.,Type=String,Description="Functional annotations: 'Allele | Annotation | Annotation_Impact | Gene_Name | Gene_ID | Feature_Type | Feature_ID | Transcript_BioType | Rank | HGVS.c | HGVS.p | cDNA.pos / cDNA.length | CDS.pos / CDS.length | AA.pos / AA.length | Distance | ERRORS / WARNINGS / INFO' ">
##INFO=<ID=LOF,Number=.,Type=String,Description="Predicted loss of function effects for this variant. Format: 'Gene_Name | Gene_ID | Number_of_transcripts_in_gene | Percent_of_transcripts_affected' ">
##INFO=<ID=NMD,Number=.,Type=String,Description="Predicted nonsense mediated decay effects for this variant. Format: 'Gene_Name | Gene_ID | Number_of_transcripts_in_gene | Percent_of_transcripts_affected' ">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001
20 14370 . G A 29.0 PASS NS=3;DP=14;AF=0.5;DB;H2;ANN=A|intergenic_region|MODIFIER|DEFB125|DEFB125|intergenic_region|DEFB125||||||||| GT:GQ:DP:HQ 0|0:48:1:51,51
Annotation을 진행한 VCF를 보시면 굵은글씨로 나타낸 ANN field가 생긴 것을 확인 할 수 있습니다.
ANN field의 자세한 설명은 다음 링크를 참조:
ANN field는 pipe 문자 ( | )로 구분되어있습니다.
예시에서 보면 MODERATE로 나온것은 파이프로 문자를 나눴을 때 3번째 위치한 것으로 ##INFO를 참고하여 보았을때 Annotation_Impact입니다.
Allele |
Annotation |
Annotation_Impact |
Gene_Name |
Gene_ID | ... 후략 |
A |
intergenic_region |
MODIFIER |
DEFB125 |
DEFB125 | ... 후략 |
Annotation_Impact는 "HIGH", "MODERATE", "LOW", "MODIFIER" 로 총 4 종류가 있으며,
Sequence Ontology (SO) term을 기준으로 SnpEff가 4종류로 나눈것입니다.
아래 표는 Annotation_Impact와 그 의미를 정리해놓은 표로 다음 링크를 참고하였습니다
Impact | Meaning | Example |
---|---|---|
HIGH | The variant is assumed to have high (disruptive) impact in the protein, probably causing protein truncation, loss of function or triggering nonsense mediated decay. | stop_gained, frameshift_variant |
MODERATE | A non-disruptive variant that might change protein effectiveness. | missense_variant, inframe_deletion |
LOW | Assumed to be mostly harmless or unlikely to change protein behavior. | synonymous_variant |
MODIFIER | Usually non-coding variants or variants affecting non-coding genes, where predictions are difficult or there is no evidence of impact. | exon_variant, downstream_gene_variant |
SnpEff는 vcf의 position을 기준으로 ANN field의 Annotation 컬럼에 SO term 값을 넣고, 이 값을 기준으로 Annotation_Impact를 분류하여 넣습니다.
다음 링크의 6~8 페이지에 SO term에 해당하는 Putative Impact 분류가 있습니다.
http://snpeff.sourceforge.net/VCFannotationformat_v1.0.pdf또는 snpeff manual page에서도 아래와 같이 표가 정리되어있습니다.
Effect Seq. Ontology | Effect Classic | Note & Example | Impact |
---|---|---|---|
coding_sequence_variant | CDS | The variant hits a CDS. | MODIFIER |
chromosome | CHROMOSOME_LARGE DELETION | A large part (over 1% or 1,000,000 bases) of the chromosome was deleted. | HIGH |
duplication | CHROMOSOME_LARGE_DUPLICATION | Duplication of a large chromoome segment (over 1% or 1,000,000 bases). | LOW |
inversion | CHROMOSOME_LARGE_INVERSION | Inversion of a large chromoome segment (over 1% or 1,000,000 bases). | MODERATE |
coding_sequence_variant | CODON_CHANGE | One or many codons are changed e.g.: An MNP of size multiple of 3 | MODERATE |
inframe_insertion | CODON_INSERTION | One or many codons are inserted e.g.: An insert multiple of three in a codon boundary | MODERATE |
disruptive_inframe_insertion | CODON_CHANGE_PLUS CODON_INSERTION | One codon is changed and one or many codons are inserted e.g.: An insert of size multiple of three, not at codon boundary | MODERATE |
inframe_deletion | CODON_DELETION | One or many codons are deleted e.g.: A deletion multiple of three at codon boundary | MODERATE |
disruptive_inframe_deletion | CODON_CHANGE_PLUS CODON_DELETION | One codon is changed and one or more codons are deleted e.g.: A deletion of size multiple of three, not at codon boundary | MODERATE |
downstream_gene_variant | DOWNSTREAM | Downstream of a gene (default length: 5K bases) | MODIFIER |
exon_variant | EXON | The variant hits an exon (from a non-coding transcript) or a retained intron. | MODIFIER |
exon_loss_variant | EXON_DELETED | A deletion removes the whole exon. | HIGH |
exon_loss_variant | EXON_DELETED_PARTIAL | Deletion affecting part of an exon. | HIGH |
duplication | EXON_DUPLICATION | Duplication of an exon. | HIGH |
duplication | EXON_DUPLICATION_PARTIAL | Duplication affecting part of an exon. | HIGH |
inversion | EXON_INVERSION | Inversion of an exon. | HIGH |
inversion | EXON_INVERSION_PARTIAL | Inversion affecting part of an exon. | HIGH |
frameshift_variant | FRAME_SHIFT | Insertion or deletion causes a frame shift e.g.: An indel size is not multple of 3 | HIGH |
gene_variant | GENE | The variant hits a gene. | MODIFIER |
feature_ablation | GENE_DELETED | Deletion of a gene. | HIGH |
duplication | GENE_DUPLICATION | Duplication of a gene. | HIGH |
gene_fusion | GENE_FUSION | Fusion of two genes. | HIGH |
gene_fusion | GENE_FUSION_HALF | Fusion of one gene and an intergenic region. | HIGH |
bidirectional_gene_fusion | GENE_FUSION_REVERESE | Fusion of two genes in opposite directions. | HIGH |
rearranged_at_DNA_level | GENE_REARRANGEMENT | Rearrengment affecting one or more genes. | HIGH |
intergenic_region | INTERGENIC | The variant is in an intergenic region | MODIFIER |
conserved_intergenic_variant | INTERGENIC_CONSERVED | The variant is in a highly conserved intergenic region | MODIFIER |
intragenic_variant | INTRAGENIC | The variant hits a gene, but no transcripts within the gene | MODIFIER |
intron_variant | INTRON | Variant hits and intron. Technically, hits no exon in the transcript. | MODIFIER |
conserved_intron_variant | INTRON_CONSERVED | The variant is in a highly conserved intronic region | MODIFIER |
miRNA | MICRO_RNA | Variant affects an miRNA | MODIFIER |
missense_variant | NON_SYNONYMOUS_CODING | Variant causes a codon that produces a different amino acid e.g.: Tgg/Cgg, W/R | MODERATE |
initiator_codon_variant | NON_SYNONYMOUS_START | Variant causes start codon to be mutated into another start codon (the new codon produces a different AA). e.g.: Atg/Ctg, M/L (ATG and CTG can be START codons) | LOW |
stop_retained_variant | NON_SYNONYMOUS_STOP | Variant causes stop codon to be mutated into another stop codon (the new codon produces a different AA). e.g.: Atg/Ctg, M/L (ATG and CTG can be START codons) | LOW |
protein_protein_contact | PROTEIN_PROTEIN_INTERACTION_LOCUS | Protein-Protein interacion loci. | HIGH |
structural_interaction_variant | PROTEIN_STRUCTURAL_INTERACTION_LOCUS | Within protein interacion loci (e.g. two AA that are in contact within the same protein, prossibly helping structural conformation). | HIGH |
rare_amino_acid_variant | RARE_AMINO_ACID | The variant hits a rare amino acid thus is likely to produce protein loss of function | HIGH |
splice_acceptor_variant | SPLICE_SITE_ACCEPTOR | The variant hits a splice acceptor site (defined as two bases before exon start, except for the first exon). | HIGH |
splice_donor_variant | SPLICE_SITE_DONOR | The variant hits a Splice donor site (defined as two bases after coding exon end, except for the last exon). | HIGH |
splice_region_variant | SPLICE_SITE_REGION | A sequence variant in which a change has occurred within the region of the splice site, either within 1-3 bases of the exon or 3-8 bases of the intron. | LOW |
splice_region_variant | SPLICE_SITE_BRANCH | A varaint affective putative (Lariat) branch point, located in the intron. | LOW |
splice_region_variant | SPLICE_SITE_BRANCH_U12 | A varaint affective putative (Lariat) branch point from U12 splicing machinery, located in the intron. | MODERATE |
stop_lost | STOP_LOST | Variant causes stop codon to be mutated into a non-stop codon e.g.: Tga/Cga, */R | HIGH |
5_prime_UTR_premature start_codon_gain_variant | START_GAINED | A variant in 5'UTR region produces a three base sequence that can be a START codon. | LOW |
start_lost | START_LOST | Variant causes start codon to be mutated into a non-start codon. e.g.: aTg/aGg, M/R | HIGH |
stop_gained | STOP_GAINED | Variant causes a STOP codon e.g.: Cag/Tag, Q/* | HIGH |
synonymous_variant | SYNONYMOUS_CODING | Variant causes a codon that produces the same amino acid e.g.: Ttg/Ctg, L/L | LOW |
start_retained | SYNONYMOUS_START | Variant causes start codon to be mutated into another start codon. e.g.: Ttg/Ctg, L/L (TTG and CTG can be START codons) | LOW |
stop_retained_variant | SYNONYMOUS_STOP | Variant causes stop codon to be mutated into another stop codon. e.g.: taA/taG, */* | LOW |
transcript_variant | TRANSCRIPT | The variant hits a transcript. | MODIFIER |
feature_ablation | TRANSCRIPT_DELETED | Deletion of a transcript. | HIGH |
regulatory_region_variant | REGULATION | The variant hits a known regulatory feature (non-coding). | MODIFIER |
upstream_gene_variant | UPSTREAM | Upstream of a gene (default length: 5K bases) | MODIFIER |
3_prime_UTR_variant | UTR_3_PRIME | Variant hits 3'UTR region | MODIFIER |
3_prime_UTR_truncation + exon_loss | UTR_3_DELETED | The variant deletes an exon which is in the 3'UTR of the transcript | MODERATE |
5_prime_UTR_variant | UTR_5_PRIME | Variant hits 5'UTR region | MODIFIER |
5_prime_UTR_truncation + exon_loss_variant | UTR_5_DELETED | The variant deletes an exon which is in the 5'UTR of the transcript | MODERATE |
sequence_feature + exon_loss_variant | NEXT_PROT | A 'NextProt' based annotation. Details are provided in the 'feature type' sub-field (ANN), or in the effect details (EFF). | MODERATE |
참고로 다음은 Sequence Ontology의 페이지:
예시 position에서 ANN field의 SnpEff Annotation은 intergenic_region으로 붙어있고,
SnpEff의 table 기준에 의하면 MODIFIER로 나와있습니다.
이상으로 SnpEff로 annotation 하는 방법과 ANN field의 Sequence Ontology term에 대해 알아봤습니다.
그럼 다음시간에 만나요~~~~
반응형
'생물정보학 > Tools' 카테고리의 다른 글
[samtools] BAM 파일을 FASTA, FASTQ 파일 형식으로 변환하기 (0) | 2017.11.22 |
---|---|
[samtools] BAM 파일에서 특정 chromosome 영역 추출하기 (0) | 2017.11.22 |
[Samtools] BAM 파일에서 Duplicated Read 찾기 (0) | 2017.10.30 |
[GATK] GATK 오류 - "Problem detecting index type" 해결법 (0) | 2017.10.16 |
[dbSNP] rsID 로 chromosome position, gene 찾기 (0) | 2017.08.04 |
댓글