본문 바로가기
생물정보학/Tools

[SnpEff] SnpEff 사용방법2 VCF annotation, SnpEff ANN field, Sequence Ontology term 설명

by HanJoohyun 2017. 10. 31.
반응형



안녕하세요 한주현 입니다.


오늘은 SnpEff 로 VCF annotation하는 간단한 명령어와 annotation된 VCF의 컬럼에서 ANN field 중 annotation, Annotation_Impact 컬럼에 대해 알아보겠습니다.


SnpEff 설치 방법에 대해 찾으신다면 다음 링크를 참고해주세요

http://korbillgates.tistory.com/61

SnpEff 사용방법1 - 다운로드 및 설치, vcf annotation


위의 링크에서도 알아봤듯이 VCF annotation 하는 명령어를 간단히 리뷰해보도록 하겠습니다.



SnpEff로 VCF annotation 하는 방법

java -jar snpEff.jar hg19 in.vcf > out.vcf


java 1.7이상 에서 작동하며 라인이 많으면 많은 메모리가 필요할 수 있습니다.



SnpEff로 annotated 된 VCF를 관찰해볼까요?



Annotation 하기 전 VCF
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  NA00001
20      14370   .       G       A       29      PASS    NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ     0|0:48:1:51,51

Annotation 결과 VCF
##SnpEffVersion="4.1g (build 2015-05-17), by Pablo Cingolani"
##SnpEffCmd="SnpEff  hg19 test.vcf "
##INFO=<ID=ANN,Number=.,Type=String,Description="Functional annotations: 'Allele | Annotation | Annotation_Impact | Gene_Name | Gene_ID | Feature_Type | Feature_ID | Transcript_BioType | Rank | HGVS.c | HGVS.p | cDNA.pos / cDNA.length | CDS.pos / CDS.length | AA.pos / AA.length | Distance | ERRORS / WARNINGS / INFO' ">
##INFO=<ID=LOF,Number=.,Type=String,Description="Predicted loss of function effects for this variant. Format: 'Gene_Name | Gene_ID | Number_of_transcripts_in_gene | Percent_of_transcripts_affected' ">
##INFO=<ID=NMD,Number=.,Type=String,Description="Predicted nonsense mediated decay effects for this variant. Format: 'Gene_Name | Gene_ID | Number_of_transcripts_in_gene | Percent_of_transcripts_affected' ">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  NA00001
20      14370   .       G       A       29.0    PASS    NS=3;DP=14;AF=0.5;DB;H2;ANN=A|intergenic_region|MODIFIER|DEFB125|DEFB125|intergenic_region|DEFB125|||||||||     GT:GQ:DP:HQ     0|0:48:1:51,51


Annotation을 진행한 VCF를 보시면 굵은글씨로 나타낸 ANN field가 생긴 것을 확인 할 수 있습니다.

ANN field의 자세한 설명은 다음 링크를 참조:


ANN field는 pipe 문자 ( | )로 구분되어있습니다.
예시에서 보면 MODERATE로 나온것은 파이프로 문자를 나눴을 때 3번째 위치한 것으로 ##INFO를 참고하여 보았을때 Annotation_Impact입니다.

Allele 

Annotation

Annotation_Impact

Gene_Name

Gene_ID

 ... 후략

 A

intergenic_region

MODIFIER

DEFB125

DEFB125

 ... 후략


Annotation_Impact는 "HIGH", "MODERATE", "LOW", "MODIFIER" 로 총 4 종류가 있으며,
Sequence Ontology (SO) term을 기준으로 SnpEff가 4종류로 나눈것입니다.

아래 표는 Annotation_Impact와 그 의미를 정리해놓은 표로 다음 링크를 참고하였습니다

Impact Meaning Example
HIGH The variant is assumed to have high (disruptive) impact in the protein, probably causing protein truncation, loss of function or triggering nonsense mediated decay. stop_gained, frameshift_variant
MODERATE A non-disruptive variant that might change protein effectiveness. missense_variant, inframe_deletion
LOW Assumed to be mostly harmless or unlikely to change protein behavior. synonymous_variant
MODIFIER Usually non-coding variants or variants affecting non-coding genes, where predictions are difficult or there is no evidence of impact. exon_variant, downstream_gene_variant


SnpEff는 vcf의 position을 기준으로 ANN field의 Annotation 컬럼에 SO term 값을 넣고, 이 값을 기준으로 Annotation_Impact를 분류하여 넣습니다.

다음 링크의 6~8 페이지에 SO term에 해당하는 Putative Impact 분류가 있습니다.
http://snpeff.sourceforge.net/VCFannotationformat_v1.0.pdf

또는 snpeff manual page에서도 아래와 같이 표가 정리되어있습니다.


Effect
Seq. Ontology
Effect
Classic
Note & Example Impact
coding_sequence_variant CDS The variant hits a CDS. MODIFIER
chromosome CHROMOSOME_LARGE DELETION A large part (over 1% or 1,000,000 bases) of the chromosome was deleted. HIGH
duplication CHROMOSOME_LARGE_DUPLICATION Duplication of a large chromoome segment (over 1% or 1,000,000 bases). LOW
inversion CHROMOSOME_LARGE_INVERSION Inversion of a large chromoome segment (over 1% or 1,000,000 bases). MODERATE
coding_sequence_variant CODON_CHANGE One or many codons are changed
e.g.: An MNP of size multiple of 3
MODERATE
inframe_insertion CODON_INSERTION One or many codons are inserted
e.g.: An insert multiple of three in a codon boundary
MODERATE
disruptive_inframe_insertion CODON_CHANGE_PLUS CODON_INSERTION One codon is changed and one or many codons are inserted
e.g.: An insert of size multiple of three, not at codon boundary
MODERATE
inframe_deletion CODON_DELETION One or many codons are deleted
e.g.: A deletion multiple of three at codon boundary
MODERATE
disruptive_inframe_deletion CODON_CHANGE_PLUS CODON_DELETION One codon is changed and one or more codons are deleted
e.g.: A deletion of size multiple of three, not at codon boundary
MODERATE
downstream_gene_variant DOWNSTREAM Downstream of a gene (default length: 5K bases) MODIFIER
exon_variant EXON The variant hits an exon (from a non-coding transcript) or a retained intron. MODIFIER
exon_loss_variant EXON_DELETED A deletion removes the whole exon. HIGH
exon_loss_variant EXON_DELETED_PARTIAL Deletion affecting part of an exon. HIGH
duplication EXON_DUPLICATION Duplication of an exon. HIGH
duplication EXON_DUPLICATION_PARTIAL Duplication affecting part of an exon. HIGH
inversion EXON_INVERSION Inversion of an exon. HIGH
inversion EXON_INVERSION_PARTIAL Inversion affecting part of an exon. HIGH
frameshift_variant FRAME_SHIFT Insertion or deletion causes a frame shift
e.g.: An indel size is not multple of 3
HIGH
gene_variant GENE The variant hits a gene. MODIFIER
feature_ablation GENE_DELETED Deletion of a gene. HIGH
duplication GENE_DUPLICATION Duplication of a gene. HIGH
gene_fusion GENE_FUSION Fusion of two genes. HIGH
gene_fusion GENE_FUSION_HALF Fusion of one gene and an intergenic region. HIGH
bidirectional_gene_fusion GENE_FUSION_REVERESE Fusion of two genes in opposite directions. HIGH
rearranged_at_DNA_level GENE_REARRANGEMENT Rearrengment affecting one or more genes. HIGH
intergenic_region INTERGENIC The variant is in an intergenic region MODIFIER
conserved_intergenic_variant INTERGENIC_CONSERVED The variant is in a highly conserved intergenic region MODIFIER
intragenic_variant INTRAGENIC The variant hits a gene, but no transcripts within the gene MODIFIER
intron_variant INTRON Variant hits and intron. Technically, hits no exon in the transcript. MODIFIER
conserved_intron_variant INTRON_CONSERVED The variant is in a highly conserved intronic region MODIFIER
miRNA MICRO_RNA Variant affects an miRNA MODIFIER
missense_variant NON_SYNONYMOUS_CODING Variant causes a codon that produces a different amino acid
e.g.: Tgg/Cgg, W/R
MODERATE
initiator_codon_variant NON_SYNONYMOUS_START Variant causes start codon to be mutated into another start codon (the new codon produces a different AA).
e.g.: Atg/Ctg, M/L (ATG and CTG can be START codons)
LOW
stop_retained_variant NON_SYNONYMOUS_STOP Variant causes stop codon to be mutated into another stop codon (the new codon produces a different AA).
e.g.: Atg/Ctg, M/L (ATG and CTG can be START codons)
LOW
protein_protein_contact PROTEIN_PROTEIN_INTERACTION_LOCUS Protein-Protein interacion loci. HIGH
structural_interaction_variant PROTEIN_STRUCTURAL_INTERACTION_LOCUS Within protein interacion loci (e.g. two AA that are in contact within the same protein, prossibly helping structural conformation). HIGH
rare_amino_acid_variant RARE_AMINO_ACID The variant hits a rare amino acid thus is likely to produce protein loss of function HIGH
splice_acceptor_variant SPLICE_SITE_ACCEPTOR The variant hits a splice acceptor site (defined as two bases before exon start, except for the first exon). HIGH
splice_donor_variant SPLICE_SITE_DONOR The variant hits a Splice donor site (defined as two bases after coding exon end, except for the last exon). HIGH
splice_region_variant SPLICE_SITE_REGION A sequence variant in which a change has occurred within the region of the splice site, either within 1-3 bases of the exon or 3-8 bases of the intron. LOW
splice_region_variant SPLICE_SITE_BRANCH A varaint affective putative (Lariat) branch point, located in the intron. LOW
splice_region_variant SPLICE_SITE_BRANCH_U12 A varaint affective putative (Lariat) branch point from U12 splicing machinery, located in the intron. MODERATE
stop_lost STOP_LOST Variant causes stop codon to be mutated into a non-stop codon
e.g.: Tga/Cga, */R
HIGH
5_prime_UTR_premature start_codon_gain_variant START_GAINED A variant in 5'UTR region produces a three base sequence that can be a START codon. LOW
start_lost START_LOST Variant causes start codon to be mutated into a non-start codon.
e.g.: aTg/aGg, M/R
HIGH
stop_gained STOP_GAINED Variant causes a STOP codon
e.g.: Cag/Tag, Q/*
HIGH
synonymous_variant SYNONYMOUS_CODING Variant causes a codon that produces the same amino acid
e.g.: Ttg/Ctg, L/L
LOW
start_retained SYNONYMOUS_START Variant causes start codon to be mutated into another start codon.
e.g.: Ttg/Ctg, L/L (TTG and CTG can be START codons)
LOW
stop_retained_variant SYNONYMOUS_STOP Variant causes stop codon to be mutated into another stop codon.
e.g.: taA/taG, */*
LOW
transcript_variant TRANSCRIPT The variant hits a transcript. MODIFIER
feature_ablation TRANSCRIPT_DELETED Deletion of a transcript. HIGH
regulatory_region_variant REGULATION The variant hits a known regulatory feature (non-coding). MODIFIER
upstream_gene_variant UPSTREAM Upstream of a gene (default length: 5K bases) MODIFIER
3_prime_UTR_variant UTR_3_PRIME Variant hits 3'UTR region MODIFIER
3_prime_UTR_truncation + exon_loss UTR_3_DELETED The variant deletes an exon which is in the 3'UTR of the transcript MODERATE
5_prime_UTR_variant UTR_5_PRIME Variant hits 5'UTR region MODIFIER
5_prime_UTR_truncation + exon_loss_variant UTR_5_DELETED The variant deletes an exon which is in the 5'UTR of the transcript MODERATE
sequence_feature + exon_loss_variant NEXT_PROT A 'NextProt' based annotation. Details are provided in the 'feature type' sub-field (ANN), or in the effect details (EFF). MODERATE





참고로 다음은 Sequence Ontology의 페이지:


예시 position에서 ANN field의 SnpEff Annotation은 intergenic_region으로 붙어있고,
SnpEff의 table 기준에 의하면 MODIFIER로 나와있습니다.


이상으로 SnpEff로 annotation 하는 방법과 ANN field의 Sequence Ontology term에 대해 알아봤습니다.

그럼 다음시간에 만나요~~~~




반응형

댓글