The Reference Variant Store holds information on more than 520 million genetic variants using various methods of annotation. Sources for variants in RVS are re-sequencing projects such as the 1000 Genomes, ESP6500, UK10K, TCGA, Scripps Wellderly; clinical annotation databases such as ClinVar and OMIM; and hypothetical, pre-computed data such as from dbNSFP. We provide annotations as to low level effects, phenotypes and diseases, (super-) population frequencies, and predictive scores (SIFT, MutationAssessor, etc.).
Studies in RVS
RVS currently contains more than 520 million variants, obtained from these studies:
- 1000 Genomes Phases 1 and 3 (1092 and 2535 samples, respectively; WGS)
- ESP6500 (6503 samples, WES)
- Scripps Wellderly (534, WGS)
- UK10K (7320: 4888 WES and 2432 WGS)
- GERA (>78,000, genotyping)
- TCGA (4415, mixed WES and WGS)
- Mt Sinai BioBank (11,210, genotyping; visible to Mount Sinai users only)
and these sample-independent resources and annotation databases:
- dbNSFP (hypothetical, amino-acid changing single nucleotide variants)
- ClinVar, OMIM, and COSMIC; HGMD, PharmGKB (the latter two are visible to Mount Sinai users only)
- literature mining (case studies, experiments, reviews)
Information provided by RVS
RVS stores information on genetic variants covering the following categories:
- Basic information:
- Genomic coordinates (GRCh37), reference and alternate allele, dbSNP ID
- Population frequencies:
- 1000 Genomes Project, Phase 3: total allele frequency; frequencies in African, American, Asian, European, and South-Asian populations
- ESP6500: total allele frequency; frequencies in African American and European American populations
- Wellderly: total allele frequency in the Wellderly population
- UK10K: total allele frequency in the control population (UK10K ALSPAC and TWINS)
- ExAC: total allele frequency; frequencies in African, Native American, East Asian, South Asian, European, Finnish, and Latino populations
- DNA- and protein-level effect:
- gene, Ensembl transcript, whether or not the transcript matches the canonical Ensembl transcript or canonical UniProt isoform (by protein sequence)
- CDS and protein change in HGVS notation, CDS length
- effect (missense, synonymous, stop-gained, UTR region, etc.), impact, and functional class; provided by snpEff
- Predictions of functional impact:
- scores and human-readable predictions from SIFT, PolyPhen2, MutationAssessor, MutationTaster, FATHMM, CADD, PROVEAN, GERP, phastCons46 primate and phastConst 100 vertebrae, FATHMM, VEST3, LR, LTR, and an ensemble score; all provided by dbNSFP, see there for more details; in addition, GWAVA scores for non-coding variants; note that scoring systems and scales differ, the human-readable predictions are (T)olerated, (B)enign, (N)eutral, (D)amaging or (D)eleterious, or (P)ossibly damaging.
- Databases and literature: cross-links to other databases, for individual variants or proteins
- ClinVar: phenotypes and clinical significance levels
- OMIM: phenotypes
- COSMIC: phenotypes
- HGMD: phenotypes and tags for disease-associated and/or functional polymorphisms (Mount Sinai users only)
- PharmGKB: drug toxicity and efficacy (Mount Sinai users only)
- PubMed: extracted by literature mining using SETH and GNAT
We do not provide any sample-level information such as genotypes or haplotypes; all our data are summarized on a per-study and per-ethnicity level for each allele.
Biodalliance Genome Browser
The genome browser plugin is kindly provided by Biodalliance. Check there for features, such as adding specific tracks, on-the-fly.
In general, double-clicking on any row in an RVS results table will move the genome browsers, displayed at the bottom of a page, to that variant's position. Large indels will also get highlighted.
We also offer a beacon service via HTTP GET requests. Beacons are meant to help researchers find datasets in which a particular genetic variant was observed. Beacon requests therefore are answering the question "Does your dataset contain any genome carrying an 'A' at position 1234567 of chromosome 1?" We extend this question to exomes, genotyping arrays, and clinical and functional annotations. See http://ga4gh.org/#/beacon for details. Our beacon responds with YES, NO, or NULL to beacon requests, for example, https://rvs.u.hpc.mssm.edu/beacon.php?chrom=17&pos=7577530&allele=G. A YES will indicate that RVS contains the specified variant, at least once in any of the datasets or annotation databases listed above.
- EVA — European Variation Archive
- EVA is an "open-access database of all types of genetic variation data from all species. The EVA provides access to highly detailed, granular, raw variant data from human [..] [Users] can download data from any study, or submit their own data to the archive. You can also query all variants in the EVA by study, gene, chromosomal location or dbSNP identifier."
- ExAC Browser — Exome Aggregation Consortium
- ExAC seeks to "aggregate and harmonize exome sequencing data from a wide variety of large-scale sequencing projects, and to make summary data available for the wider scientific community." They have accumulated data from more than 60,000 individuals thus far (Oct 2014), including various disease-specific studies as well as healthy populations.
- CanvasDB provides "an infrastructure for management and analysis of genetic variants from massively parallel sequencing (MPS) projects." It holds individual-level data from whole genome sequencing studies, such as the 1000 Genomes Project, and "makes it possible to perform advanced analyses of large-scale WGS projects on a local server."
- GEMINI — GEnome MINIng
- "GEMINI: a flexible framework for exploring genome variation."
- Beacon Project
- The Beacon project aims to «test the willingness of international sites to share genetic data in the simplest of all technical contexts. It is defined as a simple public web service that any institution can implement as a service. The service is designed merely to accept a query of the form "Do you have any genomes with an 'A' at position 100,735 on chromosome 3" and respond with one of "Yes" or "No." A site offering this service is called a "beacon".»
RVS is hosted by the Department of Genetics and Genomic Sciences at the Icahn School of Medicine at Mount Sinai and developed at the Chen lab.
- Jörg Hakenberg: concept & design, implementation
- Wei-Yi Chen: database design and data acquisition
- Philippe Thomas: literature mining
- Ying-Chih Wang: coordinate system liftover
- Rong Chen: group lead
If you use data obtained from RVS in your publications, we request that you please cite RVS in the following way:
Hakenberg J, Cheng WY, Thomas P, Wang YC, Uzilov AV, Chen R. Integrating 400 million variants from 80,000 human samples with extensive annotations: towards a knowledge base to analyze disease cohorts. BMC Bioinformatics 2016 Jan 8;17:24. PMID: 26746786. DOI: 10.1186/s12859-015-0865-9