Ensembl genome database project

Ensembl genome database project.
Content
Description	Ensembl
Contact
Research center	European Bioinformatics Institute;
Primary citation	Yates, et al. (2020)[1]
Access
Website	www.ensembl.org

Ensembl genome database project is a scientific project at the European Bioinformatics Institute, which was launched in 1999 in response to the imminent completion of the Human Genome Project.[2] Ensembl aims to provide a centralized resource for geneticists, molecular biologists and other researchers studying the genomes of our own species and other vertebrates and model organisms.[3] Ensembl is one of several well known genome browsers for the retrieval of genomic information.

Similar databases and browsers are found at NCBI and the University of California, Santa Cruz (UCSC).

Background

The human genome consists of three billion base pairs, which code for approximately 20,000–25,000 genes. However the genome alone is of little use, unless the locations and relationships of individual genes can be identified. One option is manual annotation, whereby a team of scientists tries to locate genes using experimental data from scientific journals and public databases. However this is a slow, painstaking task. The alternative, known as automated annotation, is to use the power of computers to do the complex pattern-matching of protein to DNA.

In the Ensembl project, sequence data are fed into the gene annotation system (a collection of software "pipelines" written in Perl) which creates a set of predicted gene locations and saves them in a MySQL database for subsequent analysis and display. Ensembl makes these data freely accessible to the world research community. All the data and code produced by the Ensembl project is available to download,[4] and there is also a publicly accessible database server allowing remote access. In addition, the Ensembl website provides computer-generated visual displays of much of the data.

Over time the project has expanded to include additional species (including key model organisms such as mouse, fruitfly and zebrafish) as well as a wider range of genomic data, including genetic variations and regulatory features. Since April 2009, a sister project, Ensembl Genomes, has extended the scope of Ensembl into invertebrate metazoa, plants, fungi, bacteria, and protists, whilst the original project continues to focus on vertebrates.

Displaying genomic data

Gene SGCB aligned to the human genome

Central to the Ensembl concept is the ability to automatically generate graphical views of the alignment of genes and other genomic data against a reference genome. These are shown as data tracks, and individual tracks can be turned on and off, allowing the user to customise the display to suit their research interests. The interface also enables the user to zoom in to a region or move along the genome in either direction.

Other displays show data at varying levels of resolution, from whole karyotypes down to text-based representations of DNA and amino acid sequences, or present other types of display such as trees of similar genes (homologues) across a range of species. The graphics are complemented by tabular displays, and in many cases data can be exported directly from the page in a variety of standard file formats such as FASTA.

Externally produced data can also be added to the display by uploading a suitable file in one of the supported formats, such as BAM, BED, or PSL.

Graphics are generated using a suite of custom Perl modules based on GD, the standard Perl graphics display library.

Alternative access methods

In addition to its website, Ensembl provides a REST API and a Perl API[5] (Application Programming Interface) that models biological objects such as genes and proteins, allowing simple scripts to be written to retrieve data of interest. The same API is used internally by the web interface to display the data. It is divided in sections like the core API, the compara API (for comparative genomics data), the variation API (for accessing SNPs, SNVs, CNVs..), and the functional genomics API (to access regulatory data). The Ensembl website provides extensive information on how to install and use the API.

This software can be used to access the public MySQL database, avoiding the need to download enormous datasets. The users could even choose to retrieve data from the MySQL with direct SQL queries, but this requires an extensive knowledge of the current database schema.

Large datasets can be retrieved using the BioMart data-mining tool. It provides a web interface for downloading datasets using complex queries.

Last, there is an FTP server which can be used to download entire MySQL databases as well some selected data sets in other formats.

Current species

The annotated genomes include most fully sequenced vertebrates and selected model organisms. All of them are eukaryotes, there are no prokaryotes. As of 2008, this includes:

Chordata
- Mammalia
  - Euarchontoglires
    - Primates: bushbaby, chimp, human, macaque, mouse lemur, orangutan, tarsier;
    - Scandentia: tree shrew ;
    - Glires (= Rodents + Lagomorphs): guineapig, kangaroo rat, mouse, rat, ground squirrel, pika, rabbit ;
  - Laurasiatheria: cow, dolphin, alpaca, pig, cat, dog, horse, megabat, microbat, hedgehog, shrew ;
  - Afrotheria: elephant, hyrax, tenrec
  - Xenarthra: armadillo, sloth ;
  - Marsupialia: opossum, wallaby ;
  - Monotremes: platypus;
- Birds: chicken, zebra finch;
- Lepidosauria: anole lizard (pre);
- Lissamphibia: Xenopus tropicalis;
- Teleost fishes: Takifugu rubripes (fugu), Tetraodon nigroviridis (green spotted pufferfish), Danio rerio (zebrafish), Oryzias latipes (medaka), Gasterosteus aculeatus (stickleback);
- Cyclostomata: Petromyzon marinus (sea lamprey) (pre);
- Tunicates: Ciona intestinalis, Ciona savignyi;
Non-vertebrates
- Insects: Drosophila melanogaster (fruitfly), Anopheles gambiae (mosquito), Aedes aegypti (mosquito)
- Worm: Caenorhabditis elegans
Yeast: Saccharomyces cerevisiae (baker's yeast)

gollark: The pocket computer has a bunch of gold in it.

gollark: Neural interfaces aren't cheap!

gollark: Protocol Delta has been activated.

gollark: I'll remember that in case I launch my own NI viruses.

gollark: Maybe you were using an insecure remote shell and someone somehow noticed and hackerized™ it.

References

Yates A. D.; et al. (January 2020). "Ensembl 2020". Nucleic Acids Res. 48 (D1): D682–D688. doi:10.1093/nar/gkz966. PMC 7145704. PMID 31691826. Retrieved 31 July 2020.
Flicek P, Amode MR, Barrell D, et al. (November 2010). "Ensembl 2011". Nucleic Acids Res. 39 (Database issue): D800–D806. doi:10.1093/nar/gkq1064. PMC 3013672. PMID 21045057.
Flicek P, Aken BL, Ballester B, et al. (January 2010). "Ensembl's 10th year". Nucleic Acids Res. 38 (Database issue): D557–62. doi:10.1093/nar/gkp972. PMC 2808936. PMID 19906699.
Ruffier, Magali; Kähäri, Andreas; Komorowska, Monika; Keenan, Stephen; Laird, Matthew; Longden, Ian; Proctor, Glenn; Searle, Steve; Staines, Daniel; Taylor, Kieron; Vullo, Alessandro; Yates, Andrew; Zerbino, Daniel; Flicek, Paul (January 2017). "Ensembl core software resources: storage and programmatic access for DNA sequence and genome annotation". Database. 2017 (1). doi:10.1093/database/bax020. PMC 5467575.
Stabenau A, McVicker G, Melsopp C, Proctor G, Clamp M, Birney E (February 2004). "The Ensembl Core Software Libraries". Genome Research. 14 (5): 929–933. doi:10.1101/gr.1857204. PMC 479122. PMID 15123588.

External links

This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.

[Yates2020-1] Yates A. D.; et al. (January 2020). "Ensembl 2020". Nucleic Acids Res. 48 (D1): D682–D688. doi:10.1093/nar/gkz966. PMC 7145704. PMID 31691826. Retrieved 31 July 2020.

[Flicek1-2] Flicek P, Amode MR, Barrell D, et al. (November 2010). "Ensembl 2011". Nucleic Acids Res. 39 (Database issue): D800–D806. doi:10.1093/nar/gkq1064. PMC 3013672. PMID 21045057.

[Flicek2-3] Flicek P, Aken BL, Ballester B, et al. (January 2010). "Ensembl's 10th year". Nucleic Acids Res. 38 (Database issue): D557–62. doi:10.1093/nar/gkp972. PMC 2808936. PMID 19906699.

[4] Ruffier, Magali; Kähäri, Andreas; Komorowska, Monika; Keenan, Stephen; Laird, Matthew; Longden, Ian; Proctor, Glenn; Searle, Steve; Staines, Daniel; Taylor, Kieron; Vullo, Alessandro; Yates, Andrew; Zerbino, Daniel; Flicek, Paul (January 2017). "Ensembl core software resources: storage and programmatic access for DNA sequence and genome annotation". Database. 2017 (1). doi:10.1093/database/bax020. PMC 5467575.

[Perl-5] Stabenau A, McVicker G, Melsopp C, Proctor G, Clamp M, Birney E (February 2004). "The Ensembl Core Software Libraries". Genome Research. 14 (5): 929–933. doi:10.1101/gr.1857204. PMC 479122. PMID 15123588.

Bioinformatics
Databases	Sequence databases: GenBank, European Nucleotide Archive and DNA Data Bank of Japan Secondary databases: UniProt, database of protein sequences grouping together Swiss-Prot, TrEMBL and Protein Information Resource Other databases: Protein Data Bank, Ensembl and InterPro Specialised genomic databases: BOLD, Saccharomyces Genome Database, FlyBase, VectorBase, WormBase, PHI-base, Arabidopsis Information Resource and Zebrafish Information Network
Software	BLAST Bowtie Clustal EMBOSS HMMER MUSCLE SAMtools TopHat
Other	Server: ExPASy Ontology: Gene Ontology Rosalind (education platform)
Institutions	Broad Institute Computational Biology Department (CBD) Microsoft Research - University of Trento Centre for Computational and Systems Biology (COSBI) Database Center for Life Science (DBCLS) DNA Data Bank of Japan (DDBJ) European Bioinformatics Institute (EMBL-EBI) European Molecular Biology Laboratory (EMBL) Flatiron Institute J. Craig Venter Institute (JCVI) Max Planck Institute of Molecular Cell Biology and Genetics (MPI-CBG) US National Center for Biotechnology Information (NCBI) Japanese Institute of Genetics Netherlands Bioinformatics Centre (NBIC) Philippine Genome Center (PGC) Scripps Research Swiss Institute of Bioinformatics (SIB) Wellcome Sanger Institute Whitehead Institute
Organizations	African Society for Bioinformatics and Computational Biology (ASBCB) Australia Bioinformatics Resource (EMBL-AR) European Molecular Biology network (EMBnet) International Nucleotide Sequence Database Collaboration (INSDC) International Society for Biocuration (ISB) International Society for Computational Biology (ISCB) Student Council (ISCB-SC) Institute of Genomics and Integrative Biology (CSIR-IGIB) Japanese Society for Bioinformatics (JSBi)
Meetings	Basel Computational Biology Conference‎ ([BC²]) European Conference on Computational Biology (ECCB) Intelligent Systems for Molecular Biology (ISMB) International Conference on Bioinformatics (InCoB) ISCB Africa ASBCB Conference on Bioinformatics Pacific Symposium on Biocomputing (PSB) Research in Computational Molecular Biology (RECOMB)
File formats	CRAM format FASTA format FASTQ format NeXML format Nexus format Pileup format SAM format Stockholm format
Related topics	Computational biology List of biological databases Molecular phylogenetics Sequencing Sequence database Sequence alignment
Category Commons