New release of WormBase WS259
WS259 was built by gw3
-==============================================================================-
-========= FTP site structure =================================================-
-==============================================================================-
The WS259 build directory includes:
species/ DIR - contains a sub dir for each WormBase species (G_SPECIES)
species/G_SPECIES DIR - contains a sub dir for each NCBI genome sequencing BioProject (BIOPROJECT) for the species, with the following files:
- G_SPECIES.BIOPROJECT.WS259.genomic.fa.gz - Unmasked genomic DNA
- G_SPECIES.BIOPROJECT.WS259.genomic_masked.fa.gz - Hard-masked (repeats replaced with Ns) genomic DNA
- G_SPECIES.BIOPROJECT.WS259.genomic_softmasked.fa.gz - Soft-masked (repeats lower-cased) genomic DNA
- G_SPECIES.BIOPROJECT.WS259.protein.fa.gz - Current live protein set
- G_SPECIES.BIOPROJECT.WS259.CDS_transcripts.fa.gz - Spliced cDNA sequence for the CDS portion of protein-coding transcripts
- G_SPECIES.BIOPROJECT.WS259.mRNA_transcripts.fa.gz - Spliced cDNA sequence for the full-length (including UTRs) mRNA for transcripts
- G_SPECIES.BIOPROJECT.WS259.ncrna_transcripts.fa.gz - Spliced cDNA sequence for non-coding RNA transcripts
- G_SPECIES.BIOPROJECT.WS259.pseudogenic_transcripts.fa.gz - Spliced cDNA sequence for pseudogenic transcripts
- G_SPECIES.BIOPROJECT.WS259.transposon_transcripts.fa.gz - Spliced cDNA sequence for mRNAs and pseudogenes located in Transposons
- G_SPECIES.BIOPROJECT.WS259.transposons.fa.gz - DNA sequence of curated and predicted Transposons
- G_SPECIES.BIOPROJECT.WS259.intergenic_sequences.fa.gz - DNA sequence between pairs of adjacent genes
- G_SPECIES.BIOPROJECT.WS259.annotations.gff[2|3].gz - Sequence features in either GFF2 or GFF3 format
- G_SPECIES.BIOPROJECT.WS259.canonical_geneset.gtf.gz - Genes, transcripts and CDSs in GTF (GFF2) format
- G_SPECIES.BIOPROJECT.WS259.ests.fa.gz - ESTs and mRNA sequences extracted from the public databases
- G_SPECIES.BIOPROJECT.WS259.best_blastp_hits.txt.gz - Best blastp matches to human, fly, yeast, and non-WormBase Uniprot proteins
- G_SPECIES.BIOPROJECT.WS259.*pep_package.tar.gz - latest version of the [worm|brig|bren|rema|jap|ppa|brug]pep package (if updated since last release)
- annotation/ - contains additional annotations:
- G_SPECIES.BIOPROJECT.WS259.confirmed_genes.txt.gz - DNA sequences of all genes confirmed by EST &/or cDNA
- G_SPECIES.BIOPROJECT.WS259.cDNA2orf.txt.gz - Latest set of ORF connections to each cDNA (EST, OST, mRNA)
- G_SPECIES.BIOPROJECT.WS259.geneIDs.txtgz - list of all current gene identifiers with CGC & molecular names (when known)
- G_SPECIES.BIOPROJECT.WS259.PCR_product2gene.txt.gz - Mappings between PCR products and overlapping Genes
- G_SPECIES.BIOPROJECT.WS259.*oligo_mapping.txt.gz - Oligo array mapping files
- G_SPECIES.BIOPROJECT.WS259.knockout_consortium_alleles.xml.gz - Table of Knockout Consortium alleles
- G_SPECIES.BIOPROJECT.WS259.SRA_gene_expression.tar.gz - Tables of gene expression values computed from SRA RNASeq data
- G_SPECIES.BIOPROJECT.WS259.TSS.wig.tar.gz - Wiggle plot files of Transcription Start Sites from the papers WBPaper00042246, WBPaper00042529, WBPaper00042354
- G_SPECIES.BIOPROJECT.WS259.repeats.fa..gz - Latest version of the repeat library for the genome, suitable for use with RepeatMasker
acedb DIR - Everything needed to generate a local copy of the The Primary database
- database.WS259.*.tar.gz - compressed acedb database for new release
- models.wrm.WS259 - the latest database schema (also in above database files)
- WS259-WS258.dbcomp - log file reporting difference from last release
- *Non_C_elegans_BLASTX/ - This directory contains the blastx data for non-elegans species
(reduces the size of the main database)
MULTI_SPECIES DIR - miscellaneous files with data for multiple species
- wormpep_clw.WS259.sql.bz2 - ClustalW protein multiple alignments
ONTOLOGY DIR - gene_associations, obo files for (phenotype GO anatomy) and associated association files
Release notes on the web:
-------------------------
http://www.wormbase.org/about/release_schedule
-=====================================================================================-
-=========== C. elegans data summary =================================================-
-=====================================================================================-
Chromosomal Changes:
--------------------
There are no changes to the chromosome sequences in this release.
C. elegans gene data (48304 genes in total)
----------------------------------------------
Protein-coding (20225 genes):
Curated description 5117 (25.3%)
Automated description 14835 (73.3%)
Human disease association 1504 (7.4%)
Approved Gene name 9950 (49.2%)
Reference 10936 (54.1%)
RNAi results 18768 (92.8%)
Microarray results 20136 (99.6%)
Expression patterns 19918 (98.5%)
Variations 20220 (100.0%)
Interaction data 15826 (78.2%)
Non-coding RNA and pseudogene (26532 genes):
Curated description 199 (0.8%)
Automated description 4101 (15.5%)
Human disease association 15 (0.1%)
Approved Gene name 16443 (62.0%)
Reference 5779 (21.8%)
RNAi results 1195 (4.5%)
Microarray results 1684 (6.3%)
Expression patterns 547 (2.1%)
Variations 26491 (99.8%)
Interaction data 436 (1.6%)
Uncloned (1547 genes):
Curated description 787 (50.9%)
Automated description 171 (11.1%)
Human disease association 10 (0.6%)
Approved Gene name 1501 (97.0%)
Reference 1118 (72.3%)
RNAi results 0 (0.0%)
Microarray results 0 (0.0%)
Expression patterns 17 (1.1%)
Variations 1184 (76.5%)
Interaction data 120 (7.8%)
Wormpep data set:
----------------------------
There are 28197 CDSs, from 20222 protein-coding loci
The 28197 sequences contain 38992053 base pairs in total.
Modified entries 9
Deleted entries 34
New entries 53
Reappeared entries 3
Net change +22
C. elegans Gene model confirmation status (based on the EST/mRNA/RNASeq evidence)
------------------------------------------------------------
Confirmed 18081 (64.1%) Every base of every exon has transcription evidence (mRNA/EST/RNASeq)
Partially_confirmed 8046 (28.5%) Some, but not all exon bases are covered by transcript evidence
Predicted 2070 (7.3%) No coverage by mRNA/EST/RNASeq evidence
C. elegans Operons Stats
------------------------
Live Operons 1388
Genes in Operons 3626
C. elegans GO annotation status
-------------------------------
GO_codes - used for assigning evidence
IBA Inferred by Biological aspect of Ancestor
IC Inferred by Curator
IDA Inferred from Direct Assay
IEA Inferred from Electronic Annotation
IEP Inferred from Expression Pattern
IGI Inferred from Genetic Interaction
IKR Inferred from Key Residues
IMP Inferred from Mutant Phenotype
IPI Inferred from Physical Interaction
IRD Inferred from Rapid Divergence
ISM Inferred from Sequence Model
ISO Inferred from Sequence Orthology
ISS Inferred from Sequence (or Structural) Similarity
NAS Non-traceable Author Statement
ND No Biological Data available
RCA Inferred from Reviewed Computational Analysis
TAS Traceable Author Statement
Number of gene<->GO_term associations 143686
Breakdown by annotation provider:
WormBase 77524
UniProt 46016
GO_Central 14603
GOC 3341
IntAct 2073
CACAO 34
ParkinsonsUK-UCL 31
MGI 26
BHF-UCL 24
HGNC 14
Breakdown by evdience code:
IEA 102925
Interpro2GO 22491
Other 80434
non-IEA 40761
IBA 14600
IC 109
IDA 7362
IEP 336
IGI 4163
IKR 4
IMP 8050
IPI 3617
ISM 9
ISO 1
ISS 1738
NAS 190
ND 396
TAS 186
Genes Stats:
Genes with GO_term connections 15166
Non-IEA-only annotation 702
IEA-only annotation 7937
Both IEA and non-IEA annotations 6527
GO_term Stats:
Distinct GO_terms connected to Genes 5821
Associated by non-IEA only 3319
Associated by IEA only 860
Associated by both IEA and non-IEA 1642
-=============================================================================-
-=========== Other core species data summary =================================-
-=============================================================================-
Approved gene symbols
---------------------
Brugia malayi 3384
Caenorhabditis brenneri 3912
Caenorhabditis briggsae 6703
Caenorhabditis japonica 5470
Caenorhabditis remanei 6558
Onchocerca volvulus 3256
Pristionchus pacificus 3525
Strongyloides ratti 108
Gene counts
-----------
Brugia malayi 11791 (11078 coding)
Caenorhabditis brenneri 33257 (30660 coding)
Caenorhabditis briggsae 23977 (22379 coding)
Caenorhabditis japonica 32408 (29931 coding)
Caenorhabditis remanei 59161 (57662 coding)
Onchocerca volvulus 12613 (12117 coding)
Pristionchus pacificus 24217 (24216 coding)
Strongyloides ratti 12973 (12464 coding)
Brugia malayi Gene model confirmation status (based on the EST/mRNA/RNASeq evidence)
------------------------------------------------------------
Confirmed 6129 (45.6%) Every base of every exon has transcription evidence (mRNA/EST/RNASeq)
Partially_confirmed 5901 (43.9%) Some, but not all exon bases are covered by transcript evidence
Predicted 1407 (10.5%) No coverage by mRNA/EST/RNASeq evidence
Caenorhabditis brenneri Gene model confirmation status (based on the EST/mRNA/RNASeq evidence)
------------------------------------------------------------
Confirmed 1573 (5.1%) Every base of every exon has transcription evidence (mRNA/EST/RNASeq)
Partially_confirmed 5688 (18.5%) Some, but not all exon bases are covered by transcript evidence
Predicted 23411 (76.3%) No coverage by mRNA/EST/RNASeq evidence
Caenorhabditis briggsae Gene model confirmation status (based on the EST/mRNA/RNASeq evidence)
------------------------------------------------------------
Confirmed 16081 (63.3%) Every base of every exon has transcription evidence (mRNA/EST/RNASeq)
Partially_confirmed 5113 (20.1%) Some, but not all exon bases are covered by transcript evidence
Predicted 4193 (16.5%) No coverage by mRNA/EST/RNASeq evidence
Caenorhabditis japonica Gene model confirmation status (based on the EST/mRNA/RNASeq evidence)
------------------------------------------------------------
Confirmed 1635 (4.5%) Every base of every exon has transcription evidence (mRNA/EST/RNASeq)
Partially_confirmed 5197 (14.4%) Some, but not all exon bases are covered by transcript evidence
Predicted 29144 (81.0%) No coverage by mRNA/EST/RNASeq evidence
Caenorhabditis remanei Gene model confirmation status (based on the EST/mRNA/RNASeq evidence)
------------------------------------------------------------
Confirmed 962 (3.1%) Every base of every exon has transcription evidence (mRNA/EST/RNASeq)
Partially_confirmed 5661 (18.0%) Some, but not all exon bases are covered by transcript evidence
Predicted 24827 (78.9%) No coverage by mRNA/EST/RNASeq evidence
Onchocerca volvulus Gene model confirmation status (based on the EST/mRNA/RNASeq evidence)
------------------------------------------------------------
Confirmed 424 (3.5%) Every base of every exon has transcription evidence (mRNA/EST/RNASeq)
Partially_confirmed 1904 (15.6%) Some, but not all exon bases are covered by transcript evidence
Predicted 9897 (81.0%) No coverage by mRNA/EST/RNASeq evidence
Pristionchus pacificus Gene model confirmation status (based on the EST/mRNA/RNASeq evidence)
------------------------------------------------------------
Confirmed 229 (0.9%) Every base of every exon has transcription evidence (mRNA/EST/RNASeq)
Partially_confirmed 4984 (20.6%) Some, but not all exon bases are covered by transcript evidence
Predicted 19004 (78.5%) No coverage by mRNA/EST/RNASeq evidence
Strongyloides ratti Gene model confirmation status (based on the EST/mRNA/RNASeq evidence)
------------------------------------------------------------
Confirmed 876 (7.0%) Every base of every exon has transcription evidence (mRNA/EST/RNASeq)
Partially_confirmed 2342 (18.8%) Some, but not all exon bases are covered by transcript evidence
Predicted 9265 (74.2%) No coverage by mRNA/EST/RNASeq evidence
-==============================================================================-
-=========== News for this release ============================================-
-==============================================================================-
New data sets
--------------
New/updated reference genomes
------------------------------------
Strongyloides ratti
The S.ratti assembly has been updated to include the mitochondrial
sequence and annotation. It can be found as the 'SRAE_MITOCHONDRIAL'
sequence in the FASTA / GFF files and includes 36 coding and
non-coding genes.
Proposed Changes / Forthcoming Data
------------------------------------
Model Changes
--------------
Model changes for this release are documented here:
http://wiki.wormbase.org/index.php/WS259_Models.wrm
For more information mail help@wormbase.org
-==============================================================================-
-=========== Installation guide ===============================================-
-==============================================================================-
Quick installation guide for UNIX/Linux systems
-----------------------------------------------
1. Create a new directory to contain your copy of WormBase,
e.g. /users/yourname/wormbase
2. Unpack and untar all of the database.*.tar.gz files into
this directory. You will need approximately 50-60 Gb of disk space.
3. Obtain and install a suitable acedb binary for your system
(available from www.acedb.org).
4. Use the acedb 'xace' program to open your database, e.g.
type 'xace /users/yourname/wormbase' at the command prompt.
5. See the acedb website for more information about acedb and
using xace.
____________ END _____________