New release of WormBase WS259

WS259 was built by gw3

-==============================================================================-
-========= FTP site structure =================================================-
-==============================================================================-
The WS259 build directory includes:
species/ DIR              -  contains a sub dir for each WormBase species (G_SPECIES)
species/G_SPECIES DIR     -  contains a sub dir for each NCBI genome sequencing BioProject (BIOPROJECT) for the species, with the following files:
     - G_SPECIES.BIOPROJECT.WS259.genomic.fa.gz                  - Unmasked genomic DNA
     - G_SPECIES.BIOPROJECT.WS259.genomic_masked.fa.gz           - Hard-masked (repeats replaced with Ns) genomic DNA
     - G_SPECIES.BIOPROJECT.WS259.genomic_softmasked.fa.gz       - Soft-masked (repeats lower-cased) genomic DNA
     - G_SPECIES.BIOPROJECT.WS259.protein.fa.gz                  - Current live protein set
     - G_SPECIES.BIOPROJECT.WS259.CDS_transcripts.fa.gz          - Spliced cDNA sequence for the CDS portion of protein-coding transcripts
     - G_SPECIES.BIOPROJECT.WS259.mRNA_transcripts.fa.gz         - Spliced cDNA sequence for the full-length (including UTRs) mRNA for transcripts
     - G_SPECIES.BIOPROJECT.WS259.ncrna_transcripts.fa.gz        - Spliced cDNA sequence for non-coding RNA transcripts
     - G_SPECIES.BIOPROJECT.WS259.pseudogenic_transcripts.fa.gz  - Spliced cDNA sequence for pseudogenic transcripts
     - G_SPECIES.BIOPROJECT.WS259.transposon_transcripts.fa.gz   - Spliced cDNA sequence for mRNAs and pseudogenes located in Transposons
     - G_SPECIES.BIOPROJECT.WS259.transposons.fa.gz              - DNA sequence of curated and predicted Transposons
     - G_SPECIES.BIOPROJECT.WS259.intergenic_sequences.fa.gz     - DNA sequence between pairs of adjacent genes
     - G_SPECIES.BIOPROJECT.WS259.annotations.gff[2|3].gz        - Sequence features in either GFF2 or GFF3 format
     - G_SPECIES.BIOPROJECT.WS259.canonical_geneset.gtf.gz       - Genes, transcripts and CDSs in GTF (GFF2) format
     - G_SPECIES.BIOPROJECT.WS259.ests.fa.gz                     - ESTs and mRNA sequences extracted from the public databases
     - G_SPECIES.BIOPROJECT.WS259.best_blastp_hits.txt.gz        - Best blastp matches to human, fly, yeast, and non-WormBase Uniprot proteins
     - G_SPECIES.BIOPROJECT.WS259.*pep_package.tar.gz            - latest version of the [worm|brig|bren|rema|jap|ppa|brug]pep package (if updated since last release)
     - annotation/                    - contains additional annotations:
        - G_SPECIES.BIOPROJECT.WS259.confirmed_genes.txt.gz              - DNA sequences of all genes confirmed by EST &/or cDNA
        - G_SPECIES.BIOPROJECT.WS259.cDNA2orf.txt.gz                     - Latest set of ORF connections to each cDNA (EST, OST, mRNA)
        - G_SPECIES.BIOPROJECT.WS259.geneIDs.txtgz                       - list of all current gene identifiers with CGC & molecular names (when known)
        - G_SPECIES.BIOPROJECT.WS259.PCR_product2gene.txt.gz             - Mappings between PCR products and overlapping Genes
        - G_SPECIES.BIOPROJECT.WS259.*oligo_mapping.txt.gz               - Oligo array mapping files
        - G_SPECIES.BIOPROJECT.WS259.knockout_consortium_alleles.xml.gz  - Table of Knockout Consortium alleles
        - G_SPECIES.BIOPROJECT.WS259.SRA_gene_expression.tar.gz          - Tables of gene expression values computed from SRA RNASeq data
        - G_SPECIES.BIOPROJECT.WS259.TSS.wig.tar.gz                      - Wiggle plot files of Transcription Start Sites from the papers WBPaper00042246, WBPaper00042529, WBPaper00042354
        - G_SPECIES.BIOPROJECT.WS259.repeats.fa..gz                      - Latest version of the repeat library for the genome, suitable for use with RepeatMasker
acedb DIR                -  Everything needed to generate a local copy of the The Primary database
     - database.WS259.*.tar.gz   - compressed acedb database for new release
     - models.wrm.WS259          - the latest database schema (also in above database files)
     - WS259-WS258.dbcomp        - log file reporting difference from last release
     - *Non_C_elegans_BLASTX/     - This directory contains the blastx data for non-elegans species
                                                    (reduces the size of the main database)
MULTI_SPECIES DIR - miscellaneous files with data for multiple species
     - wormpep_clw.WS259.sql.bz2 - ClustalW protein multiple alignments
ONTOLOGY DIR             - gene_associations, obo files for (phenotype GO anatomy) and associated association files


Release notes on the web:
-------------------------
http://www.wormbase.org/about/release_schedule


-=====================================================================================-
-=========== C. elegans data summary =================================================-
-=====================================================================================-

Chromosomal Changes:
--------------------
There are no changes to the chromosome sequences in this release.


C. elegans gene data (48304 genes in total)
----------------------------------------------

Protein-coding (20225 genes):
  Curated description         5117    (25.3%)
  Automated description      14835    (73.3%)
  Human disease association   1504     (7.4%)
  Approved Gene name          9950    (49.2%)
  Reference                  10936    (54.1%)
  RNAi results               18768    (92.8%)
  Microarray results         20136    (99.6%)
  Expression patterns        19918    (98.5%)
  Variations                 20220   (100.0%)
  Interaction data           15826    (78.2%)

Non-coding RNA and pseudogene (26532 genes):
  Curated description          199     (0.8%)
  Automated description       4101    (15.5%)
  Human disease association     15     (0.1%)
  Approved Gene name         16443    (62.0%)
  Reference                   5779    (21.8%)
  RNAi results                1195     (4.5%)
  Microarray results          1684     (6.3%)
  Expression patterns          547     (2.1%)
  Variations                 26491    (99.8%)
  Interaction data             436     (1.6%)

Uncloned (1547 genes):
  Curated description          787    (50.9%)
  Automated description        171    (11.1%)
  Human disease association     10     (0.6%)
  Approved Gene name          1501    (97.0%)
  Reference                   1118    (72.3%)
  RNAi results                   0     (0.0%)
  Microarray results             0     (0.0%)
  Expression patterns           17     (1.1%)
  Variations                  1184    (76.5%)
  Interaction data             120     (7.8%)



Wormpep data set:
----------------------------

There are 28197 CDSs, from 20222 protein-coding loci
The 28197 sequences contain 38992053 base pairs in total.

Modified entries      9
Deleted entries       34
New entries           53
Reappeared entries    3

Net change  +22

C. elegans Gene model confirmation status (based on the EST/mRNA/RNASeq evidence)
------------------------------------------------------------
Confirmed             18081 (64.1%)     Every base of every exon has transcription evidence (mRNA/EST/RNASeq)
Partially_confirmed    8046 (28.5%)     Some, but not all exon bases are covered by transcript evidence
Predicted              2070 (7.3%)      No coverage by mRNA/EST/RNASeq evidence

C. elegans Operons Stats
------------------------
Live Operons        1388
Genes in Operons    3626

C. elegans GO annotation status
-------------------------------

GO_codes - used for assigning evidence
  IBA Inferred by Biological aspect of Ancestor
  IC  Inferred by Curator
  IDA Inferred from Direct Assay
  IEA Inferred from Electronic Annotation
  IEP Inferred from Expression Pattern
  IGI Inferred from Genetic Interaction
  IKR Inferred from Key Residues
  IMP Inferred from Mutant Phenotype
  IPI Inferred from Physical Interaction
  IRD Inferred from Rapid Divergence
  ISM Inferred from Sequence Model
  ISO Inferred from Sequence Orthology
  ISS Inferred from Sequence (or Structural) Similarity
  NAS Non-traceable Author Statement
  ND  No Biological Data available
  RCA Inferred from Reviewed Computational Analysis
  TAS Traceable Author Statement

Number of gene<->GO_term associations    143686
  Breakdown by annotation provider:
    WormBase             77524
    UniProt              46016
    GO_Central           14603
    GOC                   3341
    IntAct                2073
    CACAO                   34
    ParkinsonsUK-UCL        31
    MGI                     26
    BHF-UCL                 24
    HGNC                    14
  Breakdown by evdience code:
    IEA     102925
      Interpro2GO 22491
      Other       80434
    non-IEA 40761
      IBA   14600
      IC      109
      IDA    7362
      IEP     336
      IGI    4163
      IKR       4
      IMP    8050
      IPI    3617
      ISM       9
      ISO       1
      ISS    1738
      NAS     190
      ND      396
      TAS     186

Genes Stats:
  Genes with GO_term connections  15166 
    Non-IEA-only annotation              702
    IEA-only annotation                 7937
    Both IEA and non-IEA annotations    6527

GO_term Stats:
  Distinct GO_terms connected to Genes   5821
    Associated by non-IEA only               3319
    Associated by IEA only                    860
    Associated by both IEA and non-IEA       1642

-=============================================================================-
-=========== Other core species data summary =================================-
-=============================================================================-

Approved gene symbols
---------------------
Brugia malayi                 3384
Caenorhabditis brenneri       3912
Caenorhabditis briggsae       6703
Caenorhabditis japonica       5470
Caenorhabditis remanei        6558
Onchocerca volvulus           3256
Pristionchus pacificus        3525
Strongyloides ratti            108

Gene counts
-----------
Brugia malayi                11791 (11078 coding)
Caenorhabditis brenneri      33257 (30660 coding)
Caenorhabditis briggsae      23977 (22379 coding)
Caenorhabditis japonica      32408 (29931 coding)
Caenorhabditis remanei       59161 (57662 coding)
Onchocerca volvulus          12613 (12117 coding)
Pristionchus pacificus       24217 (24216 coding)
Strongyloides ratti          12973 (12464 coding)

Brugia malayi Gene model confirmation status (based on the EST/mRNA/RNASeq evidence)
------------------------------------------------------------
Confirmed              6129 (45.6%)     Every base of every exon has transcription evidence (mRNA/EST/RNASeq)
Partially_confirmed    5901 (43.9%)     Some, but not all exon bases are covered by transcript evidence
Predicted              1407 (10.5%)     No coverage by mRNA/EST/RNASeq evidence

Caenorhabditis brenneri Gene model confirmation status (based on the EST/mRNA/RNASeq evidence)
------------------------------------------------------------
Confirmed              1573 (5.1%)      Every base of every exon has transcription evidence (mRNA/EST/RNASeq)
Partially_confirmed    5688 (18.5%)     Some, but not all exon bases are covered by transcript evidence
Predicted             23411 (76.3%)     No coverage by mRNA/EST/RNASeq evidence

Caenorhabditis briggsae Gene model confirmation status (based on the EST/mRNA/RNASeq evidence)
------------------------------------------------------------
Confirmed             16081 (63.3%)     Every base of every exon has transcription evidence (mRNA/EST/RNASeq)
Partially_confirmed    5113 (20.1%)     Some, but not all exon bases are covered by transcript evidence
Predicted              4193 (16.5%)     No coverage by mRNA/EST/RNASeq evidence

Caenorhabditis japonica Gene model confirmation status (based on the EST/mRNA/RNASeq evidence)
------------------------------------------------------------
Confirmed              1635 (4.5%)      Every base of every exon has transcription evidence (mRNA/EST/RNASeq)
Partially_confirmed    5197 (14.4%)     Some, but not all exon bases are covered by transcript evidence
Predicted             29144 (81.0%)     No coverage by mRNA/EST/RNASeq evidence

Caenorhabditis remanei Gene model confirmation status (based on the EST/mRNA/RNASeq evidence)
------------------------------------------------------------
Confirmed               962 (3.1%)      Every base of every exon has transcription evidence (mRNA/EST/RNASeq)
Partially_confirmed    5661 (18.0%)     Some, but not all exon bases are covered by transcript evidence
Predicted             24827 (78.9%)     No coverage by mRNA/EST/RNASeq evidence

Onchocerca volvulus Gene model confirmation status (based on the EST/mRNA/RNASeq evidence)
------------------------------------------------------------
Confirmed               424 (3.5%)      Every base of every exon has transcription evidence (mRNA/EST/RNASeq)
Partially_confirmed    1904 (15.6%)     Some, but not all exon bases are covered by transcript evidence
Predicted              9897 (81.0%)     No coverage by mRNA/EST/RNASeq evidence

Pristionchus pacificus Gene model confirmation status (based on the EST/mRNA/RNASeq evidence)
------------------------------------------------------------
Confirmed               229 (0.9%)      Every base of every exon has transcription evidence (mRNA/EST/RNASeq)
Partially_confirmed    4984 (20.6%)     Some, but not all exon bases are covered by transcript evidence
Predicted             19004 (78.5%)     No coverage by mRNA/EST/RNASeq evidence

Strongyloides ratti Gene model confirmation status (based on the EST/mRNA/RNASeq evidence)
------------------------------------------------------------
Confirmed               876 (7.0%)      Every base of every exon has transcription evidence (mRNA/EST/RNASeq)
Partially_confirmed    2342 (18.8%)     Some, but not all exon bases are covered by transcript evidence
Predicted              9265 (74.2%)     No coverage by mRNA/EST/RNASeq evidence


-==============================================================================-
-=========== News for this release ============================================-
-==============================================================================-

New data sets
--------------


New/updated reference genomes
------------------------------------


Strongyloides ratti

The S.ratti assembly has been updated to include the mitochondrial
sequence and annotation.  It can be found as the 'SRAE_MITOCHONDRIAL'
sequence in the FASTA / GFF files and includes 36 coding and
non-coding genes.


Proposed Changes / Forthcoming Data
------------------------------------


Model Changes
--------------

Model changes for this release are documented here:

http://wiki.wormbase.org/index.php/WS259_Models.wrm

For more information mail help@wormbase.org

-==============================================================================-
-=========== Installation guide ===============================================-
-==============================================================================-


Quick installation guide for UNIX/Linux systems
-----------------------------------------------

1. Create a new directory to contain your copy of WormBase,
        e.g. /users/yourname/wormbase

2. Unpack and untar all of the database.*.tar.gz files into
        this directory. You will need approximately 50-60 Gb of disk space.

3. Obtain and install a suitable acedb binary for your system
        (available from www.acedb.org).

4. Use the acedb 'xace' program to open your database, e.g.
        type 'xace /users/yourname/wormbase' at the command prompt.

5. See the acedb website for more information about acedb and
        using xace.

____________  END _____________

Last edited by Gary Williams – 114 days ago