WormBase database build overview

From WormBaseWiki

Jump to: navigation, search

1) The three US based groups upload ACeDB database to the secure Sanger FTP server. These, plus two Sanger databases, are combined in a controlled fashion, with only specific data classes and in some cases specific tags within classes, being included from each database. This also includes a full C. briggsae database which is currently handled solely at Washington University, St Louis.

2) The full chromosomal sequences are regenerated from the component parts in the two sequence curation databases. DNA, GFF and AGP files are dumped at this initial stage.

3) Wormpep, the C. elegans proteome, is generated by translating the coding genes. Wormrna, the spliced sequences of non-coding genes is generated

4) Wormpep and its C. briggsae equivalent are transfered to the Ensembl pipeline part of the build along with any of the DNA clones that have been modified. The BLAST and other analyses performed using the Ensembl infrastructure are then run in parallel with the rest of the build until it is ready to be dumped and loaded back in to the main database.

5) All nematode transcript data (mRNAs, ESTs etc) in the public nucleotide databases is aligned to the C. elegans genome using Jim Kent's BLAT program. The C. elegans alignments are then compared to the existing CDS structures and full length coding transcripts built based upon them and other relevant evidence such as TSL's and polyadenylation data.

6) Experimental data such as RNAi, alleles and PCR products used in microarrays is then mapped to the genome sequence and cross referenced to genes that overlap it. Gene clusters, based on microarray data are then dynamically determined.

7) The confirmation status of all coding genes is calculated based on the level of transcript coverage and the information added to WormPep.

8) The coordinates of relatively static datasets are transformed from the previous release. These datasets include ab initio gene predictions, Vancouver fosmids, and mass-spectrometry peptides whose coordinates may need updating when small genome sequence changes occur. This saves the step of entirely remapping these datasets with each release. The coordinate transformation is handled by a custom tool (http://www.wormbase.org/wiki/index.php/Converting_Coordinates_between_releases) that we have released to the community.

9) The data from the Compara, blast analysis and automatic protein annotation pipelines are loaded into the main database and full GFF files for each chromosome are dumped.

10) Several automated annotation procedures make user determination of gene function more apparent. A short desciption is generated based on gene name, protein domains and homology eg "C. elegans TSP-11 protein; contains similarity to Pfam domain PF00335 Tetraspanin family,contains similarity to Interpro domain IPR000301 (CD9/CD37/CD63 antigen)" for tsp-11. GO terms are applied to genes based on protein domains and phenotypes caused by mutation or RNAi. The InterPro to GO term mapping is taken from a file provided by the InterPro database at the European Bioinformatics Institute (EBI) and the phenotype to GO term connections are determined by Caltech curators.

11) Interpolated genetic map postions are determined for those genes that do not have experimental genetic mapping data.

12) At this point all data objects are present in the database so some consistancy checks are performed. A class-by-class count comparison with the previous release identifies any major errors. 12 clones, 2 from each chromosome, are manually inspected for any missing data or visually obvious errors.

13) Once we are satisfied that there are no problems with the database, GFF files are produced for each chromosome. There are several post-processing steps for these files to add extra information displayed in the the website genome browser (e.g. for SAGE tags whether the tag is associated with a gene or is unambigously mapped). These predetermined additions to the GFF lines mean that the information can easily be displayed on the website on the GFF-driven Genome Browser without the need to query the underlying ACeDB database.

14) A set of often requested data sets such as microarray probes to gene mappings, best BLASTP hits,and GeneID to locus and CDS names are generated. These and the final release files including compressed ACeDB database files, GFF, AGP and DNA files are deposited on the Sanger public FTP site. A release letter is automatically generated detailing genome sequence changes, curation activity progress, wormpep statistics and other details. Information on new and upcoming data, bug fixes and known errors is added manually before being sent out the wormbase-dev mailing list. This notifies CSHL that the database is ready for download and installing on the development server.

Personal tools