Personal tools

Difference between revisions of "Sources and Issues Next Release"

From irefindex
Jump to: navigation, search
(MPIDB: Updated the MPIDB MITAB notes.)
(Issues: Attempted to tidy up the issues.)
Line 26: Line 26:
  
 
== Issues ==
 
== Issues ==
'''Hard Release date: July 1st.'''
 
  
'''Yeast taxon id changes'''
+
=== Yeast taxon id changes ===
 +
 
 
See http://www.uniprot.org/news/2011/05/03/release
 
See http://www.uniprot.org/news/2011/05/03/release
 +
 +
{{Note|
 +
There appears to be correction of taxids in the build.
 +
|Paul}}
  
 
'''New databases'''
 
'''New databases'''
To Be Discussed
+
 
 +
InnateDB, MatrixDB, MPIDB are added.
  
 
'''BioGrid interaction record ids (pre-build issue)'''
 
'''BioGrid interaction record ids (pre-build issue)'''
To Be Done
 
  
 
Capture Biogrid interaction record ids so iRefWeb can link out to BioGrid.
 
Capture Biogrid interaction record ids so iRefWeb can link out to BioGrid.
* The only interaction id available from the BioGrid files are already being used and also there in the iRefWeb
+
 
e.g <primaryRef db="grid" id="103" refType="identity" refTypeAc="MI:0356" dbAc="MI:0463" />
+
The only interaction id available from the BioGrid files are already being used and also there in the iRefWeb, such as...
 +
 
 +
  <primaryRef db="grid" id="103" refType="identity" refTypeAc="MI:0356" dbAc="MI:0463" />
  
 
'''RIGID recalculation (pre-build issue)'''
 
'''RIGID recalculation (pre-build issue)'''
Line 48: Line 54:
  
 
Taxon specific files should contain interactions ONLY if one or both taxa, taxb have the appropriate taxon (regardless of what the source database said the interaction taxon was.  Change README.
 
Taxon specific files should contain interactions ONLY if one or both taxa, taxb have the appropriate taxon (regardless of what the source database said the interaction taxon was.  Change README.
Example see PMID http://wodaklab.org/iRefWeb/pubReport/detail?pubmed=12565857+
+
For example, see PMID...
 +
 
 +
http://wodaklab.org/iRefWeb/pubReport/detail?pubmed=12565857+
 +
 
 
A "mouse" interaction from HPRD lists only human interactors (the paper is about mouse and they have made a transfer to human without noting what they have done.)  As a result, this human interaction ends up in the mouse MITAB (because HPRD says it was mouse).  BioGRID correctly curates the paper as about mouse.
 
A "mouse" interaction from HPRD lists only human interactors (the paper is about mouse and they have made a transfer to human without noting what they have done.)  As a result, this human interaction ends up in the mouse MITAB (because HPRD says it was mouse).  BioGRID correctly curates the paper as about mouse.
  
Line 54: Line 63:
  
 
Ensure that all CORUM methods (with MI terms) are parsed.
 
Ensure that all CORUM methods (with MI terms) are parsed.
*This is now fixed (Sabry) - please get the latest mapper from CVS
 
  
'''Repeated lines (post-processing issue)''' 
+
{{Note|
 +
This is now fixed - please get the latest mapper from CVS.
 +
|Sabry}}
  
 +
'''Repeated lines (post-processing issue)''' 
  
 
There are multiple lines that are repeated many times.  These appear to arise from BIND 3DBP division (see for example lines 5,13,117,125 in Ecoli MITAB and others arising from BIND ID 92720 - 44 pieces of experimental evidence and 5 PMIDs) because the accessions for the different experimental forms are not present in MITAB.  See Antonio and bug# 245. Could be handled as a post-processing step on MITAB to take the unique set of all MITAB lines.
 
There are multiple lines that are repeated many times.  These appear to arise from BIND 3DBP division (see for example lines 5,13,117,125 in Ecoli MITAB and others arising from BIND ID 92720 - 44 pieces of experimental evidence and 5 PMIDs) because the accessions for the different experimental forms are not present in MITAB.  See Antonio and bug# 245. Could be handled as a post-processing step on MITAB to take the unique set of all MITAB lines.
  
'''MITAB/irefscape canonicalization (post-processing issue)'''
+
{{Note|
 +
This will probably be handled by using <tt>sort -u</tt> with the files.
 +
|Paul}}
 +
 
 +
'''MITAB/iRefScape canonicalization (post-processing issue)'''
  
 
Change this to choose canonical sequence rather than longest sequence (mapping score L).
 
Change this to choose canonical sequence rather than longest sequence (mapping score L).
 
Examples GeneID 84148 and 512564 unnecessarily separates Grid interaction data from interaction data from other databases.
 
Examples GeneID 84148 and 512564 unnecessarily separates Grid interaction data from interaction data from other databases.
  
Decided not to chnage L method...instead:
+
Decided not to change L method...instead:
  
Resolve by distributing non-canonicalized data as before AND a canonicalized MITAB file with complete provenance info (this will become the main MITAB file we release and it will support PSICQUIC services and we will drop non-canonicalised version in future releases).  Also, canonicalize irefscape data and include provenace data for interactors in edge attribute viewer.   
+
Resolve by distributing non-canonicalized data as before AND a canonicalized MITAB file with complete provenance info (this will become the main MITAB file we release and it will support PSICQUIC services and we will drop non-canonicalised version in future releases).  Also, canonicalize iRefScape data and include provenance data for interactors in edge attribute viewer.   
  
 
Requires review of current MITAB file format by Ian.
 
Requires review of current MITAB file format by Ian.
  
 
===Other issues===
 
===Other issues===
*Discuss the way to include I2D -- No I2D will not be included
+
 
*Parse all new datasets to a temporary database and test before homogenizing. -- not required, no new data sources
+
*Discuss the way to include I2D
 +
** I2D will not be included
 +
*Parse all new datasets to a temporary database and test before homogenizing.
 +
** Not required, no new data sources
 
*Whether to use both BIND text and BIND_Translation OR only one of them
 
*Whether to use both BIND text and BIND_Translation OR only one of them
 
+
** Still using both.
 
*The default output data will be the canonical form. The MITAB will have the canonical Accession as the UIDA and UIDB. There will be new columns beforeCanonicalizationReferenceA, beforeCanonicalizationReferenceB. The aliases and the alternative identifiers will be of the canonical group not of a specific protein. With the new columns and all the references, it has to be tested whether the row width will exceed any thresholds (e.g. MySQL maximum row with), (I assume this would not be a problem).   
 
*The default output data will be the canonical form. The MITAB will have the canonical Accession as the UIDA and UIDB. There will be new columns beforeCanonicalizationReferenceA, beforeCanonicalizationReferenceB. The aliases and the alternative identifiers will be of the canonical group not of a specific protein. With the new columns and all the references, it has to be tested whether the row width will exceed any thresholds (e.g. MySQL maximum row with), (I assume this would not be a problem).   
 
 
*For iRefScape, once the canonicalization is performed there will be no "uncanonicalize option" (currently there is a option to use canonical expansion).
 
*For iRefScape, once the canonicalization is performed there will be no "uncanonicalize option" (currently there is a option to use canonical expansion).
  

Revision as of 05:01, 17 August 2011

NoteNote

This is a planning template for the next release. It does not correspond to a released product. See http://irefindex.uio.no/ for the most recent release and related documentation. This page can be used to create the sources page. Check for xxx before copying and pasting to the appropriate sources page for the new release. Do not edit xxx in this page. Leave this page as a template. After making a new release page, update the general Sources for iRefIndex redirect page.

Last edited: 2011-08-17

Applies to iRefIndex release: xxx

Release date: xxx

Authors: Ian Donaldson, Sabry Razick and Paul Boddie

Database: iRefIndex (http://irefindex.uio.no)

Organization: Biotechnology Centre of Oslo, University of Oslo (http://www.biotek.uio.no/)

Description: This file lists interaction and protein sequence related resources used for the current build of the iRefIndex. Statistics for the iRefIndex are available and include a breakdown of interactors and interactions from each data source.

Contents

Issues

Yeast taxon id changes

See http://www.uniprot.org/news/2011/05/03/release

NotePaul

There appears to be correction of taxids in the build.

New databases

InnateDB, MatrixDB, MPIDB are added.

BioGrid interaction record ids (pre-build issue)

Capture Biogrid interaction record ids so iRefWeb can link out to BioGrid.

The only interaction id available from the BioGrid files are already being used and also there in the iRefWeb, such as...

<primaryRef db="grid" id="103" refType="identity" refTypeAc="MI:0356" dbAc="MI:0463" />

RIGID recalculation (pre-build issue)

See bug 242. Modify existing RIGID table or loose continuity of iRIGIDs with last release.

Taxon specific MITAB files (post-processing issue)

Taxon specific files should contain interactions ONLY if one or both taxa, taxb have the appropriate taxon (regardless of what the source database said the interaction taxon was. Change README. For example, see PMID...

http://wodaklab.org/iRefWeb/pubReport/detail?pubmed=12565857+

A "mouse" interaction from HPRD lists only human interactors (the paper is about mouse and they have made a transfer to human without noting what they have done.) As a result, this human interaction ends up in the mouse MITAB (because HPRD says it was mouse). BioGRID correctly curates the paper as about mouse.

CORUM methods (code change implemented)

Ensure that all CORUM methods (with MI terms) are parsed.

NoteSabry

This is now fixed - please get the latest mapper from CVS.

Repeated lines (post-processing issue)

There are multiple lines that are repeated many times. These appear to arise from BIND 3DBP division (see for example lines 5,13,117,125 in Ecoli MITAB and others arising from BIND ID 92720 - 44 pieces of experimental evidence and 5 PMIDs) because the accessions for the different experimental forms are not present in MITAB. See Antonio and bug# 245. Could be handled as a post-processing step on MITAB to take the unique set of all MITAB lines.

NotePaul

This will probably be handled by using sort -u with the files.

MITAB/iRefScape canonicalization (post-processing issue)

Change this to choose canonical sequence rather than longest sequence (mapping score L). Examples GeneID 84148 and 512564 unnecessarily separates Grid interaction data from interaction data from other databases.

Decided not to change L method...instead:

Resolve by distributing non-canonicalized data as before AND a canonicalized MITAB file with complete provenance info (this will become the main MITAB file we release and it will support PSICQUIC services and we will drop non-canonicalised version in future releases). Also, canonicalize iRefScape data and include provenance data for interactors in edge attribute viewer.

Requires review of current MITAB file format by Ian.

Other issues

  • Discuss the way to include I2D
    • I2D will not be included
  • Parse all new datasets to a temporary database and test before homogenizing.
    • Not required, no new data sources
  • Whether to use both BIND text and BIND_Translation OR only one of them
    • Still using both.
  • The default output data will be the canonical form. The MITAB will have the canonical Accession as the UIDA and UIDB. There will be new columns beforeCanonicalizationReferenceA, beforeCanonicalizationReferenceB. The aliases and the alternative identifiers will be of the canonical group not of a specific protein. With the new columns and all the references, it has to be tested whether the row width will exceed any thresholds (e.g. MySQL maximum row with), (I assume this would not be a problem).
  • For iRefScape, once the canonicalization is performed there will be no "uncanonicalize option" (currently there is a option to use canonical expansion).

Build issues

Two BIND Translation files use non-ASCII byte values that are not part of valid UTF-8 byte sequences, but do not declare an encoding explicitly:

  • taxid10090_PSIMI25.xml
  • taxid9606_PSIMI25.xml

MPIDB

The MPIDB data files are non-standard in various respects and require some special measures to structure the data for iRefIndex use. See iRefIndex MITAB Mapping for details of the way iRefIndex should retain MITAB-originating data.

innatedb.

Innatedb has data from other sources as well. I see in the download page that these is a link for curated innateDB data and we should find out whether this is a collection of all data or are these curated by innatedb. Paul has made a parser for the PSI XML and this data will be from 2011-03-06. They say however that they update the MI TAB version every week.

MatrixDB

They have non-proteins and protein fragments not only proteins as interactors. This database must be tested before homogenizing.

Interaction related resources

Source Format Location Version (date)
BIND Tab-delimited text file. ftp://ftp.bind.ca/pub/BIND/data/bindflatfiles/bindindex/ (no longer available - see below).

20050525.complex2refs.txt

20050525.ints.txt

20050525.refs.txt

20050525.complexes.txt

20050525.labels.txt

20050525.complex2subunits.txt

These file are no longer available via ftp but are available from the authors. BIND archival content is now managed by Thomson Scientific. See http://bond.unleashedinformatics.com/ and http://bond.unleashedinformatics.com/downloads/data/BIND/

For historical purposes, a snapshot of the the Blueprint web-site may be viewed at...

http://web.archive.org/web/20050204013426/www.blueprint.org/index.html

...via the internet archive at...

http://web.archive.org/web/*/http://www.blueprint.org

2005-05-25
BIND Translation PSI-MI 2.5 http://download.baderlab.org/BINDTranslation/release1_0/BINDTranslation_v1_xml_AllSpecies.tar.gz Version 1.0 (2010-12-15)
BioGRID PSI-MI 2.5 http://thebiogrid.org/downloads/archives/Release%20Archive/BIOGRID-3.1.77/BIOGRID-ALL-3.1.77.psi25.zip Version 3.1.77 (2011-06-01)
CORUM PSI-MI 2.5 http://mips.gsf.de/genre/proj/corum/index.html
http://mips.gsf.de/genre/export/sites/default/corum/allComplexes.psimi.zip
2009-12-02
DIP PSI-MI 2.5 http://dip.doe-mbi.ucla.edu/dip/Download.cgi


dip20101010.mif25
Note: date on last IMEx release file is from 2008

2010-10-10
HPRD PSI-MI 2.5 http://www.hprd.org/download
HPRD_PSIMI_041310.tar.gz
Release 9 (2010-04-13)
IntAct PSI-MI 2.5 ftp://ftp.ebi.ac.uk/pub/databases/intact/2011-05-23/psi25/pmidMIF25.zip 2011-05-25
MINT PSI-MI 2.5 ftp://mint.bio.uniroma2.it/pub/release/psi/current/psi25/pmid/ 2010-12-21
MPACT PSI-MI 2.5 ftp://ftpmips.gsf.de/yeast/PPI/mpact-complete.psi25.xml.gz 2008-01-10
MPPI PSI-MI 1.0 http://mips.gsf.de/proj/ppi/data/mppi.gz 2004-06-01 (from archive)
OPHID PSI-MI 1.0 http://ophid.utoronto.ca/ophid/downloads.html (This service no longer available, please refer to http://ophid.utoronto.ca/ophidv2.201/) 2006-07-07
New for this release
InnateDB PSI-MI 2.5 http://www.innatedb.com/download.jsp
Curated InnateDB Data
2011-03-06
MPIDB MITAB format file http://www.jcvi.org/mpidb (information)

http://www.jcvi.org/mpidb/download.php (general downloads)
http://www.jcvi.org/mpidb/interaction.php?dbsource=MPI-LIT (specific download for MPI-LIT)
http://www.jcvi.org/mpidb/interaction.php?dbsource=MPI-IMEX (specific download for MPI-IMEX)

Downloaded on 2011-06-14
MatrixDB PSI-MI 2.5 http://matrixdb.ibcp.fr/
MatrixDB_20100826.xml.zip
2010-08-26 (timestamp)

Sequence related resources (not updated yet)

Source Format Location Version (date)
SEGUID Tab-delimited text ftp://bioinformatics.anl.gov/seguid/
seguidannotation
xxxx (timestamp)
UniProt Text http://www.uniprot.org/downloads
UniProtKB/Swiss-Prot (uniprot_sprot.dat.gz)
UniProt Knowledgebase Release 2011_06 (2011-05-31) (Downloaded on 2011-06-11):
UniProtKB/Swiss-Prot
UniProtKB/TrEMBL
(from ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/reldate.txt)
UniProt Text http://www.uniprot.org/downloads
UniProtKB/TrEMBL (uniprot_trembl.dat.gz)
UniProt, IsoForms FASTA http://www.uniprot.org/downloads uniprot_sprot_varsplic.fasta.gz
UniProt, SGD Tab-delimited text file. http://www.expasy.org/cgi-bin/lists?yeast.txt
Yeast (Saccharomyces cerevisiae): entries, gene names and cross-references to SGD
UniProt, FLY Tab-delimited text file. http://www.expasy.org/cgi-bin/lists?fly.txt
Drosophila: entries, gene names and cross-references to FlyBase.
NCBI, RefSeq GenPept ftp://ftp.ncbi.nih.gov/refseq/release/complete
see *.protein.gpff.gz files
Release 47 (2011-05-12) (Downloaded on 2011-06-11)
(from http://www.ncbi.nlm.nih.gov/refseq/)
NCBI, MMDB/PDB Tab-delimited text ftp://ftp.ncbi.nih.gov/mmdb/pdbeast/table (Downloaded on 2011-06-11)
NCBI, PDB sequences FASTA ftp://ftp.ncbi.nih.gov/blast/db/FASTA/pdbaa.gz (Downloaded on 2011-06-11)
NCBI Gene2Refseq Tab-delimited text ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/
gene2refseq.gz
(Downloaded on 2011-06-14)

All iRefIndex Pages

Follow this link for a listing of all iRefIndex related pages (archived and current).