iRefIndex Build Process

From irefindex
Revision as of 17:33, 12 February 2009 by PaulBoddie (talk | contribs) (The first parts of the original document.)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Downloading the Source Data

Before downloading the source data, a location must be chosen for the downloaded files. For example:

/biotek/prometheus/storage/Sabry/data

Download the files to create local copies. This is not possible for all the data sources and some need special links to be obtained from the source administrators via e-mail. The FTPtransfer program will download data from the following sources:

  • RefSeq
  • MMDB
  • PDB
  • gene2refseq
  • IntAct
  • MINT

Manual Downloads

More information can be found at the following location:

ftp://ftp.no.embnet.org/irefindex/data/current/sources.htm

For each manual download, a subdirectory hierarchy must be created in the main data directory using a command of the following form:

mkdir -p <path-to-data>/<source>/<date>/

Here, <path-to-data> should be replaced by the location of the data directory, <source> should be replaced by the name of the source, and <date> should be replaced by the current date.

For example, for BIND this directory might be created as follows:

mkdir -p /biotek/prometheus/storage/Sabry/data/BIND/09_22_2008/

BIND

The FTP site was previously available at the following location:

ftp://ftp.bind.ca/pub/BIND/data/bindflatfiles/bindindex/

An archived copy of the data can be found at the following internal location:

/biotek/dias/donaldson3/Sabry/DATA_2006/BINDftp/

In the main downloaded data directory, create a subdirectory hierarchy as noted above.

Copy the following following files into the newly created data directory:

20060525.complex2refs.txt
20060525.complex2subunits.txt
20060525.ints.txt
20060525.labels.txt
20060525.refs.txt

BioGrid

The location of BioGrid downloads is as follows:

http://www.thebiogrid.org/downloads.php

In the main downloaded data directory, create a subdirectory hierarchy as noted above.

Select the BIOGRID-ORGANISM-XXXXX.psi25.zip file and download it to the newly created data directory.

In the data directory, uncompress the downloaded file; for example:

unzip BIOGRID-ORGANISM-2.0.44.psi25.zip

CORUM

The location of CORUM downloads is as follows:

http://mips.gsf.de/genre/proj/corum/index.html

The specific download file is this one:

http://mips.gsf.de/genre/export/sites/default/corum/allComplexes.psimi.zip

Uncompress the downloaded file:

unzip allComplexes.psimi.zip

Important Note

The CORUM data needs adjusting to work with the StaxPSIXML software. Using a suitable XSLT tool such as xsltproc, transform the uncompressed downloaded file as follows:

mv allComplexes.psimi allComplexes.psimi.orig
xsltproc fix_corum.xsl allComplexes.psimi.orig > allComplexes.psimi

The fix_corum.xsl file can be found in the XSLT directory within StaxPSIXML.

DIP

Access to data from DIP is performed via the following location:

http://dip.doe-mbi.ucla.edu/dip/Login.cgi?

You have to register, agree to terms, and get a user account.

Access credentials for internal users are available from Sabry.

In the main downloaded data directory, create a subdirectory hierarchy as noted above.

Select the FULL - complete DIP data set from the Files page:

http://dip.doe-mbi.ucla.edu/dip/Download.cgi?SM=3

Download the latest PSI-MI 2.5 file (dip<date>.mif25) to the newly created data directory. If a compressed version of the file was chosen, uncompress the file using the gunzip tool. For example:

gunzip dip20080708.mif25

HPRD

http://www.hprd.org/download/

In the main downloaded data directory, create a subdirectory hierarchy as noted above.

Download the PSI-MI single file (HPRD_SINGLE_PSIMI_<date>.xml.tar.gz) to the newly created data directory.

Note: you have to register each and every time, unfortunately.

Uncompress the downloaded file. For example:

tar zxf HPRD_SINGLE_PSIMI_090107.xml.tar.gz

OPHID

OPHID is no longer available, so you have to use the local copy of the data:

/biotek/dias/donaldson3/Sabry/iRefIndex_Backup/BckUp15SEP2008/OPHID/2008MAR16

In the main downloaded data directory, create a subdirectory hierarchy as noted above.

Copy the file ophid1153236640123.xml to the newly created data directory.

MIPS

In the main downloaded data directory, create a subdirectory hierarchy as noted above.

For MPPI, download the following file:

http://mips.gsf.de/proj/ppi/data/mppi.gz

For MPACT, download the following file:

ftp://ftpmips.gsf.de/yeast/PPI/mpact-complete.psi25.xml.gz

Uncompress the downloaded files:

gunzip mpact-complete.psi25.xml.gz
gunzip mppi.gz

UniProt

In the main downloaded data directory, create a subdirectory hierarchy as noted above.

Visit the following site:

http://www.uniprot.org/downloads

Download the UniProtKB/Swiss-Prot and UniProtKB/TrEMBL files in text format:

Or from the EBI UK mirror:

These files should be moved into the newly created data directory and uncompressed:

gunzip uniprot_sprot.dat.gz
gunzip uniprot_trembl.dat.gz
gunzip uniprot_sprot_varsplic.fasta.gz

Building FTPtransfer

The FTPtransfer.jar file needs to be obtained or built.

  1. Get the program's source code from this location:

    Using CVS with the appropriate CVSROOT setting, run the following command:

    cvs co bioscape/bioscape/modules/interaction/Sabry/FTPtransfer

    The CVSROOT environment variable should be set to the following for this to work:

    export CVSROOT=:ext:<username>@hfaistos.uio.no:/mn/hfaistos/storage/cvsroot
    (The <username> should be replaced with your actual username.)
  2. Obtain the program's dependencies. This program uses the Apache commons-net package, and this must be available during compilation. This library could be retrieved from the Apache site...

    ...or from a mirror such as the following:

  3. Extract the dependencies:
    tar zxf commons-net-1.4.1.tar.gz

    This will produce a directory called commons-net-1.4.1 containing a file called commons-net-1.4.1.jar which should be placed in the lib directory in the FTPtransfer directory...

      mkdir lib
      cp commons-net-1.4.1/commons-net-1.4.1.jar lib/

    Alternatively, the external libraries can also be found in the following location:

    /biotek/dias/donaldson3/iRefIndex/External_libraries
  4. Customise the output locations. Currently, the output locations are hard-coded, and changing them would involve searching for the following...
    /biotek/prometheus/storage/Sabry/data

    ...and replacing it with the path to the preferred output directory. The source code is found in the following directory within the FTPtransfer directory:

    src/ftptransfer
  5. Compile the source code. In order to build the software on a computer which does not have the NetBeans IDE installed, copy the generic build file into the FTPtransfer directory:
    cp Build_files/build.xml .

    Compile and create the .jar file as follows:

    ant jar

Running FTPtransfer

To run the program, invoke the .jar file as follows:

java -Xms256m -Xmx256m -jar build/jar/FTPtransfer.jar log

The specified log argument can be replaced with a suitable location for the program's execution log.

Building SHA

The SHA.jar file needs to be obtained or built.

  1. Get the program's source code from this location:

    Using CVS with the appropriate CVSROOT setting, run the following command:

    cvs co bioscape/bioscape/modules/interaction/Sabry/SHA

    The CVSROOT environment variable should be set to the following for this to work:

    export CVSROOT=:ext:<username>@hfaistos.uio.no:/mn/hfaistos/storage/cvsroot
    (The <username> should be replaced with your actual username.)
  2. Compile the source code. Compile and create the .jar file as follows:
    ant jar

    The SHA.jar file will be created in the dist directory.

Building BioPSI_Suplimenter

The BioPSI_Suplimenter.jar file needs to be obtained or built.

  1. Get the program's source code from this location:

    Using CVS with the appropriate CVSROOT setting, run the following command:

    cvs co bioscape/bioscape/modules/interaction/Sabry/BioPSI_Suplimenter

    The CVSROOT environment variable should be set to the following for this to work:

    export CVSROOT=:ext:<username>@hfaistos.uio.no:/mn/hfaistos/storage/cvsroot
    (The <username> should be replaced with your actual username.)
  2. Obtain the program's dependencies. This program uses the SHA.jar file created above as well as the MySQL Connector/J library which can be found at the following location:
  3. Extract the dependencies:
    tar zxf mysql-connector-java-5.1.6.tar.gz

    This will produce a directory called mysql-connector-java-5.1.6 containing a file called mysql-connector-java-5.1.6-bin.jar which should be placed in the lib directory in the BioPSI_Suplimenter directory...

      mkdir lib
      cp mysql-connector-java-5.1.6/mysql-connector-java-5.1.6-bin.jar lib/

    The SHA.jar file needs copying from its build location:

    cp ../SHA/dist/SHA.jar lib/

    Alternatively, the external libraries can also be found in the following location:

    /biotek/dias/donaldson3/iRefIndex/External_libraries
  4. Compile the source code. In order to build the software on a computer which does not have the NetBeans IDE installed, copy the generic build file into the BioPSI_Suplimenter directory:
    cp Build_files/build.xml .

    Compile and create the .jar file as follows:

    ant jar

Creating the Database

Enter MySQL using a command like the following:

mysql -h <host> -u <admin> -p -A

The <admin> is the name of the user with administrative privileges. For example:

mysql -h myhost -u admin -p -A

Then create a database and user using commands of the following form:

create database <database>;
create user '<username>'@'%' identified by '<password>';
grant all privileges on <database>.* to '<username>'@'%';

For example, with <database> given as irefindex, <username> given as irefindex, and a substitution for <password>:

create database irefindex;
create user 'irefindex'@'%' identified by 'mysecretpassword';
grant all privileges on irefindex.* to 'irefindex'@'%';