Running mpiBLAST
From Bootable Cluster CD
mpiBLAST is a tool for searching large databases of nucleotides or proteins. (For more information on mpiBLAST, please check out About mpiBLAST). This page is a walkthrough of using the BCCD to perform an mpiBLAST search. This tutorial assumes that you have booted the BCCD on one or more machines already. It also assumes that you are running MPICH. (MPICH is the default environment on the BCCD. To double check this, or set it up if you've switched to LAM MPI, see Running MPICH.)
Download and Install mpiBLAST
There are two options for running mpiBLAST with the BCCD. One is to download a larger-than-normal iso file that contains mpiBLAST and the D. melanogaster database, available here. If you do that, skip down to the next section.
Otherwise, if you'd like to use a traditional BCCD, first we need to download and install mpiBLAST:
- Login to the bccd environment as root. If you have already logged in as
bccd, to become root, issue the$su -command, which is the "sudo user" command. For example:[bccd@host129]>su -You will be prompted for the root user's password, which should beletmein(or see the login splash screen for your image's root password) - Type
list-packagesat the prompt. From the list that appears, selectmpi-blastand thenOK. - Type
logoutto return to the prompt or thebccduser.
Error? Aargh!
It's possible that the download will go smoothly and then you'll receive a message, something like this:
Attempting to download mpiblast.tar.gz from http://bccd.cs.uni.edu/packages/i386/2.2... OK Attempting to download mpiblast.tar.gz.sig from http://bccd.cs.uni.edu/packages/i386/2.2... OK Verifying signature for mpiblast... gpgv: Signature made Wed Jan 17 18:44:27 2007 UTC using DSA key ID 5BDEBA02 gpgv: Good signature from "BCCD Packages (2.2) <bccd@bccd.cs.uni.edu>" OK Unpacking mpiblast... tar: ./bin/testval: Wrote only 6656 of 10240 bytes tar: Skipping to next header tar: Archive contains obsolescent base-64 headers tar: Error exit delayed from previous errors FAILED
This means there isn't enough RAM on the system you're running in order to install mpiBLAST. Because the BCCD does not touch the hard drive, the space for download is limited to RAM. mpiBLAST in itself is not terribly large, but the databases of nucleotide and amino acid sequences are. A USB flash drive can be used to add to available space (see Supplementing RAM).
Using mpiBLAST
This section assumes that you'll be running mpiblast using MPICH, version 2. Unless you've specifically configured your BCCD to run LAM instead of MPICH, you're already running it. If this doesn't sound familiar, you can assume you're ok.
mpiBLAST is used in a similar manner to NCBI-Blast. mpiBLAST uses the same variables that are available for NCBI Blast,
which means that you will need to have a .ncbirc file in your home directory. This file tells where mpiBLAST where to find its databases (the Shared variable) and workspace (the Local variable). To do this, log in as user bccd with the password you specified when booting up.
The .ncbirc file that is used for this looks like this:
[mpiBLAST] Shared=/home/bccd/blastdb Local=/home/bccd/blastdb
If you don't have such a file in your home directory (which you don't if you haven't made one yourself), copy the above into the file ~/.ncbirc using nedit, nano, vi or your other favorite text editor not listed here.
After setting up your .ncbirc file, there are four steps to running mpiblast, and the first has already been done for you using the Drosophila melanogaster database.
Download a database from NIH (National Institute of Health)
In order to search a database using mpiBLAST, you first have to have a database. If you're running the larger-than-normal bloated BCCD iso or if you used list-packages to download and install mpiBLAST, you should already have one database, the Drosophila melonagaster (fruit fly) nucleotide database. You can download other databases (see the bottom of the page for links to additional databases) using the wget command. For instance, if you were going to download the drosoph.nt database again (which would be pretty boring since you already have a copy), it would look something like this:
wget ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/drosoph.nt.gz --17:00:38-- ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/drosoph.nt.gz => `drosoph.nt.gz' Resolving ftp.ncbi.nlm.nih.gov... 165.112.7.10 Connecting to ftp.ncbi.nlm.nih.gov|165.112.7.10|:21... connected. Logging in as anonymous ... Logged in! ==> SYST ... done. ==> PWD ... done. ==> TYPE I ... done. ==> CWD /blast/db/FASTA ... done. ==> PASV ... done. ==> RETR drosoph.nt.gz ... done. Length: 36,924,008 (35M) (unauthoritative) 100%[====================================>] 36,924,008 326.82K/s ETA 00:00 17:02:28 (338.88 KB/s) - `drosoph.nt.gz' saved [36924008]
After downloading, be sure to decompress it, using gunzip <database name>.
Format the database using mpiformatdb
Now comes the time where we separate the database into chunks that can be accessed by different processors. --nfrags is used to specific the number of fragments that the database should be subdivided into. You'll want to split it into the same number of fragments as processors you'll use for running mpiBLAST. This is done with mpiformatdb. In this instance, we're splitting it into four ways.
gray@proto:~$ mpiformatdb --nfrags=4 -i /fastadb/drosoph.nt -pF --quiet Reading input file Done, read 1534943 lines Reordering 1170 sequence entries Breaking drosoph.nt (122 MB) into 4 fragments Executing: formatdb -p F -i /tmp/reorderoUDWYw -N 4 -n /home/bccd/blastdb/drosoph.nt -o T Removed /tmp/reorderoUDWYw Created 4 fragments. gray@proto:~$ ls blastdb drosoph.nt formatdb.log
If you're using a different database you downloaded, be sure to specify that path rather than /fastadb/drosoph.nt. The output of this, the different chunks of the database, will then to be dumped to the shared folder specified in the .ncbirc file. (If you used the default above, this is ~/blastdb.) (Verify this with ls ~/blastdb.)
Error again?!
If you see a long list of the phrase [formatdb] FATAL ERROR: File write error, you've run out of RAM. Oops! See Customization Tips and Tricks: Supplementing RAM.
Create a test sequence file
Finally we're ready to run mpiBLAST against a test sequence. You can either create your own by pasting it in:
gray@proto:~/blastdb$ cat > blast.in >Test AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC TTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCAA TATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTACACAACATCCATGAAACGCATTAGCACCACC ATTACCACCACCATCACCATTACCACAGGTAACGGTGCGGGCTGACGCGTACAGGAAACACAGAAAAAAG CCCGCACCTGACAGTGCGGGCTTTTTTTTTCGACCAAAGGTAACGAGGTAACAACCATGCGAGTGTTGAA GTTCGGCGGTACATCAGTGGCAAATGCAGAACGTTTTCTGCGTGTTGCCGATATTCTGGAAAGCAATGCC AGGCAGGGGCAGGTGGCCACCGTCCTCTCTGCCCCCGCCAAAATCACCAACCACCTGGTGGCGATGATTG AAAAAACCATTAGCGGCCAGGATGCTTTACCCAATATCAGCGATGCCGAACGTATTTTTGCCGAACTTTT
(Remember, use ctrl-D to close the reading from stdin.)
Or you can use the one already on the BCCD. Just cp /fastadb/test.in ~/blastdb to copy it to your working directory.
Then, run mpiblast as follows:
gray@proto:~$ mpirun -np 4 -machinefile ~/machines /bin/mpiblast -d drosoph.nt -i blast.in -p blastn -o results.txt gray@proto:~$ ls [other stuff..] results.txt
- -np is the number of processors to run on (preferably the same number as you divided the database into!)
- -d is the database file to search against
- -i specifies the input file
- -p is the blast program name (should be blastn)
- -o specifies where to put the output
The results file should look similar to this:
BLASTN 2.2.10 [Oct-19-2004]
Reference: Aaron E. Darling, Lucas Carey, and Wu-chun Feng,
"The design, implementation, and evaluation of mpiBLAST."
In Proceedings of ClusterWorld 2003, June 24-26 2003, San Jose, CA
Query= Test
(560 letters)
Database: /fastadb/drosoph.nt
1170 sequences; 122,655,632 total letters
Score E
Sequences producing significant alignments: (bits) Value
gb|AE003681.2|AE003681 Drosophila melanogaster genomic scaffold ... 36 0.86
gb|AE002936.2|AE002936 Drosophila melanogaster genomic scaffold ... 36 0.86
gb|AE003698.2|AE003698 Drosophila melanogaster genomic scaffold ... 36 0.86
gb|AE003493.2|AE003493 Drosophila melanogaster genomic scaffold ... 36 0.86
gb|AE002615.2|AE002615 Drosophila melanogaster genomic scaffold ... 34 3.4
gb|AE003441.1|AE003441 Drosophila melanogaster genomic scaffold ... 34 3.4
gb|AE003525.2|AE003525 Drosophila melanogaster genomic scaffold ... 34 3.4
gb|AE003587.2|AE003587 Drosophila melanogaster genomic scaffold ... 34 3.4
gb|AE003673.2|AE003673 Drosophila melanogaster genomic scaffold ... 34 3.4
gb|AE003648.1|AE003648 Drosophila melanogaster genomic scaffold ... 34 3.4
gb|AE003628.1|AE003628 Drosophila melanogaster genomic scaffold ... 34 3.4
gb|AE003431.2|AE003431 Drosophila melanogaster genomic scaffold ... 34 3.4
gb|AE003484.1|AE003484 Drosophila melanogaster genomic scaffold ... 34 3.4
gb|AE003495.2|AE003495 Drosophila melanogaster genomic scaffold ... 34 3.4
gb|AE002665.2|AE002665 Drosophila melanogaster genomic scaffold ... 34 3.4
gb|AE003740.2|AE003740 Drosophila melanogaster genomic scaffold ... 34 3.4
gb|AE003723.3|AE003723 Drosophila melanogaster genomic scaffold ... 34 3.4
gb|AE003447.2|AE003447 Drosophila melanogaster genomic scaffold ... 34 3.4
>gb|AE003681.2|AE003681 Drosophila melanogaster genomic scaffold 142000013386035 section 6 of
105, complete sequence
Length = 329362
Score = 36.2 bits (18), Expect = 0.86
Identities = 18/18 (100%)
Strand = Plus / Minus
Query: 96 taaattaaaattttattg 113
||||||||||||||||||
Sbjct: 111644 taaattaaaattttattg 111627
>gb|AE002936.2|AE002936 Drosophila melanogaster genomic scaffold 142000013385220, complete
sequence
Length = 48123
Score = 36.2 bits (18), Expect = 0.86
Identities = 18/18 (100%)
Strand = Plus / Minus
Query: 97 aaattaaaattttattga 114
||||||||||||||||||
Sbjct: 40704 aaattaaaattttattga 40687
>gb|AE003698.2|AE003698 Drosophila melanogaster genomic scaffold 142000013386035 section 23 of
105, complete sequence
Length = 225827
Score = 36.2 bits (18), Expect = 0.86
Identities = 18/18 (100%)
Strand = Plus / Minus
Query: 107 tttattgacttaggtcac 124
||||||||||||||||||
Sbjct: 151021 tttattgacttaggtcac 151004
>gb|AE003493.2|AE003493 Drosophila melanogaster genomic scaffold 142000013386053 section 10 of
30, complete sequence
Length = 308092
Score = 36.2 bits (18), Expect = 0.86
Identities = 18/18 (100%)
Strand = Plus / Minus
<<snipped>>
Database: /fastadb/drosoph.nt
Posted date: Dec 6, 2006 5:13 PM
Number of letters in database: 30,663,804
Number of sequences in database: 292
Database: /fastadb/drosoph.nt.001
Posted date: Dec 6, 2006 5:13 PM
Number of letters in database: 30,664,011
Number of sequences in database: 293
Database: /fastadb/drosoph.nt.002
Posted date: Dec 6, 2006 5:13 PM
Number of letters in database: 30,664,004
Number of sequences in database: 293
Database: /fastadb/drosoph.nt.003
Posted date: Dec 6, 2006 5:13 PM
Number of letters in database: 30,663,813
Number of sequences in database: 292
Lambda K H
1.37 0.711 1.31
Gapped
Lambda K H
1.37 0.711 1.31
Matrix: blastn matrix:1 -3
Gap Penalties: Existence: 5, Extension: 2
Number of Hits to DB: 35,658
Number of Sequences: 1170
Number of extensions: 35658
Number of successful extensions: 72
Number of sequences better than 10.0: 18
Number of HSP's better than 10.0 without gapping: 18
Number of HSP's successfully gapped in prelim test: 0
Number of HSP's that attempted gapping in prelim test: 53
Number of HSP's gapped (non-prelim): 19
length of query: 1122
length of database: 122,655,632
effective HSP length: 18
effective length of query: 542
effective length of database: 122,634,572
effective search space: 66467938024
effective search space used: 66467938024
T: 0
A: 0
X1: 11 (21.8 bits)
X2: 15 (29.7 bits)
S1: 12 (24.3 bits)
S2: 17 (34.2 bits)
FMI
For more information...
- If you want to know more detail about mpiformatdb or mpiblast, please refer to mpiBLAST Home at http://mpiblast.lanl.gov/
- If you want to know more detail about NCBI toolbox, please refer to NCBI home at http://www.ncbi.nlm.nih.gov
- If you want to know more detail about mpich, please refer to MPICH home http://www-unix.mcs.anl.gov/mpi/mpich/
- If you want to download BLAST database from NCBI, please refer to the NCBI Blast database at ftp://ftp.ncbi.nih.gov/blast/db, or FASTA database at ftp://ftp.ncbi.nih.gov/blast/db/FASTA.
- mpiBLAST comes from Los Alamos National Laboratory (http://www.lanl.gov/)

