Running mpiBLAST

From Bootable Cluster CD

mpiBLAST is a tool for searching large databases of nucleotides or proteins. (For more information on mpiBLAST, please check out About mpiBLAST). This page is a walkthrough of using the BCCD to perform an mpiBLAST search. This tutorial assumes that you have booted the BCCD on one or more machines already. It also assumes that you are running MPICH. (MPICH is the default environment on the BCCD. To double check this, or set it up if you've switched to LAM MPI, see Running MPICH.)

Download and Install mpiBLAST

There are two options for running mpiBLAST with the BCCD. One is to download a larger-than-normal iso file that contains mpiBLAST and the D. melanogaster database, available here. If you do that, skip down to the next section.

Otherwise, if you'd like to use a traditional BCCD, first we need to download and install mpiBLAST:

  • Login to the bccd environment as root. If you have already logged in as bccd, to become root, issue the $su - command, which is the "sudo user" command. For example: [bccd@host129]>su -     You will be prompted for the root user's password, which should be letmein (or see the login splash screen for your image's root password)
  • Type list-packages at the prompt. From the list that appears, select mpi-blast and then OK.
  • Type logout to return to the prompt or the bccd user.

Error? Aargh!

It's possible that the download will go smoothly and then you'll receive a message, something like this:

Attempting to download mpiblast.tar.gz from http://bccd.cs.uni.edu/packages/i386/2.2... OK
Attempting to download mpiblast.tar.gz.sig from http://bccd.cs.uni.edu/packages/i386/2.2... OK
Verifying signature for mpiblast...
gpgv: Signature made Wed Jan 17 18:44:27 2007 UTC using DSA key ID 5BDEBA02
gpgv: Good signature from "BCCD Packages (2.2) <bccd@bccd.cs.uni.edu>"
OK
Unpacking mpiblast... tar: ./bin/testval: Wrote only 6656 of 10240 bytes
tar: Skipping to next header
tar: Archive contains obsolescent base-64 headers
tar: Error exit delayed from previous errors
FAILED

This means there isn't enough RAM on the system you're running in order to install mpiBLAST. Because the BCCD does not touch the hard drive, the space for download is limited to RAM. mpiBLAST in itself is not terribly large, but the databases of nucleotide and amino acid sequences are. A USB flash drive can be used to add to available space (see Supplementing RAM).

Using mpiBLAST

This section assumes that you'll be running mpiblast using MPICH, version 2. Unless you've specifically configured your BCCD to run LAM instead of MPICH, you're already running it. If this doesn't sound familiar, you can assume you're ok.

mpiBLAST is used in a similar manner to NCBI-Blast. mpiBLAST uses the same variables that are available for NCBI Blast, which means that you will need to have a .ncbirc file in your home directory. This file tells where mpiBLAST where to find its databases (the Shared variable) and workspace (the Local variable). To do this, log in as user bccd with the password you specified when booting up.

The .ncbirc file that is used for this looks like this:

  [mpiBLAST]
  Shared=/home/bccd/blastdb
  Local=/home/bccd/blastdb

If you don't have such a file in your home directory (which you don't if you haven't made one yourself), copy the above into the file ~/.ncbirc using nedit, nano, vi or your other favorite text editor not listed here.

After setting up your .ncbirc file, there are four steps to running mpiblast, and the first has already been done for you using the Drosophila melanogaster database.

Download a database from NIH (National Institute of Health)

In order to search a database using mpiBLAST, you first have to have a database. If you're running the larger-than-normal bloated BCCD iso or if you used list-packages to download and install mpiBLAST, you should already have one database, the Drosophila melonagaster (fruit fly) nucleotide database. You can download other databases (see the bottom of the page for links to additional databases) using the wget command. For instance, if you were going to download the drosoph.nt database again (which would be pretty boring since you already have a copy), it would look something like this:

 wget ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/drosoph.nt.gz
 --17:00:38--  ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/drosoph.nt.gz
            => `drosoph.nt.gz'
 Resolving ftp.ncbi.nlm.nih.gov... 165.112.7.10
 Connecting to ftp.ncbi.nlm.nih.gov|165.112.7.10|:21... connected.
 Logging in as anonymous ... Logged in!
 ==> SYST ... done.    ==> PWD ... done.
 ==> TYPE I ... done.  ==> CWD /blast/db/FASTA ... done.
 ==> PASV ... done.    ==> RETR drosoph.nt.gz ... done.
 Length: 36,924,008 (35M) (unauthoritative)
 
 100%[====================================>] 36,924,008   326.82K/s    ETA 00:00
 
 17:02:28 (338.88 KB/s) - `drosoph.nt.gz' saved [36924008]

After downloading, be sure to decompress it, using gunzip <database name>.

Format the database using mpiformatdb

Now comes the time where we separate the database into chunks that can be accessed by different processors. --nfrags is used to specific the number of fragments that the database should be subdivided into. You'll want to split it into the same number of fragments as processors you'll use for running mpiBLAST. This is done with mpiformatdb. In this instance, we're splitting it into four ways.

gray@proto:~$ mpiformatdb --nfrags=4 -i /fastadb/drosoph.nt -pF --quiet 
Reading input file
Done, read 1534943 lines
Reordering 1170 sequence entries
Breaking drosoph.nt (122 MB) into 4 fragments
Executing: formatdb -p F -i /tmp/reorderoUDWYw -N 4 -n /home/bccd/blastdb/drosoph.nt -o T 
Removed /tmp/reorderoUDWYw
Created 4 fragments.
gray@proto:~$ ls blastdb
drosoph.nt  formatdb.log

If you're using a different database you downloaded, be sure to specify that path rather than /fastadb/drosoph.nt. The output of this, the different chunks of the database, will then to be dumped to the shared folder specified in the .ncbirc file. (If you used the default above, this is ~/blastdb.) (Verify this with ls ~/blastdb.)

Error again?!

If you see a long list of the phrase [formatdb] FATAL ERROR: File write error, you've run out of RAM. Oops! See Customization Tips and Tricks: Supplementing RAM.

Create a test sequence file

Finally we're ready to run mpiBLAST against a test sequence. You can either create your own by pasting it in:

gray@proto:~/blastdb$ cat > blast.in 
>Test
AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC
TTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCAA
TATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTACACAACATCCATGAAACGCATTAGCACCACC
ATTACCACCACCATCACCATTACCACAGGTAACGGTGCGGGCTGACGCGTACAGGAAACACAGAAAAAAG
CCCGCACCTGACAGTGCGGGCTTTTTTTTTCGACCAAAGGTAACGAGGTAACAACCATGCGAGTGTTGAA
GTTCGGCGGTACATCAGTGGCAAATGCAGAACGTTTTCTGCGTGTTGCCGATATTCTGGAAAGCAATGCC
AGGCAGGGGCAGGTGGCCACCGTCCTCTCTGCCCCCGCCAAAATCACCAACCACCTGGTGGCGATGATTG
AAAAAACCATTAGCGGCCAGGATGCTTTACCCAATATCAGCGATGCCGAACGTATTTTTGCCGAACTTTT

(Remember, use ctrl-D to close the reading from stdin.)

Or you can use the one already on the BCCD. Just cp /fastadb/test.in ~/blastdb to copy it to your working directory.

Then, run mpiblast as follows:

gray@proto:~$ mpirun -np 4 -machinefile ~/machines /bin/mpiblast -d drosoph.nt -i blast.in -p blastn -o results.txt
gray@proto:~$ ls
[other stuff..]  results.txt
  • -np is the number of processors to run on (preferably the same number as you divided the database into!)
  • -d is the database file to search against
  • -i specifies the input file
  • -p is the blast program name (should be blastn)
  • -o specifies where to put the output

The results file should look similar to this:

BLASTN 2.2.10 [Oct-19-2004]


Reference: Aaron E. Darling, Lucas Carey, and Wu-chun Feng,
"The design, implementation, and evaluation of mpiBLAST."
In Proceedings of ClusterWorld 2003, June 24-26 2003, San Jose, CA


Query= Test
         (560 letters)

Database: /fastadb/drosoph.nt 
           1170 sequences; 122,655,632 total letters



                                                                 Score    E
Sequences producing significant alignments:                      (bits) Value

gb|AE003681.2|AE003681 Drosophila melanogaster genomic scaffold ...    36   0.86 
gb|AE002936.2|AE002936 Drosophila melanogaster genomic scaffold ...    36   0.86 
gb|AE003698.2|AE003698 Drosophila melanogaster genomic scaffold ...    36   0.86 
gb|AE003493.2|AE003493 Drosophila melanogaster genomic scaffold ...    36   0.86 
gb|AE002615.2|AE002615 Drosophila melanogaster genomic scaffold ...    34   3.4  
gb|AE003441.1|AE003441 Drosophila melanogaster genomic scaffold ...    34   3.4  
gb|AE003525.2|AE003525 Drosophila melanogaster genomic scaffold ...    34   3.4  
gb|AE003587.2|AE003587 Drosophila melanogaster genomic scaffold ...    34   3.4  
gb|AE003673.2|AE003673 Drosophila melanogaster genomic scaffold ...    34   3.4  
gb|AE003648.1|AE003648 Drosophila melanogaster genomic scaffold ...    34   3.4  
gb|AE003628.1|AE003628 Drosophila melanogaster genomic scaffold ...    34   3.4  
gb|AE003431.2|AE003431 Drosophila melanogaster genomic scaffold ...    34   3.4  
gb|AE003484.1|AE003484 Drosophila melanogaster genomic scaffold ...    34   3.4  
gb|AE003495.2|AE003495 Drosophila melanogaster genomic scaffold ...    34   3.4  
gb|AE002665.2|AE002665 Drosophila melanogaster genomic scaffold ...    34   3.4  
gb|AE003740.2|AE003740 Drosophila melanogaster genomic scaffold ...    34   3.4  
gb|AE003723.3|AE003723 Drosophila melanogaster genomic scaffold ...    34   3.4  
gb|AE003447.2|AE003447 Drosophila melanogaster genomic scaffold ...    34   3.4  

>gb|AE003681.2|AE003681 Drosophila melanogaster genomic scaffold 142000013386035 section 6 of
              105, complete sequence
          Length = 329362

 Score = 36.2 bits (18), Expect = 0.86
 Identities = 18/18 (100%)
 Strand = Plus / Minus

                                
Query: 96     taaattaaaattttattg 113
              ||||||||||||||||||
Sbjct: 111644 taaattaaaattttattg 111627


>gb|AE002936.2|AE002936 Drosophila melanogaster genomic scaffold 142000013385220, complete
             sequence
          Length = 48123

 Score = 36.2 bits (18), Expect = 0.86
 Identities = 18/18 (100%)
 Strand = Plus / Minus

                               
Query: 97    aaattaaaattttattga 114
             ||||||||||||||||||
Sbjct: 40704 aaattaaaattttattga 40687


>gb|AE003698.2|AE003698 Drosophila melanogaster genomic scaffold 142000013386035 section 23 of
              105, complete sequence
          Length = 225827

 Score = 36.2 bits (18), Expect = 0.86
 Identities = 18/18 (100%)
 Strand = Plus / Minus

                                
Query: 107    tttattgacttaggtcac 124
              ||||||||||||||||||
Sbjct: 151021 tttattgacttaggtcac 151004


>gb|AE003493.2|AE003493 Drosophila melanogaster genomic scaffold 142000013386053 section 10 of
              30, complete sequence
          Length = 308092

 Score = 36.2 bits (18), Expect = 0.86
 Identities = 18/18 (100%)
 Strand = Plus / Minus

                                
<<snipped>>


  Database: /fastadb/drosoph.nt
    Posted date:  Dec 6, 2006  5:13 PM
  Number of letters in database: 30,663,804
  Number of sequences in database:  292
  
  Database: /fastadb/drosoph.nt.001
    Posted date:  Dec 6, 2006  5:13 PM
  Number of letters in database: 30,664,011
  Number of sequences in database:  293
  
  Database: /fastadb/drosoph.nt.002
    Posted date:  Dec 6, 2006  5:13 PM
  Number of letters in database: 30,664,004
  Number of sequences in database:  293
  
  Database: /fastadb/drosoph.nt.003
    Posted date:  Dec 6, 2006  5:13 PM
  Number of letters in database: 30,663,813
  Number of sequences in database:  292
  
Lambda     K      H
    1.37    0.711     1.31 

Gapped
Lambda     K      H
    1.37    0.711     1.31 


Matrix: blastn matrix:1 -3
Gap Penalties: Existence: 5, Extension: 2
Number of Hits to DB: 35,658
Number of Sequences: 1170
Number of extensions: 35658
Number of successful extensions: 72
Number of sequences better than 10.0: 18
Number of HSP's better than 10.0 without gapping: 18
Number of HSP's successfully gapped in prelim test: 0
Number of HSP's that attempted gapping in prelim test: 53
Number of HSP's gapped (non-prelim): 19
length of query: 1122
length of database: 122,655,632
effective HSP length: 18
effective length of query: 542
effective length of database: 122,634,572
effective search space: 66467938024
effective search space used: 66467938024
T: 0
A: 0
X1: 11 (21.8 bits)
X2: 15 (29.7 bits)
S1: 12 (24.3 bits)
S2: 17 (34.2 bits)

FMI

For more information...

Personal tools