Information

The parameter for setting the number of mismatches in standalone blast

The parameter for setting the number of mismatches in standalone blast


We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

In a standalone blast search, is there a parameter that could be specified to fix the number of mismatches allowed in an alignment. I went through the parameters and I could only find parameters to set score.

-q => Penalty for a nucleotide mismatch (blastn only) [Integer] default = -3

-r => Reward for a nucleotide match (blastn only) [Integer] default = 1

What do I do if I only require hits with a maximum allowed mismatch or gap of 2 per alignment.


There is no direct option like that but you can set percentage identity filter with-perc_identity

However, this is only for reporting. BLAST will still perform all the alignments. That is why I suggested you to use bowtie.

EDIT-1 (getting long for comment)

If you want a word type of search then use the-noption. But it always starts from the 5'end which is quite justified in case of miRNA because the seed region is in the 5'.

Setting-nto0means no mismatch in seed region.-loption defines the seed length; I would advise that you use something like10.-erefers to maximum mismatches in the non seed region but it is not the number of mismatches; it means maximum tolerated sum of phred score of mismatches (fasta files are assigned a default phred score of 40 at all residues. So, for 2 mismatches you have to mention-e 80).

EDIT-2

Yes. Increasing mismatch and gap penalties would make the search more stringent if you have already specified a cutoff on score.


BLAST (biotechnology)

In bioinformatics, BLAST (basic local alignment search tool) [2] is an algorithm and program for comparing primary biological sequence information, such as the amino-acid sequences of proteins or the nucleotides of DNA and/or RNA sequences. A BLAST search enables a researcher to compare a subject protein or nucleotide sequence (called a query) with a library or database of sequences, and identify database sequences that resemble the query sequence above a certain threshold. For example, following the discovery of a previously unknown gene in the mouse, a scientist will typically perform a BLAST search of the human genome to see if humans carry a similar gene BLAST will identify sequences in the human genome that resemble the mouse gene based on similarity of sequence.


B. BLAST Search Parameters

Limit by Organism

A BLAST search may be limited by organism. The entry field will suggest completions once a user starts typing. A checkbox will exclude rather than include the organism in the search.

Limit by Entrez Query

A BLAST search can be limited to the result of an Entrez query against the database chosen. This restricts the search to a subset of entries from that database fitting the requirement of the Entrez query. Terms normally accepted by Entrez nucleotide or protein searches are accepted here. Examples are given below.

    protease NOT hiv1[organism]

This will limit a BLAST search to all proteases, except those in HIV 1.

This limits the search to entries with lengths between 1000 to 2000 bases for nucleotide entries, or 1000 to 2000 residues for protein entries.

This limits the search to mouse mRNA entries in the database. For common organisms, one can also select from the pulldown menu.

This is yet another example usage, which limits the search to protein sequences with calculated molecular weight between 10 kD to 100 kD.

This limits the search to entries that are annotated with a /specimen_voucher qualifier on the source feature.

For help in constructing Entrez queries please see the " Writing Advanced Search Statements" section of the Entrez Help document. Knowing the content of a database and applying the Entrez terms accordingly are important. For example, biomol_mrna[prop] should not be applied to htgs or chromosome database since they do not contain mRNA entries!

Compositional adjustments

Amino acid substitution matrices may be adjusted in various ways to compensate for the amino acid compositions of the sequences being compared. The simplest adjustment is to scale all substitution scores by an analytically determined constant, while leaving the gap scores fixed this procedure is called "composition-based statistics" (Schaffer et al., 2001). The resulting scaled scores yield more accurate E-values than standard, unscaled scores. A more sophisticated approach adjusts each score in a standard substitution matrix separately to compensate for the compositions of the two sequences being compared (Yu et al., 2003 Yu and Altschul, 2005 Altschul et al., 2005). Such "compositional score matrix adjustment" may be invoked only under certain specific conditions for which it has been empirically determined to be beneficial (Altschul et al., 2005) under all other conditions, composition-based statistics are used. Alternatively, compositional adjustment may be invoked universally.

[1] Schaffer, A.A., Aravind, L., Madden, T.L., Shavirin, S., Spouge, J.L., Wolf, Y.I., Koonin, E.V. and Altschul, S.F. (2001) "Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements," Nucleic Acids Res. 29:2994-3005.
[2] Yu, Y.-K., Wootton, J.C. and Altschul, S.F. (2003) "The compositional adjustment of amino acid substitution matrices," Proc. Natl. Acad. Sci. USA 100:15688-15693.
[3] Yu, Y.-K. and Altschul, S.F. (2005) "The construction of amino acid substitution matrices for the comparison of proteins with non-standard compositions," Bioinformatics 21:902-911.
[4] Altschul, S.F., Wootton, J.C., Gertz, E.M., Agarwala, R., Morgulis, A., Schaffer, A.A. and Yu, Y.-K. (2005) "Protein database searches using compositionally adjusted substitution matrices," FEBS J 272(20):5101-9.

Filter

This function mask off segments of the query sequence that have low compositional complexity, as determined by the SEG program of Wootton and Federhen (Computers and Chemistry, 1993) or, for BLASTN, by the DUST program of Tatusov and Lipman. Filtering can eliminate statistically significant but biologically uninteresting reports from the blast output (e.g., hits against common acidic-, basic- or proline-rich regions), leaving the more biologically interesting regions of the query sequence available for specific matching against database sequences.

Filtering is only applied to the query sequence (or its translation products), not to database sequences. Default filtering is DUST for BLASTN, SEG for other programs.

It is not unusual for nothing at all to be masked by SEG, when applied to sequences in SWISS-PROT or refseq, so filtering should not be expected to always yield an effect. Furthermore, in some cases, sequences are masked in their entirety, indicating that the statistical significance of any matches reported against the unfiltered query sequence should be suspect. This will also lead to search error when default setting is used.

This option masks Human repeats (LINE's, SINE's, plus retroviral repeasts) and is useful for human sequences that may contain these repeats. Filtering for repeats can increase the speed of a search especially with very long sequences (>100 kb) and against databases which contain large number of repeats (htgs). This filter should be checked for genomic queries to prevent potential problems that may arise from the numerous and often spurious matches to those repeat elements.

For more information please see "Why does my search timeout on the BLAST servers?" in the BLAST Frequently Asked Questions.

Filter (Mask for lookup table only)

BLAST searches consist of two phases, finding hits based upon a lookup table and then extending them. This option masks only for purposes of constructing the lookup table used by BLAST so that no hits are found based upon low-complexity sequence or repeats (if repeat filter is checked). The BLAST extensions are performed without masking and so they can be extended through low-complexity sequence.

With this option selected you can cut and paste a FASTA sequence in upper case characters and denote areas you would like filtered with lower case. This allows you to customize what is filtered from the sequence during the comparison to the BLAST databases.

One can use different combinations of the above filter options to achieve optimal search result.

Word-size

BLAST is a heuristic that works by finding word-matches between the query and database sequences. One may think of this process as finding "hot-spots" that BLAST can then use to initiate extensions that might eventually lead to full-blown alignments. For nucleotide-nucleotide searches (i.e., "blastn") an exact match of the entire word is required before an extension is initiated, so that one normally regulates the sensitivity and speed of the search by increasing or decreasing the word-size. For other BLAST searches non-exact word matches are taken into account based upon the similarity between words. The amount of similarity can be varied. The webpage allows the word-sizes 2, 3, and 6.

Expect

This setting specifies the statistical significance threshold for reporting matches against database sequences. The default value (10) means that 10 such matches are expected to be found merely by chance, according to the stochastic model of Karlin and Altschul (1990). If the statistical significance ascribed to a match is greater than the EXPECT threshold, the match will not be reported. Lower EXPECT thresholds are more stringent, leading to fewer chance matches being reported.

Reward and Penalty for Nucleotide Programs

Many nucleotide searches use a simple scoring system that consists of a "reward" for a match and a "penalty" for a mismatch. The (absolute) reward/penalty ratio should be increased as one looks at more divergent sequences. A ratio of 0.33 (1/-3) is appropriate for sequences that are about 99% conserved a ratio of 0.5 (1/-2) is best for sequences that are 95% conserved a ratio of about one (1/-1) is best for sequences that are 75% conserved [1]. Read more here

[1] States DJ, Gish W, and Altschul SF (1991) METHODS: A companion to Methods in Enzymology 3:66-70.

Matrix and Gap Costs

A key element in evaluating the quality of a pairwise sequence alignment is the "substitution matrix", which assigns a score for aligning any possible pair of residues. The matrix used in a BLAST search can be changed depending on the type of sequences you are searching with (see the BLAST Frequently Asked Questions). See more information on BLAST substitution matrices.

The pull down menu shows the Gap Costs for the chosen Matrix. There can only be a limited number of options for these parameters. Increasing the Gap Costs will result in alignments which decrease the number of Gaps introduced.

PSI-BLAST can save the Position Specific Score Matrix constructed through iterations. The PSSM thus constructed can be used in searches against other databases with the same query by copying and pasting the encoded text into the PSSM field.

  1. Run a protein BLAST search.
  2. Check the PSI-BLAST box on formatting page.
  3. Click the "Format" Button.
  4. On the PSI-BLAST results page, click the "Run PSI-BLAST Iteration 2" button.
  5. Select the Download link at the top of the page and download the PSSM to your computer.

To use the PSSM in a new protein BLAST search against other databases:

  1. Open a new protein BLAST page.
  2. Select PSI-BLAST as the Algorithm under "Program Selection" (this may already be set).
  3. Select the "+" next to "Algorithm parameters" at the bottom of the search page.
  4. Scroll to the "PSI/PHI/DELTA BLAST" section and use the "Choose File" button to upload the PSSM that you saved in step 5 above.
  5. Select a different target database.
  6. Click "BLAST" button to start the search

PHI-BLAST Pattern

PHI-BLAST (Pattern-Hit Initiated BLAST) is a search program that combines matching of regular expressions with local alignments surrounding the match. Given a protein sequence S and a regular expression pattern P occurring in S, PHI-BLAST helps answer the question:


The parameter for setting the number of mismatches in standalone blast - Biology

This manual documents the BLAST (Basic Local Alignment Search Tool) command line applications developed at the National Center for Biotechnology Information (NCBI). These applications have been revamped to provide an improved user interface, new features, and performance improvements compared to its counterparts in the NCBI C Toolkit. Hereafter we shall distinguish the C Toolkit BLAST command line applications from these command line applications by referring to the latter as the BLAST+ applications, which have been developed using the NCBI C++ Toolkit ( http://www.ncbi.nlm.nih.gov/books/bv.fcgi? rid=toolkit.TOC&depth=2 ).

Please feel free to contact us with any questions, feedback, or bug reports at blast [email protected]

2. Installation

The BLAST+ applications are distributed in executable and source code format. For the executable formats we provide installers as well as tarballs the source code is only provided as a tarball. These are freely available at ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/. Please be sure to use the most recent available version this will be indicated in the file name (for instance, in the sections below, version 2.2.18 is listed, but this should be replaced accordingly).

2.1 Windows

Download the executable installer ncbi-blast-2.2.18+.exe and double click on it. After accepting the license agreement, select the install location and click “Install” and then “Close”

2.2 MacOSX

For users without administrator privileges: Download the ncbi-blast-2.2.18+-universal macosx.tar.gz tarball and follow the procedure described in Other Unix platforms.

For users with administrator privileges and machines MacOSX version 10.5 or higher: Download the ncbi-blast-2.2.18+.dmg installer and double click on it. Double click the newly mounted ncbi-blast-2.2.18+ volume, double click on ncbi-blast-2.2.18+.pkg and follow the instructions in the installer. By default the BLAST+ applications are installed in /usr/local/ ncbi/blast, overwriting its previous contents (an uninstaller is provided and it is recommended when upgrading a BLAST+ installation).

2.3 RedHat Linux

Download the appropriate *.rpm file for your platform and either install or upgrade the ncbi blast+ package as appropriate using the commands:

Install : rpm -ivh ncbi-blast-2.2.18-1.x86_64.rp m Upgrade : rpm -Uvh ncbi-blast-2.2.18-1.x86_64.rp m

Note: one must have root privileges to run these commands. If you do not have root privileges, please use the procedure described in Other Unix platforms.

2.4 Other Unix platforms

Download the tarball and expand it in the location of your choice.

2.5 Source tarball

Use this approach if you would like to build the BLAST+ applications yourself. Download the tarball, expand it and in the expanded directory type the following commands:

cd c+ + ./configure --without-debug --with-mt --with-build-root=ReleaseM T cd ReleaseMT/buil d make all_ r

The compiled executables will be found in c++/ReleaseMT/bin.

In Windows, extract the tarball and open the appropriate MSVC solution or project file (e.g.: c++compilersmsvc800_prjstaticuild), build the -CONFIGURE- project, click on “Reload” when prompted by the development environment, and then build the -BUILD-ALL- project. The compiled executables will be found in the directory corresponding to the build configuration selected (e.g.: c++compilersmsvc800_prjstaticindebugdll).

3. Quick start

3.1 For users of NCBI C Toolkit BLAST

The easiest way to get started using these command line applications is by means of the legacy_blast.pl PERL script which is bundled along with the BLAST+ applications. To utilize this script, simply prefix it to the invocation of the C toolkit BLAST command line application and append the --path option pointing to the installation directory of the BLAST+ applications. For example, instead of using

blastall -i query -d nr -o blast.ou t

legacy_blast.pl blastall -i query -d nr -o blast.ou t --path /opt/blast/bi n

For more details, refer to the section titled Backwards compatibility script.

Users of Web BLAST can take advantage of the search strategies to quickly get started using the BLAST+ applications, as these intend to allow seamless integration between the Web and command line BLAST tools. For more details, refer to the section on BLAST search strategies.

3.3 For new users of BLAST

An introduction to BLAST is outside the scope of this manual, more information on this subject can be found on http://blast.ncbi.nlm.nih.gov/Blast.cgi? CMD=Web&PAGE_TYPE=BlastDocs . Nonetheless, new users will benefit from the examples in the cookbook as well as reading the user manual.

3.4 Downloading BLAST databases

The BLAST databases are required to run BLAST locally and to support automatic resolution of sequence identifiers. Documentation about these can be found ftp://ftp.ncbi.nlm.nih.gov/ blast/db/README . These databases may be retrieved automatically with the update_blastdb.pl perl script, which is included as part of this distribution.

This script will download multiple tar files for each BLAST database volume if necessary, without having to designate each volume. For example:

will download all the relevant HTGs tar files (htgs.00.tar.gz, …, htgs.N.tar.gz) The script can also compare your local copy of the database tar file(s) and only download tar files if the date stamp has changed reflecting a newer version of the database. This will allow the script run on a schedule and only download tar files when needed. Documentation for the update_blastdb.pl script can be obtained by running the script without any arguments (perl is required).

4. User manual

4.1 Functionality offered by BLAST+ applications

The functionality offered by the BLAST+ applications has been organized by program type, as to more closely resemble Web BLAST. The following graph depicts a correspondence between the NCBI C Toolkit BLAST command line applications and the BLAST+ applications:

As an example, to run a search of a nucleotide query (translated “on the fly” by BLAST) against a protein database one would use the blastx application instead of blastall. The blastx application will also work in “Blast2Sequences” mode (i.e.: accept FASTA sequences instead of a BLAST database as targets) and can also send BLAST searches over the network to the public NCBI server if desired.

The blastn, blastp, blastx, tblastx, tblastn, psiblast, rpsblast, and rpstblastn are considered search applications, as they execute a BLAST search, whereas makeblastdb, blastdb_aliastool, and blastdbcmd are considered BLAST database applications, as they either create or examine BLAST databases.

There is also a new set of sequence filtering applications described in the section Sequence filtering applications and an application to build database indices that greatly speed up megablast in some cases (see section titled Megablast indexed searches).

Please note that the NCBI C Toolkit applications seedtop and blastclust are not available in this release.

4.2 Common options

The following is a listing of options that are common to the majority of BLAST+ applications followed by a brief description of what they do:

4.2.1 best_hit_overhang : Overhang value for Best-Hit algorithm. For more details, see the section Best-Hits filtering algorithm.

4.2.2 best_hit_score_edge : Score edge value for Best-Hit algorithm. For more details, see the section Best-Hits filtering algorithm.

4.2.3 db : File name of BLAST database to search the query against. Unless an absolute path is used, the database will be searched relative to the current working directory first, then relative to the value specified by the BLASTDB environment variable, then relative to the BLASTDB configuration value specified in the configuration file.

4.2.4 dbsize : Effective length of the database.

4.2.5 dbtype : Molecule type stored or to store in a BLAST database.

4.2.6 db_soft_mask : Filtering algorithm ID to apply to the database as soft masking for subject sequences. The algorithm IDs for a given BLAST database can be obtained by invoking blastdbcmd with its -info flag (only shown if such filtering in the BLAST database is available). For more details see the section Masking in BLAST databases.

4.2.7 culling_limit : Ensures that more than the specified number of HSPs are not aligned to the same part of the query. This option was designed for searches with a lot of repetitive matches, but if possible it is probably more efficient to mask the query to remove the repetitive sequences.

4.2.8 entrez_query : Restrict the search of the BLAST database to the results of the Entrez query provided.

4.2.9 evalue : Expectation value threshold for saving hits.

4.2.10 export_search_strategy : Name of the file where to save the search strategy (see section titled BLAST search strategies).

4.2.11 gapextend : Cost to extend a gap.

4.2.12 gapopen : Cost to open a gap.

4.2.13 gilist : File containing a list of GIs to restrict the BLAST database to search. The expect values in the BLAST results are based upon the sequences actually searched and not on the underlying database.

4.2.14 h : Displays the application’s brief documentation.

4.2.15 help : Displays the application’s detailed documentation.

4.2.16 html : Enables the generation of HTML output suitable for viewing in a web browser.

4.2.17 import_search_strategy : Name of the file where to read the search strategy to execute (see section titled BLAST search strategies).

4.2.18 lcase_masking : Interpret lowercase letters in query sequence(s) as masked.

4.2.19 matrix : Name of the scoring matrix to use.

4.2.720 max_target_seqs : Maximum number of aligned sequences to keep from the BLAST database.

4.2.21 negative_gilist : File containing a list of GIs to exclude from the BLAST database.

4.2.22 num_alignments : Number of alignments to show in the BLAST output.

4.2.23 num_descriptions : Number of one-line descriptions to show in the BLAST output.

4.2.24 num_threads : Number of threads to use during the search.

4.2.25 out : Name of the file to write the application’s output. Defaults to stdout.

4.2.26 outfmt : Allows for the specification of the search application’s output format. A listing of the possible format types is available via the search application’s -help option. If a custom output format is desired, this can be specified by providing a quoted string composed of the desired output format (tabular, tabular with comments, or comma-separated value), a space, and a space delimited list of output specifiers. The list of supported output specifiers is available via the -help command line option. Unsupported output specifiers will be ignored. This should be specified using double quotes if there are spaces in the output format specification (e.g.: outfmt "7 sseqid ssac qstart qend sstart send qseq evalue bitscore").

4.2.27 parse_deflines : Parse the query and subject deflines.

4.2.28 query : Name of the file containing the query sequence(s), or ‘-‘ if these are provided on standard input.

4.2.29 query_loc : Location of the first query sequence to search in 1-based offsets (Format: start-stop).

4.2.30 remote : Instructs the application to submit the search to NCBI for remote execution.

4.2.31 searchsp : Effective length of the search space.

4.2.32 seg : Arguments to SEG filtering algorithm (use ‘no’ to disable).

4.2.33 show_gis : Show NCBI GIs in deflines in the BLAST output.

4.2.34 soft_masking : Apply filtering locations as soft masks (i.e.: only when finding alignment seeds).

4.2.35 subject : Name of the file containing the subject sequence(s) to search.

4.2.36 subject_loc : Location of the first subject sequence to search in 1-based offsets (Format: start-stop).

4.2.37 strand : Strand(s) of the query sequence to search.

4.2.38 threshold : Minimum word score such that the word is added to the BLAST lookup table.

4.2.39 ungapped : Perform ungapped alignments only.

4.2.40 version : Displays the application’s version.

4.2.41 window_size : Size of the window for multiple hits algorithm, use 0 to specify 1-hit algorithm.

4.2.42 word_size : Word size for word finder algorithm.

4.2.43 xdrop_gap : X-dropoff value (in bits) for preliminary gapped extensions.

4.2.44 xdrop_gap_final : X-dropoff value (in bits) for final gapped alignment.

4.2.45 xdrop_ungap : X-dropoff value (in bits) for ungapped extensions.

4.3 Backwards compatibility script

The purpose of the legacy_blast.pl Perl script is to help users make the transition from the C Toolkit BLAST command line applications to the BLAST+ applications. This script produces its own documentation by invoking it without any arguments.

The legacy_blast.pl script supports two modes of operation, one in which the C Toolkit BLAST command line invocation is converted and executed on behalf of the user and another which solely displays the BLAST+ application equivalent to what was provided, without executing the command.

The first mode of operation is achieved by specifying the C Toolkit BLAST command line application invocation and optionally providing the --path argument after the command line to convert if the installation path for the BLAST+ applications differs from the default (available by invoking the script without arguments). See example in the first section of the Quick start.

The second mode of operation is achieved by specifying the C Toolkit BLAST command line application invocation and appending the --print_only command line option as follows:

$ ./legacy_blast.pl megablast -i query.fsa -d nt -o mb.out --print_onl y /opt/ncbi/blast/bin/blastn -query query.fsa -db "nt" -out mb.ou t $

4.4 Exit codes

All BLAST+ applications have consistent exit codes to signify the exit status of the application.

The possible exit codes along with their meaning are detailed in the table below:
Exit Code Meaning
0 Success
1 Error in query sequence(s) or BLAST options
2 Error in BLAST database
3 Error in BLAST engine
4 Out of memory
255 Unknown error
In the case of BLAST+ database applications, the possible exit codes are 0 (indicating success)
and 1 (indicating failure).

4.5 Improvements over C Toolkit BLAST command line applications
4.5.1 Query splitting

This new feature in the BLAST+ applications provides substantial performance improvements, particularly for blastx searches and it is automatically enabled by the software when deemed appropriate. Below is a graph comparing the runtime of blastall and blastx when searching different size excerpts of NC_007113 (varying from 10 kbases to about 10 Mbases) against the human genome database (experiments performed in July 2008):

4.5.2 Tasks

The concept of tasks has been added to support the notion of commonly performed tasks via the -task command line option in blastn and blastp. The following tasks are currently available:

Program Task Name Description
blastp blastp Traditional BLASTP to compare a protein query to a protein database
blastp-short BLASTP optimized for queries shorter than 30 residues
blastn blastn Traditional BLASTN requiring an exact match of 11
blastn-short BLASTN program optimized for sequences shorter than 50 bases
megablast Traditional megablast used to find very similar (e.g., intraspecies or closely related species) sequences
dc-megablast Discontiguous megablast used to find more distant (e.g., interspecies) sequences

4.5.3 Megablast indexed searches

Indexed searches for megablast are available and are faster than regular megablast. The application to generate the database indices is called makembindex, which is included in this distribution. More information about it can be found at ftp://ftp.ncbi.nlm.nih.gov/pub/ agarwala/indexed_megablast/README.usage .

4.5.4 Partial sequence retrieval from BLAST databases

Improvements to the BLAST database reading module allow it to fetch only the relevant portions of the subject sequence that are needed in the gapped alignment stage, providing a substantial improvement in runtime. The following example compares 103 mouse EST sequences against the human genome shows (example run in July 2008 after the database had already been loaded into memory):

$ time megablast -d 9606_genomic -i est.500.in -r 2 -q - 3 -F mL -m 9 -o old.out -W 11 -t 18 -G 5 -E 2 341.455u 65.242s 6:47.99 99.6% 0+0k 0+0io 0pf+0 w $ time blastn -task dc-megablast -db 9606_genomic -quer y

est.500.in -outfmt 7 -out new.ou t 218.540u 11.632s 3:50.53 99.8% 0+0k 0+0io 0pf+0 w

Similar gains in performance should be expected in BLAST databases which contain very large sequences and many very short queries.

4.5.5 BLAST search strategies

BLAST search strategies are files which encode the inputs necessary to perform a BLAST search. The purpose of these files is to be able to seamlessly reproduce a BLAST search in various environments (Web BLAST, command line applications, etc).

4.5.5.1 Exporting search strategies on the Web BLAST

Click on "download" next to the RID/saved strategy in the "Recent Results" or "Saved Strategies" tabs.

4.5.5.2 Exporting search strategies with BLAST+ applications

Add the -export_search_strategy along with a file name to the command line options.

4.5.5.3 Importing search strategies on Web BLAST

Go to the "Saved Strategies" tab, click on "Browse" to select your search strategy file, then click on "View" to load it into the submission page.

4.5.5.4 Importing search strategies with BLAST+ applications

Add the -import_search_strategy along with a file name containing the search strategy file. Note that if provided, the –query, -db, -use_index, and –index_name command line options will override the specifications of the search strategy file provided (no other command line options will override the contents of the search strategy file).

4.5.6 Negative GI lists

Negative GI lists are available on search applications and they provide a means to exclude GIs from a BLAST database search. The expect values in the BLAST results are based upon the sequences actually searched and not on the underlying database. For an example, see the cookbook.

4.5.7 Masking in BLAST databases

It is now possible to create BLAST databases which contain filtered sequences (also known as masking information or masks). This filtering information can be used as soft masking for the subject sequences. For instructions on creating masked BLAST databases, please see the cookbook.

4.5.8 Custom output formats for BLAST searches

The BLAST+ search command line applications support custom output formats for the tabular and comma-separated value output formats. For more details see the common options as well as the cookbook.

4.5.9 Custom output formats to extract BLAST database data

blastdbcmd supports custom output formats to extract data from BLAST databases via the outfmt command line option. For more details see the blastdbcmd options as well as the cookbook.

4.5.10 Improved software installation packages

The BLAST+ applications are available via Windows and MacOSX installers as well as RPMs (source and binary) and unix tarballs. For more details about these, refer to the installation section.

4.5.11 Sequence filtering applications

The BLAST+ applications include a new set of sequence filtering applications, namely segmasker, dustmasker, and windowmasker. segmasker is an application that identifies and masks low complexity regions of protein sequences. The dustmasker and windowmasker applications provide similar functionality for nucleotide sequences (see ftp:// ftp.ncbi.nlm.nih.gov/pub/agarwala/dustmasker/README.dustmasker and ftp:// ftp.ncbi.nlm.nih.gov/pub/agarwala/windowmasker/README.windowmasker for more information).

4.5.12 Best-Hits filtering algorithm

The Best-Hit filtering algorithm is designed for use in applications that are searching for only the best matches for each query region reporting matches. Its -best_hit_overhang parameter, H, controls when an HSP is considered short enough to be filtered due to presence of another HSP. For each HSP A that is filtered, there exists another HSP B such that the query region of HSP A extends each end of the query region of HSP B by at most H times the length of the query region for B.

Additional requirements that must also be met in order to filter A on account of B are: i evalue(A) >= evalue(B) ii score(A)/length(A) < (1.0 – score_edge) * score(B)/length(B)

We consider 0.1 to 0.25 to be an acceptable range for the -best_hit_overhang parameter and

0.05 to 0.25 to be an acceptable range for the -best_hit_score_edge parameter. Increasing the value of the overhang parameter eliminates a higher number of matches, but increases the running time increasing the score_edge parameter removes smaller number of hits.

4.5.13 Automatic resolution of sequence identifiers

The BLAST+ search applications support automatic resolution of query and subject sequence identifiers specified as GIs or accessions (see the cookbook section for an example). This feature enables the user to specify one or more sequence identifiers (GIs and/or accessions, one per line) in a file as the input to the -query and -subject command line options.

Upon encountering this type of input, by default the BLAST+ search applications will try to resolve these sequence identifiers in locally available BLAST databases first, then in the BLAST databases at NCBI, and finally in Genbank (the latter two data sources require a properly configured internet connection). These data sources can be configured via the DATA_LOADERS configuration option and the BLAST databases to search can be configured via the BLASTDB_PROT_DATA_LOADER and BLASTDB_NUCL_DATA_LOADER configuration options (see the section on Configuring BLAST).

4.6 Options by program type
4.6.1 blastp

4.6.1.1 task : Specify the task to execute. For more details, refer to the section on tasks.

4.6.1.2 comp_based_stats : Select the appropriate composition based statistics mode (applicable only to blastp and tblastn). Available choices and references are available by invoking the application with -help option.

4.6.1.3 use_sw_tback : Instead of using the X-dropoff gapped alignment algorithm, use Smith-Waterman to compute locally optimal alignments

4.6.2 blastn

4.6.2.1 task : Specify the task to execute. For more details, refer to the section on tasks.

4.6.2.2 penalty : Penalty for a nucleotide mismatch.

4.6.2.3 reward : Reward for a nucleotide match.

4.6.2.4 use_index : Use a megablast database index.

4.6.2.5 index_name : Name of the megablast database index.

4.6.2.6 perc_identity : Minimum percent identity of matches to report

4.6.2.7 dust : Arguments to DUST filtering algorithm (use ‘no’ to disable).

4.6.2.8 filtering_db : Name of BLAST database containing filtering elements (i.e.: repeats)

4.6.2.9 window_masker_taxid : Enable WindowMasker filtering using a Taxonomic ID [experimental]

4.6.2.10 window_masker_db : Enable WindowMasker filtering using this repeats database. [experimental]

4.6.2.11 no_greedy : Use non-greedy dynamic programming extension.

4.6.2.12 min_raw_gapped_score : Minimum raw gapped score to keep an alignment in the preliminary gapped and traceback stages.

4.6.2.13 template_type : Discontiguous megablast template type.

4.6.2.14 template_length : Discontiguous megablast template length.

4.6.2.15 off_diagonal_range Maximum number of diagonals separating two hits used to initiate an extension. Increasing values of this parameter lead to a longer run time, but more sensitive results. If this parameter is set, a value of five is suggested. Only discontiguous megablast uses two hits by default.

4.6.3 blastx

4.6.3.1 query_gencode : Genetic code to use to translate the query sequence(s).

4.6.3.2 frame_shift_penalty : Frame shift penalty for use with out-of-frame gapped alignments 4.6.3.3 max_intron_length: Length of the largest intron allowed in a translated nucleotide sequence when linking multiple distinct alignments (a negative value disables linking).

4.6.4 tblastx

4.6.4.1 db_gencode : Genetic code to use to translate database/subjects.

4.6.4.2 max_intron_length : Identical to blastx.
4.6.5 tblastn

4.6.5.1 db_gencode : Identical to tblastx.

4.6.5.2 frame_shift_penalty : Identical to blastx.

4.6.5.3 max_intron_length : Identical to blastx.

4.6.5.4 comp_based_stats : Identical to blastp.

4.6.5.5 use_sw_tback : Identical to blastp.

4.6.5.6 in_pssm : Checkpoint file to initiate PSI-TBLASTN (currently unimplemented).

4.6.6 psiblast

4.6.6.1 comp_based_stats : Identical to blastp with the exception that only composition based statistics mode 1 is valid when a PSSM is the input (either when restarting from a checkpoint file or when performing multiple PSI-BLAST iterations).

4.6.6.2 gap_trigger : Number of bits to trigger gapping.

4.6.6.3 use_sw_tback : Identical to blastp.

4.6.6.4 num_iterations : Number of iterations to perform.

4.6.6.5 out_pssm : Name of the file to store checkpoint file containing a PSSM.

4.6.6.6 out_ascii_pssm : Name of the file to stire ASCII version of PSSM.

4.6.6.7 in_msa : Name of the file containing multiple sequence alignment to restart PSI-BLAST.

4.6.6.8 in_pssm : Checkpoint file to re-start PSI-BLAST.

4.6.6.9 pseudocount : Pseudo-count value used when constructing the PSSM.

4.6.6.10 inclusion_ethresh : E-value inclusion threshold for pairwise alignments to be considered to build the PSSM.

4.6.6.11 phi_pattern : Name of the file containing a PHI-BLAST pattern to search.

4.6.7 rpstblastn

4.6.7.1 query_gencode : Identical to blastx.

4.6.8 makeblastdb

This application serves as a replacement for formatdb.

4.6.8.1 in : Input file or BLAST database name to use as source the data type is automatically detected. Note that multiple input files/BLAST databases can be provided, each must be separated by white space in a string quoted with single quotation marks. Multiple input files/ BLAST databases which contain white space in them should be quoted with double quotation marks inside the white space-separated, single quoted string (e.g.: -in ‘“C:My Documents seqs.fsa” “E:UsersJoe Smithmyfasta.fsa”‘ ).

4.6.8.2 title : Title for the BLAST database to create

4.6.8.3 parse_seqids : Parse the Seq-id(s) in the FASTA input provided. Please note that this option should be provided consistently among the various applications involved in creating BLAST databases. For instance, the filtering applications as well as convert2blastmask should use this option if makeblastdb uses it also.

4.6.8.4 hash_index : Enables the creation of sequence hash values. These hash values can then be used to quickly determine if a given sequence data exists in this BLAST database.

4.6.8.5 mask_data : Comma-separated list of input files containing masking data to apply to the sequences being added to the BLAST database being created. For more information, see Masking in BLAST databases and the examples.

4.6.8.6 out : Name of the BLAST database to create.

4.6.8.7 max_file_sz : Maximum file size for any of the BLAST database files created.

4.6.8.8 logfile : Name of the file to which the program log should be redirected (stdout by default).

4.6.8.9 taxid: Taxonomy ID to assign to all sequences.

4.6.8.10 taxid_map: Name of file which provides a mapping of sequence IDs to taxonomy IDs.

4.6.9 blastdb_aliastool

This application replaces part of the functionality offered by formatdb. When formatting a large input FASTA sequence file into a BLAST database, makeblastdb breaks up the resulting database into optimal sized volumes and links the volumes into a large virtual database through an automatically created BLAST database alias file.

We can use BLASTdatabase alias files under different scenarios to manage the collection of BLAST databases and facilitate BLAST searches. For example, we can create an alias file to combine an existing BLAST database with newly generated ones while leaving the original one undisturbed. Also, for an existing BLAST database, we can create a BLAST database alias file based on a GI list so we can search a subset of it, eliminating the need of creating a new database. For examples of how to use this application, please see the cookbook section.

This application supports three modes of operation:

Converts a text file containing GIs (one per line) to a more efficient

binary format. This can be provided as an argument to the -gilist option

of the BLAST search command line binaries or to the -gilist option of this program to create an alias file for a BLAST database (see below). 2) Alias file creation (restricting with GI List): Creates an alias for a BLAST database and a GI list which restricts this database. This is useful if one often searches a subset of a database (e.g., based on organism or a curated list). The alias file makes the search appear as if one were searching a regular BLAST database rather than the subset of one. 3) Alias file creation (aggregating BLAST databases): Creates an alias for multiple BLAST databases. All databases must be of the same molecule type (no validation is done). The relevant options are -dblist and -num_volumes.

4.6.9.1 gi_file_in : Text file to convert, should contain one GI per line.

4.6.9.2 gi_file_out : File name of converted GI file

4.6.9.3 title : Title for BLAST database.

4.6.9.4 gilist : Name of the file containing the GIs to restrict the database provided in -db.

4.6.9.5 out : Identical to makeblastdb.

4.6.9.6 logfile : Identical to makeblastdb.

Please note that when using GI lists, the expect values in the BLAST results are based upon the sequences actually searched and not on the underlying database.

4.6.10 blastdbcmd

This application is the successor to fastacmd. The following are its supported options:

4.6.10.1 entry : A comma-delimited search string of sequence identifiers, or the keyword ‘all’ to select all sequences in the database.

4.6.10.2 entry_batch : Input file for batch processing, entries must be provided one per line. If input is provided on standard input, a ‘-‘ should be used to indicate this.

4.6.10.3 pig : PIG (Protein Identity Group) to retrieve.

4.6.10.4 info : Print BLAST database information (overrides all other options).

4.6.10.5 range : Selects the range of a sequence to extract in 1-based offsets (Format: start-stop).

4.6.10.6 strand : Strand of nucleotide sequence to extract.

4.6.10.7 outfmt : Output format string. For a list of available format specifiers, invoke the application with its -help option. Note that for all format specifiers except %f, each line of output will correspond to a single sequence. This should be specified using double quotes if there are spaces in the output format specification (e.g.: -outfmt "%g %t").

4.6.10.8 target_only : The definition line of the sequence should contain target GI only.

4.6.10.9 get_dups : Retrieve duplicate accessions

4.6.10.10 line_length : Line length for output (applicable only with FASTA output format).

4.6.10.11 ctrl_a : Use Ctrl-A as the non-redundant defline separator (applicable only with FASTA output format).

4.6.10.12 mask_sequence_with : Allows the specification of a filtering algorithm ID from the BLAST database to apply to the sequence data. Applicable only with FASTA and sequence data output formats (%f and %s respectively).

4.6.10.13 list : Display BLAST databases available in the directory provided as an argument to this option.

4.6.10.14 list_outfmt : Allows for the specification of the output format for the -list option a listing of the possible format types is available via the application’s -help option. Unsupported output specifiers will be ignored. This option’s argument should be specified using double quotes if there are spaces in the output format specification.

4.6.10.15 recursive : Recursively traverse the directory provided to the –list option to find and display available BLAST databases.

4.6.10.16 show_blastdb_search_path : Displays the default BLAST database search paths (separated by colons).

4.6.11 convert2blastmask

This application extracts the lower-case masks from its FASTA input and converts them to a file format suitable for specifying masking information to makeblastdb. The following are its supported options:

4.6.11.1 masking_algorithm : The name of the masking algorithm used to create the masks (e.g.: dust, seg, windowmasker, repeat).

4.6.11.2 masking_options : The options used to configure the masking algorithm.

4.6.11.3 in : Name of the input file, by default is standard input.

4.6.11.4 output : Name of the output file, by default is standard output.

4.6.11.5 outfmt : Output file format.

4.6.11.6 parse_seqids : Identical to makeblastdb.

4.6.12 blastdbcheck

This application performs tests on BLAST databases to check their integrity. The following are its supported options: 4.6.12.1 dir : Name of the directory where to look for BLAST databases.

4.6.12.2 recursive : Flag to specify whether to recursively search for BLAST databases in the directory specified above.

4.6.12.3 full : Check every database sequence.

4.6.12.4 stride : Check every Nth database sequence.

4.6.12.5 samples : Check a randomly selected set of N sequences.

4.6.12.6 ends : Check the beginning and ending N sequences in the database.

4.6.12.7 isam : Set to true to perform ISAM file checking on each of the selected sequences.

4.6.13 blast_formatter

This application formats both local and remote BLAST results. An RID is required to format remote BLAST results. The RID may be obtained either from a search submitted to the NCBI BLAST web page or by using the –remote switch with one of the applications mentioned above. The blast_formatter accepts the BLAST archive format for stand-alone formatting. The BLAST archive format can be produced by using “-outfmt 11” argument with the stand-alone applications. The following are its supported options:

4.6.13.1 rid : BLAST RID of the report to be formatted.

4.6.13.2 archive : File produced by BLAST application using –outfmt 11

4.7 Configuring BLAST

The BLAST+ search applications can be configured by means of a configuration file named .ncbirc (on Unix-like platforms) or ncbi.ini (on Windows). This is a plain text file which contains sections and key-value pairs to specify configuration parameters. Lines starting with a semi-colon are considered comments. This file will be searched in the following order and locations:

1 Current working directory

2 User's HOME directory 3 Directory specified by the NCBI environment variable The search for this file will stop at the first location where it is found and the configurations settings from that file will be applied. If the configuration file is not found, default values will apply. The following are the possible configuration parameters that impact the BLAST+ applications:

Configuration Parameter

Data loaders to use for automatic sequence identifier resolution. This is acomma separated list of the following keywords: blastdb, genbank, and none. The none keyword disables this feature and takes precedence over any other keywords specified.

Locally available BLAST database name to search when resolving proteinsequences using BLAST databases. Ignored if DATA_LOADERS does not include the blastdb keyword.

Locally available BLAST database name to search when resolving nucleotide sequences using BLAST databases. Ignored ifDATA_LOADERS does not include the blastdb keyword.

Path to gene information files (NCBI only).

Path to windowmasker files (experimental).

Current working directory blastdb,genbank

Current working directory Current working directory

The following is an example with comments describing the available parameters for configuration:

Start the section for BLAST configuratio n [BLAST ] Specifies the path where BLAST databases are installe d BLASTDB=/home/guest/blast/d b Specifies the data sources to use for automatic resolutio n for sequence identifier s DATA_LOADERS=blastd b Specifies the BLAST database to use resolve protein sequence s BLASTDB_PROT_DATA_LOADER=custom_protein_databas e Specifies the BLAST database to use resolve protein sequence s BLASTDB_NUCL_DATA_LOADER=/home/some_user/my_nucleotide_d b

Windowmasker settings (experimental ) [WINDOW_MASKER ] WINDOW_MASKER_PATH=/home/guest/blast/db/windowmaske r end of fil e

4.7.1 Memory usage

The BLAST search programs can exhaust all memory on a machine if the input is too large or if there are too many hits to the BLAST database. If this is the case, please see your operating system documentation to limit the memory used by a program (e.g.: ulimit on Unix-like platforms).

5. Cookbook

5.1 Query a BLAST database with a GI, but exclude that GI from the results

Extract a GI from the ecoli database : $ blastdbcmd -entry all -db ecoli -dbtype nucl -outfmt %g | head -1 |

tee exclude_m e 178618 1 Run the restricted database search, which shows there are no self-hits : $ blastn -db ecoli -negative_gilist exclude_me -show_gis -num_alignments 0

-query exclude_me | grep `cat exclude_me `

Query= gi|1786181|gb|AE000111.1|AE00011 1

5.2 Create a masked BLAST database

Creating a masked BLAST database is a two step process:

a Generate the masking data using a sequence filtering utility like windowmasker or dustmasker

b Generate the actual BLAST database using makeblastdb

For both steps, the input file can be a text file containing sequences in FASTA format, or an

existing BLAST database created using makeblastdb. We will provide examples for both

5.2.1 Collect mask information files

For nucleotide sequence data in FASTA files or BLAST database format, we can generate the mask information files using windowmasker or dustmasker. windowmasker masks the over represented sequence data and it can also mask the low complexity sequence data using the built-in dust algorithm (through the -dust option). To mask low-complexity sequences only, we will need to use dustmasker.

For protein sequence data in FASTA files or BLAST database format, we need to use segmasker to generate the mask information file.

The following examples assume that BLAST databases, listed in 5.2.3, are available in the current working directory. Note that you should use the sequence id parsing consistently. In all our examples, we enable this function by including the “-parse_seqids” in the command line arguments.

We can generate the masking information with dustmasker using a single command line:

$ dustmasker -in hs_chr -infmt blastdb -parse_seqids -outfmt maskinfo_asn1_bin -out hs_chr_dust.asn b

Here we specify the input is a BLAST database named hs_chr (-in hs_chr -infmt blastdb), enable the sequence id parsing (-parse_seqids), request the mask data in binary asn.1 format (-outfmt maskinfo_asn1_bin), and name the output file as hs_chr_dust.asnb (-out hs_chr_dust.asnb).

If the input format is the original FASTA file, hs_chr.fa, we need to change input to -in and infmt options as follows:

$ dustmasker -in hs_chr.fa -infmt fasta -parse_seqids -outfmt maskinfo_asn1_bin -out hs_chr_dust.asn b

To generate the masking information using windowmasker from the BLAST database hs_chr, we first need to generate a counts file:

$ windowmasker -in hs_chr -infmt blastdb -mk_counts -parse_seqids -out hs_chr_mask.count s

Here we specify the input BLAST database (-in hs_chr -infmt blastdb), request it to generate the counts (-mk_counts) with sequence id parsing (-parse_seqids), and save the output to a file named hs_chr_mask.counts (-out hs_chr_mask.counts).

To use the FASTA file hs_chr.fa to generate the counts, we need to change the input file name and format:

$ windowmasker -in hs_chr.fa -infmt fasta -mk_counts -parse_seqids -out hs_chr_mask.count s

With the counts file we can then proceed to create the file containing the masking information as follows:

$ windowmasker -in hs_chr -infmt blastdb -ustat hs_chr_mask.count -outfmt maskinfo_asn1_bin -parse_seqids -out hs_chr_mask.asn b

Here we need to use the same input (-in hs_chr -infmt blastdb) and the output of step 1 (-ustat hs_chr_mask.counts). We set the mask file format to binary asn.1 (-outfmt maskinfo_asn1_bin), enable the sequence ids parsing (-parse_seqids), and save the masking data to hs_chr_mask.asnb (-out hs_chr_mask.asnb).

To use the FASTA file hs_chr.fa, we change the input file name and file type:

$ windowmasker -in hs_chr.fa -infmt fasta -ustat hs_chr.counts -outfmt maskinfo_asn1_bin -parse_seqids -out hs_chr_mask.asn b

We can generate the masking information with segmasker using a single command line:

$ segmasker -in refseq_protein -infmt blastdb -parse_seqids -outfmt maskinfo_asn1_bin -out refseq_seg.asn b

Here we specify the refseq_protein BLAST database (-in refseq_protein -infmt blastdb), enable sequence ids parsing (-parse_seqids), request the mask data in binary asn.1 format (-outfmt maskinfo_asn1_bin), and name the out file as refseq_seg.asnb (-out refseq_seg.asnb).

If the input format is the FASTA file, we need to change the command line to specify the input format:

$ segmasker -in refseq_protein.fa -infmt fasta -parse_seqids -outfmt maskinfo_asn1_bin -out refseq_seg.asn b

We can also extract the masking information from a FASTA sequence file with lowercase masking (generated by various means) using convert2blastmask utility. An example command line follows:

$ convert2blastmask -in hs_chr.mfa -parse_seqids -masking_algorithm repeat

-masking_options "repeatmasker, default" -outfmt maskinfo_asn1_bin

Here the input is hs_chr.mfa (-in hs_chr.mfa), enable parsing of sequence ids, specify the masking algorithm name (-masking_algorithm repeat) and its parameter (-masking_options “repeatmasker, default”), and ask for asn.1 output (-outfmt maskinfo_asn1_bin) to be saved in specified file (-out hs_chr_mfa.asnb).

5.2.2 Create BLAST database with the masking information

Using the masking information data files generated in steps 5.2.1.1, 5.2.1.2, 5.2.1.3, and 5.2.1.4, we can create BLAST database with masking information incorporated.

Note : we should use “-parse_seqids” in a consistent manner – either use it in both steps or not use it at all.

For example, we can use the following command line to apply the masking information, created in step 5.2.1.2, to the existing BLAST database generated in 5.2.3:

$ makeblastdb -in hs_chr -dbtype nucl -parse_seqids -mask_data hs_chr_mask.asnb -out hs_chr -title "Human Chromosome, Ref B37.1 "

Here, we use the existing BLAST database as input file (-in hs_chr), specify its type (-dbtype nucl), enable parsing of sequence ids (-parse_seqids), provide the masking data from step

5.2.1.2 (-mask_data hs_chr_mask.asnb), and name the output database with the same base name (-out hs_chr) overwriting the existing one.

To use the original FASTA sequence file (hs_chr.fa) as the input, we need to use “-in hs_chr.fa” to instruct makeblastdb to use that FASTA file instead.

We can check the “re-created” database to find out if the masking information was added properly, using blastdbcmd with the following command line:

$ blastdbcmd -db hs_chr -inf o

This command prints out a summary of the target database:

Database: human chromosomes, Ref B37. 1 24 sequences 3,095,677,412 total base s

Date: Aug 13, 2009 3:02 PM Longest sequence: 249,250,621 base s

Available filtering algorithms applied to database sequences :

Algorithm ID Algorithm name Algorithm option s 30 windowmaske r

Volumes : /export/home/tao/blast_test/hs_ch r

Extra lines under the “Available filtering algorithms …” describe the masking algorithms available. The “Algorithm ID” field, 30 in our case, is what we need to use if we want to invoke database soft masking during an actual search through the “-db_soft_mask” parameter.

We can apply additional masking data to an existing BLAST database with one type of masking information already added. For example, we can apply the dust masking, generated in step 5.2.1.1, to the database generated in step 5.2.2.1, we can use this command line:

$ makeblastdb -in hs_chr -dbtype nucl -parse_seqids -mask_data hs_chr_dust.asnb -out hs_chr -title "Human Chromosome, Ref B37.1 "

Here, we use the existing database as input file (-in hs_chr), specify its type (-dbtype nucl), enable parsing of sequence ids (-parse_seqids), provide the masking data from step 5.2.1.1 ( mask_data hs_chr_dust.asnb), naming the database with the same based name (-out hs_chr) overwriting the existing one.

Checking the “re-generated” database with blastdbcmd:

$ blastdbcmd -db hs_chr -inf o

we can see that both sets of masking information are available:

Database: Human Chromosome, Ref B37. 1 24 sequences 3,095,677,412 total base s

Date: Aug 25, 2009 4:43 PM Longest sequence: 249,250,621 base s

Available filtering algorithms applied to database sequences :

Algorithm ID Algorithm name Algorithm option s

11 dust window=64 level=20 linker= 1

Volumes : /net/gizmo4/export/home/tao/blast_test/hs_ch r

A more straightforward approach to apply multiple sets of masking information in a single makeblastdb run by providing multiple set of masking data files in a comma delimited list:

$ makeblastdb -in hs_chr -dbtype nucl -parse_seqids -mask_data hs_chr_dust.asnb, hs_chr_mask.asnb -out hs_ch r

We can use the masking data file generated in step 5.2.1.3 to create a protein BLAST database:

$ makeblastdb -in refseq_protein -dbtype prot -parse_seqids

-mask_data refseq_seg.asnb -out refseq_protein -title

Using blastdbcmd, we can check the database thus generated:

$ blastdbcmd -db refseq_protein -inf o

This produces the following summary, which includes the masking information:

Database: RefSeq Protein Databas e 7,044,477 sequences 2,469,203,411 total residue s

Date: Sep 1, 2009 10:50 AM Longest sequence: 36,805 residue s

Available filtering algorithms applied to database sequences :

Algorithm ID Algorithm name Algorithm option s 21 seg window=12 locut=2.2 hicut=2. 5

Volumes : /export/home/tao/blast_test/refseq_protein2.0 0 /export/home/tao/blast_test/refseq_protein2.0 1 /export/home/tao/blast_test/refseq_protein2.0 2

We use the following command line, which is very similar to that given in 5.2.2.1.

$ makeblastdb -in hs_chr.mfa -dbtype nucl -parse_seqids -mask_data hs_chr_mfa.asnb -out hs_chr_mfa -title "Human chromosomes (mfa) "

Here we use the lowercase masked FASTA sequence file as input (-in hs_chr.mfa), specify the database as nucleotide (-dbtype nucl), enable parsing of sequence ids (-parse_seqids), provide the masking data (-mask_data hs_chr_mfa.asnb), and name the resulting database as hs_chr_mfa (-out hs_chr_mfa).

Checking the database thus generated using blastdbcmd, we have:

Database: Human chromosomes (mfa ) 24 sequences 3,095,677,412 total base s

Date: Aug 26, 2009 11:41 AM Longest sequence: 249,250,621 base s

Available filtering algorithms applied to database sequences :

Algorithm ID Algorithm name Algorithm option s 40 repeat repeatmasker lowercas e

Volumes : /export/home/tao/hs_chr_mf a

The algorithm name and algorithm options are the values we provided in step 5.2.1.4.

5.2.3 Obtaining Sample data for this cookbook entry

For input nucleotide sequences, we use the BLAST database generated from a FASTA input file hs_chr.fa, containing complete human chromosomes from BUILD37.1, generated by inflating and combining the hs_ref_*.fa.gz files located at:

We use this command line to create the BLAST database from the input nucleotide sequences:

$ makeblastdb -in hs_chr.fa -dbtype nucl -parse_seqids -out hs_chr -title "Human chromosomes, Ref B37.1 "

For input nucleotide sequences with lowercase masking, we use the FASTA file hs_chr.mfa, containing the complete human chromosomes from BUILD37.1, generated by inflating and combining the hs_ref_*.mfa.gz files located in the same ftp directory.

For input protein sequences, we use the preformatted refseq_protein database from the NCBI blast/db/ ftp directory:

5.3 Search the database with database soft masking information

To enable the database masking during a BLAST search, we need to get the Algorithm ID using the -info parameter of blastdbcmd. For the database generated in step 5.2.2.2, we can use the following command line to activate one of the database soft masking created by windowmasker:

$ blastn -query HTT_gene -task megablast -db hs_chr -db_soft_mask 30 -outfmt 7 -out HTT_megablast_mask.out -num_threads 4

Here, we use the blastn program to search a nucleotide query HTT_gene* (-query HTT_gene) with megablast algorithm (-task megablast) against the database created in step 5.2.2.1 (-db hs_chr). We invoke the soft database masking (-db_soft_mask 30), set the result format to tabular output (-outfmt 7), and save the result to a file named HTT_megablast_mask.tab (-out HTT_megablast_mask.tab). We also activated the multi-thread feature of blastn to speed up the search by using 4 CPUs $ (-num_threads 4).

*This is a genomic fragment containing the HTT gene from human, including 5 kb up- and down-stream of the transcribed region. It is represented by NG_009378.

$ The number to use under in your run will depend on the number of CPUs your system has.

In a test run under a 64-bits Linux machine, the above search takes 9.828 seconds real time, while the same run without database soft masking invoked takes 31 minutes 44.651 seconds.

5.4 Extract all human sequences from the nr database

Although one cannot select GIs by taxonomy from a database, a combination of unix command line tools will accomplish this:

$ blastdbcmd -db nr -entry all -outfmt "%g %T" | awk ' < if ($2 == 9606) < print $1 >> ' | blastdbcmd -db nr -entry_batch - -out human_sequences.tx t

The first blastdbcmd invocation produces 2 entries per sequence (GI and taxonomy ID), the awk command selects from the output of that command those sequences which have a taxonomy ID of 9606 (human) and prints its GIs, and finally the second blastdbcmd invocation uses those GIs to print the sequence data for the human sequences in the nr database.

5.5 Custom data extraction and formatting from a BLAST database

The following examples show how to extract selected information from a BLAST database and how to format it:

Extract the accession, sequence length , and masked locations for GI 71022837 : $ blastdbcmd -entry 71022837 -db Test/mask-data-db -outfmt "%a %l %m " XP_761648.1 1292 119-139140-144147-152154-160161-216

Extract the masked FASTA for GI 71022837 : $ blastdbcmd -entry 71022837 -db Test/mask-data-db -mask_sequence_with 20,40 -target_onl y >gi|71022837|ref|XP_761648.1| hypothetical protein UM05501. 1 MPPSARHSAHPSHHPHAGGRDLHHAAGGPPPQGGPGMPPGPGNGPMHHPHSSYAQSMPPPPGLPPHAMNGINGP P PSTHGGPPPRMVMADGPGGAGGPPPPPPPHIPRSSSAQSRIMEAaggpagpppagppastspavqklslaNEaa w vsiGsaaetmedydralsayeaalrhnpysvpalsaiagvhrtldnfekavdyfqrvlnivpengdtWGSMGHC Y LMMDDLQRAYTAYQQALYHLPNPKEPKLWYGIGILYDRYGSLEHAEEAFASVVRMDPNYEKANEIYFRLGIIYK Q QNKFPASLECFRYILDNPPRPLTEIDIWFQIGHVYEQQKEFNAAKEAYERVLAENPNHAKVLQQLGWLYHLSNA G FNNQERAIQFLTKSLESDPNDAQSWYLLGRAYMAGQNYNKAYEAYQQAVYRDGKNPTFWCSIGVLYYQINQYRD A LDAYSRAIRLNPYISEVWFDLGSLYEACNNQISDAIHAYERAADLDPDNPQIQQRLQLLRNAEAKGGELPEAPV P QDVHPTAYANNNGMAPGPPTQIGGGPGPSYPPPLVGPQLAGNGGGRGDLSDRDLPGPGHLGSSHSPPPFRGPPG T DDRGARGPPHGALAPMVGGPGGPEPLGRGGFSHSRGPSPGPPRMDPYGRRLGSPPRRSPPPPLRSDVHDGHGAP P HVHGQGHGQGHGQGHGQGHGQGHGQSHGHSHGGEFRGPPPLAAAGPGGPPPPLDHYGRPMGGPMSEREREMEWE R EREREREREQAARGYPASGRITPKNEPGYARSQHGGSNAPSPAFGRPPVYGRDEGRDYYNNSHPGSGPGGPRGG Y ERGPGAPHAPAPGMRHDERGPPPAPFEHERGPPPPHQAGDLRYDSYSDGRDGPFRGPPPGLGRPTPDWERTRAG E YGPPSLHDGAEGRNAGGSASKSRRGPKAKDELEAAPAPPSPVPSSAGKKGKTTSSRAGSPWSAKGGVAAPGKNG K ASTPFGTGVGAPVAAAGVGGGVGSKKGAAISLRPQEDQPDSRPGSPQSRRDASPASSDGSNEPLAARAPSSRMV D EDYDEGAADALMGLAGAASASSASVATAAPAPVSPVATSDRASSAEKRAESSLGKRPYAEEERAVDEPEDSYKR A KSGSAAEIEADATSGGRLNGVSVSAKPEATAAEGTEQPKETRTETPPLAVAQATSPEAINGKAESESAVQPMDV D GREPSKAPSESATAMKDSPSTANPVVAAKASEPSPTAAPPATSMATSEAQPAKADSCEKNNNDEDEREEEEGQI H EDPIDAPAKRADEDGA K

5.6 Display BLAST search results with custom output format

The –outfmt option permits formatting arbitrary fields from the BLAST tabular format. Use the –help option on the command-line application (e.g., blastn) to see the supported fields.

5.6.1 Example of custom output format

The following example shows how to display the results of a BLAST search using a custom output format. The tabular output format with comments is used, but only the query accession, subject accession, evalue, query start, query stop, subject start, and subject stop are requested. For brevity, only the first 10 lines of output are shown:

$ echo 1786181 | ./blastn -db ecoli -outfmt "7 qacc sacc evalu e qstart qend sstart send " # BLASTN 2.2.18 + # Query: gi|1786181|gb|AE000111.1|AE00011 1 # Database: ecol i # Fields: query acc., subject acc., evalue, q. start, q. end, s .

start, s. en d # 85 hits foun d

AE000111 AE000111 0.0 1 10596 1 10596
AE000111 AE000174 8e-30 5565 5671 6928 6821
AE000111 AE000394 1e-27 5587 5671 135 219
AE000111 AE000425 6e-26 5587 5671 8552 8468
AE000111 AE000171 3e-24 5587 5671 2214 2130
$

5.6.2 Trace-back operations (BTOP)

The “Blast trace-back operations” (BTOP) string describes the alignment produced by BLAST. This string is similar to the CIGAR string produced in SAM format, but there are important differences. BTOP is a more flexible format that lists not only the aligned region but also matches and mismatches. BTOP operations consist of 1.) a number with a count of matching letters, 2.) two letters showing a mismatch (e.g., “AG” means A was replaced by G), or 3.) a dash (“-“) and a letter showing a gap. Note also that BTOP always shows the query on the plus strand, whereas the CIGAR string always has the subject on the plus strand.

The box below shows a blastn run first with BTOP output and then the same run with the BLAST report showing the alignments.

$ blastn -query test_q.fa -subject test_s.fa -dust no -outfmt " 6 qseqid sseqid btop" -parse_defline s query1 q_multi 7AG3 9 query1 q_multi 7A-3 9 query1 q_multi 6-G-A4 1 $ blastn -query test_q.fa -subject test_s.fa -dust no -parse_defline s BLASTN 2.2.24 +

Score = 82.4 bits (44), Expect = 9e-2 2

Identities = 46/47 (97%), Gaps = 0/47 (0% )

Query 1 ACGTCCGAGACGCGAGCAGCGAGCAGCAGAGCGACGAGCAGCGACGA 4 7

||||||| |||||||||||||||||||||||||||||||||||||| | Sbjct 47 ACGTCCGGGACGCGAGCAGCGAGCAGCAGAGCGACGAGCAGCGACGA 9 3

Score = 80.5 bits (43), Expect = 3e-2 1

Identities = 46/47 (97%), Gaps = 1/47 (2% )

Query 1 ACGTCCGAGACGCGAGCAGCGAGCAGCAGAGCGACGAGCAGCGACGA 47
||||||| |||||||||||||||||||||||||||||||||||||||
Sbjct 1 ACGTCCG-GACGCGAGCAGCGAGCAGCAGAGCGACGAGCAGCGACGA 46

Score = 78.7 bits (42), Expect = 1e-2 0

Identities = 47/49 (95%), Gaps = 2/49 (4% )

Query 1 ACGTCC--GAGACGCGAGCAGCGAGCAGCAGAGCGACGAGCAGCGACGA 4 7

|||||| |||||||||||||||||||||||||||||||||||||||| | Sbjct 94 ACGTCCGAGAGACGCGAGCAGCGAGCAGCAGAGCGACGAGCAGCGACGA 14 2

5.7 Use blastdb_aliastool to manage the BLAST databases

Often we need to search multiple databases together or wish to search a specific subset of sequences within an existing database. At the BLAST search level, we can provide multiple database names to the “-db” parameter, or to provide a GI file specifying the desired subset to the “-gilist” parameter. However for these types of searches, a more convenient way to conduct them is by creating virtual BLAST databases for these. Note: When combining BLAST databases, all the databases must be of the same molecule type. The following examples assume that the two databases as well as the GI file are in the current working directory.

5.7.1 Aggregate existing BLAST databases

To combine the two nematode nucleotide databases, named “nematode_mrna” and “nematode_genomic", we use the following command line:

$ blastdb_aliastool -dblist "nematode_mrna nematode_genomic" -dbtype nucl -out nematode_all -title "Nematode RefSeq mRNA + Genomic "

5.7.2 Create a subset of a BLAST database

The nematode_mrna database contains RefSeq mRNAs for several species of round worms. The best subset is from C. elegance. In most cases, we want to search this subset instead of the complete collection. Since the database entries are from NCBI nucleotide databases and the database is formatted with ”-parse_seqids”, we can use the “-gilist c_elegance_mrna.gi” parameter/value pair to limit the search to the subset of interest, alternatively, we can create a subset of the nematode_mrna database as follows:

$ blastdb_aliastool -db nematode_mrna -gilist c_elegance_mrna.gi -dbtype nucl -out c_elegance_mrna -title "C. elegans refseq mRNA entries "

Note: one can also specify multiple databases using the -db parameter of blastdb_aliastool.

5.8 Reformat BLAST reports with blast_formatter

It may be helpful to view the same BLAST results in different formats. A user may first parse the tabular format looking for matches meeting a certain criteria, then go back and examine the relevant alignments in the full BLAST report. He may also first look at pair-wise alignments, then decide to use a query-anchored view. Viewing a BLAST report in different formats has been possible on the NCBI BLAST web site since 2000, but has not been possible with stand-alone BLAST runs. The blast_formatter allows this, if the original search produced blast archive format using the –outfmt 11 switch. The query sequence, the BLAST options, the masking information, the name of the database, and the alignment are written out as ASN. 1 (a structured format similar to XML). The blast_formatter reads this information and formats a report. The BLAST database used for the original search must be available, or the sequences need to be fetched from the NCBI, assuming the database contains sequences in the public dataset. The box below illustrates the procedure. A blastn run first produces the BLAST archive format, and the blast_fomatter then reads the file and produces tabular output.

Blast_formatter will format stand-alone searches performed with an earlier version of a database if both the search and formatting databases are prepared so that fetching by sequence ID is possible. To enable fetching by sequence ID use the –parse_seqids flag when running makeblastdb, or (if available) download preformatted BLAST databases from ftp:// ftp.ncbi.nlm.nih.gov/blast/db/ using update_blastdb.pl (provided as part of the BLAST+ package). Currently the blast archive format and blast_formatter do not work with database free searches (i.e., -subject rather than –db was used for the original search).

$ echo 1786181 | blastn -db ecoli -outfmt 11 -out out.1786181.as n $ blast_formatter -archive out.1786181.asn -outfmt "7 qacc sacc evalu e qstart qend sstart send " # BLASTN 2.2.24 + # Query: gi|1786181|gb|AE000111.1|AE000111 Escherichia coli K-12 MG165 5 section 1 of 40 0 # Database: ecol i # Fields: query acc., subject acc., evalue, q. start, q. end ,

s. start, s. en d # 85 hits foun d

AE000111 AE000111 0.0 1 10596 1 10596
AE000111 AE000174 8e-30 5565 5671 6928 6821
AE000111 AE000394 1e-27 5587 5671 135 219

AE000111 AE000425 6e-26 5587 5671 8552 846 8 AE000111 AE000171 3e-24 5587 5671 2214 213 0 AE000111 AE000171 1e-23 5587 5670 10559 1064 2 AE000111 AE000376 1e-22 5587 5675 129 4 2 AE000111 AE000268 1e-22 5587 5671 6174 609 0 AE000111 AE000112 1e-22 10539 10596 1 5 8 AE000111 AE000447 5e-22 5587 5670 681 59 8 AE000111 AE000344 6e-21 5587 5671 4112 419 6 AE000111 AE000490 2e-20 5584 5671 4921 483 5 AE000111 AE000280 2e-20 5587 5670 12930 1284 7

5.9 Extract lowercase masked FASTA from a BLAST database with masking information

If a BLAST database contains masking information, this can be extracted using the blastdbcmd options –db_mask and –mask_sequence as follows:

$ blastdbcmd -info -db mask-data-d b

Database: Mask data tes t 10 sequences 12,609 total residue s

Date: Feb 17, 2009 5:10 PM Longest sequence: 1,694 residue s

Available filtering algorithms applied to database sequences :

Algorithm ID Algorithm name Algorithm option s 20 seg default options use d 40 repeat -species Desmodus_rotundu s

$ blastdbcmd -db mask-data-db -mask_sequence_with 20 -entry 7102283 7

>gi|71022837|ref|XP_761648.1| hypothetical protein UM05501.1 [Ustilago maydis 521 MPPSARHSAHPSHHPHAGGRDLHHAAGGPPPQGGPGMPPGPGNGPMHHPHSSYAQSMPPPPGLPPHAMNGINGPPP GGPPPRMVMADGPGGAGGPPPPPPPHIPRSSSAQSRIMEAaggpagpppagppastspavQklslANEaawvsIGsaetm EdydralsayeaalrhnpysvpalsaiagvhrtldnfekavdyfqrvlnivpengdTWGSMGHCYLMMDDLQRAYTYQQ ALYHLPNPKEPKLWYGIGILYDRYGSLEHAEEAFASVVRMDPNYEKANEIYFRLGIIYKQQNKFPASLECFRYILDPP PLTEIDIWFQIGHVYEQQKEFNAAKEAYERVLAENPNHAKVLQQLGWLYHLSNAGFNNQERAIQFLTKSLESDPNDQS YLLGRAYMAGQNYNKAYEAYQQAVYRDGKNPTFWCSIGVLYYQINQYRDALDAYSRAIRLNPYISEVWFDLGSLYECN QISDAIHAYERAADLDPDNPQIQQRLQLLRNAEAKGGELPEAPVPQDVHPTAYANNNGMAPGPPTQIGGGPGPSYPPL GPQLAGNGGGRGDLSDRDLPGPGHLGSSHSPPPFRGPPGTDDRGARGPPHGALAPMVGGPGGPEPLGRGGFSHSRGSP GPPRMDPYGRRLGSPPRRSPPPPLRSDVHDGHGAPPHVHGQGHGQGHGQGHGQGHGQGHGQSHGHSHGGEFRGPPPLA AAGPGGPPPPLDHYGRPMGGPMSEREREMEWEREREREREREQAARGYPASGRITPKNEPGYARSQHGGSNAPSPAFGR PVYGRDEGRDYYNNSHPGSGPGGPRGGYERGPGAPHAPAPGMRHDERGPPPAPFEHERGPPPPHQAGDLRYDSYSDGRD GPFRGPPPGLGRPTPDWERTRAGEYGPPSLHDGAEGRNAGGSASKSRRGPKAKDELEAAPAPPSPVPSSAGKKGKTTSSR AGSPWSAKGGVAAPGKNGKASTPFGTGVGAPVAAAGVGGGVGSKKGAAISLRPQEDQPDSRPGSPQSRRDASPASSDGSN E PL A ARAPSSRMVDEDYDEGAADALMGLAGAASASSASVATAAPAPVSPVATSDRASSAEKRAESSLGKRPYAEEERAVD E PE D SYKRAKSGSAAEIEADATSGGRLNGVSVSAKPEATAAEGTEQPKETRTETPPLAVAQATSPEAINGKAESESAVQP M DV D GREPSKAPSESATAMKDSPSTANPVVAAKASEPSPTAAPPATSMATSEAQPAKADSCEKNNNDEDEREEEEGQIHE D PI D APAKRADEDGA K $

5.10 Display the locations where BLAST will search for BLAST databases

This is accomplished by using the -show_blastdb_search_path option in blastdbcmd:

$ blastdbcmd -show_blastdb_search_pat h :/net/nabl000/vol/blast/db/blast1:/net/nabl000/vol/blast/db/blast2 : $

5.11 Display the available BLAST databases at a given directory:

This is accomplished by using the -list option in blastdbcmd:

$ blastdbcmd -list repeat -recursiv e repeat/repeat_3055 Nucleotid e repeat/repeat_31032 Nucleotid e repeat/repeat_35128 Nucleotid e repeat/repeat_3702 Nucleotid e repeat/repeat_40674 Nucleotid e repeat/repeat_4530 Nucleotid e repeat/repeat_4751 Nucleotid e repeat/repeat_6238 Nucleotid e repeat/repeat_6239 Nucleotid e repeat/repeat_7165 Nucleotid e repeat/repeat_7227 Nucleotid e repeat/repeat_7719 Nucleotid e repeat/repeat_7955 Nucleotid e repeat/repeat_9606 Nucleotid e repeat/repeat_9989 Nucleotid e $

The first column of the default output is the file name of the BLAST database (usually provided as the –db argument to other BLAST+ applications), the second column represents the molecule type of the BLAST database. This output is configurable via the list_outfmt command line option.


Local installation of ncbi-blast+ together with the nr and taxonomy database

It has been a while since I installed my local nr and taxonomy database last time. This week, I need to do this again for a different server, so I think it might be worthwhile to write a brief note to record whole process for my future reference.

If the ncbi-blast+ software has not been installed for our system, we need to first download and install if before setting up those two databases. For this example, my installed ncbi-blast+ version number is 2.2.31, but you can change it to which ever version that you like in this list: ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/

Add the returned full path to the $PATH environment variable in the

Now that we have installed ncbi-blast+, we need to create an environmental variable $BLASTDB to point to the directory where we want to store our NCBI nr database and the taxonomy database. Say we want to set $BLASTDB to be “/home/users/DB”, we can do this by type:

Now we can use the update_blastdb.pl script (shipped with ncbi-blast+) to download and set up the nr database:

We will install the taxonomy database in a similar way:

Now you should be able to run local blast against the nr database by running

To further retrieve the taxonomy information of each hit, you need to set customized output format, such as:


Results

Outline of the Algorithmic Approach

A read aligner should deliver the original position of the read in the reference genome. Such a position will be called the true position in the following. Optimally scoring local alignments of the read and the reference genome can be used to obtain a possible true position, but because an alignment of the read with the reference genome at the true position does not always have an optimal score according to the chosen scoring scheme, this method does not always work. Nevertheless, there are no better approaches available unless further information about the read is at hand.

We present a new read mapping approach that aims at finding optimally scoring local alignments of a read and the reference genome. It is based on computing inexact seeds of variable length and allows to handle insertions, deletions (indels gaps), and mismatches. Throughout the document the notion of differences refers to mismatches, insertions and deletions in some local alignment of the read and the reference genome, irrespective of whether they arise from technical artifacts or sequence variation. A single difference is either a single mismatch, a single character insertion or a single character deletion. Although not limited to a specific scoring scheme, we have implemented our seed search model in the program segemehl assigning a score of 1 to each match and a score of −1 to each mismatch, insertion or deletion. Our matching strategy derives from a simple and commonly used idea. Assume an optimally scoring local alignment of a read with the reference genome with exactly two differences. If the positions of the differences in the alignment are sufficiently far apart, we can efficiently locate exact seeds which in turn may deliver the position of the optimal local alignment in the reference genome. Likewise, if the distance between the two differences is small, two continuous exact matches at the ends of the read possibly allow to map the read to this position. To exploit this observation, the presented method employs a heuristic based on searches starting at all positions of the read. That is, for each suffix of the read the longest prefix match, i.e. the longest exact match beginning at the first position of the suffix with all substrings of the reference genome is computed. If the longest prefix match is long enough that it only occurs in a few positions of the reference genome, it may be feasible to check all these positions to verify if the longest prefix match is part of a sufficiently good alignment. While this approach works already well for many cases, we need to increase the sensitivity for cases where the computation of the longest prefix match fails to deliver a match at the position of the optimally scoring local alignment. This is the case when a longer prefix match can be obtained at another position of the reference genome by exactly matching characters that would result in a mismatch, insertion or deletion in the optimal local alignment (cf. Fig. 1). Therefore, during the computation of each longest prefix match we check a limited number of differences by enumerating at certain positions all possible mismatches and indels (cf. Fig. 2).

Assume a simple scoring scheme that assigns a score of +1 to a single character match and a score of 0 to a single character mismatch, a single insertions or deletion. Using longest prefix matches bears the risk of ignoring differences in the best, i.e. optimally scoring, local alignment. Its retrieval fails if a longer match can be obtained at another position of the reference sequence by matching a character, that is inserted, deleted, or mismatched in the best local alignment. Depending on the length of the reference genome and its nucleotide composition the probability is determined by the length of the substring that can be matched to the position of the best local alignment before the first difference occurs. (A) The optimally scoring alignment of the read P: = cttcttcggc begins at position 3 of the reference genome S: = atacttcttcggcaga. Let Pi denote the i th suffix of the read P. For each Pi, the starting positions of the longest match in S comprise the position of Pi in the best local alignment (solid green lines). That is, the longest match of P0 begins at position 3, the longest match of P1 begins at position 4, the longest match of P2 begins at position 5 and so forth. (B) For the read P: = cttc g tcggc, the retrieval of the best local alignment fails for all Pi, i<5 (dashed red line) due to the inclusion of a character that results in a mismatch in the optimally scoring local alignment. (C) The read P: = cttct g cggc contains, with respect to the best local alignment, a mismatch at position 5 of the read. Here the position 5 of the read is not included in the longest prefix match and nearly all substrings align correctly to the reference genome.

We give an explanation based on a suffix trie which is equivalent to the suffix interval tree shown in Fig. 5 (see Methods). The suffix trie for S$ with S: = acttcttcggc (left) holds twelve leaves. Each numbered leaf corresponds to exactly one suffix in S. Nodes with only one child are not explicitly shown. Note, that internal nodes implicitly represent all leafs in their respective subtree. Thus, internal nodes can be regarded as sets of suffixes. The right panel holds the longest matches for different matching paths in the trie. Matching the first three suffixes of the read P: = cgtcggc results in three different paths in the suffix trie. Each path is equivalent to a sequence of suffix intervals, a matching stem, in the enhanced suffix array. Let denote the matching stem for Pi = i th suffix of P. The q th interval in , denoted by , implicitly represents the set of suffixes in S matching P[ii+q−1]. The path for the first suffix P0 is of length two (green solid line). Hence, the equivalent matching stem is a sequence of three intervals: , and . Since only represents the suffix S7, the longest prefix match of P0 is of length 2 occurring at position 7 of the reference sequence (right panel). The matching stem for P1 (red solid line) ends with . Therefore, matches of length one occur at positions 8 and 9 in S. The longest prefix match for P3 occurs at position 6 of S (dashed orange line). Note, that the intervals of equivalently represent S6. An alternative path leads to a match with position 4. The branch denotes the alternative that accepts the mismatch of g and t at position 1 of P0.

To efficiently compute the longest prefix matches, we exploit their properties for two consecutive suffixes of a read, i.e. for two suffixes starting at position i and i+1. If the suffix starting at position i has a longest prefix match of length , then the suffix starting at position i+1 has a longest prefix match of length at least −1. For example, assume a read ACTGACTG. If the second suffix has a longest prefix match of length 4, i.e. CTGA, with the reference genome, we immediately see that the third suffix has a longest prefix match not shorter than 3—because we already know that the substring TGA exists in the reference genome. Using an enhanced suffix array of the reference sequence, we can easily exploit this fact and determine the longest prefix match of the next suffix without rematching the first −1 characters. Likewise, the enumeration of mismatches and indels is also restricted to the remaining characters of the suffix in our model.

For each suffix of a read, we thus obtain a set of exact matches and alternative inexact matches and their respective positions in the reference sequence. These exact and inexact matches act as seeds. If a seed occurs more than t times in the reference genome, then it is omitted, where t is a user specified parameter ( segemehl option –maxocc ). The heuristics rigorously selects the exact or inexact seed with the smallest E-value, computed according to the Blast-statistics [14]. If this E-value is smaller than some user defined threshold ( segemehl option -E ), the bitvector algorithm of [1] is applied to a region around the genomic position of the seed to obtain an alignment of the read and the reference sequence. While the score based search for local alignment seeds controls the sensitivity of our matching model, the bitvector alignment controls its specificity: if the alignment has more matching characters than some user specified percentage a of the read ( segemehl option -A ) the corresponding genomic position is reported (see Methods).

The computation of the longest prefix match is implemented by a top-down traversal of a conceptual suffix interval tree, guided by the characters of the read. The suffix interval tree is equivalent to a suffix trie (see Methods). The traversal delivers a matching stem. Note that for the DNA alphabet there are at most four edges outgoing from each node of the suffix interval tree. To introduce mismatches, the traversal is simply continued with alternative edges, i.e. edges diverging from the matching stem. To introduce insertions, the traversal is not regularly continued, but characters of the read are skipped. Deletions are simulated by skipping nodes of the suffix interval tree and continuing the search at their child nodes (see Methods). We refer to these alternative paths that branch off from the matching stem as branches. The maximum number of branches to be considered is controlled by the seed differences threshold k ( segemehl option -D ). Note, that while matching character by character along a suffix of a read, the number of branches is expected to decrease quickly.

Performance Tests

segemehl constructs indices either for each chromosome of a genome and the matching is performed chromosome-wise or, depending on the available RAM , chromosomes are combined to larger sequences. Compared to other methods, the index structure used by segemehl is significantly larger. For example, the enhanced suffix array of human chromosome 1 occupies approximately 3 GB of space. As it is stored on disk, the index only needs to be computed once. The construction of the index requires linear time. For example, on a single CPU, the construction of the complete enhanced suffix array for human chromosome 1 takes approximately 15 minutes. For our comparison, we ran segemehl with maximum occurrence parameter t = 500. The maximum E-value for seeds was set to 0.5 and minimum identity threshold to a = 85% which corresponds to a maximum of ⌈0.15·m⌉ differences in an alignment of the read of length m.

We compared segemehl to Bowtie v0.9.7 with option –all, BWA v0.2.0, MAQ v0.7.1, PatMaN v1.2.1 and SOAP v1.11 with option –r 2. MAQ and SOAP are based on ungapped alignments which are computed by hash lookups [7],[8],[13]. Due to length restrictions, MAQ is limited to Illumina (and SOLiD) reads. It additionally takes quality scores into account. The quality values needed by MAQ were, for all nucleotides, uniformly set to a value corresponding to the error rate. Bowtie [11] and BWA [13] index the reference genome with the Burrows-Wheeler transform. BWA allows a limited number of indels. PatMaN [12] matches the reads by traversing a non-deterministic suffix automaton constructed from the reference genome. Except for PatMaN , all programs only report matches with the smallest edit distance. BWA and Bowtie each need about 10 minutes to build their index. The fastq files needed by MAQ are built in approximately 2 minutes. PatMaN and SOAP require no indexing steps. The options for the other programs were chosen so as to achieve results similar to segemehl . For our comparison, we performed tests on simulated as well as real-life read data sets. For the simulation we generated read sets representing different error rates, types and distributions. We used three distinct error sets, one containing only mismatches, one containing only indels and a last one representing reads with mismatches and indels at a ratio of 1∶1. Additionally, different error distributions were used to model error scenarios such as terminal contamination (e.g. linker, poly-A tails) or decreasing read quality. We chose uniform, 5′, 3′ and terminal error distributions.

Each simulated dataset contained 500 000 simulated reads, each of length 35 bp, sampled from a 50 MB large region of the human genome (chromosome 21). We introduced errors to each simulated read according to previously defined rates, error types and distributions. For the 50 MB region we constructed the indexes required for segemehl and Bowtie . For MAQ we constructed the index for the read set under consideration. Index construction took approximately one minute for Bowtie and BWA . The construction for the enhanced suffix array for segemehl took 3.5 minutes. The binary fastq files for MAQ were created in about 20 seconds.


BLAST alternatives

Depending on your sequencing result and desired analysis, BLAST may not always be your optimal choice. For difficult sequence alignments that BLAST is unable to handle, Clustal is our frequent choice for pairwise or multiple sequence alignments of nucleotide or protein sequences. We also use COBALT for aligning multiple protein sequences, particularly for comparing different isoforms. In addition to our favorites, there are a number of sequence alignment tools available.

Try these resources for lists of alternatives to BLAST:


Background

The Basic Local Alignment Search Tool (BLAST) from NCBI has been a popular tool for analyzing the large data sets of genetic sequences that have become common when working with new generation sequencing technologies. BLAST has been the preferred utility for sequence alignment and identification in bioinformatics and genomics research and workflows for almost 30 years [1]. One of the main challenges for researchers utilizing the NCBI BLAST is interpreting the huge amount of output data produced when analyzing large numbers of input sequences. While BLAST does allow for multiple output formats as well as limiting the number of top hit results (using -outfmt and -max_target_seqs, respectively) [2], for some purposes such as pushing results down a workflow or pipeline, these tools may not be enough to ensure results that can be meaningfully and reasonably interpreted. The controversy raised by Shah et al. [3] in their 2018 paper outlining a bug in the functionality of the -max_target_seqs parameter has started a discussion in the BLAST community over the usage and potential for misuse of the parameter. NCBI published a response stating that the utility of this parameter is simply misunderstood by the community and that the bug seen by Shah et al. was the result of “overly aggressive optimization” introduced in 2012, and patched the issue following the release of BLAST+ 2.8.1 in 2018 [4]. However, follow up test cases and posts, including those by Peter Cock [5], have shown that this issue is much more complex than simply “BLAST returns the first N hits that exceed the specified e-value threshold”. While the update 2.8.1 fixed 9/10 of Shah et al’s test cases, according to the post by Peter Cock, 1/10 remained invalid, due to an error with the internal candidate sequence limit introduced by -max_target_seqs 1. This is because, as was stated by Shah et al. [3], the -max_target_seqs parameter is applied much earlier in the search, before the final gapped alignment stage. This means that the use of this parameter can change the number of sequences processed as well as the statistical significance of a hit if using composition-based statistics [2]. This is contrary to the popular assumption that the parameter is simply a filter applied post search [6]. This intuition is false, and may lead to errors in the resulting data of a BLAST search if the value of -max_target_seqs is too small. The use of -max_target_seqs in this way is not advised. As a result of the misinformation and confusion over ‘-max_target_seqs’ and other intricacies of the BLAST heuristic search and filtering process, there has been a push towards more detailed documentation of these processes and effects of parameters on the BLAST algorithm [6], with NCBI adding a warning to the BLAST command-line application if the value of -max_target_seqs is less than 5 [7]. The community has also moved towards better methods of narrowing the results of a large search, as opposed to using BLAST parameters that may affect the actual search process. These methods include resources like Bio-Perl and BioPython that can be used to create scripts to parse and filter result files. A few community written scripts can be found available online, such as the Perl script created by Dr. Xiaodong Bai and published online by Ohio State [8], a version of this script produced by Erin Fichot [9], and a XML to tabular parser by Peter Cock [10]. While all of these scripts (and others like them) can potentially be very useful for parsing BLAST XML results into a concise tabular format, most have drawbacks that leave much to be desired. First and most importantly, for Bai and Fichot, the programs require Perl and Bio-Perl modules which can be unwieldy and slow for use in high-throughput workflows and pipelines, especially those which are built on a modern python framework. Furthermore, both scripts contain a bug, found on lines 77 and 93 respectively, that causes the query frame value to be lost through the parsing process, setting the value to 0. Our team sought to correct this and other errors and to provide a verified solution that can be soundly applied for research purposes. Secondly, the team saw a need for increased functionality, particularly the ability to filter results by threshold values input by the user. The only program that implements a threshold other than BLAST-QC is the script by Fichot, but only a bit score threshold is implemented. Our team sought to provide a single solution that would let researchers determine the best combination of values that would be optimal for any given experiment without the need to change parsers between runs. The team’s central motivation was to pursue creating a dedicated piece of quality control software for use in research workflows, find a solution that solely utilizes Python 3, streamlines the process, and reduces run times for parsing large data sets of BLAST results. Implementation:

BLASTQC has been implemented in a single python file, requiring only that Python3 be installed for all functionalities to be used. The team felt that an implementation in Python was important for the simplicity and ease of use that comes with the Python environment. Python is also one of the most popular and well understood languages for research purposes, and thus is a perfect choice for a tool that is designed for portability and integration into other research processes. Python is also capable of very fast runtimes when compared to other interpreted languages, such as Perl, and while it may be slower than a compiled language like C, the benefits in ease of use and portability outweigh the minor increase in runtimes. For example, C requires the use of dependencies like libxml2 for parsing, requiring a higher level of knowledge to make modifications to source code, and as such is less desirable as a simple addition to bioinformatic workflows already built within the Python framework. With Python 3 the parsing step of the workflow is simplified to a single file. Furthermore, the use of a standalone script rather than the use of a command line sorting option such as GNU sort not only provides a great increase in possible functionality, as implementing filtering parameters in bash on the command line can be cumbersome, but also allows for a better user experience for researchers who don’t want to memorize long sort commands that need to be changed constantly as experiment goals change. The BLAST-QC script implements thresholds on e-value, bit-score, percentage identity, and the number of ‘taxids’ (level of taxonomic or functional gene detail) in the definition of a hit (<hit_def > in BLAST XML results). It is also possible for the user to choose which of these values the output should be ordered by and how many top matches should be returned per query sequence in the input. Thus, the behavior of the -max_target_seqs parameter may be implemented with ease without altering the search process. Additionally, if the researcher decides that a higher bit-score is more important for a certain experiment, it is trivial to change the parsing process to return the highest bit-score hit, whereas max_target_seqs only supports returning top hits by e-value. Further, the Python script is also capable of setting a range on the threshold values, and selecting those sequences that produced a more detailed hit definition within that range. This is useful for researchers because it avoids the problem of finding a high scoring sequence that provides no relevant information, as there may be little use in knowing that a hit accurately matches an unknown sequence. For example, a BLAST search may return a hit sequence with a taxid of “protein of unknown function DUF1680”. This may not be a useful result for a study on the function of a specific protein, regardless of how low the evalue of the hit. BlastQC allows researchers to define the reasonable evalue for their application using input parameters, and returns hits with more informative taxids that still fit within the chosen parameters. Increased definition information is useful for narrowing the taxonomy of a species (for BLAST nucleotide) or the type/functionality of a protein sequence in a protein BLAST query. The team also found an issue in many of the available community parsers regarding the number of HSPs (high scoring pairs) per hit. In some cases BLAST may return multiple HSPs per hit sequence, and the BLAST-QC script handles this by considering it a separate hit that retains the same id and definition. This case was not covered in any of the scripts the team encountered online, causing hits with multiple HSPs to lose any data from the additional HSPs. The BLAST-QC Python script is compatible with all BLAST types (BLASTP, BLASTN, BLASTX, etc.) as well as both the tabular and XML output formats (−outfmt 6 and -outfmt 5, respectively) and reports all relevant data produced in a BLAST results file: query name, query length, accession number, subject length, subject description, e-value, bit-score, query frame, query start, query end, hit start, hit end, percent identity, and percent conserved (qseqid, sseqid, pident, length, mismatch, gapopen, qstart, qend, sstart, send, evalue, bitscore and salltitles (optional) for tabular output). Information on these values can be found in the BLAST glossary and manual [2, 11], and the two percentage values (percent identity and percent conserved) have been calculated using the identity (Hsp_identity), positive (Hsp_positive) and align length values (Hsp_align-len). Percent identity is defined as the percent of the sequences that have identical residues at the same alignment positions and is calculated by the number of identical residues divided by the length of the alignment multiplied by 100 (100*(hsp_identity/hsp_align-len)). Percent conserved (positive) is defined as the percent of the sequences that have ‘positive’ residues (chemically similar) at the same alignment positions and is calculated by the number of positive residues divided by the length of the alignment multiplied by 100 (100*(hsp_positive/hsp_align-len)). Additionally, BLAST-QC supports parallel processing of results, using Python’s multiprocessing module. The number of concurrent BLASTQC processes defaults to the number of CPU cores present on the machine, but this value may be adjusted using the -p command line option. If sequential processing is desired, the number of processes may be set to 1 using “-p 1”. This along with the ability to pipe input from stdin allow for replication of some of GNU sort’s main features.


Browse the manual

Clicking Next leads to the parameters for the read mapping (see figure 27.3).


Figure 27 . 3 : Setting parameters for the mapping.

  • Match score The positive score for a match between the read and the reference sequence. It is set by default to 1 but can be adjusted up to 10.
  • Mismatch cost The cost of a mismatch between the read and the reference sequence. Ambiguous nucleotides such as "N", "R" or "Y" in read or reference sequences are treated as mismatches and any column with one of these symbols will therefore be penalized with the mismatch cost.

After setting the mismatch cost you need to choose between linear gap cost and affine gap cost, and depending on the model you choose, you need to set two different sets of parameters that control how gaps in the read mapping are penalized.

  • Linear gap cost The cost of a gap is computed directly from the length of the gap and the insertion or deletion cost. This model often favors small, fragmented gaps over long contiguous gaps. If you choose linear gap cost, you must set the insertion cost and the deletion cost:
    • Insertion cost. The cost of an insertion in the read (a gap in the reference sequence). The cost of an insertion of length will be Insertion cost .
    • Deletion cost. The cost of a deletion in the read (gap in the read sequence). The cost of a deletion of length will be Deletion cost .
    • Insertion open cost. The cost of opening an insertion in the read (a gap in the reference sequence).
    • Insertion extend cost. The cost of extending an insertion in the read (a gap in the reference sequence) by one column.
    • Deletion open cost. The cost of a opening a deletion in the read (gap in the read sequence).
    • Deletion extend cost. The cost of extending a deletion in the read (gap in the read sequence) by one column.

    Using affine gap cost, an insertion of length is penalized by
    Insertion open cost + Insertion extend cost

    and a deletion of length is penalized by
    Deletion open cost + Deletion extend cost

    Adjusting the cost parameters above can improve the mapping quality, especially when the read error rate is high or the reference is expected to differ significantly from the sequenced organism. For example, if the reads are expected to contain many insertions and/or deletions, it can be a good idea to lower the insertion and deletion costs to allow more of such errors. However, one should also consider the possible drawbacks when adjusting these settings: reducing the insertion and deletion cost increases the risk of mapping reads to the wrong positions in the reference.


    Figure 27 . 4 : An alignment of a read where a region of 35bp at the start of the read is unaligned while the remaining 57 nucleotides matches the reference.

    Figure 33.20 shows an example using linear gap cost where the read mapper is unable to map a region in a read due to insertions in the read and mismatches between the read and the reference. The aligned region of the read has a total of 57 matching nucleotides which result in an alignment score of 57 which is optimal when using the default cost for mismatches and insertions/deletions (2 and 3 respectively). If the mapper had aligned the remaining 35bp of the read as shown in figure 33.21 using the default scoring scheme, the score would become:

    In this case, the alignment shown in figure 33.20 is optimal since it has the highest score. However, if either the cost of deletions or mismatches were reduced by one, the score of the alignment shown in figure 33.21 would become 61 and 58, respectively, and thus make it optimal.


    Figure 27 . 5 : An alignment of a read containing a region with several mismatches and deletions. By reducing the default cost of either mismatches or deletions the read mapper can make an alignment that spans the full length of the read.

    Once the optimal alignment of the read is found, based on the cost parameters described above, a filtering process determines whether this match is good enough for the read to be included in the output. The filtering threshold is determined by two factors:

    • Length fraction The minimum percentage of the total alignment length that must match the reference sequence at the selected similarity fraction. A fraction of 0.5 means that at least half of the alignment must match the reference sequence before the read is included in the mapping (if the similarity fraction is set to 1). Note, that the minimal seed (word) size for read mapping is 15 bp, so reads shorter than this will not be mapped.
    • Similarity fraction The minimum percentage identity between the aligned region of the read and the reference sequence. For example, if the identity should be at least 80% for the read to be included in the mapping, set this value to 0.8. Note that the similarity fraction relates to the length fraction, i.e. when the length fraction is set to 50% then at least 50% of the alignment must have at least 80% identity (see figure 33.22).


    Figure 27 . 6 : A read containing 59 nucleotides where the total alignment length is 60. The part of the alignment that gave rise to the optimal score has length 58 which excludes 2 bases at the left end of the read. The length fraction of the matching region in this example is therefore 58/60 = 0.97. Given a minimum length fraction of 0.5, the similarity fraction of the alignment is computed as the maximum similarity fraction of any part of the alignment which constitute at least 50% of the total alignment. In this example the marked region in the alignment with length 30 (50% of the alignment length) has a similarity fraction of 0.83 which will satisfy the default minimum similarity fraction requirement of 0.8.


    The parameter for setting the number of mismatches in standalone blast - Biology

    Staphopia includes a primer based SCCmec Typing scheme. Unfortunately there was a disconnect between running the ananlysis and making the call. The primers are aligned by BLAST in the Staphopia Anlaysis Pipeline but the logic making the type and subtype calls was in the Staphopia Website.

    staphopia-sccmec has taken this process and merged into a single command.

    You can either provide results already produced by Staphopia or give an assembly and a path to SCCmec data (included in this repo).

    staphopia-sccmec has two ways of being used, with Staphopia results or with an assembly.

    Part of the Staphopia Anlaysis Pipeline is to BLAST the SCCmec primers against your assembly. As mentioned previously, the logic to actually use those results and make a type call was included on the database side.

    staphopia-sccmec allows you to type using your existing Staphopia results. You just need to provide the path to a set of samples processed by staphopia and use the --staphopia parameter.

    Below is an example of typing the Staphopia results for SRX085180.

    As an alternative to Staphopia results, you can provide an assembly, or a directory of assemblies. When an assembly is provided, a BLAST database is created of the assembly and the primers are blasted against it.

    Below is an example of typing the GCF_001580515 assembly, available in the test/ directory.