Spaln (space-efficient spliced alignment) is a
stand-alone program that maps and aligns a set of cDNA or protein sequences
onto a whole genomic sequence in a single job. Spaln also performs spliced
or ordinary alignment after rapid similarity search against a protein sequence
database, if a genomic segment or an amino acid sequence is given as a query.
From Version 1.4, spaln supports a combination of protein sequence database and
a given genomic segment. From Version 2.2, spaln also performs rapid similarity
search and (semi-)global alignment of a set of protein sequence queries again
a protein sequence database. Spaln adopts multi-phase
heuristics that makes it possible to perform the job on a conventional personal
computer running under Unix/Linux with limited memory. The program is written
in C++ and distributed as source codes and also as executables for a few platforms.
Unless binaries are not provided, users must compile the program on their
own system. Although the program has been tested only on a Linux operating
system, it is likely to be portable to most Unix systems with little or no
modifications. The accessory program sortgrcd sorts the gene loci found
by spaln in the order of chromosomal position and orientation. From
version 2.3.2, spaln and sortgrcd can handle some gzipped files
without prior expansion if USE_ZLIB mode is activated upon compilation.From version 2.3.2a, compressed query sequence file(s) may also be
accepted.
[1] Gotoh, O. "
A space-efficient and accurate method for mapping and aligning cDNA sequences onto genomic sequence"
Nucleic Acids Research 36 (8) 2630-2638 (2008).
[2] Gotoh, O. "
Direct mapping and alignment of protein sequences onto genomic sequence"
Bioinformatics 24 (21) 2438-2444 (2008).
[3] Iwata, H. and Gotoh, O. "
Benchmarking spliced alignment programs including Spaln2, an extended version of Spaln that incorporates additional species-specific features"
Nucleic Acids Research 40 (20) e161 (2012)
[4] Nagasaki, H., Arita, M., Nishizawa, T., Suwa, M., Gotoh, O. "
Automated classification of alternative splicing and transcriptional initiation and construction of a visual database of the classified patterns" Bioinformatics 22 (10) 1211-1216 (2006).
From source
To compile the source codes in the default settings, follow the instructions below. If you download the source file (spaln2.3.2a) in the directory download, five directories will be generated under download/spalnXX/ after installation, where XX is a version code. We assume work is your workspace, which may or may not be identical to download. Note: The default destinations of the installed files of version 1 are somewhat different from those described here.
- bin : binaries
- doc : documents
- seqdb : sample sequences. In this directory you should format genomic or database files
- src : source codes
- table : parameter files used by spaln
To modify the location of executables and/or other settings, run 'configure --help' at step 6 below. (Warning: Full path name rather than relative path name must be given for executables or other directories as the arguments of the configure command.) These locations are hard coded in spaln. The locations of the 'seqdb' and 'table' directories will be respectively denoted by seqdb and table below. Hence, seqdb=download/spalnXX/seqdb, and table=download/spalnXX/table in the default settings.
- % mkdir download
- % cd download
- Download spalnXX.tar.gz
- % tar xfz spalnXX.tar.gz
- % cd ./spalnXX/src
- % ./configure [--help]
- Please manually edit Makefile if $(CC) does not indicate a C++ compiler or
- % CC=g++ ./configure [other options]
- To make spaln and sorgrcd handle gzipped files, ./configure command should be run with --use_zlib=1. Alternatively, you may manusally edit the generated Makefile so that -DUSE_ZLIB=1 is included in the compile option.
- % make
- % make install
- Executables are copied to ../bin
- makmdm program makes mutation data matrices of various PAM levels in the ../table directory
- % make clearall
- Add download/spalnXX/bin to your PATH
% setenv PATH $PATH:download/spalnXX/bin (csh/tsh)
$ export PATH=$PATH:download/spalnXX/bin (sh/bsh)
Preferably, you may add the above line in your start up rc file (e.g. ~/.bashrc)
Alternatively, move or copy download/spalnXX/bin/* to a directory on your PATH, if you have not specified the location of executables at step 6 above.
- If you have changed the location of table and/or seqdb directory after installation, set the env variables ALN_TAB and/or ALN_DBS as explained in the following subsection.
- Proceed to Sequence data formation.
From binaries
Binaries for a 32 bit (spaln2.0.4.linux32) or 64 bit (spaln2.3.2a.linux64) Linux machine are available. To use the binaries, follow the instructions below.
Case I: Assume the directory work is your workspace where every material is stored. In this case, seqdb=work.
- % mkdir work
- % cd work
- Download spalnXX.PC.tar.gz, where PC is a platform code
- % tar xfz spalnXX.PC.tar.gz
- Add work/bin to your PATH
Or move or copy work/bin/* to a directory on your PATH
- % mv ./table/* .; rmdir ./table
- % mv ./seqdb/* .; rmdir ./seqdb
- Now proceed to Sequence data formation.
Case II: Assume your workspace work is distinct from seqdb
- % mkdir download
- % cd download
- Download spalnXX.PC.tar.gz, where PC is a platform code
- % tar xfz spalnXX.PC.tar.gz
- Add download/bin to your PATH
Or move or copy download/bin/* to a directory on your PATH
- % setenv ALN_TAB download/table (csh/tsh)
$ export ALN_TAB=download/table (sh/bsh)
- % setenv ALN_DBS download/seqdb (csh/tsh)
$ export ALN_DBS=download/seqdb (sh/bsh)
- Add the above lines to your rc file, so that you don't have to repeat the commands at every login time.
- Now proceed to Sequence data formation
If you do not need genome mapping or database search, you may skip this section. All sequence files should be in (multi-)fasta format.
To perform genome mapping, the genomic sequence must be formatted before use. Formatting is optional for amino acid sequence database search.
- % cd seqdb
- Download or copy genomic sequences or protein database sequence in multi-fasta format. If spaln is accordingly compiled, gzipped file need not be uncompressed (the file name must be X.gz).
- Chromosomal sequences should be concatenated into a single file. Alternatively, you can use multiple chromosomal files without concatenation. This procedure will be described at the end of this section. To render the 'make' command effective, the extension of the genomic sequence file should be '.mfa'or '.gf', and protein database sequence should be '.faa'. Hereafter, the file name is assumed to be xxxgnm.mfa or prosdb.faa.
The total number of residues in a file must not be greater than or equal to 2**32.
- % ./makeidx.pl -i[n|p|np] [spaln options] xxxgnm.mfa(.gz) or
% ./makeidx.pl -i[a] [spaln options] prosdb.faa(.gz)
- These commands are shortcuts that replace the following series of operations 5-7, if (i) the input is a single sequence file.
and (ii) the default parameter setting is used. In that case, you can skip following instructions.The block size and k-mer size are estimated from the genome size. The -ix option specifies the "block file(s)" .bkx to be constructed, where x is 'a', 'n' or 'p'. The -inp option will construct both .bkn and .bkp files together with the .idx and associated files. If -ix is omitted or x is empty, no block file is constructed.
- % make xxxgnm.idx (for genomic sequence) or
% make prosdb.idx (for protein database sequence)
- This command converts the sequence into a binary format. Four or five files, xxxgnm.seq, xxxgnm.idx, xxxgnm.ent, xxxgnm.grp, and optionally xxxgnm.odr are constructed (prosdb instead of xxxgnm in case of make prosdb.idx). It may take several tens of minutes to construct the files for mammalian genome.
- % make xxxgnm.bkn (for cDNA queries) or
% make xxxgnm.bkp (for protein queries) or
% make prosdb.bka (for protein database)
- This command makes the block index table. This process may take another several tens of minutes.
- Internally, the make command invokes
'spaln -Wxxxgnm.bkn -KD [Options] xxxgnm.mfa' or
'spaln -Wxxxgnm.bkp -KP [Options] xxxgnm.mfa' or
'spaln -Wprosdb.bka -KA [Options] prosdb.faa'.
- If xxxgnm.grp or prosdb.grp were successfully constructed at step 5 above, the option values below would be automatically calculated by script makblk.pl. WARNING: The estimated maximal gene size can be
inadequately small if only a part of the genome (e.g a single chromosome)
is formatted. At that time, explicitly specify the maximal gene size by
the -XGN option of makblk.pl or at the runtime of spaln.
N can have suffix 'k' and 'M' to indicate that the number is measured in kbp and Mbp, respectively.
- Options: (default value)
- -XAN: alphabet size of the reduced amino acids: 6 < N <= 20 (20)
- -XBS: bit patterns of the spaced seeds. The pattern should be asymmetric when the number of patterns > 2.
- -XCN: number of seed patterns: 0 <= N <= 5 (0: contiguous seed)
- -XGN: maximum gene length (262144)
- -XaN: a parameter used to filter excessively abundant words (10)
- -XbN: block size (4096)
N < 65536 and 'genome size' < 65536 * N must be satisfied. An estimate of N is sqrt(genome size). For mammals, N is nearly equal to 54000.
- -XgN: maximal distance in block number between 5' terminal and 3' terminal blocks (16)
- -XkN: word size (11 for DNA, 5 for protein)
- -XsN: distance between neighboring seeds (= k)
- It is possible to generate xxxgnm.idx and other three files directly from the input files without concatenation:
% makdbs -nxxxgnm -KD file1 ... fileN
% make xxxgnm.bkn (for cDNA queries) or
% make xxxgnm.bkp (for protein queries)
This method is particularly useful when the concatenation might yield a file too large to be dealt with by the OS.
- Prepare protein, cDNA, or genomic segment sequence(s) in (multi-)fasta
format (denoted by query below). From 2.3.2a, zgipped fasta file(s) may
be used as the query without prior expansion. Note, however, that compressed
query can considerably slow down the execution rate.
- Store query to work
- % cd work
- Run spaln in one of the following four modes. Spaln
does not support comparison between two genomic segments.
(A) % spaln -Q[0|1|2|3] [-ON] [other options]
genome_segment query
(B) % spaln -Q[4|5|6|7] [-ON] [other options]
-dxxxgnm query
(C) % spaln -Q[4|5|6|7] [-ON] [other options]
-aprosdb query
(D) % spaln -Q[4|5|6|7] [-ON] [other options]
prosdb.faa query
In the last case, prosdb.faa will be internally formatted, and the formatted results will be discarded after the end of execution.
Only a subset of queries may be examined if query is replaced with 'query (from to)', where 'from' and 'to' are the first and last entry numbers in query to be examined.
To run spaln on multiple CPUs, for example, the following commands may be used and the results may be summarized with sortgrcd, as explained later.
(A) % spaln -Q7 -O12 -oxxxO1 -dxxxgnm 'query (1 1000)'
(B) % spaln -Q7 -O12 -oxxxO2 -dxxxgnm 'query (1001 2000)'
(C) % spaln -Q7 ...
However, the procedure will be simplified if a multi-thread operation is
used as follows:
(D) % spaln -Q7 -O12 -oxxx -dxxxgnm -t[N] query
Options: (default value)
- -CN: Use the genetic code specified by the "transl_table number" defined in NCBI transl_table (1).
- -HN: Output is suppressed if the alignment score is less than N. See also -pw. (35)
- -K[D|P|A]: Format either genomic DNA for sequence search with DNA (D) or Protein (P) queries or protein database sequences for search with a genomic segment (A). Use in combination with -W option below.
- -LS: Smith-Waterman-type local alignment. This option may prune out weakly matched terminal regions.
- -M[N]: Single or multiple output for each query
- No option (default): single locus
- No argument: Multiple loci up to the maximum number specified by the program (4 in the present implementation)
- N=1: Re-search the query region not aligned in the first trial. May be useful to detect chimera or fragmented genomic region
- N>1: Output multiple loci maximally up to N
- -ON: Select output format (4)
- N=0: Gff3 gene format
- N=1: Alignment
- N=2: Gff3 match format
- N=3: Bed format
- N=4: Exon-oriented format similar to the output of megablast -D 3
- N=5: Intron-oriented output
- N=6: Concatenated exon sequence
- N=7: Translated amino-acid sequence. Presently not very useful for cDNA queries because the entire exon rather than an ORF is translated
- N=8: Mapping (block) information only. Use with -Q4
- N=10: SAM format
- N=12: Output the same information as -O4 in the
binary format. If -oOutput is set, three files named
Output.grd, Output.erd, and Output.qrd will be created. Otherwise,
query.grd, query.erd, and query.qrd will be created.
- -QN: Select algorithm (3)
- 0<=N<=3: Genomic segment in the fasta format given by the first argument vs. query given by the second argument. One may skip the formatting step described above if only this mode of operation is used.
- 4<=N<=7: Genome mapping and alignment. The genomic sequence must be formatted beforehand.
- N=0,4: DP procedure without HSP search. Considerably slow
- N=1,2,3,5,6,7: Recursive HSP search up to the level of (N % 4)
- -RS: Read block index table from file S.
If omitted, the xxxgnm.bkn, xxxgnm.bkp,
or prosdb.bka file will be read depending on the type of query. The
appropriate file is searched for in the current directory, the directory
specified by the env variable 'ALN_DBS', and the
'seqdb' directory specified at the compile time in this order.
- -SN: Orientation of the DNA query sequence (0)
- N=0: The
orientation is inferred from the phrases (e.g. 5' end) in the header
line of each entry within a fasta file. If no information is available,
both orientations are examined, and the result with the better score is
reported.
- N=1: Forward orientation only
- N=2: Reverse-complement orientation only
- N=3: Examine both orientations
- -Txxx: Specify the species. For genome vs. DNA comparison, -yS flag should also be set in combination with this option. xxx corresponds to the subdirectory in the table directory.
- -VN: Minimum space to induce Hirschberg's algorithm (16M)
- -WS: Write block index table to file S.
- -g: gzipped output used in combination with -W or -O12 option.
- -i[a|p]: Input mode with -Q[0..3].
- -ia: Alternative mode; a genomic segment of an odd numbered entry in the input file is aligned with the query of the following entry.
- -ip: Parallel mode; the i-th entry in the file specified by the first argument is aligned with the i-th entry in the file specified by the second argument.
- default: The genomic segment specified by the first argument is aligned with each entry in the file specified by the second argument.
- -oS: Destination of output file name (stdout)
- -pa: Terminal polyA or polyT sequence is not trimmed.
- -pi: Mark exon-intron junctions by color in the alignment (-O1).
- -pq: Suppress warning messages sent to stderr.
- -pw: Report result even if alignment score is below threshold value.
- -px: Suppress self-comparisons in the execution mode (C) or (D).
- -xBS: Bit pattern of seeds used for HSP search at level 1
- -xbS: Bit pattern of seeds used for HSP search at level 3
- -uN: Gap-extension penalty (3, 2, 2)
- -vN: Gap-opening penalty (8, 6, 9)
- -yaN: Dinucleotide pairs at the ends of an intron (0)
- N=0: Accept only the canonical pairs (GT..AG,GC..AG,AT..AC)
- N=1: accept also AT..AN
- N=2: allow up to one mismatch from GT..AG
- N=3: accept any pairs. An omission of N implies N = 3
- -yiN: Intron penalty (11, 8, 11)
- -yjN: Incline of long gap penalty (0.6)
- -ykN: Flex point where the incline of gap penalty changes (7)
- -ylN: Double affine gap penalty if N=3; otherwise affine gap penalty
- -ymN: Score for a nucleotide match (2, 2)
- -ynN: Penalty for nucleotide mismatch (6, 2)
- -yoN: Penalty for an in-frame termination codon (100)
- -ypN: PAM level used in the alignment (third) phase (150)
- -yqN: PAM level used in the second phase (50)
- -yxN: Penalty for a frame shift (100)
- -yyN: Relative contribution of splicing signal (8)
- -yzN: Relative contribution of coding potential (2)
- -yAN: Relative contribution of the translational initiation or termination signal (8)
- -yBN: Relative contribution of branch point signal (0)
- -yEN: Minimum exon length (2)
- -yIS: Intron distribution parameters
- -yJN: Relative contribution of the bonus given
to a conserved intron position
- -yLN: Minimum intron length (30, 30)
- -yS: For
cDNA queries, use species-specific exon-intron boundary signals. For
protein queries, invoke the 'salvage' procedure to examine all blocks
with positive scores.
- -ySN: N specifies the percentile
contribution of the species-specific splice signal. The other part is
derived from the universal signal given to the dinucleotides at the ends
of an intron. An omission of N implies N = 100.
- -yX: For a DNA query, this option sets parameter
values for cross-species comparison. The actual values are given as the
second number in the parentheses of the above lines. -yS100 is also
automatically invoked. Conversely, this option specifies an intra-species
mode for a protein query, whereby -yS30 is invoked.
- -yYN: Relative contribution of length-dependent part of intron penalty (8)
- -yZN: Relative contribution of oligomer composition within an intron (0)
- Sortgrcd
Sortgrcd is used to recover the output of spaln with -O12 option, to apply some filtering, and also to rearrange the output of multiple spaln runs. It is invoked by:
% sortgrcd [options] X.grd [Y.grd ...] or
% sortgrcd [options] X.grd.gz [Y.grd.gz ...]
- Options:
- -CN: Minimum cover rate = % nucleotides in predicted exons / length of query (x 3 if query is protein) (0-100)
- -FN: Filter level (N=0: no; N=1: mild; N=2: medium; N=3: stringent)
- -IN: Minimum sequence identity (0-100)
- -HN: Minimum alignment score (35)
- -ON: Output mode. N=0 or 3-7: same as -ON of spaln; N=15: -O5 format for unique introns (default: N=4)
- -SC: Sort chromosomes/contigs in the order of C=a: alphabetical, b: abundance, c: appearance in genome database, r: reverse order for minus strand
- -VN: Internal memory size used for core sort. If the
data size is greater than N, the sorting procedure will be
done in pieces.
- -mN: Maximum number of mismatches within 20 bp from the nearest exon-intron boundary
- -nN: Maximum number of non-canonical (other than GT..AG, GC..AG, AT..AC) intron ends
- -uN: Maximum number of unpaired (gap) sites within 20 bp from the nearest exon-intron boundary
- By default, no filter listed above is applied.
- When the output of spaln is separated into several files, the combined results are subjected to the sorting. Although *.grd files are assigned as the argument, there must be corresponding *.erd and *.qrd files in the same directory.
- In the default output format, the gene structure corresponding to each transcript is delimited by a line starting with '@', whereas each gene locus is delimited by a line starting with '!' [4]. Two transcripts belong to the same locus if their corresponding genomic regions overlap by at least one nucleotide on the same strand.
- With -O0 option, the outputs follow the instruction of Gff3 (http://www.sequenceontology.org/gff3.shtml) where a gene locus is defined as described above.
- To experience the flow of procedures with the samples in seqdb, type in the following series of commands after moving to seqdb.
% make dictdisc.cf
% make dictdisc.faa
% make dictdisc_g.gf
% ./makeidx.pl -inp dictdisc_g.gf
% make dictdisc.srd
% make dictdisc.spn
Added/modified in Ver. 2.3.2a (2018-09-03):
- From this version, compressed query fasta file(s) may be accepted.
- The new option of spaln '-g' directly generates compressed output(s) when used in combination with -W or -O12 option.
- makeidx.pl and makblk.pl have been modified to accord with gzipped genome/database fasta files.
- A small bug in makdbs.c has been fixed.
Added/modified in Ver. 2.3.2 (2018-07-26):
- From this version, input fasta files (X.mfa, X.gf, or X.faa), formatted data files (X.seq, X.bka, X.bkn, and X.bkp) for spaln, and X.grd, X.erd, and X.qrd for sortgrcd may be gzipped if USE_ZLIB mode is activated upon compilation. The compressed file name must end with .gz (e.g. X.mfa.gz). Note: other data files (X.ent, X.grp, X.idx, and X.odr) must not be compressed.
- A serious bug concerning with multiple queries has been fixed. This has considerably improved mapping sensitivity particularly when -M option is set under single thread operation.
- Fixation of several smaller bugs and fine tuning of codes further enhanced mapping sensitivity and specificity particularly for short protein queries.
- -ON option of sortgrcd has been extended to support -O3 (BED), -O6 (cDNA = concatenated exons), and -O7 (translated amino acid sequence).
Added/modified in Ver. 2.3.1 (2017-06-22):
- The data structure of a block file (X.bkn, X.bkp, X.bka) has been simplified to reduce its size. According to this modification, upper compatibility of database is maintained, but earlier versions of spaln cannot read block files formatted with this version of spaln
- The internal representation of splicing junctions has been modified, which has led to considerable reduction in code size and memory requirement, and slight reduction in execution time.
- The algorithm for limiting the range of blocks to be passed to alignment phases has been modified. This modification will facilitate detection of the first or the last exon that is separated from the other exons by more than one block size.
- The default value for nucleotide mismatch penalty with -yX option has been restored to that of earlier versions.
Added/modified in Ver. 2.3.0 (2017-02-15):
- The limitation on the total number of residues in a database sequence less than 2**32 has been removed. To realize this update, the database format has been slightly changed. Spaln ver.2.3 can read database sequence formatted by a previous version of spaln, but the opposite is not true. In some cases, reformatting the database would result in better performance, even if the data size is less than 2**32.
- Several 'new' non-standard genetic codes are supported, following the update of NCBI's transl_table.
- Makeidx.pl has been updated. Formatting options such as -XCN may be transferred to spaln through this command. The default size of k-mers with -p option has been slightly modified.
- Miscellaneous small modifications have been done to facilitate better maintenance.
Added/modified in Ver. 2.2.2a (2016-11-19):
- In the mapping mode, an approximate genic range is now inffered from the query, if its gene structure is provided in the extended fasta format. Examples of the extended fasta format will be obtained by running spaln with -O6 or -O7 option. This modification relies on the observations that orthorogous introns tend to have similar lengths, and is proven to be effective when the first or a few orfollowing introns are so longer that the 5' end of the gene is hard to be located by the previous version of spaln.
- The expression of genomic site numbers in an alignment is corrected.
- Excessive new line after sequence identifier is suppressed, when the header line of the query has no comment.
Added/modified in Ver. 2.2.2 (2016-05-06):
- Although implemented earlier, Spaln now formally supports rapid sequence similarity search against a protein sequence database with a set of protein sequence queries. Unlike blastp and some other programs for local sequence similarity search, spaln calculates (semi-)global alignments. Currently, p-values are not estimated, as spaln intends to find only a small number of strong similarities (to use the found sequence as a template for spliced alignment, when the query is a genomic segment).
- Specification of species-specific parameter set is made easier. An eight character genus-species identifier (e.g. mus_musc) or genus name (e.g. Mus) is now allowed as the argument to the -T option for specifying the corresponding parameter set (Tetrapod in this example). To activate this facility, new "gnm2tab" file must be installed in the "table" directory.
- A rare problem that mapping routine becomes much slow down due to internal repeats in query sequences have been fixed.
- Estimation of maximum gene length in a genome from the total genome size has been modified. Remember that you must set the expected maximum gene length by -GXn option (n is the expected maximum gene length) at the formatting phase or at each run time if your "genome" is not full size.
- Intron length distribution is modelled by a combination of up to three (formally two) Frechet distributions.
Added/modified in Ver. 2.2.1a-e (2016-02-22):
- Surplus lines in seqdb/makblk.pl have been removed
- configure script has been modified.
- A few modifications have been made to accord with Mac OS X.
- Several bugs in minor run modes have been fixed.
- The problem of skipping the first sequence line in some input fasta files has been fixed.
Added/modified in Ver. 2.2.1 (2015-12-02):
- A few bugs have been fixed. In particular, output in SAM format has been corrected.
- For mapping amino acid sequences, the numbers of matches and mismatches are displayed correctly.
Added/modified in Ver. 2.2.0 (2015-08-14):
- One-time run of database search has been enabled. In this mode, you need not format database sequences beforehand. However, the formatted results are not retained.
- A few primitive bugs have been fixed. In particular, the problem with long header line has been fixed. Output formats with -O3 (BED format) and -O10 (SAM format) are also rectified.
Added/modified in Ver. 2.1.4 (2015-01-30):
- Three-byte addressing has been abolished. Accordingly, data files formatted with an older version of spaln can be used by the current version, but the opposite will result in a failure.
- A few primitive bugs have been fixed.
Added/modified in Ver. 2.1.3 (2014-12-15):
- -CN option is added to deal with non-standard genetic codes, where N denotes the "transl_table number" defined in NCBI transl_table. For example, -C6 option enables proper translation of Tetrahymena or Paramecium genes.
- A failure to identify the 3' splice junction followed by a long gap has been amended.
- A few bugs with -O1 option have been fixed.
Added/modified in Ver. 2.1.2 (2013-10-17):
- A few bugs in spaln and sortgrcd have been fixed.
- Inching toward more c++ style coding.
Added/modified in Ver. 2.1.1 (2013-7-1):
- The bugs in makdbs have been fixed.
- The placements of exon-intron boundaries are modified when spaln is run under -ya[N] (N = 2 or 3) options with protein queries.
Added/modified in Ver. 2.1.0 (2013-5-15):
From this version, the formats of the database files and output of -O12 option of spaln have been changed. These modifications are intended to eliminate the limitations on the lengths of identifiers of both genomic and query sequences (formally up to 20 letters). To accord with these modifications, the associated programs, makdbs and sortgrcd, have been updated. Despite these modifications in the file formats, the user interface of each program is unchanged. spaln_2.1 can read database sequences formatted by an older version, provided that the length limitations on the identifiers are satisfied. However, sortgrcd_2 can no longer process outputs from an older version of spaln. From Version 2.1.3, spaln can deals with non-standard genetic codes.
The major revision number of spaln has been updated
from 1 to 2. The new version incorporates additional features
for intron recognition in addition to other revisions. The major points of revision are as follows.
- The codes have been extensively rewritten in C++, so that they are no longer compiled by a C compiler.
- A heuristic routine is added after the HSP search to reduce the number of calls of the restricted DP routine.
- -t[N] option specifies the number of CPUs involved in a multi-thread operation. If N is omitted, all available CPUs are used.
- A branch point signal and an intron propensity based on its oligomer compositions are incorporated in the scoring system.
- The splice junction signals are composed of two terms: (1) universal signal that depends on the dinucleotide pair at the ends of an intron and (2) species-specific signal that depends on the sequence around each splice junction. The relative contributions of these two terms can be freely adjusted by -ySN option.
- The intron penalty is now automatically adjusted depending on the
values of other parameters.
- Species specific parameters are available for 61 divergent eukaryotes. Those parameters are stored in each subdirectory in the 'table' directory.
- A set of benchmark data named "SPAliBASE" is ready for download.
- The limitations on the length of the sequence id of both genomic and query sequences are removed. According to this modification, makdbs and sortgrcd programs are updated.
- The default order of output from sortgrcd has been changed.
- A few bugs under - M option are fixed.
Added/modified in Ver. 2.0.6 (2012-01-23):
- The limitation on the length of the sequence id (formally max. 20 letters) has been removed. Note that the length of entry id in a database file is still limited up to 14 letters. This limitation will be removed in the next release.
- -XsN option is added to allow for overlapping k-mer seeds at the block search phase, where N (1<=N<=k) indicates the distance between adjacent seeds. The sensitivity of block search will be improved with a small N, which is particularly beneficial for mapping of short queries. Compared with the default setting (N=k, i.e. tiling seeds), however, the memory consumption for the k-mer table will be increased by the factor of k/N.
- Spaln now supports mapping of paired-end reads. Use -ip option for paired input files (matching 5?f and 3?f reads must appear in the same order) or -ia option for a single input file (matching 5?f and 3?f reads must appear alternatively).
- The performance of the GvsA mode (query = genomic segment, database = amino acid sequences) has been improved.
- Memory leak errors have been fixed (20130226)
Added/modified in Ver. 2.0.4 (2011-9-10, 2011-10-11, 2011-11-17)
- A bug when the database file is constructed from multiple input sequence files has been fixed. Makefile and makblk.pl in 'seqdb' directory are also modified to accord for multiple input files.
- The output with the -O3 option is changed from Psl-like to Bed format.
- The default parameters used in polyA detection have been
modified to reduce the chance of removing genome-encoded
polyA-like sequences.
- The procedure to collect related HSPs within and adjacent significant
blocks has been simplified.
- The procedure to assemble HSPs has been improved to reduce the chance
of chimera formation among tandemly repeated paralogous genes.
- A bug concerning with very long genes has been fixed.
Added/modified in Ver. 1.4.5 (2010-04-23, 2010-05-08, 2010-08-04)
- Graphical output is supplemented in the Web interface.
- -U option is added to spaln. With this option, alignment is computed without splicing. It may be useful when genomic fragment(s) are mapped/aligned with whole genome.
- Several small bugs in sortgrcd have been fixed.
- Occasional core dump of spaln with -M option has been remedied.
- A problem of makdbs command when the character '|' is contained in a second or later word in the header line of fasta file was fixed.
Added/modified in Ver. 1.4.4 (2009-12-01, 2010-01-25, 2010-02-6, 2010-02-23)
- -SN option is added to spaln. When N = 3 (default) both orientations
of query sequence(s) are examined. When N = 1 and N = 2, only forward and reverse
direction is examined, respectively. When N = 0, the orientation to be examined
depends on the annotation (header line of fasta format) of the query sequence.
The list of phrases that indicate the orientation is given in the file
"StrandPhrase" in the table/spalnXX/seqdb. If no phrase in this list is found in the
annotation, both orientations are examined.
- -yaN option of spaln has been modified. When N = 0 (default), only
canonical intron boundary sequences (GT..AG, GC..AG, AT..AC) are allowed.
When N = 1, the third consensus is relaxed as AT..AN. When N = 2, one mismatch
from GT..AG is allowed in addition to the relaxed consensus of AT..AN.
When N = 3, any sequence is allowed as a boundary.
- Many bugs of sortgrcd have been fixed.
- -O15 option of sortgrcd outputs information about unique introns merged
from the mappings of several transcripts.
- Unnecessary spaces in Gff3 format have been removed from outputs of
both spaln and sortgrcd.
Added/modified in Ver. 1.4.3c (2009-08-21)
- Some incompatibilities between documents and programs have been rectified.
- Species-specific parameters in table/spalnXX/seqdb have been updated.
Added/modified in Ver. 1.4.3b (2009-07-10) (2009-07-24)
- Sortgrcd has been modified so that files with may chromosome/contig entries should be handled.
- Spaln's write mode now accepts plural arguments each corresponding to a piece of the whole genome (e.g. each chromosomal sequence). This modification has made it possible for spaln to format whole genomic sequence on a platform that cannot handle a large file (e.g. > 2GB). For example, the human genome is composed of 24 chromosomes. The total genome size exceeds 2 Gb, while the maximum chromosome size is less than 300 Mb. Even though a system fails to read the file of whole genome at a time, it can support spaln to process chromosomal sequences sequentially. In the example bellow, "my_genome" is the genome name, and chr1, chr2, ... chrn are chromosomal sequence files.
% makdbs -nmy_genome -KD chr1 chr2 ... chrn
% spaln -Wmy_genome.bkn -Xk12 -Xb54272 -XG2802688 -Xg64 -KD chr1 chr2 ... chrn
Then, you can use spaln in a usual way:
% spaln -Q7 -dmy_genome cDNAs
- A few bugs are fixed to prevent core dumps under some conditions.
Added/modified in Ver. 1.4.2 (2009-05-16) (2009-06-17)
- Bugs of sortgrcd has been fixed. sortgrcd now supports -O5 (intron-oriented format) output.
- Several bugs that were conditionally problematic have also been fixed.
Added/modified in Ver. 1.4.1 (2009-02-16) (2009-03-16)
- A combination of a protein sequence database and a genomic segment(s) is now supported. The protein sequence database must be formatted beforehand. At the run time, a set of ORFs in the genomic segment are translated and rapidly searched against the database sequences, and then the best-hit sequence is used as the template of the spliced alignment between the relevant portion of the genomic segment and the protein sequence retrieved.
- The format of .bkn and .bkp files is slightly changed. This has introduced some incompatibility between formatted files and software; spaln Ver.1.4 can use files formatted by spaln Ver.1.3, but the inverse would make an error.
- The output form now includes Gff3 (gene or match form) and Psl-like format.
- Sortgrcd also supports Gff3 gene format in addition to the native format.
- The default intron-penalty function has been modified from the well-shaped to a 'generic' function derived from intron-length distributions of several species.
- Many small modifications have been made to improve the stability of the software. In particular, multiple outputs of the same locus have been suppressed when -M option is set.
Added/modified in Ver. 1.3.2. (2008-09-12) (2008-09-25) (2008-10-21)
- Divide by zero error with makblk.pl when genome size is less than 1Mb has been fixed.
- Missing genomic fragment name in -Q[0-3] mode has been recovered.
- A few bugs have been fixed to avoid core dumps due to incompatibility between HSP coordinates and sequence ends.
- Known exon boundary information can be incorporated into the objective function when a cDNA, as well as a protein, is used as a template. To do so, simply add the -yBN option, where N is the bonus given to each matched location of exon-exon boundary.
- Compile time errors on some systems have been fixed.
Added/modified in Ver. 1.3.1. (2008-08-21)
- The portability has been improved for spaln to run on 64-bit machines.
- The codes are cleaned up so that less number of warnings should be output at the compilation time.
- 'make install' command is modified so that species-specific parameter files should be copied to the working table/spalnXX/seqdb directory.
- The test procedure described at the end of this document now includes the case of protein sequence queries.
Added/modified in Ver. 1.3.0
- Protein sequences can now be used as queries. If properly formatted, users do not need to be conscious of the difference in the type of queries at the run time.
- Bugs with -O12 option was fixed to regain normal output.
- The best amino acid substitution matrices are now chosen in different phases of execution. In this connection, the format of mdm files has been changed. Accordingly, the revised makmdm command must be run before the use of the new version of spaln.
- Combination of relevant blocks has been modified to improve sensitivity.
- The 'Salvage' procedure was added to find homologues that are missed by the formal criteria.
Source
Binaries
Data
Copyright (c) 2007-2018 Osamu Gotoh all rights reserved