ASplice Software
ASplice is a scalable and memory-efficient algorithm for de novo
transcriptome assembly that constructs splicing graphs for RNA-Seq libraries,
which recover alternative splicing information. For each node in each splicing
graph, the expression level is reported as the number of reads per kilobase of
node per million reads (RPKM) with respect to each library.
The source code consists of four files extractfa.c,
extractfa2.c, asplicek.c,
and asplicel.c. They can be compiled with the commands
"gcc -O3 -o extractfa extractfa.c", "gcc -O3 -o extractfa2 extractfa2.c",
"gcc -O3 -o asplicek asplicek.c", and "gcc -O3 -o asplicel asplicel.c".
Steps
- For each RNA-Seq library, extract and trim each read based on the quality
score with the following command (arguments within brackets are optional, but
an earlier argument cannot be skipped if a later argument is supplied):
- Single-end reads: extractfa input_fastq_filename output_fasta_filename [number_of_parts] [minimum_sequence_length] [minimum_quality_score]
- Paired-end reads: extractfa2 forward_fastq_filename reverse_fastq_filename output_fasta_filename [number_of_parts] [minimum_sequence_length] [minimum_quality_score]
When number_of_parts is larger than one, the output filenames are of the form
output_fasta_filename.part, where part is the part number ranging from 1 to
number_of_parts. The default values are: number_of_parts = 1,
minimum_sequence_length = 1, and minimum_quality_score = 15.
- Perform the parallel stage by running the following command on each part
(or each library):
-
asplicek k filename [part_number]
Each part (or each library) can be run on one processor on the same or different
computing nodes. The range of k is 13 ≤ k ≤ 45 with k odd. Omit
part_number if number_of_parts is equal to one. For a library with multiple
parts, asplicek should be run number_of_parts times with part_number ranging
from 1 to number_of_parts. Output filename is of the form filename.k if
part_number is not supplied, and filename.part.k otherwise. If asplicek has
been run before with k' < k, then the previous result with the largest k' will
be utilized by the iterative algorithm. When multiple assemblies with different
values of k are needed, asplicek should be run multiple times with the desired
values of k in increasing order.
- Perform the sequential stage to obtain the assembly by running the
following command:
-
asplicel k c filename_1 ... filename_n > output_file
Each assembly with a given setting of k and the k-mer coverage cutoff c can be
run on one processor on the same or different computing nodes. The range of k
is 13 ≤ k ≤ 45 with k odd. For a library with multiple parts, only one
filename is needed without the part number. The program will automatically
determine the number of parts for each library. One RPKM value will be reported
for each library (not each part) within each node.
-
Perform downstream analysis using output_file.
Output
The assembly is represented in an annotated fasta format, in which
each splicing graph is given as a collection of nodes, with
connecting normal and paired edges and RPKM values for each library embedded
within the name of each node. Different splicing graphs are separated by blank
lines.
Each node name is given as >NODE_u:v_1,v_2,...,v_p,(w_1),(w_2)...,(w_q),
where u is the ID of the current node, u -> v_1, u -> v_2, ..., u -> v_p are
normal edges, and u -> w_1, u -> w_2, ..., u -> w_q are paired edges,
followed by one RPKM value for each library that are listed in the same
order as the library files.
SNPs are reported within the sequences as IUPAC letters that are not A, C,
G, T.
Reference
Sze S.-H., Pimsler M.L., Tomberlin J.K., Jones C.D. and Tarone A.M.
A scalable and memory-efficient algorithm for de novo transcriptome
assembly of non-model organisms.