ASplice Software

ASplice is a scalable and memory-efficient algorithm for de novo transcriptome assembly that constructs splicing graphs for RNA-Seq libraries, which recover alternative splicing information. For each node in each splicing graph, the expression level is reported as the number of reads per kilobase of node per million reads (RPKM) with respect to each library.

The source code consists of four files extractfa.c, extractfa2.c, asplicek.c, and asplicel.c. They can be compiled with the commands "gcc -O3 -o extractfa extractfa.c", "gcc -O3 -o extractfa2 extractfa2.c", "gcc -O3 -o asplicek asplicek.c", and "gcc -O3 -o asplicel asplicel.c".

Steps

  1. For each RNA-Seq library, extract and trim each read based on the quality score with the following command (arguments within brackets are optional, but an earlier argument cannot be skipped if a later argument is supplied): When number_of_parts is larger than one, the output filenames are of the form output_fasta_filename.part, where part is the part number ranging from 1 to number_of_parts. The default values are: number_of_parts = 1, minimum_sequence_length = 1, and minimum_quality_score = 15.
  2. Perform the parallel stage by running the following command on each part (or each library): Each part (or each library) can be run on one processor on the same or different computing nodes. The range of k is 13 ≤ k ≤ 45 with k odd. Omit part_number if number_of_parts is equal to one. For a library with multiple parts, asplicek should be run number_of_parts times with part_number ranging from 1 to number_of_parts. Output filename is of the form filename.k if part_number is not supplied, and filename.part.k otherwise. If asplicek has been run before with k' < k, then the previous result with the largest k' will be utilized by the iterative algorithm. When multiple assemblies with different values of k are needed, asplicek should be run multiple times with the desired values of k in increasing order.
  3. Perform the sequential stage to obtain the assembly by running the following command: Each assembly with a given setting of k and the k-mer coverage cutoff c can be run on one processor on the same or different computing nodes. The range of k is 13 ≤ k ≤ 45 with k odd. For a library with multiple parts, only one filename is needed without the part number. The program will automatically determine the number of parts for each library. One RPKM value will be reported for each library (not each part) within each node.
  4. Perform downstream analysis using output_file.

Output

The assembly is represented in an annotated fasta format, in which each splicing graph is given as a collection of nodes, with connecting normal and paired edges and RPKM values for each library embedded within the name of each node. Different splicing graphs are separated by blank lines.

Each node name is given as >NODE_u:v_1,v_2,...,v_p,(w_1),(w_2)...,(w_q), where u is the ID of the current node, u -> v_1, u -> v_2, ..., u -> v_p are normal edges, and u -> w_1, u -> w_2, ..., u -> w_q are paired edges, followed by one RPKM value for each library that are listed in the same order as the library files.

SNPs are reported within the sequences as IUPAC letters that are not A, C, G, T.

Reference

Sze S.-H., Pimsler M.L., Tomberlin J.K., Jones C.D. and Tarone A.M. A scalable and memory-efficient algorithm for de novo transcriptome assembly of non-model organisms.