VDJSeq-Solver - Manual

Download and Install

Download and decompress the archive VDJSeq-Solver.tar.gz from the 'Releases' Section into a local folder.

tar xzf VDJSeq-Solver.tar.gz

Being build on top of the following programs:

TopHat (1.0.14)
BEDTools (2.16.2)
Bowtie (0.12.8)
Blast (2.2.25)
SHRiMP (2.2.2)
Samtools (0.1.18)

You need to install the afore mentioned softwares to run VDJSeq-Solver program. If you dont' want to install these programs on your machine you can also take advantage of the directory TOOLS we included in VDJSeq-Solver.tar.gz distribution by adding the programs contained in the directory to your PATH. All these programs have been modified in the next versions, but being VDJSeq-Solver developed considering the reported version, it is recommended to use them.

Requirements

All VDJSeq-Solver pipeline is written in Java and Perl languages so make sure that these two programming languages are currently present on your machine. The recommended versions are respectively 1.6.0_16 and 5.8.8.

You can check the version of both the programming languages by typing:

perl -version java -version

Datasets

In order to run VDJSeq-Solver you need to provide the bowtie index of the human genome, the VJ and D gene sequences in fasta format, and finally a .bed file containing the location relative to heavy chain locus genes In VDJSeq-Solver distribution we included these files in the directory called REFERENCE.

In particular:

REFERENCE/hg19/hg19.*: pre-built index of the human genome hg19;
REFERENCE/VDJ_genes/VJ_genes.fasta: fasta file relative to heavy chain V and J genes;
REFERENCE/VDJ_genes/D_genes.fasta: fasta file relative to heavy chain D genes;
REFERENCE/IG.bed: bed file relative to heavy chain genes location on the human hg19 genome.

Build the reference is however possible following the bowtie-build instructions, whereas the other files can be easily downloaded from online repositories.

We also provide a dataset test of paired-end 100 bp RNA-Seq reads from MCL (Mantel Cell Lymphoma) located in VDJSeq-Solver main folder (sequence_1.fastq, sequence_2.fastq) . It is a subset of reads belonging to a real sample reduced because the complete dataset was not yet published. The main clone recombination involves, however once a time IGHV2-70*11 IGHD1-26*01 IGHJ4*02 alleles. The recombined sequence reconstructed (where g3 is imposed equal to 4 as detailed in the following) is:

CGGGTCACCTTGAGGGAGTCTGGTCCTGCGCTGGTGAAACCCACACAGACCCTCACACTGACCTGCACCTTCTCTGGGTTCTCACTCAG CACTAGTGGAATGTGTGTGAGCTGGATCCGTCAGCCCCCAGGGAAGGCCCTGGAGTGGCTTGCACGCATTGATTGGGATGATGATAAAT ACTACAGCACATCTCTGAAGACCAGGCTCACCATCTCCAAGGACACCTCCAAAGACCAGGTGGTCCTTACAATGACCAACATGGACCCT GTGGACACAGCCACGTATTACTGTGCACGGTCTGCAGGTGGGAGCTACCCCGATGACTACTGGGGCCAGGGAACCCTGGTCACCGTCTC CTCAG

Run

In order to run VDJSeq-Solver you have to launch inside the VDJSeq-Solver directory the VDJSeq-Solver_run.pl script with the following options:

-g1: First mate file in .fastq format
-g2: Second mate file in .fastq format
-g3: Number of VJ recombinations for which the D analysis has to be performed

An out directory will be automatically created into the main folder. It contains:

tophat_run directory: you can find here two files named respectively overlapping_reads.bam and overlapping_reads.sam relative to the reads that Tophat correctly mapped against the heavy chain genes locus and a subdirectory containing the tophat log files;
bowtie_run directory : you can find here a file called unmapped_def.fa containing the Bowtie unmapped reads;
blast_run directory : you can find here blast_result.txt file relative to the reads mapped by Blast on V, J genes and blast_reference.txt file relative instead to the reads mapped by Blast on the virtual VDJ reference sequences created.
cardinality_global_sorted.txt file: a .txt file reporting all the VJ recombinations identified in the sample under study with the relative score (number of encompassing VJ reads). For instance:

num_supporting_reads J_allele-0-V_allele

log directory: here are contained the .log files output of the VJ recombination search activity.
1 2 3 ... X directories: the g3 parameter specify the number of VJ recombinations for which the D allele search has to be performed and the virtual reference built. The program automatically will create a number of subdirectory equal to the number specified by g3 for the D allele analysis of the chosen recombination (details explained as comments in the script). Higher is the number of encompassing reads supporting the specific VJ recombination and higher will be the progressive number assigned to the directory. It is recommended to select this number on the basis of the kind of analysis that you want to perform. In each of these subdirectory will be generated a log directory inherent the D search and references creation activities and two files:

1) output_classify.txt file: a .txt file reporting every D allele identified after Shrimp alignment as recombinant for the VJ couple with relative score (number of mates supporting it). For instance:

dir_name

num_supporting_VJ_reads VJ_couple

D_allele_name num_supporting_D_mates

If a D allele can't be detected from the reads, the last line of the file will report this information.

2) output_def.txt file: a sorted .txt file containing the number of mates supporting the recombination sequences created by VDJSeq-Solver that involves the most scored D allele for the specific VJ recombination. For instance:

num_supporting_VDJ_sequence_reads VDJ_alleles VDJ_sequence