Reconciliator 2.0 -- The tool for merging assemblies
1. Installation.
Download reconciliator2.tgz from www.genome.umd.edu . Untar the file using command >tar xvzf reconciliator2.tgz.
Reconciliator uses Nucmer software for sequence alignment. Nucmer is part of the Mummer package and can be obtained here: http://mummer.sourceforge.net/ Please make sure that nucmer and show-coords routines from te Mummer package are installed and added to the PATH.
2. Preparing the assemblies for reconciliation.
The assemblies to be reconciled must be converted into the Sanger/WashU format. Each assembly must have the following files:
1.
contigs.bases: fasta file for contig bases
2. contigs.quals: fasta files for contig quality scores
3. supercontigs: structure of supercontigs (scaffolds)
-- see below
4. reads.placed: provides the locations of reads which were placed in the
assembly -- see below
5. libraries.txt provides the
re-estimation of library size distributions -- see below
Descriptions of the files:
1 and 2. contigs.bases and contigs.quals: fasta file with bases and quality scores for contigs. All contigs must be oriented forward, in correspondence with the supercontigs.file (see next).
3. supercontigs: structure of supercontigs (scaffolds)
This file contains a description of the supercontigs (also called scaffolds). They are ordered lists of contigs, with approximately known gaps between them. By assumption, all contigs in a supercontig are oriented in the same direction (forward). The file consists of lines, each starting with a keyword, which is supercontig, contig, or gap:
(a) supercontig line format:
supercontig [supercontig name]
(b) contig line format:
contig [contig name] [contig
size]
(c) gap line format:
gap [gap length] [gap length
standard deviation, optional, put
* if not available]
* contig
name is consistent with the naming in contigs.bases and contigs.quals
* gap length is the estimated gap length between contigs
(negative if overlap predicted)
* gap length standard deviation: estimated standard
deviation for gap length value (desired)
Standard deviation can be replaced by * if unknown.
Example:
supercontig s1
contig c1 5500
gap 200 * * 2
contig c7 2000
gap 2235 * * 5
contig c3 182882
supercontig s2
contig c2 2828
4. reads.placed: provides the locations of reads which were placed in the assembly
This is a file with one line per read placed in the assembly. Each line has white-space-separated fields, as follows:
(a) NCBI ti
number for read (or *, if none known)
(b) read name
(c) start of trimmed read on original read
(d) number of bases in trimmed read
(e) orientation on contig (0 = forward, 1 = reverse)
(f) contig name
(g) supercontig name
(h) approximate start of trimmed read on contig
(i) approximate start of trimmed read on supercontig.
For c, h, and i, the first position is always 1 (not 0). For h, the start of a read on a contig is always the smallest position on the contig which the read covers, regardless of its orientation. This applies to i as well. For i, positions on supercontigs are measured so as to take account of gaps.
5. libraries.txt : provides the re-estimation of library size distributions
This is a file with one line per library. Each line has white-space separated fields as follows:
(a) Library ID: any ID
compatible with mateinfo.txt
(b) estimated mean insert length
(c) estimated standard deviation of
insert length
In addition to these files for each assembly one file containing mate pair data must be provided Ths is a file with one line per mate pair. Each line has white-space separated fields as follows:
(a) forward_read_name (compatible with the second field
in reads.placed)
(b) reverse_read_name
(compatible with the second field in reads.placed)
(c) library_ID (compatible with
the first field in libraries.txt)
A. read.pairs: where a read came from on its sequencing clone.
A convenience file -- independent of assembly and strictly
not necessary but useful.
3. Running the reconciliation.
There are three steps involved in running the reconciliation:
a. prepare
the reference and supplementary assembly in the format described above and place
them into two separate directories
b. prepare mateinfo.txt file (see above)
c. cd to the directory where the reconciliation software is
installed and run:
./reconcile2.sh /path_to_reference_assembly/ /path_to_supplementary_assembly/ /path_to_mateinfo/mateinfo.txt num_bases_per_alignment_batch max_cpus_to_use
arguments:
/path_to_reference_assembly/ --
full path to the directory where reads.placed and other files for the reference
assembly are located
/path_to_supplementary_assembly/
-- full path to the directory where reads.placed and other files for the
supplementary assembly are located
/path_to_mateinfo/mateinfo.txt --
location of the mateinfo file
num_bases_per_alignment_batch and
max_cpus_to_use -- integers specifying the number of cpus to use and the number
of bases to align in a single process -- depends on your memory size. Aliging
100000000 bases takes about 4gb of memory, and it scales proportionally, so if
your computer has 4 cpus and 32gb of memory, then you can run 4 alignment
processes with 200000000 bases in each. In general I recommend to set
num_bases_per_alignment_batch to 50000000 and then max_cpus_to_use to the number
of compute cores in your computer
4. Interpreting the results.
The reconciliation routine will create a reconciled_assembly directory. That directory will normally contain the following files:
all_successful_joins.txt --
this file has the listing of all joins that were made
assembly_reconciliation_stats.txt -- this file has the before and after
reconciliation stats
ce_problem_points_remaining.txt -- this file has the listing of all
statistically detected compressions/expansion in the reconciled assembly, there
are lots of false positives here
compressions_remaining.txt -- compressions that were fixed
contig_name_correspondence.txt -- corresponds the names of the reference contigs
to the names of the reconciled contigs
contigs.bases -- reconciled contig sequences
contigs.quals -- reconciled contig quals
gaps.txt -- listing of gaps "after" contigs. if gap is zero then the
contig is the end of a scaffold
reads.placed -- reconciled assembly read placements -- reads placed uniquely
reads.placed.with_surrogates -- full reconciled assembly read placements -- some
reads are placed multiply
supercontigs -- reconciled assembly supercontigs