Reconciliator 2.0 -- The tool for merging assemblies

1. Installation.

Download reconciliator2.tgz from www.genome.umd.edu .  Untar the file using command >tar xvzf reconciliator2.tgz. 

Reconciliator uses Nucmer software for sequence alignment.  Nucmer is part of the Mummer package and can be obtained here: http://mummer.sourceforge.net/  Please make sure that nucmer and show-coords routines from te Mummer package are installed and added to the PATH.

2. Preparing the assemblies for reconciliation.

The assemblies to be reconciled must be converted into the Sanger/WashU format.  Each assembly must have the following files:

1. contigs.bases: fasta file for contig bases
2. contigs.quals: fasta files for contig quality scores
3. supercontigs: structure of supercontigs (scaffolds) -- see below
4. reads.placed: provides the locations of reads which were placed in the assembly
-- see below
5libraries.txt provides the re-estimation of library size distributions -- see below

Descriptions of the files:

1 and 2. contigs.bases and contigs.quals: fasta file with bases and quality scores for contigs.  All contigs must be oriented forward, in correspondence with the supercontigs.file (see next).

3.  supercontigs: structure of supercontigs (scaffolds)

This file contains a description of the supercontigs (also called scaffolds). They are ordered lists of contigs, with approximately known gaps between them.  By assumption, all contigs in a supercontig are oriented in the same direction (forward). The file consists of lines, each starting with a keyword, which is supercontig, contig, or gap:

    (a) supercontig line format:
    supercontig [supercontig name]

    (b) contig line format:
    contig [contig name] [contig size]

    (c) gap line format:
    gap [gap length] [gap length standard deviation, optional, put * if not available]

    * contig name is consistent with the naming in contigs.bases and contigs.quals
    * gap length is the estimated gap length between contigs
      (negative if overlap predicted)
   * gap length
standard deviation: estimated standard deviation for gap length value (desired)

     Standard deviation can be replaced by * if unknown.

Example:

supercontig s1
contig c1
5500
gap 200 * * 2
contig c7
2000
gap 2235 * * 5
contig c3
182882

supercontig s2
contig c2
2828

 4.  reads.placed: provides the locations of reads which were placed in the assembly

 This is a file with one line per read placed in the assembly.  Each line has white-space-separated fields, as follows:

   (a) NCBI ti number for read (or *, if none known)
   (b) read name
   (c) start of trimmed read on original read
   (d) number of bases in trimmed read
   (e) orientation on contig (0 = forward, 1 = reverse)
   (f) contig name
   (g) supercontig name
   (h) approximate start of trimmed read on contig
   (i) approximate start of trimmed read on supercontig.

 For c, h, and i, the first position is always 1 (not 0).  For h, the start of a read on a contig is always the smallest position on the contig which the read covers, regardless of its orientation.  This applies to i as well.  For i, positions on supercontigs are measured so as to take account of gaps. 

5 libraries.txt : provides the re-estimation of library size distributions

This is a file with one line per library.  Each line has white-space separated fields as follows:

    (a) Library ID: any ID compatible with mateinfo.txt
    (b) estimated mean insert length
    (c) estimated standard deviation of insert length

 In addition to these files for each assembly one file containing mate pair data must be provided  Ths is a file with one line per mate pair.  Each line has white-space separated fields as follows:

    (a) forward_read_name (compatible with the second field in reads.placed)
    (b) reverse_read_name (compatible with the second field in reads.placed)
    (c) library_ID (compatible with the first field in libraries.txt)

A.  read.pairs: where a read came from on its sequencing clone.

                A convenience file -- independent of assembly and strictly

                not necessary but useful.

3. Running the reconciliation.

There are three steps involved in running the reconciliation:

    a. prepare the reference and supplementary assembly in the format described above and place them into two separate directories
    b. prepare mateinfo.txt file (see above)
    c. cd to the directory where the reconciliation software is installed and run:

        ./reconcile2.sh /path_to_reference_assembly/ /path_to_supplementary_assembly/ /path_to_mateinfo/mateinfo.txt num_bases_per_alignment_batch  max_cpus_to_use

        arguments:
        /path_to_reference_assembly/  -- full path to the directory where reads.placed and other files for the reference assembly are located
        /path_to_supplementary_assembly/  -- full path to the directory where reads.placed and other files for the supplementary assembly are located
        /path_to_mateinfo/mateinfo.txt -- location of the mateinfo file
        num_bases_per_alignment_batch and max_cpus_to_use -- integers specifying the number of cpus to use and the number of bases to align in a single process -- depends on your memory size. Aliging 100000000 bases takes about 4gb of memory, and it scales proportionally, so if your computer has 4 cpus and 32gb of memory, then you can run 4 alignment processes with 200000000 bases in each.  In general I recommend to set num_bases_per_alignment_batch to 50000000 and then max_cpus_to_use to the number of compute cores in your computer

4. Interpreting the results.

The reconciliation routine will create a reconciled_assembly directory.  That directory will normally contain the following files:

all_successful_joins.txt -- this file has the listing of all joins that were made
assembly_reconciliation_stats.txt -- this file has the before and after reconciliation stats
ce_problem_points_remaining.txt -- this file has the listing of all statistically detected compressions/expansion in the reconciled assembly, there are lots of false positives here
compressions_remaining.txt -- compressions that were fixed
contig_name_correspondence.txt -- corresponds the names of the reference contigs to the names of the reconciled contigs
contigs.bases -- reconciled contig sequences
contigs.quals -- reconciled contig quals
gaps.txt -- listing of gaps "after" contigs.  if gap is zero then the contig is the end of a scaffold
reads.placed -- reconciled assembly read placements -- reads placed uniquely
reads.placed.with_surrogates -- full reconciled assembly read placements -- some reads are placed multiply
supercontigs -- reconciled assembly supercontigs