UMD Overlapper description

1. UMD_Overlapper is a high-precision overlapper.

An “overlapper” takes a set of reads and quality data and determines which read pairs plausibly overlap; that is, it determines the pairs of reads containing subsequences that are similar enough that they may have come from overlapping parts of the genome. In particular, UMD_Overlapper is a “repeat discriminator”; that is, it can frequently distinguish between reads lying in different copies of a repeat. It does this based on a small number of differences in the repeat copies by looking at patterns of differences between collections of reads. It can and has been run on datasets with sizes ranging from BACs up to mammalian-sized genomes.

It is designed to be used as the first step in creating an assembly, and it will typically produce more sequence with fewer errors when used with

1) phrap (use phrapUMD, a version of phrap that incorporates our overlapper; see phrapUMD), or

2) the Celera Assembly program, or

3) The Baylor College of Medicine’s assembly program Atlas (use AtlasUMD, a version of Atlas that incorporates our overlapper and replaces phrap with phrapUMD)

The input to the UMD_Overlapper consists of (trimmed or untrimmed) read and quality data in either FASTA format or Celera’s FRG format. Reads do not have to be trimmed for quality or vector because UMD_Overlapper can perform these functions.

It outputs two sets of plausible overlaps: a set called “All” and a smaller set called “Reliable”.

If two reads come from overlapping parts of the genome then, with very high probability, the pair will appear in the collection “All”, unless they come from a very highly repetitive region. This collection will also include some spurious overlaps.

For example, if the one end of a read A plausibly overlaps both reads B and C, but B and C clearly don’t overlap each other, then at least one of the overlaps A-B or A-C must be spurious. We call this combination A-B-C a fork.

To avoid deleting a correct overlap, both A-B and A-C are included in the set of “All” overlaps. In contrast, the set of “Reliable” overlaps will contain at most one of these overlaps, and quite likely neither.

Assembly programs usually use overlaps to put together “unitigs”, regions of the genome which can be assembled in a unique manner based on overlaps. They then use mate pair information to combine these unitigs into bigger structures. The set of reliable overlaps is designed to make the unitigs as large as possible without having misassemblies.

The current version of this software can be downloaded HERE.

2. Usage

To install run
Install.perl
with no args.

To run, use
runUMDOverlapper.perl (flags) inputFilename(s) outputFilenameSpecifier

See below for a discussion of outputFilenameSpecifier.

To get help on running this exec, use the -h flag.

The output consists of the following results:
a) Output read and quality data, either in fasta format or Celera .frg file
format. This may either be corrected (the default) or uncorrected data.
For the latter use the flag '-use-uncleaned-reads'.

b) A set of overlaps. (*.overlaps)

c) A set of "reliable" overlaps (*.reliable.overlaps) which, when used with
phrapUMD, frequently results in a better assembly than the regular
overlaps (from b).


There are 3 possible combinations of formats for input and output files:
1) For input frg and output frg files, use command
runUMDOverlapper.perl (flags) input.frg output.frg
2) For input frg and output seq and qual files, use command
runUMDOverlapper.perl (flags) input.frg outputPrefix
3) For input seq and qual files and output seq and qual files, use command
runUMDOverlapper.perl (flags) inputSeq inputQual outputPrefix

3. Notes

The program looks for a filename ending in .frg to determine which type
of file is specified.

Currently, if the input file is an .frg file, the trim points specified
in the file are used to trim the reads. Later we will arrange the
pipeline so that vector and/or quality trim can be done by our
routines.

We (currently) only use our vector trimmer for case 3. Our vector
trimming routines are under development. The routines correctly trim
when there is enough vector sequence included in the read. If (most of)
the vector has been trimmed off, however, the routines are not reliable.
NOTE: In case 3 the program trims for quality. The default setting is to trim
at an error rate of 2%. The flag -trim-error-rate sets this error rate.
The flag
-trim-error-rate .05
sets this error rate to 5%