Newest version released September 19, 2014 is MaSuRCA 2.3.2.
Version 2.3.0 and 2.3.1 have severe bugs that render the assemblies unreliable.
Please, rerun these assemblies with version 2.3.2.

MaSuRCA assembler

MaSuRCA is whole genome assembly software. It combines the efficiency of the de Bruijn graph and Overlap-Layout-Consensus (OLC) approaches. MaSuRCA can assemble data sets containing only short reads from Illumina sequencing or a mixture of short reads and long reads (Sanger, 454).

Compiling

The MaSuRCA assembler is written in C++ and perl. It is developed and tested on x86_64 Linux systems. It might work on other UNIX like systems but it is not well tested. The following is required (all current major Linux distributions include these software but may require installation with the builtin package manager):

  • GNU C++ compiler g++ version 4.7 or higher.
  • GNU make.
  • Perl version 5.8 or higher.
  • Development file of library bz2 (usually packaged as libbz2-dev or libbz2-devel).
  • Perl Statistics::Descriptive library.

To install, download the latest soure and run the installation script './install.sh'. To install in a different directory, say '/opt/MaSuRCA', pass the DEST environment variable, like this: DEST=/opt/MaSuRCA ./install.sh

A recent version of the compiler is available from the "Developer Toolset" on RedHat and Scientific Linux/CentOS. Pass the variable CC and CXX to the install script pointing to your compiler: CC=/path/to/gcc47 CXX=/path/to/g++47 ./install.sh

Running the assembler

This is only a quick start up guide. Refer to the documentation for more details. The assembly is driven by a configuration file that specifies the location of the read files and some parameters. A shell script is generated from this configuration that will run the actual assembler. The steps are as follows, assuming that the variable $MASURCA contains the directory where the code was compiled.

Generate a sample configuration file named 'configuration.txt': $ MASURCA/bin/masurca -g configuration.txt Then edit the configuration file and start the assembly: $ MASURCA/bin/masurca configuration.txt ./assemble.sh

Contact

For any questions or comments, contact Aleksey Zimin or Guillaume Marçais .

MaSuRCA publication

The paper describing the assembler is published in Oxford Bioinformatics journal. If you use MaSuRCA in your work, please cite:
Zimin, A. et al. The MaSuRCA genome Assembler. Bioinformatics (2013). doi:10.1093/bioinformatics/btt476

MaSuRCA assembled a 22Gb pine genome

Plant genomes, including conifer trees, are often very large and repetitive, hence difficult to assemble. The loblolly pine (Pinus taeda) has a genome of 22Gb, the largest genome ever assembled.

The genome group at UMD is part of the Pine Reference Sequence group and MaSuRCA assembled the loblolly pine genome.

Change log

Version 2.3.2 (Bug fix version)

  • Fixed bug in generated assemble.sh script

Version 2.3.0

  • Improved jumping library filter: more stable and better performing.
  • Newer version of QuorUM.

Version 2.2.2 (Bug fix version)

  • added GC bias calculation and adjustment for computing the coverage and distinguishing between unique and repeat genome regions
  • Limit the number of short linking mates used in the assembly: their utility quickly diminishes as we use more, but the assembly run time inreases
  • Improved technique to choose k-mer sizes for super-reads and for the jumping library filtering

Version 2.2.1 (Bug fix version)

  • Fix compilation errors on CentOS/RedHat.
  • Many bug fixes.
  • Experimental binary distribution for some platforms, available on the ftp site.

Version 2.2.0

  • The error correction with Quorum is much faster and slightly improved.
  • The jumping libraries are filtered using variable k-mer sizes.
  • The gap filling procedure is faster.
  • Parameters for scaffolding have been fined tuned.

Version 2.1.0

  • Introduced additional filtering step for the circularization-based libraries: we now localize the paired end reads around each jumping pair and attempt to merge the two mates in the pair pretending it is a non-junction short innie. The merge fails for the correct junction-contatinig pairs. This is done in work2.1 folder and the additional non-junction (chimeric) mate pairs detected are listed in work2.1/output.txt
  • Rewrote renaming/filting of initial fastq files.
  • Set USE_LINKING_MATES=0 by default, force USE_LINKING_MATES=0 if OTHER long reads are supplied.
  • Set DO_HOMOPOLYMER_TRIM=0 by default.
  • Renamed runSRCA.pl to masurca.
  • The assemble.sh script can be regenerated with './assemble.sh -r'
  • The PATHS section in the configuration file is now deprecated, all paths to the binaries are automatically determined based on the location of the masurca script.
  • Improved the speed of the main jumping library filter code and correctly implemented the --join-aggressive flag in the mate joiner code. This flag joins the mate pair into a single read if any path through k-mer graph exists leading from one mate to the other.
  • Updated the scaffold merging logic in the CA scaffolder (cgw) to improve speed
  • Changed the logic of handling low kmer counts in the error corrector -- now if the current kmer count is below the Poisson threshold, but the alternative counts are low as well, such that the probability of an error is lower than 10e-6 (computed from binomial distribution), the base is not corrected

Version 2.0.3.1 (Bug fix version)

  • Fix compilation issues

Version 2.0.3

  • Keep skip mers in a Jellyfish hash in the Celera overlapper: the overlapper should not die anymore because of the number of skip mers
  • Fix race condition bug in gap closing: a few more gap will successfully be reported as closed
  • Various bug fixes