MaSuRCA is whole genome assembly software. It combines the efficiency of the de Bruijn graph and Overlap-Layout-Consensus (OLC) approaches. MaSuRCA can assemble data sets containing only short reads from Illumina sequencing or a mixture of short reads and long reads (Sanger, 454).
The MaSuRCA assembler is written in C++ and perl. It is developed and tested on x86_64 Linux systems. It might work on other UNIX like systems but it is not well tested. The following is required (all current major Linux distributions include these software but may require installation with the builtin package manager):
- GNU C++ compiler g++ version 4.4 or higher.
- GNU make.
- Perl version 5.8 or higher.
- Development file of library bz2 (usually packaged as libbz2-dev or libbz2-devel).
To install, download the latest soure and run the installation
script './install.sh'. To install in a different directory, say
'/opt/MaSuRCA', pass the DEST environment variable, like this:
To compile on RedHat 5 and 6 (or CentOS 5 and 6), make sure that version 4.4 of gcc is installed:
sudo yum install gcc44-c++
and pass the variable CC and CXX to the install script, as follows:
CC=gcc44 CXX=g++44 ./install.sh
- Latest source code and binaries.
- Usage documentation.
- Older versions are available from the ftp site.
Running the assembler
This is only a quick start up guide. Refer to the documentation for more details. The assembly is driven by a configuration file that specifies the location of the read files and some parameters. A shell script is generated from this configuration that will run the actual assembler. The steps are as follows, assuming that the variable $MASURCA contains the directory where the code was compiled.
Generate a sample configuration file named 'configuration.txt':
$ MASURCA/bin/masurca -g configuration.txt
Then edit the configuration file and start the assembly:
$ MASURCA/bin/masurca configuration.txt