MaSuRCA is whole genome assembly software. It combines the efficiency of the de Bruijn graph and Overlap-Layout-Consensus (OLC) approaches. MaSuRCA can assemble data sets containing only short reads from Illumina sequencing or a mixture of short reads and long reads (Sanger, 454).
The MaSuRCA assembler is written in C++ and perl. It is developed and tested on x86_64 Linux systems. It might work on other UNIX like systems but it is not well tested. The following is required (all current major Linux distributions include these software but may require installation with the builtin package manager):
- GNU C++ compiler g++ version 4.7 or higher.
- GNU make.
- Perl version 5.8 or higher.
- Development file of library bz2 (usually packaged as libbz2-dev or libbz2-devel).
- Perl Statistics::Descriptive library.
To install, download the latest soure and run the installation
script './install.sh'. To install in a different directory, say
'/opt/MaSuRCA', pass the DEST environment variable, like this:
A recent version of the compiler is available from the "Developer
Linux/CentOS. Pass the variable CC and CXX to the install
script pointing to your compiler:
CC=/path/to/gcc47 CXX=/path/to/g++47 ./install.sh
Running the assembler
This is only a quick start up guide. Refer to the documentation for more details. The assembly is driven by a configuration file that specifies the location of the read files and some parameters. A shell script is generated from this configuration that will run the actual assembler. The steps are as follows, assuming that the variable $MASURCA contains the directory where the code was compiled.
Generate a sample configuration file named 'configuration.txt':
$ MASURCA/bin/masurca -g configuration.txt
Then edit the configuration file and start the assembly:
$ MASURCA/bin/masurca configuration.txt