Cloud Infrastructure Applications – MuG Virtual Research Environment

Available applications

Virtual Machine name	Operating System	Software installed	PMES Application / URL
serial_maker+	Debian (6.0.5)	Maker (2.28), Exonerate (2.2.0), Snap (2006-07-28), Augustus (2.5.5), Blast (2.2.28+), RepeatMasker (1.295), TRF (4.07b)	Maker (help) Exonerate (help) Augustus (help)
bwapipeline	Debian (6.0.5)	BWA (0.1.17), bcftools (0.1.18), samtools (0.7.5a)	Bwa (help)
bowtie+	Debian (6.0.5)	Bowtie (2-2.2.3), tophat (2.0.12), boost (1_55), samtools boost (1_55), samtools (0.1.80.1.8)	Bowtie (help) Tophat (help)
bignasim	Ubuntu (8.04)	MongoDB v.2.6.2, Cassandra 2.1, Curves+ 2.0, R 2.15.0, Gnuplot 4.2, Grace 5.1.21, GROMACS 5.0, JSMol 14.0.5, PCASuite 1.1, Ambertools 14, Netpbm 10.0, VMD 1.8.5, ffmpeg 2.5, MDAnalysis, RnamlView	BigNASim (help)
nucleosomedynamics	Debian (8.2)	R 2.15.0, Gnuplot 4.2, Grace 5.1.21, JBrowse 1.11.6, MongoDB 3.0.7	Nucleosome Dynamics (help)

Description of the applications

PMES applications allow the user to configure and launch the software installed within each cloud virtual machine. Here are detailed the arguments required by each application, the specific software used by each wrapper or pipeline, as well as the input files that each virtual machine need to access in order to run the specified software. More detailed information about the execution process and the general I/O management can be found in the Dashboard documentation.

Genome Annotation

MAKER

Application Type: stand alone

This application runs the genome annotation pipeline MAKER 2.

MAKER: identifies repeats, aligns ESTs and proteins to a genome, produces ab initio gene predictions and automatically synthesizes these data into gene annotations.

[+/- details]

Arguments	configuration file Opts	Maker configuration file detailing general options and input files
	configuration file Bopts	Maker configuration file detailing the similarity parameters
	cpus blast	Number of CPUs of BLAST2. They should correspond to the total number of CPUs reserved.
	basename	Base-name of the pipeline output
Input Files	The two configuration files need to be uploaded, together with any other file referenced into such control files.
Output Files	The application returns a compressed folder called [BASENAME].tar.gz
Sample Files	Configuration files: http://transplantdb.bsc.es/documents/samples/maker/maker_opts.ctl http://transplantdb.bsc.es/documents/samples/maker/maker_bopts.ctl Files refered into the configuration file Opts: http://transplantdb.bsc.es/documents/samples/maker/dpp_contig.fasta http://transplantdb.bsc.es/documents/samples/maker/dpp_est_fasta
Special requirements

AUGUSTUS

Application Type: stand alone

AUGUSTUS is a program that predicts genes in eukaryotic genomic sequences. It can be used as an ab initio program, but the program may also incorporate hints on the gene structure coming from extrinsic sources such as EST, MS/MS, protein alignments and synthenic genomic alignments.

[+/- details]

Arguments	query sequence	The query file contains the DNA input sequence and must be in uncompressed (multiple) fasta format
	specie	Choose one of the followings, for which Augustus has been trained: human, fly, arabidopsis, brugia, aedes, tribolium, schistosoma, tetrahymena, galdieria, maize, toxoplasma, caenorhabditis, , aspergillus_fumigatus, aspergillus_nidulans, aspergillus_oryzae, aspergillus_terreus, botrytis_cinerea, candida_albicans, candida_guilliermondii, candida_tropicalis, chaetomium_globosum, coccidioides_immitis, coprinus, coprinus_cinereus, cryptococcus_neoformans_gattii, cryptococcus_neoformans_neoformans_B, cryptococcus_neoformans_neoformans_JEC21, debaryomyces_hansenii, encephalitozoon_cuniculi_GB, eremothecium_gossypii, fusarium_graminearum, histoplasma_capsulatum, kluyveromyces_lactis, laccaria_bicolor, lamprey, leishmania_tarentolae, lodderomyces_elongisporus, magnaporthe_grisea, neurospora_crassa, phanerochaete_chrysosporium, pichia_stipitis, rhizopus_oryzae, saccharomyces_cerevisiae_S288C, saccharomyces_cerevisiae_rm11-1a_1, schizosaccharomyces_pombe, trichinella, ustilago_maydis, yarrowia_lipolytica, nasonia, tomato, chlamydomonas, amphimedon, pneumocystis
	optional parameters	Original optional Augustus parameters. Default: –strand=both –genemodel=partial –maxDNAPieceSize=200000 Consult http://augustus.gobics.de/binaries/README.TXT to modify the default parameter or add others.
	output	Base-name of the GFF that will be generated
Input Files	Augustus only requires the FASTA file corresponding to the ‘query sequence’ parameter.
Output Files	The pipeline generates an output file called [OUTPUT].gff
Sample Files	query sequence: http://transplantdb.bsc.es/documents/samples/augustus/sequence.fa specie: arabidopsis
Special requirements

Pairwise Sequence Alignment

EXONERATE

Application Type: stand alone

Exonerate is a generic tool for pairwise sequence comparison. It allows you to align sequences using many alignment models, either exhaustive dynamic programming, or a variety of heuristics.

[+/- details]

Arguments

Query

query sequence/s required. These must be in a FASTA format file. Single or multiple query sequences may be supplied in one or more files.

Target

target sequence/s required. Also, must be in a FASTA format file. As the query sequences, single or multiple target sequences and files may be supplied.

Are also available though the shared storage, the Plant Esembl genomes (release 20) and the GRCh37 human genome. In order to use them, specify one of the following options:

arabidopsis_lyrata

arabidopsis_thaliana

brachypodium_distachyon

brassica_rapa

chlamydomonas_reinhardtii

cyanidioschyzon_merolae

glycine_max

hordeum_vulgare

medicago_truncatula

musa_acuminata

oryza_brachyantha

oryza_glaberrima

oryza_indica

oryza_indica

oryza_sativa

physcomitrella_patens

populus_trichocarpa

selaginella_moellendorffii

setaria_italica

solanum_lycopersicum

solanum_tuberosum

sorghum_bicolor

triticum_aestivum

triticum_urartu

vitis_vinifera

zea_mays
Homo_sapiens.chromosome.NUM
where NUM = chr. Num, X, Y or MT]

Options

original optional Exonerate parameters. Consult them in: http://www.ebi.ac.uk/~guy/exonerate/exonerate.man.html

Default: –model ungapped –bestn 0 –score 100 –exhaustive FALSE –showtargetgff yes

Chunks Query

equivalent to querychunktotal. Number of chunks into which the query will by split in order to run on different nodes (*).

Chunks Target

equivalent to targetchunktotal. Number of chunks into which the target will by split in order to run on different nodes (*).

Output basename

basename of the output

Input Files

The Query FASTA file/s need to be uploaded. And also the Target FASTA file/s, unless the target correspond to a sequence included in the Plant Esembl genomes (release 20) and or the GRCh37 human release. In such cases, only the “Target“ parameter need to be specified.

Output Files

The application generates a GZIP file containing the concatenation of all Exonerate outfiles. File: BASENAME.gz

Sample Files

Query: http://transplantdb.bsc.es/documents/samples/exonerate/TAIR_partial.fa
Target: arabidopsis_lyrata
Chunks Query: 1
Chunks Target: 4
Adanved tab → Cores: 4 (*)
(*) Consider that the total number of cores reserved in the cloud should correspond to: Chunks-Query multiplied by Chunks-Target. If, for example, you wish to split the target database into 3 parts and the query into 2, 6 exonerate jobs would run, so 6 cores need to be reserved. The granularity of the chunk goes down to a single sequence.

Special requirements

The application requires access to the DATA2 data storage, is no Target file is uploaded and instead, Ensembl or GRCh37 databases are specified.

NGS Alignment

BWA

Application Type: stand alone

This application is a sequential pipeline that uses BWA to align paired-end reads against a reference genome and converts the resulting alignment into a BAM file using SAM Tools.

BWA (Burrows-Wheeler Alignment): software package for mapping low-divergent sequences against a large reference genome.
SAM Tools: provides various utilities for manipulating alignments in the SAM format, including sorting, merging, indexing and generating alignments in a per-position format.

[+/- details]

Arguments

fastq1

paired-end reads file 1 in fastq format.

fastq2

paired-end reads file 2 in fastq format.

Reference Genome

indexed reference genome (Ensembl release 20). Options:

arabidopsis_lyrata

arabidopsis_thaliana

brachypodium_distachyon

brassica_rapa

chlamydomonas_reinhardtii

cyanidioschyzon_merolae

glycine_max

hordeum_vulgare

medicago_truncatula

musa_acuminata

oryza_brachyantha

oryza_glaberrima

oryza_indica

oryza_indica

oryza_sativa

physcomitrella_patens

populus_trichocarpa

selaginella_moellendorffii

setaria_italica

solanum_lycopersicum

solanum_tuberosum

sorghum_bicolor

triticum_aestivum

triticum_urartu

vitis_vinifera

zea_mays

Output basename

Base-name of the pipeline output

Input Files

The files required to run the application correspond to the arguments fastq1 and fastq2.

Output Files

The pipeline generates an output file called [BASENAME].bam

Sample Files

Fastq1: http://transplantdb.bsc.es/documents/samples/bwa/1.fastq.gz
Fastq2: http://transplantdb.bsc.es/documents/samples/bwa/2.fastq.gz
Reference Genome: arabidopsis_thaliana
Output base-name: results

Special requirements

The application requires access to the DATA2 data storage, where Ensembl database is stored.

TopHat

Application Type: stand alone

This application executes the TopHat program. Additionally, it runs bowtie2-build to build the genome bowtie2 indexes.
[ bowtie2-build ] → TopHat

TOPHAT: is a program that aligns RNA-Seq reads to a genome in order to identify exon-exon splice junctions. It is built on the ultrafast short read mapping program Bowtie.

The application, as the original software, behaves differently according to the given arguments. For instance:

Align reads:
Build transcriptome from GTF:
Resume:

[+/- details]

Arguments	read	A comma-separated list of files containing reads in FASTQ or FASTA format. For paired-end reads, this should be the *_1 files.
	read2	A comma-separated list of files containing reads in FASTA or FASTA format. Only used for paired end reads. It contains the _2 set of files, which must appear in the same order as the _1 files.
	index	Genome to be searched. The parameter accepts two types of values: -1 : Bowtie2 indexes basename. The program will look indexbt2 and indexrev.bt2 files, which require to be uploaded (Input tab). -2: comma-separated list of files containing reads in FASTA format. They will be indexed using Bowtie2-build program.
	cpus	Number of threads to align reads. They should correspond to the number of cores reserved in the ‘advanced’ tab. Notice that Bowtie2-build do not parallelise.
	output	Basename of the directory in which TopHat will write all of its output.
	topHat options	Native options of TopHat program. Notice that some options are input files (i.e. -j file.juncs), therefore, they require to be uploaded to the cloud though the ‘input’ tab. Check options at: http://ccb.jhu.edu/software/tophat/manual.shtml
Input Files	When running Tophat to align RNA-Seq reads, they need to be uploaded to the virtual machine in FASTQ or FASTA format. The target name or target path (3th column), should correspond to the argument ‘read’ and ‘read2’. When using pre-built indexes in ‘index’, .1.bt2 and .rev.1.bt2 files need to be uploaded. When new indexes are to be build, the original genomic FASTA files should be transfered. Additionally, when user supplies their own insertions, deletions, or list of known transcripts, the corresponding .GTF, .BED, .JUNCS, etc., files need to be correclty specified within ‘TopHat options’, as well as uploaded through the ‘input’ tab.
Output Files	The application returns a [OUTPUT].tar.gz, a compressed version of the standard Tophat ouput directory. When a new transcriptome index is created ( –GTF & –transcriptome-index within ‘TopHat options’), it is included in the [OUTPUT].tar.gz, so it can be reused in other TopHat runs.
Sample Files	read : http://transplantdb.bsc.es/documents/samples/bowtie/reads_1.fq read2: http://transplantdb.bsc.es/documents/samples/bowtie/reads_2.fq index: arabidopsis_lyrata cpus: 8 topHat options: -r 20
Special requirements

Bowtie

Application Type: stand alone

This application allows launch the fast aligner Bowtie2. Additionally, the wrapper includes the option to build the indexes from input reference genomes, if the pre-build indexes of ENSEMBL Plants full-version genomes were not suitable. The application includes the following software:
[ bowtie2-build ] → bowtie2 → samtools

BOWTIE2: It is an ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences. It is particularly good at aligning reads of about 50 up to 100s or 1,000s of characters to relatively long (e.g. mammalian) genomes.
BOWTIE2-build: The program indexes the genome with an FM Index (based on the Burrows-Wheeler Transform or BWT) to keep its memory footprint small. This step is only performed when the user supplies a genome sequence to be indexed.
SAM Tools: provide various utilities for manipulating alignments in the SAM format, including sorting, merging, indexing and generating alignments in a per-position format.

[+/- details]

Arguments

read

unpaired reads to be aligned OR paired-end reads containing mate 1s. FASTQ is the default format. A list of comma-separated read files is also accepted.

paired-end reads containing mate 2s when mate 1s is specified in argument read. FASTQ is the default format. A list of comma-separated read files is also accepted.

reference

reference genome against whom reads are aligned. The parameter accept two possible type of data:

genome sequence : comma-separated list of FASTA files containing the reference sequences to be aligned to. From them, FM indexes will be created running bowtie2-build program.
genome index: base-name of the index of a pre-build reference genome. The base-name is the name of any of the index files up to but not including the final *.1.bt2, *.rev.1.bt2, etc. User can upload their own index files, or use those available in the shared disk. Following are the basenames of the pre-build genomes:

arabidopsis_lyrata

arabidopsis_thaliana

brachypodium_distachyon

brassica_rapa

chlamydomonas_reinhardtii

cyanidioschyzon_merolae

glycine_max

hordeum_vulgare

medicago_truncatula

musa_acuminata

oryza_brachyantha

oryza_glaberrima

oryza_indica

oryza_indica

oryza_sativa

physcomitrella_patens

populus_trichocarpa

selaginella_moellendorffii

setaria_italica

solanum_lycopersicum

solanum_tuberosum

sorghum_bicolor

triticum_aestivum

triticum_urartu

vitis_vinifera

zea_mays

only index

yes|no. Return only the indexes build from ‘reference’ sequence/s. As no reads will be aligned, the following arguments will be ignored:’read’, ‘read2’, ‘cpus’, ‘bowtie2-parameters’. Default: no

cpus

number of thhreads created by bowtie2. They should correspond to the number of cores reserved in the ‘advanced’ tab. Notice that bowtie2-build do not parallelise.

output

base-name of the final packed and compressed output

bowtie2-build parameters

native options of bowtie2-build program. Consult them in: http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml#the-bowtie2-build-indexer

Default: -f –offrate 5

bowtie2 parameters

native options of bowtie. Consult them in http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml#command-line

Default: –end-to-end –sensitive –un-gz unpaired.sam.gz –met-file metrics.log –time

Input Files

Read files in FASTQ or Illumina’s QSEQ format (bowtie2 parameters = –qseq) need to be uploaded to the virtual machine.

The target name or target path (3th column) should correspond to the argument ‘read’ and ‘read_2’.

Additionally, when user supplies their own indexes, all *.1.bt2 and *.rev.1.bt2 files need to be uploaded, and their path and base-names set in the argument ‘reference’. However, if ‘reference’ argument refers to genomic sequences, the files to upload are the FASTA reference genome sequences.

Output Files

The application returns a [OUTPUT].tar.gz file.

It will contain a variable number of files, created according to the provided options:

indexes files (reference*.bt2 and reference.rev.*.bt2 )
SAM and BAM read’s alignments (alignment.sam, alignment.bam)
file containing unpaired reads that fail to align (named according to –un options)
file containing unpaired reads that align at least once (named according to –al options)
file containing paired-end reads that fail to align concordantly (named according to –un-conc options)
file containing paired-end reads that align concordantly at least once to file (named according to –al-conc options)
bowtie2 metrics file (named according to –metrics-file)

Sample Files

read: http://transplantdb.bsc.es/documents/samples/bowtie/reads_1.fq
read2: http://transplantdb.bsc.es/documents/samples/bowtie/reads_2.fq
reference: arabidopsis_lyrata
only_ index: no

Special requirements

The application requires access to the DATA2 data storage, if Ensembl database is used as reference genome.