Cloud Infrastructure Applications


Available applications

 

Virtual Machine name Operating System Software installed PMES Application / URL
serial_maker+ Debian (6.0.5) Maker (2.28), Exonerate (2.2.0), Snap (2006-07-28), Augustus (2.5.5),
Blast (2.2.28+), RepeatMasker (1.295), TRF (4.07b)
Maker (help)
Exonerate (help)
Augustus (help)
bwapipeline Debian (6.0.5) BWA (0.1.17), bcftools (0.1.18), samtools (0.7.5a) Bwa (help)
bowtie+ Debian (6.0.5) Bowtie (2-2.2.3), tophat (2.0.12), boost (1_55), samtools boost (1_55),
samtools (0.1.80.1.8)
Bowtie (help)
Tophat (help)
bignasim Ubuntu (8.04) MongoDB v.2.6.2, Cassandra 2.1, Curves+ 2.0, R 2.15.0, Gnuplot 4.2,
Grace 5.1.21, GROMACS 5.0, JSMol 14.0.5, PCASuite 1.1, Ambertools 14,
Netpbm 10.0, VMD 1.8.5, ffmpeg 2.5, MDAnalysis, RnamlView
BigNASim (help)
nucleosomedynamics Debian (8.2) R 2.15.0, Gnuplot 4.2, Grace 5.1.21, JBrowse 1.11.6, MongoDB 3.0.7 Nucleosome Dynamics (help)

 

Description of the applications

 

PMES applications allow the user to configure and launch the software installed within each cloud virtual machine. Here are detailed the arguments required by each application, the specific software used by each wrapper or pipeline, as well as the input files that each virtual machine need to access in order to run the specified software. More detailed information about the execution process and the general I/O management can be found in the Dashboard documentation.

 

Genome Annotation

 

MAKER

Application Type: stand alone

This application runs the genome annotation pipeline MAKER 2.

  • MAKER: identifies repeats, aligns ESTs and proteins to a genome, produces ab initio gene predictions and automatically synthesizes these data into gene annotations.
Arguments configuration file Opts Maker configuration file detailing general options and input files
configuration file Bopts Maker configuration file detailing the similarity parameters
cpus blast Number of CPUs of BLAST2. They should correspond to the total number of CPUs reserved.
basename Base-name of the pipeline output
Input Files The two configuration files need to be uploaded, together with any other file referenced into such control files.
Output Files The application returns a compressed folder called [BASENAME].tar.gz
Sample Files Configuration files:
http://transplantdb.bsc.es/documents/samples/maker/maker_opts.ctl

http://transplantdb.bsc.es/documents/samples/maker/maker_bopts.ctl
Files refered into the configuration file Opts:
http://transplantdb.bsc.es/documents/samples/maker/dpp_contig.fasta
http://transplantdb.bsc.es/documents/samples/maker/dpp_est_fasta
Special requirements


 

 

AUGUSTUS

Application Type: stand alone

AUGUSTUS is a program that predicts genes in eukaryotic genomic sequences. It can be used as an ab initio program, but the program may also incorporate hints on the gene structure coming from extrinsic sources such as EST, MS/MS, protein alignments and synthenic genomic alignments.

Arguments query sequence The query file contains the DNA input sequence and must be in uncompressed (multiple) fasta format
specie Choose one of the followings, for which Augustus has been trained:

human, fly, arabidopsis, brugia, aedes, tribolium, schistosoma, tetrahymena, galdieria, maize, toxoplasma, caenorhabditis, , aspergillus_fumigatus, aspergillus_nidulans, aspergillus_oryzae, aspergillus_terreus, botrytis_cinerea, candida_albicans, candida_guilliermondii, candida_tropicalis, chaetomium_globosum, coccidioides_immitis, coprinus, coprinus_cinereus, cryptococcus_neoformans_gattii, cryptococcus_neoformans_neoformans_B, cryptococcus_neoformans_neoformans_JEC21, debaryomyces_hansenii, encephalitozoon_cuniculi_GB, eremothecium_gossypii, fusarium_graminearum, histoplasma_capsulatum, kluyveromyces_lactis, laccaria_bicolor, lamprey, leishmania_tarentolae, lodderomyces_elongisporus, magnaporthe_grisea, neurospora_crassa, phanerochaete_chrysosporium, pichia_stipitis, rhizopus_oryzae, saccharomyces_cerevisiae_S288C, saccharomyces_cerevisiae_rm11-1a_1, schizosaccharomyces_pombe, trichinella, ustilago_maydis, yarrowia_lipolytica, nasonia, tomato, chlamydomonas, amphimedon, pneumocystis

optional parameters Original optional Augustus parameters. Default: –strand=both –genemodel=partial –maxDNAPieceSize=200000

Consult http://augustus.gobics.de/binaries/README.TXT to modify the default parameter or add others.

output Base-name of the GFF that will be generated
Input Files Augustus only requires the FASTA file corresponding to the ‘query sequence’ parameter.
Output Files The pipeline generates an output file called [OUTPUT].gff
Sample Files query sequence: http://transplantdb.bsc.es/documents/samples/augustus/sequence.fa
specie: arabidopsis
Special requirements

 

Pairwise Sequence Alignment

 

EXONERATE

Application Type: stand alone

Exonerate is a generic tool for pairwise sequence comparison. It allows you to align sequences using many alignment models, either exhaustive dynamic programming, or a variety of heuristics.

Arguments Query query sequence/s required. These must be in a FASTA format file. Single or multiple query sequences may be supplied in one or more files.
Target target sequence/s required. Also, must be in a FASTA format file. As the query sequences, single or multiple target sequences and files may be supplied.

Are also available though the shared storage, the Plant Esembl genomes (release 20) and the GRCh37 human genome. In order to use them, specify one of the following options:

arabidopsis_lyrata

arabidopsis_thaliana

brachypodium_distachyon

brassica_rapa

chlamydomonas_reinhardtii

cyanidioschyzon_merolae

glycine_max
hordeum_vulgare

medicago_truncatula

musa_acuminata

oryza_brachyantha

oryza_glaberrima

oryza_indica

oryza_indica
oryza_sativa

physcomitrella_patens

populus_trichocarpa

selaginella_moellendorffii

setaria_italica

solanum_lycopersicum

solanum_tuberosum

sorghum_bicolor

triticum_aestivum

triticum_urartu

vitis_vinifera

zea_mays
Homo_sapiens.chromosome.NUM
 where NUM = chr. Num, X, Y or MT]
Options original optional Exonerate parameters. Consult them in: http://www.ebi.ac.uk/~guy/exonerate/exonerate.man.html

Default: –model ungapped –bestn 0 –score 100 –exhaustive FALSE –showtargetgff yes

Chunks Query equivalent to querychunktotal. Number of chunks into which the query will by split in order to run on different nodes (*).
Chunks Target equivalent to targetchunktotal. Number of chunks into which the target will by split in order to run on different nodes (*).
Output basename basename of the output
Input Files The Query FASTA file/s need to be uploaded. And also the Target FASTA file/s, unless the target correspond to a sequence included in the Plant Esembl genomes (release 20) and or the GRCh37 human release. In such cases, only the “Target“ parameter need to be specified.
Output Files The application generates a GZIP file containing the concatenation of all Exonerate outfiles. File: BASENAME.gz
Sample Files Query: http://transplantdb.bsc.es/documents/samples/exonerate/TAIR_partial.fa
Target: arabidopsis_lyrata
Chunks Query: 1
Chunks Target: 4
Adanved tab → Cores: 4 (*)
(*) Consider that the total number of cores reserved in the cloud should correspond to: Chunks-Query multiplied by Chunks-Target. If, for example, you wish to split the target database into 3 parts and the query into 2, 6 exonerate jobs would run, so 6 cores need to be reserved. The granularity of the chunk goes down to a single sequence.
Special requirements The application requires access to the DATA2 data storage, is no Target file is uploaded and instead, Ensembl or GRCh37 databases are specified.

 

NGS Alignment

 

BWA

Application Type: stand alone

This application is a sequential pipeline that uses BWA to align paired-end reads against a reference genome and converts the resulting alignment into a BAM file using SAM Tools.

  • BWA (Burrows-Wheeler Alignment): software package for mapping low-divergent sequences against a large reference genome.
  • SAM Tools: provides various utilities for manipulating alignments in the SAM format, including sorting, merging, indexing and generating alignments in a per-position format.
Arguments fastq1 paired-end reads file 1 in fastq format.
fastq2 paired-end reads file 2 in fastq format.
Reference Genome indexed reference genome (Ensembl release 20). Options:

arabidopsis_lyrata

arabidopsis_thaliana

brachypodium_distachyon

brassica_rapa

chlamydomonas_reinhardtii

cyanidioschyzon_merolae

glycine_max
hordeum_vulgare

medicago_truncatula

musa_acuminata

oryza_brachyantha

oryza_glaberrima

oryza_indica

oryza_indica
oryza_sativa

physcomitrella_patens

populus_trichocarpa

selaginella_moellendorffii

setaria_italica

solanum_lycopersicum

solanum_tuberosum

sorghum_bicolor

triticum_aestivum

triticum_urartu

vitis_vinifera

zea_mays
Output basename Base-name of the pipeline output
Input Files The files required to run the application correspond to the arguments fastq1 and fastq2.
Output Files The pipeline generates an output file called [BASENAME].bam
Sample Files Fastq1: http://transplantdb.bsc.es/documents/samples/bwa/1.fastq.gz
Fastq2: http://transplantdb.bsc.es/documents/samples/bwa/2.fastq.gz
Reference Genome: arabidopsis_thaliana
Output base-name: results
Special requirements The application requires access to the DATA2 data storage, where Ensembl database is stored.


 

 

TopHat

Application Type: stand alone

This application executes the TopHat program. Additionally, it runs bowtie2-build to build the genome bowtie2 indexes.
[ bowtie2-build ] → TopHat

  • TOPHAT: is a program that aligns RNA-Seq reads to a genome in order to identify exon-exon splice junctions. It is built on the ultrafast short read mapping program Bowtie.

The application, as the original software, behaves differently according to the given arguments. For instance:

  • Align reads:
  • Build transcriptome from GTF:
  • Resume:
Arguments read A comma-separated list of files containing reads in FASTQ or FASTA format. For paired-end reads, this should be the *_1 files.
read2 A comma-separated list of files containing reads in FASTA or FASTA format. Only used for paired end reads. It contains the *_2 set of files, which must appear in the same order as the *_1 files.
index Genome to be searched. The parameter accepts two types of values:

-1 : Bowtie2 indexes basename. The program will look index*bt2 and index*rev.bt2 files, which require to be uploaded (Input tab).

-2: comma-separated list of files containing reads in FASTA format. They will be indexed using Bowtie2-build program.

cpus Number of threads to align reads. They should correspond to the number of cores reserved in the ‘advanced’ tab. Notice that Bowtie2-build do not parallelise.
output Basename of the directory in which TopHat will write all of its output.
topHat options Native options of TopHat program. Notice that some options are input files (i.e. -j file.juncs), therefore, they require to be uploaded to the cloud though the ‘input’ tab.

Check options at: http://ccb.jhu.edu/software/tophat/manual.shtml

Input Files When running Tophat to align RNA-Seq reads, they need to be uploaded to the virtual machine in FASTQ or FASTA format. The target name or target path (3th column), should correspond to the argument ‘read’ and ‘read2’.

When using pre-built indexes in ‘index’, *.1.bt2 and *.rev.1.bt2 files need to be uploaded. When new indexes are to be build, the original genomic FASTA files should be transfered.

Additionally, when user supplies their own insertions, deletions, or list of known transcripts, the corresponding .GTF, .BED, .JUNCS, etc., files need to be correclty specified within ‘TopHat options’, as well as uploaded through the ‘input’ tab.

Output Files The application returns a [OUTPUT].tar.gz, a compressed version of the standard Tophat ouput directory.

When a new transcriptome index is created ( –GTF & –transcriptome-index within ‘TopHat options’), it is included in the [OUTPUT].tar.gz, so it can be reused in other TopHat runs.

Sample Files read : http://transplantdb.bsc.es/documents/samples/bowtie/reads_1.fq
read2: http://transplantdb.bsc.es/documents/samples/bowtie/reads_2.fq
index: arabidopsis_lyrata
cpus: 8
topHat options: -r 20
Special requirements


 

 

Bowtie

Application Type: stand alone

This application allows launch the fast aligner Bowtie2. Additionally, the wrapper includes the option to build the indexes from input reference genomes, if the pre-build indexes of ENSEMBL Plants full-version genomes were not suitable. The application includes the following software:
[ bowtie2-build ] → bowtie2 → samtools

  • BOWTIE2: It is an ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences. It is particularly good at aligning reads of about 50 up to 100s or 1,000s of characters to relatively long (e.g. mammalian) genomes.
  • BOWTIE2-build: The program indexes the genome with an FM Index (based on the Burrows-Wheeler Transform or BWT) to keep its memory footprint small. This step is only performed when the user supplies a genome sequence to be indexed.
  • SAM Tools: provide various utilities for manipulating alignments in the SAM format, including sorting, merging, indexing and generating alignments in a per-position format.
Arguments read unpaired reads to be aligned OR paired-end reads containing mate 1s. FASTQ is the default format. A list of comma-separated read files is also accepted.
read2 paired-end reads containing mate 2s when mate 1s is specified in argument read. FASTQ is the default format. A list of comma-separated read files is also accepted.
reference reference genome against whom reads are aligned. The parameter accept two possible type of data:

  1. genome sequence : comma-separated list of FASTA files containing the reference sequences to be aligned to. From them, FM indexes will be created running bowtie2-build program.
  2. genome index: base-name of the index of a pre-build reference genome. The base-name is the name of any of the index files up to but not including the final *.1.bt2, *.rev.1.bt2, etc. User can upload their own index files, or use those available in the shared disk. Following are the basenames of the pre-build genomes:
  3. arabidopsis_lyrata

    arabidopsis_thaliana

    brachypodium_distachyon

    brassica_rapa

    chlamydomonas_reinhardtii

    cyanidioschyzon_merolae

    glycine_max
    hordeum_vulgare

    medicago_truncatula

    musa_acuminata

    oryza_brachyantha

    oryza_glaberrima

    oryza_indica

    oryza_indica
    oryza_sativa

    physcomitrella_patens

    populus_trichocarpa

    selaginella_moellendorffii

    setaria_italica

    solanum_lycopersicum

    solanum_tuberosum

    sorghum_bicolor

    triticum_aestivum

    triticum_urartu

    vitis_vinifera

    zea_mays
only index yes|no. Return only the indexes build from ‘reference’ sequence/s. As no reads will be aligned, the following arguments will be ignored:’read’, ‘read2’, ‘cpus’, ‘bowtie2-parameters’. Default: no
cpus number of thhreads created by bowtie2. They should correspond to the number of cores reserved in the ‘advanced’ tab. Notice that bowtie2-build do not parallelise.
output base-name of the final packed and compressed output
bowtie2-build parameters native options of bowtie2-build program. Consult them in: http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml#the-bowtie2-build-indexer

Default: -f –offrate 5

bowtie2 parameters native options of bowtie. Consult them in http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml#command-line

Default: –end-to-end –sensitive –un-gz unpaired.sam.gz –met-file metrics.log –time

Input Files Read files in FASTQ or Illumina’s QSEQ format (bowtie2 parameters = –qseq) need to be uploaded to the virtual machine.

The target name or target path (3th column) should correspond to the argument ‘read’ and ‘read_2’.

Additionally, when user supplies their own indexes, all *.1.bt2 and *.rev.1.bt2 files need to be uploaded, and their path and base-names set in the argument ‘reference’. However, if ‘reference’ argument refers to genomic sequences, the files to upload are the FASTA reference genome sequences.

Output Files The application returns a [OUTPUT].tar.gz file.

It will contain a variable number of files, created according to the provided options:

  • indexes files (reference*.bt2 and reference.rev.*.bt2 )
  • SAM and BAM read’s alignments (alignment.sam, alignment.bam)
  • file containing unpaired reads that fail to align (named according to –un options)
  • file containing unpaired reads that align at least once (named according to –al options)
  • file containing paired-end reads that fail to align concordantly (named according to –un-conc options)
  • file containing paired-end reads that align concordantly at least once to file (named according to –al-conc options)
  • bowtie2 metrics file (named according to –metrics-file)
Sample Files read: http://transplantdb.bsc.es/documents/samples/bowtie/reads_1.fq
read2: http://transplantdb.bsc.es/documents/samples/bowtie/reads_2.fq
reference: arabidopsis_lyrata
only_ index: no
Special requirements The application requires access to the DATA2 data storage, if Ensembl database is used as reference genome.