Scripting for large-scale sequencing based on Hadoop

André Schumacher; Luca Pireddu; Aleksi Kallio; Matti Niemenmaa; Eija Korpelainen; Gianluigi Zanetti; Keijo Heljanko

doi:10.14806/ej.19.A.628

Scripting for large-scale sequencing based on Hadoop

Authors

André Schumacher 1 ICSI, Berkeley, USA 2 Helsinki Institute for Information Technology HIIT, Helsinki, Finland 3 Aalto University, Espoo, Finland
Luca Pireddu CRS4, Pula
Aleksi Kallio CSC-IT Center for Science, Helsinki
Matti Niemenmaa Aalto University, Espoo
Eija Korpelainen CSC-IT Center for Science, Helsinki
Gianluigi Zanetti CRS4, Pula
Keijo Heljanko 1 Aalto University, Espoo, Finland 2 Helsinki Institute for Information Technology HIIT, Helsinki, Finland

DOI:

https://doi.org/10.14806/ej.19.A.628

Keywords:

bioinformatics, NGS, data analysis, cloud computing, high-performance computing

Abstract

The large volumes of data generated by modern sequencing experiments present significant challenges in their manipulation and analysis. Traditional approaches are often found to be complicated to scale. We describe our ongoing work on SeqPig, a tool that facilitates the use of the Pig Latin distributed scripting language to manipulate, analyze and query sequencing data applying the advances motivated by the “big data revolution” in data-intensive activities. SeqPig provides access to popular data formats and implements a number of custom sequencing-specific functions. Most importantly, it grants users access to the scalable Hadoop platform from a high level scripting language

Downloads

Additional Files

PDF Proofreader's corrections

Published

2013-04-08

Issue

Vol. 19: Supplement A

Section

Posters

License

Authors who publish with this journal agree to the following terms:

Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).

Scripting for large-scale sequencing based on Hadoop

Authors

DOI:

Keywords:

Abstract

Downloads

Additional Files

Published

Issue

Section

License

Language

Developed By

Information