fastq; RNA-seq; Quality control; Preprocessing; Sequence data
- Pérez-Rubio, P.
- Lottaz, C.
- Engelmann, J.C., meer
Pérez-Rubioet al. BMC Bioinformatics (2019) 20:226 https://doi.org/10.1186/s12859-019-2799-0SOFTWAREOpen AccessFastqPuri: high-performancepreprocessing of RNA-seq dataPaula Pérez-Rubio1, Claudio Lottaz1and Julia C. Engelmann2*AbstractBackground:RNA sequencing (RNA-seq) has become the standard means of analyzing gene and transcriptexpression in high-throughput. While previously sequence alignment was a time demanding step, fast alignmentmethods and even more so transcript counting methods which avoid mapping and quantify gene and transcriptexpression by evaluating whether a read is compatible with a transcript, have led to significant speed-ups in dataanalysis. Now, the most time demanding step in the analysis of RNA-seq data is preprocessing the raw sequence data,such as running quality control and adapter, contamination and quality filtering before transcript or genequantification. To do so, many researchers chain different tools, but a comprehensive, flexible and fast software thatcovers all preprocessing steps is currently missing.Results:We here presentFastqPuri, a light-weight and highly efficient preprocessing tool for fastq data.FastqPuriprovides sequence quality reports on the sample and dataset level with new plots which facilitate decision making forsubsequent quality filtering. Moreover,FastqPuriefficiently removes adapter sequences and sequences frombiological contamination from the data. It accepts both single- and paired-end data in uncompressed or compressedfastq files.FastqPurican be run stand-alone and is suitable to be run within pipelines. We benchmarkedFastqPuriagainst existing tools and found thatFastqPuriis superior in terms of speed, memory usage, versatility andcomprehensiveness.Conclusions:FastqPuriis a new tool which covers all aspects of short read sequence data preprocessing. It wasdesigned for RNA-seq data to meet the needs for fast preprocessing of fastq data to allow transcript and genecounting, but it is suitable to process any short read sequencing data of which high sequence quality is needed, suchas for genome assembly or SNV (single nucleotide variant) detection.FastqPuriis most flexible in filtering undesiredbiological sequences by offering two approaches to optimize speed and memory usage dependent on the total sizeof the potential contaminating sequences.FastqPuriis available athttps://github.com/jengelmann/FastqPuri.Itisimplemented in C and R and licensed under GPL v3.