From software engineering point of view, when you skip fancy nomenclature and go beyond scientific paper abstracts, bioinformatics is mostly about handling tabular data. Loads of it. Compressed output of a typical sequencing run is around few gigabytes. But once you start analysing this data you can get as much as 10 times the size of raw input data. When working with larger projects’ datasets sizes of hundreds of gigabytes or a few terabytes are not unusual.
Just recently we completed a very interesting project, which also provided us with a valuable lesson. Without going into much details, one of the steps of the analysis was a typical example of the large dataset vs a simple bioinformatics algorithms. For smaller datasets the solution would be trivial, but in this case about 5GB of input data grew to 2.5TB. That needed to be parsed, filtered and aggregated into intermediate result for the next step of processing pipeline. What could we do with TSV that is larger than the size of local hard drives? The obvious approach was to just “gzipp” it. We got a nice 200GB file. That appeared to be a very smart choice until we had to parse this file a few times. Parallel processing of the large compressed files is next to impossible. So we had to transform this into something more wieldy. Everybody knows there are no silver bullets but there is Apache Arrow. And it has a very simple API that plays nice with Pandas.
The idea behind the project Arrow is to provide language-agnostic in-memory format for efficient handling of the columnar data. It is backed by major players in every data processing ecosystem (Hadoop, Spark, Impala, Drill, Pandas, …) and supports a variety of popular programming languages: Java , Python, C/C++.
To tame the input and the output files we used Apache Parquet, which is popular in Hadoop ecosystem and is the cool technology behind tools like Facebook’s Presto or Amazon Athena. When paired with fast compression like Snappy from Google, Parquet provides a good balance of performance and the file size. It makes parallel processing a breeze.
Our dataset was a bit too large for a single machine but too small to bother with spinning a cluster. To save time we decided to go with single server and Python’s multiprocessing module.
Split large, compressed file into Parquet chunks:
import multiprocessing as mp
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
column_names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'class_name']
reader = pd.read_table('iris.data.gz', chunksize=1e7, names=column_names, compression='gzip')
for chunk_no, chunk in enumerate(reader):
Do some work in parallel:
chunk = pq.read_table(os.path.join('in_dir', filename), nthreads=2).to_pandas()
# Some processesing ...
pool = mp.Pool()
for filename in glob.glob('in_dir/*.parquet'):
Load and aggregate results:
df = pq.read_table('out_dir', nthreads=mp.cpu_count()).to_pandas()
# Aggregate the results into final report ...
To give you an idea of storage and performance benefits of using Parquet over text files let me show you some stats. Please keep in mind that this is not a rigorous benchmark, just a crude comparison.
Here are the file sizes for a small sample of our data set (~2 millions of rows):
||Size in MB
|CSV with GZip compression
|Parquet with Snappy compression
So the size difference is significant but not as significant as performance increase:
||Time in seconds
|CSV with GZip compression
|Parquet with Snappy compression – 1 loading thread
|Parquet with Snappy compression – 2 loading threads
|Parquet with Snappy compression – 2 threads, only first two columns
For our calculation we were using only selected columns from our input files, but we wanted to preserve all of the original data. You know. “Just in case”. It later proved to be a great decision.
Because we switched to Parquet our analysis got almost order of magnitude faster when loading input files and then it go better. Parquet gives you metadata, quick way to check table dimensions without loading it, columns selection, it stores row indices, merging of partitions and much more…
Working with large tabular files does not have to be a headache if done in a smart way. Next time when you need to store or quickly process large tables, consider Arrow/Parquet combination and save CSV file format for the final report.