From a software engineering point of view, when you skip fancy nomenclature and go beyond scientific paper abstracts, bioinformatics mostly deals with handling tabular data – loads of it. The compressed output of a typical sequencing run is around a few gigabytes. But once you start analyzing this data you can get as much as 10 times the size of raw input data. When working with larger projects’ datasets sizes of hundreds of gigabytes or a few terabytes are not unusual.
A Large Amount of Input Data Aggregation – Problem and Solution
Just recently we completed an exciting project, which also provided us with a valuable lesson. Without going into much detail, one of the steps of the analysis was a typical example of a large dataset vs a simple bioinformatics algorithm.
For smaller datasets, the solution would be trivial, but in this case, about 5GB of input data grew to 2.5TB. That needed to be parsed, filtered and aggregated into intermediate results for the next step of the processing pipeline. What could we do with a TSV that is larger than the size of local hard drives? The obvious approach was to just “gzip” it. We got a nice 200GB file.
That appeared to be a very smart choice until we had to parse this file a few times. Parallel processing of large compressed files is next to impossible. So we had to transform this into something more wieldy. Everybody knows there are no silver bullets – but there is Apache Arrow. And it has a very simple API that plays nice with Pandas.
What is Project Arrow? Apache Solution with Other Technologies
The idea behind the project Arrow is to provide a language-agnostic in-memory format for efficient handling of the columnar data. It is backed by major players in every data processing ecosystem (Hadoop, Spark, Impala, Drill, Pandas, …) and supports a variety of popular programming languages: Java, Python, and C/C++.
To tame the input and the output files we used Apache Parquet, which is popular in the Hadoop ecosystem and is the cool technology behind tools like Facebook’s Presto or Amazon Athena. When paired with fast compression like Snappy from Google, Parquet provides a good balance of performance and file size. It makes parallel processing a breeze.
Our Results of Utilizing Apache Arrow for Big Datasets
Our dataset was too large for a single machine but too small to bother with spinning a cluster. To save time we decided to go with a single server and Python’s multiprocessing module.
Split large, compressed files into Parquet chunks:
import glob
import multiprocessing as mp
import os.path
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
column_names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'class_name']
reader = pd.read_table('iris.data.gz', chunksize=1e7, names=column_names, compression='gzip')
for chunk_no, chunk in enumerate(reader):
pq.write_table(pa.Table.from_pandas(chunk),
os.path.join('in_dir', 'iris-{:04d}.parquet'.format(chunk_no)),
compression='snappy')
Do some work in parallel:
def do_work(filename):
chunk = pq.read_table(os.path.join('in_dir', filename), nthreads=2).to_pandas()
# Some processing...
pq.write_table(pa.Table.from_pandas(chunk),
os.path.join('out_dir', filename),
compression='snappy')
pool = mp.Pool()
for filename in glob.glob('in_dir/*.parquet'):
pool.apply_async(do_work, args=(filename,))
pool.close()
pool.join()
Load and aggregate results:
df = pq.read_table('out_dir', nthreads=mp.cpu_count()).to_pandas()
# Aggregate the results into the final report ...
df.to_csv('report.txt')
The Benefits of Using Parquet over Text Files
Let me show you some stats to give you an idea of the storage and performance benefits of using Parquet over text files. Please keep in mind that this is not a rigorous benchmark, just a crude comparison.
Here are the file sizes for a small sample of our data set (~2 millions of rows):
Format |
Size in MB |
CSV |
173 |
CSV with GZip compression |
22 |
Parquet with Snappy compression |
32 |
So the size difference is significant but not as significant as the performance increase:
Loading method |
Time in seconds |
CSV |
3.51 |
CSV with GZip compression |
3.54 |
Parquet with Snappy compression – 1 loading thread |
1.53 |
Parquet with Snappy compression – 2 loading threads |
1.31 |
Parquet with Snappy compression – 2 threads, only first two columns |
0.53 |
For our calculation, we were using only selected columns from our input files, but we wanted to preserve all of the original data. You know. “Just in case”. It later proved to be a great decision.
Because we switched to Parquet our analysis got almost an order of magnitude faster when loading input files and then it got better. Parquet gives you metadata, a quick way to check table dimensions without loading it, column selection, store row indices, the merging of partitions, and much more…
Apache Arrow and Parquet – Large Tables Processed with Ease
Bioinformatics is a field that has to deal with vast amounts of data very often. This is especially valid if we consider sequencing runs, as well as the process of analysis and creating valuable insights. This type of a file is usually extremely heavy, making it difficult to process and extract valuable, scientific data.
However, working with such large tabular files does not have to be a headache if done smartly. Next time when you need to store or quickly process large tables, consider the Arrow/Parquet combination and save the CSV file format for the final report.