Parallel Batch Processing

When compressing multiple files, I once again was disappointed that gzip/bzip2/lzop/… are not able to use multiple cores to speed up the compression process.

Furtunately, there is a small program which, at least, allows to spawn multiple processes (default is number of CPU cores) to process multiple files in parallel. Surprisingly, the name of the program is “parallel”. 🙂 It can be installed in Debian with a simple

apt-get install parallel

Here is a little bash script which illustrates the difference between sequential and parallel file compression:

#!/usr/bin/env bash
TEMP_DIR="`mktemp`"

cd "$TEMP_DIR"
for i in {0000..1023}; do dd if=/dev/urandom of=testfile$i bs=1M count=1; done

time bzip2 testfile*
time bunzip2 testfile*
time parallel bzip2 ::: testfile*
time parallel bunzip2 ::: testfile*

rm -rf "$TEMP_DIR"

This script generates 1024 1 MiB files in a temporary directory. First, it compresses and decompresses all files sequentially. After that, it performs the same steps in parallel. For each step, the time is measured. In the parallel version, all cores are used, leading to a significant decrease in runtime. The following times were measured in an Arch Linux VM within VirtualBox with four streams on an Intel i7 quad core CPU:

Sequential compression time:

real	2m53.668s
user	2m50.713s
sys	0m2.887s

Sequential decompression time:

real	1m17.762s
user	1m14.490s
sys	0m1.990s

Parallel compression time:

real	1m1.950s
user	3m27.610s
sys	0m6.707s

Parallel decompression time:

real	0m39.234s
user	1m52.003s
sys	0m5.477s

The real power of this little program unleashes when it is combined with find. To transcode all wav files present within a directory and its subdirectories, find can be used with the exec option:

find /path/to/parent/folder -type f -name '*.wav' -exec flac -8 {} +

Again, this command runs sequentially, leading to a long overall runtime utilizing only one core. For parallel transcoding of multiple files at once, simply change the command to this:

find /path/to/parent/folder -type f -name '*.wav' | parallel flac -8

This variant spawns multiple instances of the flac encoder leading to the CPU operating at full capacity.