Hadoop ZIP File Processing: Custom InputFormat Guide

Image representing Hadoop as depicted in Crunc...

Image via CrunchBase

Important Update

This post has been obsoleted by my update here: Hadoop: Processing ZIP files in Map/Reduce

The Problem

One of the first use-cases I had for playing with Apache Hadoop involved extracting and parsing the contents of thousands of ZIP files. Hadoop doesn't have a built-in reader for ZIP files, it just sees them as binary blobs.

The Solution

To solve the problem I wrote two small utility classes, ZipFileInputFormat.java and ZipFileRecordReader.java. They extend the default FileInputFormat and RecordReader classes to add this new functionality.

Effectively, your Mapper class now receives two parameters:

Key<Text> which is the name of the file within the ZIP
Value<BytesWritable> which is the complete contents of the file as a binary blob

Usage

To use them, you just need to set the InputFormat appropriately on your job:

job.setInputFormatClass(ZipFileInputFormat.class);

You can still use all the sexy features from the FileInputFormat base implementation:

ZipFileInputFormat.setInputPaths(job, new Path("/data/Test/*"));

Results and Limitations

And the result is that your Mapper class now gets the uncompressed contents of each file within the ZIP. It's worth noting that each ZIP file will be assigned to a single Mapper, and that Mapper will process each of the files within the ZIP - they are not redistributed among your cluster.

Hadoop: Processing ZIP files in Map/Reduce

Updated ZipFileInputFormat framework for processing thousands of ZIP files in Hadoop with failure tolerance and comprehensive examples

July 06, 2012 • 5 min read

Consuming Twitter streams from Java

Build a Java utility class to consume Twitter Streaming API data for offline analysis in Hadoop with automatic file segmentation

March 12, 2011 • 1 min read

The problem with Big Data is not the Data

The real problem with Big Data isn't volume—it's knowing what you want to achieve and starting with clear business challenges, not technology.

September 25, 2012 • 7 min read

The Solution

To solve the problem I wrote two small utility classes, and . They extend the default FileInputFormat and RecordReader classes to add this new functionality.

Effectively, your Mapper class now receives two parameters:

Key<Text> which is the name of the file within the ZIP

Value<BytesWritable> which is the complete contents of the file as a binary blob

Michael Cutler

Reading ZIP files from Hadoop Map/Reduce

Important Update

The Problem

The Solution

Usage

Results and Limitations

Related Posts

Hadoop: Processing ZIP files in Map/Reduce

Consuming Twitter streams from Java

The problem with Big Data is not the Data

Reading ZIP files from Hadoop Map/Reduce

Important Update

The Problem

The Solution

Usage

Results and Limitations

Related Posts

Hadoop: Processing ZIP files in Map/Reduce

Consuming Twitter streams from Java

The problem with Big Data is not the Data