Reading ZIP files from Hadoop Map/Reduce
Image via CrunchBase
Important Update
This post has been obsoleted by my update here: Hadoop: Processing ZIP files in Map/Reduce
The Problem
One of the first use-cases I had for playing with Apache Hadoop involved extracting and parsing the contents of thousands of ZIP files. Hadoop doesn't have a built-in reader for ZIP files, it just sees them as binary blobs.
The Solution
To solve the problem I wrote two small utility classes, ZipFileInputFormat.java and ZipFileRecordReader.java. They extend the default FileInputFormat and RecordReader classes to add this new functionality.
Effectively, your Mapper class now receives two parameters:
Key<Text>
which is the name of the file within the ZIPValue<BytesWritable>
which is the complete contents of the file as a binary blob
Usage
To use them, you just need to set the InputFormat appropriately on your job:
job.setInputFormatClass(ZipFileInputFormat.class);
You can still use all the sexy features from the FileInputFormat base implementation:
ZipFileInputFormat.setInputPaths(job, new Path("/data/Test/*"));
Results and Limitations
And the result is that your Mapper class now gets the uncompressed contents of each file within the ZIP. It's worth noting that each ZIP file will be assigned to a single Mapper, and that Mapper will process each of the files within the ZIP - they are not redistributed among your cluster.
Related Posts
Hadoop: Processing ZIP files in Map/Reduce
Updated ZipFileInputFormat framework for processing thousands of ZIP files in Hadoop with failure tolerance and comprehensive examples
Consuming Twitter streams from Java
Build a Java utility class to consume Twitter Streaming API data for offline analysis in Hadoop with automatic file segmentation
The problem with Big Data is not the Data
The real problem with Big Data isn't volume—it's knowing what you want to achieve and starting with clear business challenges, not technology.