![Image representing Hadoop as depicted in Crunc... Image representing Hadoop as depicted in Crunc...](http://www.crunchbase.com/assets/images/resized/0001/3073/13073v1-max-450x450.png)
This post has been obsoleted by my update here: Hadoop: Processing ZIP files in Map/Reduce
One of the first use-cases I had for playing with Apache Hadoop involved extracting and parsing the contents of thousands of ZIP files. Hadoop doesn’t have a built-in reader for ZIP files, it just sees them as binary blobs.
To solve the problem I wrote two small utility classes, ZipFileInputFormat.java and ZipFileRecordReader.java. They extend the default FileInputFormat and RecordReader classes to add this new functionality. Effectively, your Mapper class now receives two parameters Key<Text> which is the name of the file within the ZIP and Value<BytesWritable> which is the complete contents of the file as a binary blob.
To use them, you just need to set the InputFormat appropriately on your job:
job.setInputFormatClass(ZipFileInputFormat.class);
You can still use all the sexy features from the FileInputFormat base implementation:
ZipFileInputFormat.setInputPaths(job, new Path("/data/Test/*"));
And the result is that your Mapper class now gets the uncompressed contents of each file within the ZIP. It’s worth noting that each ZIP file will be assigned to a single Mapper, and that Mapper will process each of the files within the ZIP – they are not redistributed among your cluster.
![](http://img.zemanta.com/pixy.gif?x-id=6f982f39-6e1d-4316-b4dd-245ef9f8656b)
Leave a Reply