Reading ZIP files from Hadoop Map/Reduce

Mar 11, 2011

—

Image representing Hadoop as depicted in Crunc... — Image via CrunchBase

This post has been obsoleted by my update here: Hadoop: Processing ZIP files in Map/Reduce

One of the first use-cases I had for playing with Apache Hadoop involved extracting and parsing the contents of thousands of ZIP files. Hadoop doesn’t have a built-in reader for ZIP files, it just sees them as binary blobs.

To solve the problem I wrote two small utility classes, ZipFileInputFormat.java and ZipFileRecordReader.java. They extend the default FileInputFormat and RecordReader classes to add this new functionality. Effectively, your Mapper class now receives two parameters Key<Text> which is the name of the file within the ZIP and Value<BytesWritable> which is the complete contents of the file as a binary blob.

~~To use them, you just need to set the InputFormat appropriately on your job:~~

~~job.setInputFormatClass(ZipFileInputFormat.class);~~

~~You can still use all the sexy features from the FileInputFormat base implementation:~~

~~ZipFileInputFormat.setInputPaths(job, new Path("/data/Test/*"));~~

And the result is that your Mapper class now gets the uncompressed contents of each file within the ZIP. It’s worth noting that each ZIP file will be assigned to a single Mapper, and that Mapper will process each of the files within the ZIP – they are not redistributed among your cluster.

Comments

4 responses to “Reading ZIP files from Hadoop Map/Reduce”

Lance Norskog

2011-08-26

I would like the ability to cart around SequenceFiles with metadata tags. This would work great if it was able to ignore some files. Would you consider adding a feature for ignoring some files?
Or, can I just copy this to Hadoop/Mahout and fiddle with it? “Accept” or “Reject” lists would both be useful.

Reply
ritu

2012-06-25

Can you please share the client/calling code

Reply
1. cotdp
  
  2012-07-06
  
  Hello, in response to your feedback I’ve updated the code and written a new post with example implementations.
  
  https://cutler.sg/2012/07/hadoop-processing-zip-files-in-mapreduce/
  
  Reply
Hadoop: Processing ZIP files in Map/Reduce | Personal Website of Michael Cutler

2012-07-06

[…] to popular request, I’ve updated my simple framework for processing ZIP files in Hadoop Map/Reduce jobs. Previously the only easy solution was […]

Reply

Reading ZIP files from Hadoop Map/Reduce

Comments

4 responses to “Reading ZIP files from Hadoop Map/Reduce”

Leave a Reply Cancel reply