One of the first use-cases I had for playing with Apache Hadoop involved extracting and parsing the contents of thousands of ZIP files.  Hadoop doesn’t have a built-in reader for ZIP files, it just sees them as binary blobs.

To solve the problem I wrote two small utility classes, and  They extend the default FileInputFormat and RecordReader classes to add this new functionality.  Effectively, your Mapper class now receives two parameters Key<Text> which is the name of the file within the ZIP and Value<BytesWritable> which is the complete contents of the file as a binary blob.

To use them, you just need to set the InputFormat appropriately on your job:


You can still use all the sexy features from the FileInputFormat base implementation:

ZipFileInputFormat.setInputPaths(job, new Path("/data/Test/*"));

And the result is that your Mapper class now gets the uncompressed contents of each file within the ZIP.  It’s worth noting that each ZIP file will be assigned to a single Mapper, and that Mapper will process each of the files within the ZIP – they are not redistributed among your cluster.


  1. I would like the ability to cart around SequenceFiles with metadata tags. This would work great if it was able to ignore some files. Would you consider adding a feature for ignoring some files?
    Or, can I just copy this to Hadoop/Mahout and fiddle with it? “Accept” or “Reject” lists would both be useful.

