Category: Hadoop

  • Hadoop: Processing ZIP files in Map/Reduce

    by

    in

    Due to popular request, I’ve updated my simple framework for processing ZIP files in Hadoop Map/Reduce jobs. Previously the only easy solution was to unzip files locally and then upload them to the Hadoop Distributed File System (HDFS) for processing. This adds a lot of unnecessary complexity when you are dealing with thousands of ZIP files; Java…

  • Consuming Twitter streams from Java

    A while ago I was playing with the Twitter Streaming API, one of the first things I wanted to do was collect a lot of data for off-line analysis (in Hadoop).  I wrote a hacky little utility class called TwitterConsumer.java that did just the trick. Basically you just initialise it with a valid Twitter account…

  • Reading ZIP files from Hadoop Map/Reduce

    by

    in ,

    This post has been obsoleted by my update here: Hadoop: Processing ZIP files in Map/Reduce One of the first use-cases I had for playing with Apache Hadoop involved extracting and parsing the contents of thousands of ZIP files.  Hadoop doesn’t have a built-in reader for ZIP files, it just sees them as binary blobs. To solve…