Consuming Twitter streams from Java

Background
A while ago I was playing with the Twitter Streaming API, one of the first things I wanted to do was collect a lot of data for off-line analysis (in Hadoop).
The Solution
I wrote a hacky little utility class called TwitterConsumer.java that did just the trick. Basically you just initialise it with a valid Twitter account (username/password) and give it the URL of the stream you would like to consume, this could be any of the generic sample streams or a more sophisticated filter-based stream.
Usage
TwitterConsumer t = new TwitterConsumer("username", "password", "http://stream.twitter.com/1/statuses/sample.json", "sample");
t.start();
Results
The result is that you get the sample stream (~1% of everything said on Twitter) written out into a series of files called sample-<timestamp>.json, I have segmented them at 64MB boundaries out of convenience for storing in HDFS.
Generally the rate of the sample stream means you get a new 64MB file every 45-60 minutes.
The Cutler.sg Newsletter
Weekly notes on AI, engineering leadership, and building in Singapore. No fluff.
Hadoop: Processing ZIP files in Map/Reduce
Updated ZipFileInputFormat framework for processing thousands of ZIP files in Hadoop with failure tolerance and comprehensive examples
Reading ZIP files from Hadoop Map/Reduce
Custom utility classes to extract and parse ZIP file contents in Hadoop MapReduce jobs using ZipFileInputFormat and ZipFileRecordReader
Two Papers That Puncture the Hype
One paper shows frontier models degrade as context grows — even on trivial tasks. The other shows reasoning models hit a wall and think less as problems get harder. Read carefully, both point at the same engineering response.