Consuming Twitter streams from Java

twitter fail image
twitter fail image (Photo credit: Wikipedia)

A while ago I was playing with the Twitter Streaming API, one of the first things I wanted to do was collect a lot of data for off-line analysis (in Hadoop).  I wrote a hacky little utility class called that did just the trick.

Basically you just initialise it with a valid Twitter account (username/password) and give it the URL of the stream you would like to consume, this could be any of the generic sample streams or a more sophisticated filter-based stream.

TwitterConsumer t = new TwitterConsumer("username", "password", "", "sample");

The result is that you get the sample stream (~1% of everything said on Twitter) written out into a series of files called sample-<timestamp>.json, I have segmented them at 64MB boundaries out of convenience for storing in HDFS.  Generally the rate of the sample stream means you get a new 64MB file every 45-60 minutes.


2 responses to “Consuming Twitter streams from Java”

  1. How can I filter twitter firehose for all tweets containing $TICKER?…

    Technically speaking, “Gnip” ( and “DataSift” are the only companies licensed to resell the Twitter firehose data. Both have filtering and tracking functionality, and both charge for tweets at the same price ($0.10 per 1000 tweets). In…

  2. Edgar avatar

    Awesome! Thanks

Leave a Reply

Your email address will not be published. Required fields are marked *