Consuming Twitter streams from Java

twitter fail image
twitter fail image (Photo credit: Wikipedia)

A while ago I was playing with the Twitter Streaming API, one of the first things I wanted to do was collect a lot of data for off-line analysis (in Hadoop).  I wrote a hacky little utility class called TwitterConsumer.java that did just the trick.

Basically you just initialise it with a valid Twitter account (username/password) and give it the URL of the stream you would like to consume, this could be any of the generic sample streams or a more sophisticated filter-based stream.

TwitterConsumer t = new TwitterConsumer("username", "password", "http://stream.twitter.com/1/statuses/sample.json", "sample");
t.start();

The result is that you get the sample stream (~1% of everything said on Twitter) written out into a series of files called sample-<timestamp>.json, I have segmented them at 64MB boundaries out of convenience for storing in HDFS.  Generally the rate of the sample stream means you get a new 64MB file every 45-60 minutes.


Comments

2 responses to “Consuming Twitter streams from Java”

  1. How can I filter twitter firehose for all tweets containing $TICKER?…

    Technically speaking, “Gnip” (www.gnip.com) and “DataSift” are the only companies licensed to resell the Twitter firehose data. Both have filtering and tracking functionality, and both charge for tweets at the same price ($0.10 per 1000 tweets). In…

  2. Edgar avatar

    Awesome! Thanks

Leave a Reply

Your email address will not be published. Required fields are marked *