Consuming Twitter streams from Java
Background
A while ago I was playing with the Twitter Streaming API, one of the first things I wanted to do was collect a lot of data for off-line analysis (in Hadoop).
The Solution
I wrote a hacky little utility class called TwitterConsumer.java that did just the trick. Basically you just initialise it with a valid Twitter account (username/password) and give it the URL of the stream you would like to consume, this could be any of the generic sample streams or a more sophisticated filter-based stream.
Usage
TwitterConsumer t = new TwitterConsumer("username", "password", "http://stream.twitter.com/1/statuses/sample.json", "sample");
t.start();
Results
The result is that you get the sample stream (~1% of everything said on Twitter) written out into a series of files called sample-<timestamp>.json
, I have segmented them at 64MB boundaries out of convenience for storing in HDFS.
Generally the rate of the sample stream means you get a new 64MB file every 45-60 minutes.
Related Posts
Hadoop: Processing ZIP files in Map/Reduce
Updated ZipFileInputFormat framework for processing thousands of ZIP files in Hadoop with failure tolerance and comprehensive examples
Reading ZIP files from Hadoop Map/Reduce
Custom utility classes to extract and parse ZIP file contents in Hadoop MapReduce jobs using ZipFileInputFormat and ZipFileRecordReader
Social TV is Dead?
Despite claims that Social TV is dead, data from 486,659 Zeebox tweets and 4.3M Miso tweets reveals a more complex reality in the second-screen battle.