Twitter Streaming API Java Consumer for Hadoop Analysis

Background

A while ago I was playing with the Twitter Streaming API, one of the first things I wanted to do was collect a lot of data for off-line analysis (in Hadoop).

The Solution

I wrote a hacky little utility class called TwitterConsumer.java that did just the trick. Basically you just initialise it with a valid Twitter account (username/password) and give it the URL of the stream you would like to consume, this could be any of the generic sample streams or a more sophisticated filter-based stream.

Usage

TwitterConsumer t = new TwitterConsumer("username", "password", "http://stream.twitter.com/1/statuses/sample.json", "sample");
t.start();

Results

The result is that you get the sample stream (~1% of everything said on Twitter) written out into a series of files called sample-<timestamp>.json, I have segmented them at 64MB boundaries out of convenience for storing in HDFS.

Generally the rate of the sample stream means you get a new 64MB file every 45-60 minutes.

Background

A while ago I was playing with the Twitter Streaming API, one of the first things I wanted to do was collect a lot of data for off-line analysis (in Hadoop).

The Solution

Usage

TwitterConsumer t = new TwitterConsumer("username", "password", "http://stream.twitter.com/1/statuses/sample.json", "sample");
t.start();

Results

Generally the rate of the sample stream means you get a new 64MB file every 45-60 minutes.

Michael Cutler

Consuming Twitter streams from Java

Background

The Solution

Usage

Results

Related Posts

Hadoop: Processing ZIP files in Map/Reduce

Reading ZIP files from Hadoop Map/Reduce

Social TV is Dead?

Consuming Twitter streams from Java

Background

The Solution

Usage

Results

Related Posts

Hadoop: Processing ZIP files in Map/Reduce

Reading ZIP files from Hadoop Map/Reduce

Social TV is Dead?