What Netflix knows about you and why it's a lesson to others...
Math.random()
), or simply may not capture it in a useful form. Therefore when it comes to connecting these data-points back to something that can measure the success of the recommendations - it gets a little tricky. "No problem!" they might say, we can always conduct (subjective) user satisfaction surveys. It doesn't matter if you are serving video content, news articles or adverts .. this is known (in the biz) as #FAIL If you do nothing else right, you absolutely must capture data about what was served to customers. Measuring user activity is key So your swanky recommendation system is surfacing what you think is relevant content, but how do you determine if the user even saw your suggestions - let alone interacted with them. This is the interesting part, and I find it fascinating that so many implementations only appear to measure Click-through's or Purchases. The fact of the matter is that if the content recommended was in-front of the user, they've already interacted with it in an implicit way - you just need to figure it out.
esn=null&country=GB&application_name=MERCHWEB&application_v=1.2&data=[
{"time":1340138780754,"request_id":"4fc...019","video_id":70095139,"track_id":50000085,"row":0,"rank":6,"location":"WATCHNOW"},
{"time":1340138780754,"request_id":"4fc...019","video_id":60020817,"track_id":50000085,"row":0,"rank":7,"location":"WATCHNOW"},
{"time":1340138780754,"request_id":"4fc...019","video_id":70139996,"track_id":50000085,"row":0,"rank":8,"location":"WATCHNOW"},
{"time":1340138780754,"request_id":"4fc...019","video_id":70213047,"track_id":50000085,"row":0,"rank":9,"location":"WATCHNOW"},
{"time":1340138780754,"request_id":"4fc...019","video_id":70102778,"track_id":50000085,"row":0,"rank":0,"location":"WATCHNOW"},
{"time":1340138780754,"request_id":"4fc...019","video_id":70105820,"track_id":50000085,"row":0,"rank":1,"location":"WATCHNOW"}]
This is an example of the ping sent from my browser a moment ago, its quite clear to see 'video_id', 'row' and 'rank' going back to be processed and recorded by Netflix. Just to re-iterate this point, I haven't clicked any mouse button - I simply hovered over the right/left arrows to scroll around the UI. This sort of activity wouldn't be captured by Clickstream or Web Logs, it has to be engineered into the User Interface. Implicit user activity data is King If all you are capturing is click-throughs and transactions - you are missing most of the data. Going a step beyond the scenario illustrated above, say I've browsed around a few genres, picked something and clicked 'Watch Now'. Is that the end of the story? No! The next step up in data collection is capturing how users interact with your content - I will stick with the Netflix example for now. Most web browsers, smart phones and application platforms allow you to tightly control and monitor their video playback widgets. You can capture details like:
- Is the content playing, or is it paused?
- How far through the content are they?
- Is the video widget having buffering problems?
- Is the video widget playing anything at all?
- Is the user fast-forwarding through sections of the content?
Again, most video content on the internet is delivered over Content Delivery Networks (CDN), inferring this sort of activity from CDN logs is not impossible but its right up there with Cold Fusion and Faster-than-Light space travel. It makes far more sense to build this measurement into the application, be it a web-based application, smart phone app or some monolithic Flash/Silverlight application. Holy cr@p, we've now got 200TB of implicit activity data!?!?! Firstly, congratulations! You've overcome the first hurdles and are now merrily collecting a potential gold-mine of data. Now comes the hard part, making sense of it all! This is the area where you need a bit of a head for statistics, and a set of tools that help you 'discover' trends in behaviours, build 'models' based on the captured data and start to make 'predictions'. This isn't a particularly new field, it's existed in one form or another for decades. At the moment the skills and people capable of doing these things are gravitating around the name "Data Scientists". So what is a Data Scientist? The best description I've seen comes from @josh_wills (Data Scientist @ Cloudera):
Related Posts
Winning with Big Data - IBM Research
Key insights from IBM Research's webinar featuring Netflix and StubHub on implicit data collection, recommendation strategies, and the evolution from BI to Data Science.
It's been a while...
Left BSkyB to co-found TUMRA, a data science startup, and been busy developing products while updating personal website
Revolution R on CentOS 6
Installing Revolution Analytics R statistical computing platform on CentOS 6 with dependency resolution and compatibility fixes