There has been recent a flurry of articles about Netflix and how they are analysing user behaviour and habits to enhance the recommendations they make. Last week Mohammed Sabah (Senior Data Scientist @ Netflix) presented at Hadoop Summit in San Jose describing some of the ways they do this and the impact of making good recommendations – measurable customer satisfaction.
The underlying principle is beautifully simple – capture data about ‘What’ you offer your users, and ‘How’ they interact with it, then you can infer ‘Why’ things work or don’t work.
Miss out any of these steps and you’re fundamentally doing it wrong. This is likely the first hurdle in any organisation that was not born from the internet-age and wholly reliant on data-driven decisions.
Capturing data *can* be hard too
These days it’s not uncommon for large enterprises to buy-in services to makes their lives easier. After all they already have a *real* business, this internet stuff is just a ‘fad’. So naturally in doing so large enterprises may not know what their 3rd-party recommendation system has suggested to their customers.
Indeed the 3rd-party may not know either (Math.random()
), or simply may not capture it in a useful form. Therefore when it comes to connecting these data-points back to something that can measure the success of the recommendations – it gets a little tricky. “No problem!” they might say, we can always conduct (subjective) user satisfaction surveys.
It doesn’t matter if you are serving video content, news articles or adverts .. this is known (in the biz) as #FAIL
If you do nothing else right, you absolutely must capture data about what was *served* to customers.
Measuring user activity is key
So your swanky recommendation system is surfacing what you think is relevant content, but how do you determine if the user even saw your suggestions – let alone interacted with them. This is the interesting part, and I find it fascinating that so many implementations only appear to measure Click-through’s or Purchases. The fact of the matter is that if the content recommended was in-front of the user, they’ve already interacted with it in an implicit way – you just need to figure it out.
Take for example the above screenshot, from my own Netflix account. Netflix have around 60,000 pieces of content (movies, tv programmes etc.) available to me here in the UK. Out of all that content they’ve distilled it down to just a “Top 10” to recommend to me. Cunningly I can only see 4-5 on the screen, I have to scroll to the right/left to see the other choices. This is the same with the Genres, they might surface 50 movies in the Thriller genre and I as a user interact with the UI to browse the available titles.
I’m already subconsciously interacting with their recommendations, indeed the first 4 recommendations above were “meh” enough that I didn’t even mouse-over them, I just scrolled to the right to see what else was on offer. This is an implicit interaction with these recommendations. Netflix know this and if you watch the traffic from your Web Browser you will see discrete ‘pings’ going back to ‘presentationtracking.netflix.com’ that describe what row of content was being browsed and how many places I went to the right/left.
esn=null&country=GB&application_name=MERCHWEB&application_v=1.2&data=[ {"time":1340138780754,"request_id":"4fc...019","video_id":70095139,"track_id":50000085,"row":0,"rank":6,"location":"WATCHNOW"}, {"time":1340138780754,"request_id":"4fc...019","video_id":60020817,"track_id":50000085,"row":0,"rank":7,"location":"WATCHNOW"}, {"time":1340138780754,"request_id":"4fc...019","video_id":70139996,"track_id":50000085,"row":0,"rank":8,"location":"WATCHNOW"}, {"time":1340138780754,"request_id":"4fc...019","video_id":70213047,"track_id":50000085,"row":0,"rank":9,"location":"WATCHNOW"}, {"time":1340138780754,"request_id":"4fc...019","video_id":70102778,"track_id":50000085,"row":0,"rank":0,"location":"WATCHNOW"}, {"time":1340138780754,"request_id":"4fc...019","video_id":70105820,"track_id":50000085,"row":0,"rank":1,"location":"WATCHNOW"}]
This is an example of the ping sent from my browser a moment ago, its quite clear to see ‘video_id’, ‘row’ and ‘rank’ going back to be processed and recorded by Netflix.
Just to re-iterate this point, I haven’t clicked any mouse button – I simply hovered over the right/left arrows to scroll around the UI. This sort of activity wouldn’t be captured by Clickstream or Web Logs, it has to be engineered into the User Interface.
Implicit user activity data is King
If all you are capturing is click-throughs and transactions – you are missing *most* of the data. Going a step beyond the scenario illustrated above, say I’ve browsed around a few genres, picked something and clicked ‘Watch Now’. Is that the end of the story? No!
The next step up in data collection is capturing how users interact with your content – I will stick with the Netflix example for now. Most web browsers, smart phones and application platforms allow you to tightly control and monitor their video playback widgets. You can capture details like:
- Is the content playing, or is it paused?
- How far through the content are they?
- Is the video widget having buffering problems?
- Is the video widget playing anything at all?
- Is the user fast-forwarding through sections of the content?
Again, most video content on the internet is delivered over Content Delivery Networks (CDN), inferring this sort of activity from CDN logs is not impossible but its right up there with Cold Fusion and Faster-than-Light space travel.
It makes far more sense to build this measurement into the application, be it a web-based application, smart phone app or some monolithic Flash/Silverlight application.
Holy cr@p, we’ve now got 200TB of implicit activity data!?!?!
Firstly, congratulations! You’ve overcome the first hurdles and are now merrily collecting a potential gold-mine of data. Now comes the hard part, making sense of it all!
This is the area where you need a bit of a head for statistics, and a set of tools that help you ‘discover’ trends in behaviours, build ‘models’ based on the captured data and start to make ‘predictions’. This isn’t a particularly new field, it’s existed in one form or another for decades. At the moment the skills and people capable of doing these things are gravitating around the name “Data Scientists”.
So what is a Data Scientist? The best description I’ve seen comes from @josh_wills (Data Scientist @ Cloudera):
I guess the key skills are being able to ask the right questions of the data, get the necessary results and tell a story/communicate the findings. This is exactly the sort of field Mohammad Sabah is working in at Netflix and it’s going to become increasingly important for large enterprises to understand that “Data Science” is not synonymous with “Business Reporting” functions.
Of course it’s fashionable to talk about ‘Big Data’ at the moment, and just about every man and his dog have some sort of ‘Big Data’ related offering. It is crucial to remember that ‘Big Data’ and all the tools that come with it are just the plumbing. They let you collect and store masses of data, and give you a framework to process that data in a parallel / scalable way. However these tools do not know what your data looks like, what it means and most importantly they won’t solve your business-related challenges – to do this you need skilled people that know how to use the data to solve your problems.
Summary
It comes as no surprise to me that Netflix, Facebook, Google etc. all collect masses of data – “Massive Data” even – it is core to their businesses. Traditional enterprises are now struggling to catch-up. Consultancies, software vendors large and small are all eager to jump on the bandwagon and sell you just about anything they can get away with branding ‘Big Data’.
However it’s important to note, that you can have all the shiny ‘Big Data’ plumbing you like with connectors to just about every other appliance you already have – none of this will help you solve your real-world business challenges.
If 2012 is the year of ‘Big Data’, then 2013 will almost certainly become the year of ‘Data Science’. I’m sorry guys, but if you are in the ‘Big Data’ space, it’s already heavily commoditized by services like Amazon Elastic Map/Reduce – and it’s only going to get more-so.
Solving problems in new and innovative ways is what’s needed – it’s what I’m doing!
Thanks for reading! Feel free to engage in discussion in the comments below or on Twitter (@cotdp).
If some of the points raised in this post hit a little close to home… it’s not too late to contact us at TUMRA – we’re a Data Science agency and I’m its CTO.
Leave a Reply