Collaborative Filtering
Tyler McMullen
... which for the purposes of this talk means:
Recommendations
Netflix
Google Reader
Pandora
Last.fm
...and of course... Amazon
(shameless plug)
I like to think of it as a fill-in-blank puzzle.
Bob Suzie Joe
Item A Item B Item C 5 1 5 5 1 ? 1 5 1
Dataset
Dataset
Dataset
Correlate
Correlate
Correlate
Recommendations
Content Booster
Output
Data
Data
Data > Algorithms
Data
Amazon uses a simple item-to-item correlation system
Data
Amazon uses a simple item-to-item correlation system How can they get away with that? ~ 20 million items n million users
Data
If every user bought 200 items their user-item matrix would be 0.001% full
purchases ratings
Data
purchases ratings views shopping cart votes wishlists baby registry wedding registry tell-a-friend
Data
purchases ratings views shopping cart votes wishlists baby registry wedding registry tell-a-friend anything you can measure!
Data
Data
Data > Algorithms
more different data > more of the same data
Correlation
Correlation
Find patterns in the data sets
Correlation Pearson Singular Value Decomposition
Correlation Pearson Singular Value Decomposition
Kendall tau coefficient Spearman's rho point biserial correlation coefficient
Correlation
Word of Caution: Watch for O(n2) here
Recommendation
Recommendation
This is the part where we figure out what you'll like.
Recommendation So we have all these correlation matrices. One for each of the datasets that we correlated.
Bob Bob Suzie Joe
0.87 0.74
Suzie -0.74 -0.9
Joe 0.856 0.1
Recommendation So let's say we have a user named Fred...
Joe 0.9 Bob 0.75 Suzie 0.5
Recommendation
Joe
Joe 0.9 Bob 0.75 Suzie 0.5
Item A Item B
5 4
Bob Item B Item C
5 2
Suzie Item C Item A
2 2
Recommendation Joe Item A Item B
5 4
Bob Item B Item C
5 2
Suzie Item C Item A
Item A Joe – 5 Suzie – 2
Item B Joe – 4 Bob – 5
Item C 2 2
Bob – 2 Suzie – 2
Recommendation Item A Joe – 5 Suzie – 2
Item B Joe – 4 Bob – 5
Item C Bob – 2 Suzie – 2
Item A Item B Item C
3.93 4.45 2
Recommendation
Item A Item B Item C
3.93 4.45 2
Content Boosting
Content Boosting
Your users reveal their preferences in their actions.
Content Boosting
Your users reveal their preferences in their actions.
If I mark every horror movie in your system as a ”1”... I don't like horror movies.
Content Boosting
Your users reveal their preferences in their actions.
If I mark every horror movie in your system as a ”1”... I don't like horror movies. If I rate every Will Smith movie as ”5 stars”... I probably like Will Smith.
Content Boosting
All Items have properties.
Content Boosting
All Items have properties.
Movies have genres, actors, studio, locations, etc...
Content Boosting
All Items have properties.
Movies have genres, actors, studio, locations, etc... Comics have genres, writers, artists, publishers, etc...
Content Boosting
All Items have properties.
Movies have genres, actors, studio, locations, etc... Comics have genres, writers, artists, publishers, etc... Kittens have color, gender, breed, cute captions, etc...
Content Boosting I Am Legend
5
Action Will Smith
Cloverfield
4
Action No Will Smith
Independence Day
4
Action Will Smith
Sleepless in Seattle Romance No Will Smith
1
Content Boosting I Am Legend
5
Action Will Smith
Cloverfield
4
Action No Will Smith
Independence Day
4
Action Will Smith
Sleepless in Seattle Romance No Will Smith
1
So what do my preferences say about me?
Content Boosting I Am Legend
5
Action Will Smith
Cloverfield
My mean rating is 3.5, so...
4
Action No Will Smith
Independence Day
4
Action Will Smith
Sleepless in Seattle Romance No Will Smith
So what do my preferences say about me?
1
Content Boosting I Am Legend
5
Action Will Smith
Cloverfield
My mean rating is 3.5, so...
4
Action No Will Smith
Independence Day
4
Action Will Smith
Sleepless in Seattle Romance No Will Smith
So what do my preferences say about me?
1
Action: +0.8
Content Boosting I Am Legend
5
Action Will Smith
Cloverfield
My mean rating is 3.5, so...
4
Action No Will Smith
Independence Day
Romance No Will Smith
Action: +0.8 Romance: -2.5
4
Action Will Smith
Sleepless in Seattle
So what do my preferences say about me?
1
Content Boosting I Am Legend
5
Action Will Smith
Cloverfield
My mean rating is 3.5, so...
4
Action No Will Smith
Independence Day
Romance No Will Smith
Action: +0.8 Romance: -2.5
4
Action Will Smith
Sleepless in Seattle
So what do my preferences say about me?
1
Will Smith: +1
Content Boosting
Your recommendations are only as good as the amount and quality of your data.
Content Boosting
Your recommendations are only as good as the amount and quality of your data.
Content Boosting is thus especially useful if you have limited data.
Output
Output
I have nothing interesting to say about output...
Output
I have nothing interesting to say about output... Moving on.
Now let's look at some code.
http://github.com/tyler/collaborative_filter