Demystifying Facebook Friend Recommendation using graph mining.
Elucidating the features required in facebook friend recommendation.
In this blog, I am going to discuss Facebook friend recommendations and emphasize featurization using this dataset. So, let us start with exploratory data analysis first.
Table of Contents:
- Problem definition and graph overview.
- Mapping the problem and Business constraints and metrics
- Exploratory data analysis
- Posing the problem as a classification problem
- Jaccard and Cosine similarity
- PageRank
- Shortest Path
- Connected-components
- Adar Index
- Kartz Centrality
- HITS Score
- SVD
- Weight Features
- Preferential Attachment
- SVD_dot
- Modeling
- Conclusion
- References
PROBLEM DEFINITION AND GRAPH OVERVIEW
We are given a directed social graph to predict missing links to recommend friends/connections and followers. It will be more clear when we see the image below:
From the image, we can observe 5 users (U1, U2, U3, U4, U5). The line connecting the users are called edges. The graph is a set of vertices and edges. Mathematically, it is written as G = <V, E>, where V is Vertex and E is an edge. Each edge is a pair of two vertices, edge = (Ui, Uj). The directed graphs mean that one user is following the other. These types of directed graphs are mostly visible on Instagram and Twitter. LOOK at the picture above; we see U5 is a new user only following U1. Since we have to predict missing links, Can we say U5 follows U4 and U3 and vice-versa? Let's move forward……….
The highlighted part shows the features given. We can see node 1 is following three other nodes. There are 1.86 and 9.43 million nodes(people) and edges, respectively. We could have metadata like the city in which the person lives or in which college they studied or did they study in the same college or some additional information. But, as far as this task is concerned, we have directed graph nodes and edges and have to work on it.
MAPPING THE PROBLEM and BUSINESS CONSTRAINT and METRICS
Let us say that we have vertex Ui and Uj. If Ui is following Uj or there is a directed edge between Ui and Uj: it will be labeled as “1”. If Ui is not following Uj or there is no edge between Ui and Uj: it will be labeled as “0”. So the problem automatically gets mapped to a binary classification problem.
Now let us look at the figures below and understand the situation:
We see that U1 follows → {U3,U4,U5}, U2 follows → {U3,U4,U6}. Here, U1 and U2 are having overlapped set of choices. There is a high chance that U1 follows U2 or vice-versa or U1 follows U6 or U2 follows U5. The fact that U1 is following U2 signifies a high chance that U2 will follow back U1. These are called graph features.
Before designing the problem, We keep track of few constraints:
No low-latency requirements: The friend suggestion should be precomputed beforehand and store in a hash table-like structure. Suggesting friends should not take much time.
Assign probability value during suggestions: We will recommend the highest probability links to a user, so we need to predict the probabilities of useful links. I could have 100 such users, which Ui could follow, and the probability values. I might have 5 or 10 slots, where I want to show the most probable top 5 friends the user Ui may want to follow.
Ideally, high precision and recall are required while suggesting friends.
Performance Metrics:
Both precision and recall are important; hence the F1 score is a good choice here. Let say our K = 10 and Ui = {U1, U2, U3, …U10}, here are top 10 probable vertices that Ui may want to follow. Precision is important because out of these 10, how many are actually correct.
EXPLORATORY DATA ANALYSIS
Here we will be extensively using networks library, one of the most popular libraries where you want to perform computation on a network or a graph. A straightforward function, “nx.info(graph),” gives us information about the graph. The output is
Name:
Type: DiGraph
Number of nodes: 1862220
Number of edges: 9437519
Average in degree: 5.0679
Average out-degree: 5.0679
To understand this output, let's look at the image below:
indegree refers to incoming link, and outdegree refers to outgoing link. Let us see a subgraph of the large dataset. From the result of the subgraph, we see there are some missing links or edges.
Name:
Type: DiGraph
Number of nodes: 66
Number of edges: 50
Average in degree: 0.7576
Average out degree: 0.7576
Now we will perform in-degree and out-degree analyses.
In-degree analysis: The number of followers is exactly equal to in degree. Let's see how many followers are there for each user very minutely? From the graph, we see that there are many people with 0 followers. If you notice the sharp line, we can see a few users with more than 500 followers in our datasets.
Let us zoom in a bit as there are millions of users squashed. So it is hard noticing vividly.
We can confer that 200000 users to 700000 users have 1 follower. Followers increase to 7 from the range of 800000 to 1.4 million. Let us observe the distribution with box plots:
We see that most users have 1 or very few followers. The plot shows outliers. We can depict these outliers as few users having many followers.
Let’s look at 90th to 100th percentile. We can observe that 90% of the users have followers of less than 12. 99% of the users have followers less than 40 or fewer. There is one user with 552 followers.
Let’s zoom in from the 99th percentile to the 100th percentile. Only 0.1% of users have more than 112 followers.
We also obtain a similar result on drawing PDFs.
Out-degree analysis: Out-degree is the number of people each person is following. From the graph, we see there is one person who is following more than 1500 people. Most users are following fewer people.
Let’s zoom in on the graph from 0th users to 1.5 million users. We can observe from the graph that around 1.5 million users follow less than 7 people.
On seeing the box plot, the situation is a little better than in-degree. There are fewer outliers. The number of people following is lesser.
One person is following 1566 people, and 99 percentile people are following less than 40 people.
99.9 % of people are following less than 123 people.
PDF of the outdegree analyses
From the dataset, we can make the following observations:
No of persons having zero followers are 274512, and % is 14.741115442858524.
No of persons having zero followers are 188043 and % is 10.097786512871734.
No of persons who are not following anyone and also not having any followers are 0.
The same analysis can be performed for both (followers+following). degree(Ui) = In_degree(Ui) + Out_degree(Ui). Let us direct look in zoomed version of the graph.
- Min of no of (followers + following) is 1.
334291 people have a minimum no of (followers + following). - The Max of no of (followers + following) is 1579.
Only 1 person has a maximum no of (followers + following). - No of persons having (followers + following) less than 10 are 1320326.
- No of weakly connected components 45558
weakly connected components with 2 nodes 32195
If all the nodes in a graph are connected by some path ignoring direction, that entire graph is a weakly connected component.
POSING A PROBLEM AS A CLASSIFICATION PROBLEM
We want to convert this problem into a binary classification problem such that existing labels are classified as “1” and non-existing edges are classified as “0”. In the original graph above, there is no edge between U1 and U2, U1 and U3, U1 and U4. These edges should be labeled as “0” in the data. But this is not provided in the train.csv file, or we don’t have zero-labeled data in the dataset. LET US SAMPLE SOME DATA….
We need to generate data for the label “0”. But for each vertex, we will have an (n-1) edge. For all edges, we will have n x (n-1) edges. The total nodes present in the dataset are 1.86x10⁶; hence the total size will be 1.86x10⁶ x (0.86 x 10⁶), which is very large. We will now randomly sample 9.43 million edges out of total possible edges not present in train.csv and construct our new dataset. We will only consider those nodes where edges from one node to another are greater than 2 as “0”.
Train and test split
In the real world, if you are an employee of Facebook or Instagram, you have to deal with the temporal nature of data. So, you needed to carry out time-based splitting.
But we are just provided with a graph of a particular timestamp. Since we don’t have the timestamps, we will carry out random-based splitting.
- We split the data into an 80:20 ratio.
- We split positive and negative links separately because we only need positive training data to create graphs and feature generation.
- For positive links : X_train_pos, X_test_pos, y_train_pos, y_test_pos
- For negative links: X_train_neg, X_test_neg, y_train_neg, y_test_neg
The final result is
Number of nodes in the train data graph with edges 7550015
Number of nodes in the train data graph without edges 7550015
Number of nodes in the test data graph with edges 1887504
Number of nodes in the test data graph without edges 1887504
- Now, we appended X_train_neg to X_train_pos.
- We concatenated (y_train_pos,y_train_neg).
- We appended X_test_neg to X_test_pos.
- We concatenated (y_test_pos,y_test_neg).
Name:
Type: DiGraph
Number of nodes: 1780722
Number of edges: 7550015
Average in degree: 4.2399
Average out degree: 4.2399
Name:
Type: DiGraph
Number of nodes: 1144623
Number of edges: 1887504
Average in degree: 1.6490
Average out degree: 1.6490
no of people common in train and test -- 1063125
no of people present in train but not present in test -- 717597
no of people present in test but not present in train -- 81498
% of people not there in Train but exist in Test in total Test data are 7.1200735962845405 %
Final datasets stand:
Data points in train data (15100030, 2)
Data points in test data (3775008, 2)
Shape of traget variable in train (15100030,)
Shape of traget variable in test (3775008,)
SIMILARITY MEASURES:
Now, we will try to convert these pairs of vertices into some numerical, categorical, or binary features to carry out machine learning, as we cannot carry out algorithms on the vertices set.
For any graph-based machine learning problem, featurization is the most important part of the project. Remember, we do feature engineering on existing features, and creating new features should be done after the train test split to avoid data leakage.
- Jaccard distance: The Jaccard similarity index (sometimes called the Jaccard similarity coefficient) compares members for two sets to see which members are shared and distinct. It’s a measure of similarity for the two sets of data, ranging from 0% to 100%. The higher the percentage, the more similar the two populations. Although it’s easy to interpret, it is susceptible to small samples sizes and may give erroneous results, especially with microscopic samples or data sets with missing observations.
Let's look at the image again:
X = U1 = {U5,U3,U4}, Y = U2 = {U3, U4,U6}, computing Jaccard distance we get J(X,Y) = 2/4. Larger the Jaccard distance , higher is the chance that there exists relation between U1 and U2.
2. Cosine distance (Otsuka-Ochiai coefficient):
Cosine distance is
Which is used when X and Y are vectors. But, we will use a simple extension to cosine distance, which is Otsuka Ochiai Coefficient.
Otsuka-Ochiai coefficient is used when X and Y are set.
- So, Cosine distance ( Otsuka-Ochiai Coefficient ) will be high when there is more overlap between sets X and Y.
- Here also, we will compute for the follower as well as for followee.
PAGE RANK
- PageRank is a very, very popular featurization technique for directed graphs as Google used it.
- PageRank was named after Larry Page, one of the founders of Google.
- It is an algorithm used by google search to rank web pages in their search engine results.
- It is a way of measuring the importance of website pages.
- PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the page is.
- If many pages have a destination as “B,” then “B” must be important.
and
If my page “B” is given as a link in many important pages like C, E, D, etc., the “B” page value increases.
- If a user has a high PageRank score, other users and significant users are linking to Ui.
- PageRank can tell us about relative importance.
- Bill Gates is a celebrity.
He is followed by some important people like Mark Zuckerberg and Warren Buffet and some common people like me. So it is quite sure that he is also an important person.
Now there is a significantly higher probability that Bill Gates will be followed by “you.”
- Now, for that 7.12 % of test data not present in train data, we can give our mean Pagerank value as their PageRank value.
SHORTEST PATH:
We can compute the shortest path between two-element. If nodes have a direct path, i.e., directly connected, we remove that edge and calculate the path.
- If there is no edge between two vertices, we will consider the default value “-1”.
- Suppose we have an edge between 2 vertices, whose shortest path is of length ‘1’, which makes no sense. In that case, we will remove that edge and calculate the shortest path between those two vertices.
- The shortest path measures how far two vertices are at the least.
CONNECTED COMPONENTS
- Strongly Connected Component
A graph is said to be strongly connected if every vertex is reachable from every other vertex. The strongly connected components of an arbitrary directed graph form a partition into subgraphs that are strongly connected.
- Weakly connected component
A weakly connected component is one in which all components are connected by some path, ignoring direction.
- Checking for same weakly connected component (Community)
When I ignore directions, I have a path from each vertex to another vertex. So graph S1 and S2 are weakly connected components of a graph.
My college friends and my work-related friends can be completely different. So weakly connected components actually create something called “Community.”
A group of similar users or people with similar interests or people from the same college, or people with the same interest in Machine learning can probably form the same community. So, if Ui and Uj belong to the same weakly connected component, then it is quite probable that they both may have some interest in something or belong to the same city or can be anything in common.
If I had even one edge between S1 and S2, we would have been having a single weakly connected component instead of two.
#getting weekly connected edges from graph
wcc=list(nx.weakly_connected_components(train_graph))
def belongs_to_same_wcc(a,b):
index = []
if train_graph.has_edge(b,a):
for i in wcc:
if b in i:
index = i
break
if (b in index):
train_graph.remove_edge(b,a)
if compute_shortest_path_length(b,a)==-1:
train_graph.add_edge(b,a)
return 0
else:
train_graph.add_edge(b,a)
return 1
else:
return 0
if train_graph.has_edge(a,b):
for i in wcc:
if a in i:
index= i
break
if (b in index):
train_graph.remove_edge(a,b)
if compute_shortest_path_length(a,b)==-1:
train_graph.add_edge(a,b)
return 0
else:
train_graph.add_edge(a,b)
return 1
else:
return 0
else:
for i in wcc:
if a in i:
index= i
break
if(b in index):
return 1
else:
return 0
In the above code:
- First, we are checking if there is a direct edge from b to a. If yes, we are checking to do both b, and a belong to the same wcc or community.
- If yes, we removed the direct edge from b to a and calculated the shortest path from b to a. If the shortest path does not exist, then we declare that they are not in the same wcc/community even though they belong to the same wcc/community.
- However, if there exists a shortest path from b to a, then it means they could be friends and hence we are declaring them in wcc/community.
- Secondly, we check if there is a direct edge from a to b and follow the same approach as above and return 1 or 0.
- If there is no direct edge from s to d or from d to s at the first place, then we are simply checking whether they belong to the same wcc. If yes, then just return 1 else 0.
- Why we have checked for the existence of a direct path from b to a and from a to b?
In our dataset, all of the data points have paths from a to b or from b to a. So if we don’t check for this condition, it will return 1 everywhere in most of the data points. Hence, if there is a direct path then we have used this strategy that we will remove this direct path and check if a is reachable from d or d is reachable from a via path length more than 2 or indirect path, if yes then only we are declaring them in same wcc/community else not.
ADAR INDEX
- The Adamic/Adar Index is a measure designed to predict links in a social network.
- Neighborhoods are basically subsets of vertices/nodes which are connected in either direction to x.
- Now, let’s write the formula for Adar Index:
- Here, we are operation function on U which belongs to the intersection of the neighborhood of x node and y node.
- Here, let’s say, U1 and U2 are two vertexes belonging to the intersection of N(x) and N(y) and both are connected to both x and y.
U1 has a huge neighborhood, so it’s probably a celebrity. So there is a minimal chance that x and y are going to be related.
While U2 has a small neighborhood, so it’s a common man like us. As it has a small group, it can belong to a college group or workgroup, so x and y can be related in this case. - Assize of the N(u) increases, log(N(u)) increases and 1/log(N(u)) decreases. So contribution os nodes like U1 who have large no neighbors, their Adar Index is small and vice versa.
DOES THE PERSON FOLLOW BACK?
If b is following a, then there is a very high probability that a will follow b.
So if there is an edge between b and a, the function return 1 else 0. It’s called the follows_back feature. This is a very very useful feature in social media like Instagram.
KATZ CENTRALITY
- It is similar to what we have seen in Google’s PageRank, but quite an old technique from 1953.
- Katz centrality computes the centrality for a node based on the centrality of its neighbors. It’s a generalization of eigenvector centrality.
- It is used to measure neighboring nodes' relative degree of influence within a graph or social network.
- The Katz centrality for node i :
where A
is the adjacency matrix of graph G with eigenvalues λ. The parameter β controls the initial centrality and
HITS SCORE
- HITS stands for Hyperlink-Induced Topic Search(often referred to as hubs and authorities)is a link analysis algorithm that rates web pages.
- The HITS algorithm computes two numbers for a node. Authorities estimate the node value based on the incoming links. Hubs estimate the node value based on outgoing links.
- hubs → More no of outlinks, ex- yahoo.com
- Authorities → More no of links, ex- cnn.com, bbc.com, mit.edu
- So, HITS gives every web page wi two scores <hubs, authorities>.
- Step 1: This is an iterative algorithm. At the very initial stage, for every page, we have auth(p) = 1 and hub(p) = 1.
- Step 2: Authority update rule: For each p, we update auth(p) to
- Step 3: Hub update rule: For each p, we update hub(p) to
- Step 4: If we keep running these update rules, they will run till infinity. So once we update the authority or hub score of any page, we normalize it so that the score does not become infinitely large.
- We will run these steps iteratively and eventually if we do this many times, the authority score of p and hub score of p will converge which is proven mathematically. So run those steps iteratively until only when the authority score and hub score do not change anymore.
SINGULAR VALUE DECOMPOSITION
- So you are given a graph G with directed edges Ui → Uj.
- There is a concept called adjacency matrix that represents all of the information we have in our graph(which has around 1.78 million users) in the form of a matrix.
- Adjacency matrix changes graph to graph. Let’s stick to the graph we have.
- Adjacency matrix(A) of G, where A is of size 1.78M x 1.78M.
- This matrix will be an nxn matrix, whose ijth cell is going to be 1 if there is a directed edge from Ui to Uj else 0.
- In our case, our adjacency matric is a binary matrix.
- There are other matrix variations for undirected graphs called weighted graphs, for now, let’s stick to directed graphs.
- We will get a huge and sparse matrix as most of the values will be zero.
- We will decompose the adjacency matrix using SVD.
- A = U Σ VT
Where U = size of 1.78M x 6
Σ = size of 6 x 6
VT(transpose) = size of 6 x 1.78M - In A, U’s ith row represents the same size vector as VT’s ith column. Both are the same size vector representation of ith vertex of the graph. We got this using left singular matrix U and right singular matrix VT.
- So, for the pair of vertices where we need to predict if there is an edge or not, we have 24 features (for Ui-u-6,vt-6, Uj-u-6,vt-6).
- Matrix Factorization helps to incorporate implicit feedback i.e information that is not directly given but can be derived from analyzing user behavior.
WEIGHT FEATURES
1M represents 1 million. Since Ui is a celebrity, the chances of any two random points knowing each other are minimal. But since Uj is not a celebrity, but a common person like us, the chances of any two random points knowing each other is very high.
How to encode the information if the point is a celebrity or not using just the number of links linking into it!
There is a weighted feature called “W” for any point Ui:
For every vertex Ui, we have these 6 features.
- weight of incoming edges
- weight of outgoing edges
- weight of incoming edges + weight of outgoing edges
- weight of incoming edges * weight of outgoing edges
- 2*weight of incoming edges + weight of outgoing edges
- weight of incoming edges + 2*weight of outgoing edges
To determine the similarity of nodes, an edge weight value was calculated between nodes. Edge weight decreases as the neighbor count go up. As we have directed edges, we have calculated weighted in and weighted out differently.
PREFERENTIAL ATTACHMENT
One well-known concept in social networks is that users with many friends tend to create more connections in the future. This is since in some social networks, like in finance, the rich get richer. We estimate how ”rich” our two vertices are by calculating the multiplication between the number of friends (|Γ(x)|) or followers each vertex has. It may be noted that the similarity index does not require any node neighbor information; therefore, this similarity index has the lowest computational complexity.
SVD_dot
svd_dot is Dot product between source node svd and destination node svd features
So let's look at the features we have
MODELLING
I have used random forest and XGBOOST and also performed hyper-tuning. The confusion matrix for train and test is given below. For code refer to my Github.
The accuracy table is listed below.
CONCLUSION
By performing this case study following observations were made:
1) Initially we have only a couple of features in our data set. First, we performed exploratory data analysis on our given data set such as the number of followers and followees of each person.
2) then after we generated some data points which were not present in our given data-set, since we had only class label 1 data.
3) We did some feature engineering on the dataset like finding the shortest path, Katz centrality, Jaccard distances, page rank, preferential attachments, etc.
4) After performing exploratory data analysis and feature engineering we split the whole dataset into train and test and performed random forest and XGBOOST taking f1-score as our metric.
5) At the end we plotted the confusion matrix and pretty-table for both algorithms and found the best hyperparameters.
THANK YOU SO MUCH for going through this long document. Please feel free to comment if there are any suggestions or queries. You can also knock me on Linkedin. Here is a link to my GitHub. Happy reading……….
REFERENCES
- https://www.appliedaicourse.com/lecture/11/applied-machine-learning-online-course/3170/problem-definition/7/module-6-machine-learning-real-world-case-studies
- http://www.statisticshowto.com/jaccard-index/
- https://en.wikipedia.org/wiki/Cosine_similarity
- https://networkx.github.io/documentation/networkx-1.10/reference/generated/networkx.algorithms.link_analysis.pagerank_alg.pagerank.html
- https://en.wikipedia.org/wiki/PageRank
- https://en.wikipedia.org/wiki/Adamic/Adar_index
- https://en.wikipedia.org/wiki/Katz_centrality
- https://en.wikipedia.org/wiki/HITS_algorithm
- http://be.amazd.com/link-prediction/
- https://storage.googleapis.com/kaggle-forum-message-attachments/2594/supervised_link_prediction.pdf