Demystifying Facebook Friend Recommendation using graph mining.

Elucidating the features required in facebook friend recommendation.


In this blog, I am going to discuss Facebook friend recommendations and emphasize featurization using this dataset. So, let us start with exploratory data analysis first.

  1. Problem definition and graph overview.
  2. Mapping the problem and Business constraints and metrics
  3. Exploratory data analysis
  4. Posing the problem as a classification problem
  5. Jaccard and Cosine similarity
  6. PageRank
  7. Shortest Path
  8. Connected-components
  9. Adar Index
  10. Kartz Centrality
  11. HITS Score
  12. SVD
  13. Weight Features
  14. Preferential Attachment
  15. SVD_dot
  16. Modeling
  17. Conclusion
  18. References

We are given a directed social graph to predict missing links to recommend friends/connections and followers. It will be more clear when we see the image below:

Sample network of users and their connections

From the image, we can observe 5 users (U1, U2, U3, U4, U5). The line connecting the users are called edges. The graph is a set of vertices and edges. Mathematically, it is written as G = <V, E>, where V is Vertex and E is an edge. Each edge is a pair of two vertices, edge = (Ui, Uj). The directed graphs mean that one user is following the other. These types of directed graphs are mostly visible on Instagram and Twitter. LOOK at the picture above; we see U5 is a new user only following U1. Since we have to predict missing links, Can we say U5 follows U4 and U3 and vice-versa? Let's move forward……….

snapshot of dataset

The highlighted part shows the features given. We can see node 1 is following three other nodes. There are 1.86 and 9.43 million nodes(people) and edges, respectively. We could have metadata like the city in which the person lives or in which college they studied or did they study in the same college or some additional information. But, as far as this task is concerned, we have directed graph nodes and edges and have to work on it.

ONE node following three other nodes

Let us say that we have vertex Ui and Uj. If Ui is following Uj or there is a directed edge between Ui and Uj: it will be labeled as “1”. If Ui is not following Uj or there is no edge between Ui and Uj: it will be labeled as “0”. So the problem automatically gets mapped to a binary classification problem.

Now let us look at the figures below and understand the situation:

a simple network

We see that U1 follows → {U3,U4,U5}, U2 follows → {U3,U4,U6}. Here, U1 and U2 are having overlapped set of choices. There is a high chance that U1 follows U2 or vice-versa or U1 follows U6 or U2 follows U5. The fact that U1 is following U2 signifies a high chance that U2 will follow back U1. These are called graph features.

Before designing the problem, We keep track of few constraints:

No low-latency requirements: The friend suggestion should be precomputed beforehand and store in a hash table-like structure. Suggesting friends should not take much time.

Assign probability value during suggestions: We will recommend the highest probability links to a user, so we need to predict the probabilities of useful links. I could have 100 such users, which Ui could follow, and the probability values. I might have 5 or 10 slots, where I want to show the most probable top 5 friends the user Ui may want to follow.

Ideally, high precision and recall are required while suggesting friends.

Performance Metrics:

Both precision and recall are important; hence the F1 score is a good choice here. Let say our K = 10 and Ui = {U1, U2, U3, …U10}, here are top 10 probable vertices that Ui may want to follow. Precision is important because out of these 10, how many are actually correct.

Here we will be extensively using networks library, one of the most popular libraries where you want to perform computation on a network or a graph. A straightforward function, “,” gives us information about the graph. The output is

To understand this output, let's look at the image below:

Basic information of the graph

indegree refers to incoming link, and outdegree refers to outgoing link. Let us see a subgraph of the large dataset. From the result of the subgraph, we see there are some missing links or edges.


Now we will perform in-degree and out-degree analyses.

In-degree analysis: The number of followers is exactly equal to in degree. Let's see how many followers are there for each user very minutely? From the graph, we see that there are many people with 0 followers. If you notice the sharp line, we can see a few users with more than 500 followers in our datasets.

No. of follower VS No. of users.

Let us zoom in a bit as there are millions of users squashed. So it is hard noticing vividly.

No. of follower VS No. of users.

We can confer that 200000 users to 700000 users have 1 follower. Followers increase to 7 from the range of 800000 to 1.4 million. Let us observe the distribution with box plots:

boxplot for data distribution

We see that most users have 1 or very few followers. The plot shows outliers. We can depict these outliers as few users having many followers.

Let’s look at 90th to 100th percentile. We can observe that 90% of the users have followers of less than 12. 99% of the users have followers less than 40 or fewer. There is one user with 552 followers.

Let’s zoom in from the 99th percentile to the 100th percentile. Only 0.1% of users have more than 112 followers.

We also obtain a similar result on drawing PDFs.

Out-degree analysis: Out-degree is the number of people each person is following. From the graph, we see there is one person who is following more than 1500 people. Most users are following fewer people.

Let’s zoom in on the graph from 0th users to 1.5 million users. We can observe from the graph that around 1.5 million users follow less than 7 people.

On seeing the box plot, the situation is a little better than in-degree. There are fewer outliers. The number of people following is lesser.

One person is following 1566 people, and 99 percentile people are following less than 40 people.

99.9 % of people are following less than 123 people.

PDF of the outdegree analyses

From the dataset, we can make the following observations:

No of persons having zero followers are 274512, and % is 14.741115442858524.

No of persons having zero followers are 188043 and % is 10.097786512871734.

No of persons who are not following anyone and also not having any followers are 0.

The same analysis can be performed for both (followers+following). degree(Ui) = In_degree(Ui) + Out_degree(Ui). Let us direct look in zoomed version of the graph.

  • Min of no of (followers + following) is 1.
    334291 people have a minimum no of (followers + following).
  • The Max of no of (followers + following) is 1579.
    Only 1 person has a maximum no of (followers + following).
  • No of persons having (followers + following) less than 10 are 1320326.
  • No of weakly connected components 45558
    weakly connected components with 2 nodes 32195
    If all the nodes in a graph are connected by some path ignoring direction, that entire graph is a weakly connected component.


Dividing the dataset as ‘1’ and ‘0.’

We want to convert this problem into a binary classification problem such that existing labels are classified as “1” and non-existing edges are classified as “0”. In the original graph above, there is no edge between U1 and U2, U1 and U3, U1 and U4. These edges should be labeled as “0” in the data. But this is not provided in the train.csv file, or we don’t have zero-labeled data in the dataset. LET US SAMPLE SOME DATA….

We need to generate data for the label “0”. But for each vertex, we will have an (n-1) edge. For all edges, we will have n x (n-1) edges. The total nodes present in the dataset are 1.86x10⁶; hence the total size will be 1.86x10⁶ x (0.86 x 10⁶), which is very large. We will now randomly sample 9.43 million edges out of total possible edges not present in train.csv and construct our new dataset. We will only consider those nodes where edges from one node to another are greater than 2 as “0”.

Train and test split

In the real world, if you are an employee of Facebook or Instagram, you have to deal with the temporal nature of data. So, you needed to carry out time-based splitting.

But we are just provided with a graph of a particular timestamp. Since we don’t have the timestamps, we will carry out random-based splitting.

  • We split the data into an 80:20 ratio.
  • We split positive and negative links separately because we only need positive training data to create graphs and feature generation.
  • For positive links : X_train_pos, X_test_pos, y_train_pos, y_test_pos
  • For negative links: X_train_neg, X_test_neg, y_train_neg, y_test_neg

The final result is

  • Now, we appended X_train_neg to X_train_pos.
  • We concatenated (y_train_pos,y_train_neg).
  • We appended X_test_neg to X_test_pos.
  • We concatenated (y_test_pos,y_test_neg).

Final datasets stand:


Now, we will try to convert these pairs of vertices into some numerical, categorical, or binary features to carry out machine learning, as we cannot carry out algorithms on the vertices set.

For any graph-based machine learning problem, featurization is the most important part of the project. Remember, we do feature engineering on existing features, and creating new features should be done after the train test split to avoid data leakage.

  1. Jaccard distance: The Jaccard similarity index (sometimes called the Jaccard similarity coefficient) compares members for two sets to see which members are shared and distinct. It’s a measure of similarity for the two sets of data, ranging from 0% to 100%. The higher the percentage, the more similar the two populations. Although it’s easy to interpret, it is susceptible to small samples sizes and may give erroneous results, especially with microscopic samples or data sets with missing observations.

Let's look at the image again:

small network

X = U1 = {U5,U3,U4}, Y = U2 = {U3, U4,U6}, computing Jaccard distance we get J(X,Y) = 2/4. Larger the Jaccard distance , higher is the chance that there exists relation between U1 and U2.

2. Cosine distance (Otsuka-Ochiai coefficient):

Cosine distance is

Which is used when X and Y are vectors. But, we will use a simple extension to cosine distance, which is Otsuka Ochiai Coefficient.

Otsuka-Ochiai coefficient is used when X and Y are set.

  • So, Cosine distance ( Otsuka-Ochiai Coefficient ) will be high when there is more overlap between sets X and Y.
  • Here also, we will compute for the follower as well as for followee.
  • PageRank is a very, very popular featurization technique for directed graphs as Google used it.
  • PageRank was named after Larry Page, one of the founders of Google.
  • It is an algorithm used by google search to rank web pages in their search engine results.
  • It is a way of measuring the importance of website pages.
  • PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the page is.
  • If many pages have a destination as “B,” then “B” must be important.
    If my page “B” is given as a link in many important pages like C, E, D, etc., the “B” page value increases.
  • If a user has a high PageRank score, other users and significant users are linking to Ui.
  • PageRank can tell us about relative importance.
  • Bill Gates is a celebrity.
    He is followed by some important people like Mark Zuckerberg and Warren Buffet and some common people like me. So it is quite sure that he is also an important person.
    Now there is a significantly higher probability that Bill Gates will be followed by “you.”
  • Now, for that 7.12 % of test data not present in train data, we can give our mean Pagerank value as their PageRank value.


We can compute the shortest path between two-element. If nodes have a direct path, i.e., directly connected, we remove that edge and calculate the path.

  • If there is no edge between two vertices, we will consider the default value “-1”.
  • Suppose we have an edge between 2 vertices, whose shortest path is of length ‘1’, which makes no sense. In that case, we will remove that edge and calculate the shortest path between those two vertices.
  • The shortest path measures how far two vertices are at the least.


  • Strongly Connected Component
    A graph is said to be strongly connected if every vertex is reachable from every other vertex. The strongly connected components of an arbitrary directed graph form a partition into subgraphs that are strongly connected.
  • Weakly connected component
    A weakly connected component is one in which all components are connected by some path, ignoring direction.
  • Checking for same weakly connected component (Community)

When I ignore directions, I have a path from each vertex to another vertex. So graph S1 and S2 are weakly connected components of a graph.

My college friends and my work-related friends can be completely different. So weakly connected components actually create something called “Community.”

A group of similar users or people with similar interests or people from the same college, or people with the same interest in Machine learning can probably form the same community. So, if Ui and Uj belong to the same weakly connected component, then it is quite probable that they both may have some interest in something or belong to the same city or can be anything in common.

If I had even one edge between S1 and S2, we would have been having a single weakly connected component instead of two.

In the above code:

  • First, we are checking if there is a direct edge from b to a. If yes, we are checking to do both b, and a belong to the same wcc or community.
  • If yes, we removed the direct edge from b to a and calculated the shortest path from b to a. If the shortest path does not exist, then we declare that they are not in the same wcc/community even though they belong to the same wcc/community.
  • However, if there exists a shortest path from b to a, then it means they could be friends and hence we are declaring them in wcc/community.
  • Secondly, we check if there is a direct edge from a to b and follow the same approach as above and return 1 or 0.
  • If there is no direct edge from s to d or from d to s at the first place, then we are simply checking whether they belong to the same wcc. If yes, then just return 1 else 0.
  • Why we have checked for the existence of a direct path from b to a and from a to b?
    In our dataset, all of the data points have paths from a to b or from b to a. So if we don’t check for this condition, it will return 1 everywhere in most of the data points. Hence, if there is a direct path then we have used this strategy that we will remove this direct path and check if a is reachable from d or d is reachable from a via path length more than 2 or indirect path, if yes then only we are declaring them in same wcc/community else not.


  • The Adamic/Adar Index is a measure designed to predict links in a social network.
  • Neighborhoods are basically subsets of vertices/nodes which are connected in either direction to x.
  • Now, let’s write the formula for Adar Index:
The formula for the Adar index
  • Here, we are operation function on U which belongs to the intersection of the neighborhood of x node and y node.
Does X relates to Y
  • Here, let’s say, U1 and U2 are two vertexes belonging to the intersection of N(x) and N(y) and both are connected to both x and y.
    U1 has a huge neighborhood, so it’s probably a celebrity. So there is a minimal chance that x and y are going to be related.
    While U2 has a small neighborhood, so it’s a common man like us. As it has a small group, it can belong to a college group or workgroup, so x and y can be related in this case.
  • Assize of the N(u) increases, log(N(u)) increases and 1/log(N(u)) decreases. So contribution os nodes like U1 who have large no neighbors, their Adar Index is small and vice versa.


If b is following a, then there is a very high probability that a will follow b.
So if there is an edge between b and a, the function return 1 else 0. It’s called the follows_back feature. This is a very very useful feature in social media like Instagram.


  • It is similar to what we have seen in Google’s PageRank, but quite an old technique from 1953.
  • Katz centrality computes the centrality for a node based on the centrality of its neighbors. It’s a generalization of eigenvector centrality.
  • It is used to measure neighboring nodes' relative degree of influence within a graph or social network.
  • The Katz centrality for node i :

where A is the adjacency matrix of graph G with eigenvalues λ. The parameter β controls the initial centrality and


  • HITS stands for Hyperlink-Induced Topic Search(often referred to as hubs and authorities)is a link analysis algorithm that rates web pages.
  • The HITS algorithm computes two numbers for a node. Authorities estimate the node value based on the incoming links. Hubs estimate the node value based on outgoing links.
  • hubs → More no of outlinks, ex-
  • Authorities → More no of links, ex-,,
  • So, HITS gives every web page wi two scores <hubs, authorities>.
  • Step 1: This is an iterative algorithm. At the very initial stage, for every page, we have auth(p) = 1 and hub(p) = 1.
  • Step 2: Authority update rule: For each p, we update auth(p) to
where P to is all pages that link to page p.
  • Step 3: Hub update rule: For each p, we update hub(p) to
where P from is all pages to which links to
  • Step 4: If we keep running these update rules, they will run till infinity. So once we update the authority or hub score of any page, we normalize it so that the score does not become infinitely large.
  • We will run these steps iteratively and eventually if we do this many times, the authority score of p and hub score of p will converge which is proven mathematically. So run those steps iteratively until only when the authority score and hub score do not change anymore.


  • So you are given a graph G with directed edges Ui → Uj.
  • There is a concept called adjacency matrix that represents all of the information we have in our graph(which has around 1.78 million users) in the form of a matrix.
  • Adjacency matrix changes graph to graph. Let’s stick to the graph we have.
  • Adjacency matrix(A) of G, where A is of size 1.78M x 1.78M.
  • This matrix will be an nxn matrix, whose ijth cell is going to be 1 if there is a directed edge from Ui to Uj else 0.
  • In our case, our adjacency matric is a binary matrix.
  • There are other matrix variations for undirected graphs called weighted graphs, for now, let’s stick to directed graphs.
  • We will get a huge and sparse matrix as most of the values will be zero.
  • We will decompose the adjacency matrix using SVD.
  • A = U Σ VT
    Where U = size of 1.78M x 6
    Σ = size of 6 x 6
    VT(transpose) = size of 6 x 1.78M
  • In A, U’s ith row represents the same size vector as VT’s ith column. Both are the same size vector representation of ith vertex of the graph. We got this using left singular matrix U and right singular matrix VT.
  • So, for the pair of vertices where we need to predict if there is an edge or not, we have 24 features (for Ui-u-6,vt-6, Uj-u-6,vt-6).
  • Matrix Factorization helps to incorporate implicit feedback i.e information that is not directly given but can be derived from analyzing user behavior.


1M represents 1 million. Since Ui is a celebrity, the chances of any two random points knowing each other are minimal. But since Uj is not a celebrity, but a common person like us, the chances of any two random points knowing each other is very high.

How to encode the information if the point is a celebrity or not using just the number of links linking into it!

There is a weighted feature called “W” for any point Ui:

Here X = set of vertices linking into Ui or linking out from Ui

For every vertex Ui, we have these 6 features.

  • weight of incoming edges
  • weight of outgoing edges
  • weight of incoming edges + weight of outgoing edges
  • weight of incoming edges * weight of outgoing edges
  • 2*weight of incoming edges + weight of outgoing edges
  • weight of incoming edges + 2*weight of outgoing edges

To determine the similarity of nodes, an edge weight value was calculated between nodes. Edge weight decreases as the neighbor count go up. As we have directed edges, we have calculated weighted in and weighted out differently.


One well-known concept in social networks is that users with many friends tend to create more connections in the future. This is since in some social networks, like in finance, the rich get richer. We estimate how ”rich” our two vertices are by calculating the multiplication between the number of friends (|Γ(x)|) or followers each vertex has. It may be noted that the similarity index does not require any node neighbor information; therefore, this similarity index has the lowest computational complexity.

The link between A and C is more probable than the link between A and B as C has many more neighbors than B
The formula for feature Preferential Attachment


svd_dot is Dot product between source node svd and destination node svd features

So let's look at the features we have

Total features for training


I have used random forest and XGBOOST and also performed hyper-tuning. The confusion matrix for train and test is given below. For code refer to my Github.

Train confusion_matrix from random forest
Test confusion_matrix from random forest
Train confusion_matrix from XGBOOST
Test confusion_matrix from XGBOOST

The accuracy table is listed below.

table displaying results


By performing this case study following observations were made:

1) Initially we have only a couple of features in our data set. First, we performed exploratory data analysis on our given data set such as the number of followers and followees of each person.

2) then after we generated some data points which were not present in our given data-set, since we had only class label 1 data.

3) We did some feature engineering on the dataset like finding the shortest path, Katz centrality, Jaccard distances, page rank, preferential attachments, etc.

4) After performing exploratory data analysis and feature engineering we split the whole dataset into train and test and performed random forest and XGBOOST taking f1-score as our metric.

5) At the end we plotted the confusion matrix and pretty-table for both algorithms and found the best hyperparameters.

THANK YOU SO MUCH for going through this long document. Please feel free to comment if there are any suggestions or queries. You can also knock me on Linkedin. Here is a link to my GitHub. Happy reading……….





Machine learning enthusiast

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store