PCA for Visualization and Dimension Reduction….

Sagor Saha
11 min readJul 28, 2021


Explanation and derivation of PCA from a geometric point of view

source: https://i.stack.imgur.com/lNHqt.gif

Dimensionality reduction refers to techniques for reducing the number of input variables in training data.’’ Among the number of techniques available, we will thoroughly look into PCA (Principal Component Analysis).

Table of Contents:

  1. PCA intuition (geometric)
  2. Column standardization
  3. Optimization problem
  4. Calculation of covariance matrix
  5. Eigen values and Eigen vectors
  6. Interpretation of Eigen values
  7. Summary
  8. Implementation of PCA from scratch
  9. Implementation of PCA using scikit learn
  10. Limitations of PCA
  11. Conclusion
  12. References

PCA intuition:

Principal component analysis (PCA) is the process of computing the principal components and using them to perform a change of basis on the data, sometimes using only the first few principal components and ignoring the rest. Let us consider a 2-D dataset with feature1(f1) in the x-axis and feature2(f2) in the y-axis.

Data points on f1 and f2 axis

The figure alongside depicts two features f1 and f2. It can be seen that the variance on f2 is maximum. This indicates that information captured by f2 is greater than f1. If we convert 2D to 1D, we can safely drop f1 as f1 captures less information.

Column Standardization:

Now, let us standardize the data points.

converting data points

such that,

Column standardization

On performing column standardization, the orientation of data points gets shifted to the center. This is not the exact orientation but can be assumed as an approximate one.

Column-centered graph

We observe that both the axis is equally responsible for preserving the knowledge. So, in order to convert into 1D, we have to rotate the f1 axis in a direction such that f1 has maximum variance on it and f2 is perpendicular to f1.

maximum variance position

Now we can drop f2' as f1' has a maximum variance. Hence, 1D conversion is done. You must be thinking, how did I come up with the direction of maximum variance. Well, that's the optimization problem that PCA is going to solve.

Optimization problem:

We now want to find the direction u1 such that the variance of Xi’s projected onto f1' is maximized. Let us understand this statement by looking at the picture below.

data point projected on the plane

here, Xi = datapoint, U1 = unit vector in the direction of maximum variance, and let Xi’s = projection of Xi on U1. U1 is considered a direction in which all Xi is projected. Let's clear our goal

‘‘ Here, we want a unit vector U1 that aligns in the direction of maximum variance. Meaning if any point is projected on U1, the variance is maximized.’’

The projection of data points Xi’s on unit vector U1 is given by:

Now, let us calculate the mean of projection:

After this, let us calculate the Covariance matrix S for d dimensional data with n number of data points, which is given by

The variance of point Xi’s projected on U1 is given by

Calculation of co-variance matrix:

As the data column is standardized, hence the second term is zero. So, our final equation becomes,

Our final objective function for PCA is

Here S is a known quantity, so we need to find a unit vector that maximizes the above objective function.

Eigen values and Eigen vectors:

Using the power of Lagrange multipliers, we can rewrite the above equation as

Taking the derivative of the above equation with respect to U1, we get

U1 is the eigenvector of Covariance matrix S, and lambda is the corresponding eigenvalue. U1 is the unit vector that gives the direction of maximum variance. For a d x d covariance matrix, we get d eigenvector and eigenvalue pair. For a given covariance matrix, every pair of eigenvector Ui, Uj is perpendicular to each other.

Now, let us calculate eigenvalues and eigenvectors of a covariance matrix S; we make use of

We can write the above equation as,

For each eigenvalue, we have a specific eigenvalue equation.

Now, let us find the determinant of the matrix.

Once we get the eigenvalues, we can replace these eigenvalues in equation(4) and solve the equation to get the eigenvector for each eigenvalue.

Now, the question arises on which eigenvalues to select?

From the above equation, we can see that in order to maximize the L.H.S term, we need to find the largest eigenvalues, lambda. In this scenario, the largest eigenvalue corresponds to the direction of maximum variance.

The question arises that how do we convert d dimension to d’ dimension such that d’ <<d and also preserves the maximum information.

We will decompose the original covariance matrix S into eigenvalues and eigenvectors as we know that, for a d x d covariance matrix, we get d eigenvalues and eigenvectors pairs from equation 4.

The above equation is for eigen-decomposition of a matrix where,

here, U is d x d square matrix containing all eigenvectors of S, /\ is a diagonal matrix where non-diagonal elements are 0. The diagonals of matrix /\ consist of all eigenvalues of S. The number of eigenvalues and eigenvectors are proportional to the data. Eigenvectors, Ui are unit vectors whose magnitude is equal to 1. Eigenvectors point in the direction of maximum variance. Eigenvalues stretch the eigenvectors in the direction the vector is pointing or in the opposite direction without rotating it. For example, a negative eigenvalue may reverse the direction of the eigenvector as part of scaling it. Eigenvalues are proportional to the variance.

The total variance is explained by:

Suppose we want to convert d dimension to d’ dimensions such that d’<<d, we need to select the top d’ eigenvalues, which will retain maximum information. We will be more clear of the statement below.

Hence, the amount of information retained can be given by

By taking a subset of the eigenvector and eigenvalue pairs, we can retain maximum information.

Now, the question arises on which pairs of eigenvalues and eigenvectors to choose. Well, for the d x d matrix, we first determine the d eigenvector and eigenvalues using eigen-decomposition. Then we sort these pairs in descending order.

Since the eigenvalues are proportional to the variance retained, we select only the top d’ pairs with the largest eigenvalues.

Now, how do we perform transformation? Well, remember equation 1….

U’ is the final transformation matrix. Here, matrix U’ consists of top d’ eigenvectors of covariance matrix S. We are using linear transformation U on input Xi to transform it into a reduced feature space.

Interpretation of eigenvalues(ƛ):

Let us understand the geometric interpretation of eigenvalues by looking at different cases.


100% data representation on f1'


75% data representation on f1'


60% data representation on f1'


50% data representation on f1'

In the 4th case, if one feature is dropped, we lose 50% of the information. Hence, it is suggested to use top eigenvalues for maximum variance.


  1. Perform column standardization.
  2. Calculate the covariance matrix.
  3. Compute the eigenvalues and eigenvectors from the covariance matrix.
  4. Sort these eigenvalue pairs in descending order of magnitude.
  5. Select the top eigenvalues which retain maximum variance.
  6. Perform data transformation original data by using the eigenvectors corresponding to top eigenvalues.

Implementation of PCA from scratch:

Lets work with digit recognizer dataset. It has 42000 rows and 784 columns. Let me break this dataset. Say, you have dataset of digit images and classes ranging from 0–9. Each image has dimension of 28 x 28 (784). I hope you can correlate now. Each row contains a label from 0–9 and 784 pixel values as a column.

#importing the libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# reading the data
d0 = pd.read_csv('train.csv')
# save the labels into a variable l.
labels = d0['label']
# Drop the label feature and store the pixel data in d.
data = d0.drop("label",axis=1)
# Finding the size of data

The data size appears to be (42000, 784) and (42000,). Let us check an index from the dataset.

# display or plot a number.
idx = 1
# reshape from 1d to 2d pixel array
grid_data = data.iloc[idx].to_numpy().reshape(28,28)
plt.imshow(grid_data, interpolation = "none", cmap = "gray")
The value stored in the first index

Performing column standardization

# Data-preprocessing: Standardizing the datafrom sklearn.preprocessing import StandardScaler
standardized_data = StandardScaler().fit_transform(data)
(42000, 784)

Followed by Covariance matrix

#find the co-variance matrix which is : A^T * A
sample_data = standardized_data
# matrix multiplication using numpy
covar_matrix = np.matmul(sample_data.T , sample_data)
print ( "The shape of variance matrix = ", covar_matrix.shape)The shape of variance matrix = (784, 784)

Now, finding the top two eigen-values and corresponding eigen-vectors and
projecting onto a 2-Dim space.

from scipy.linalg import eigh# the parameter 'eigvals' is defined (low value to heigh value) 
# eigh function will return the eigen values in asending order
# this code generates only the top 2 (782 and 783) eigenvalues.
values, vectors = eigh(covar_matrix, eigvals=(782,783))
print("Shape of eigen vectors = ",vectors.shape)
# converting the eigen vectors into (2,d) shape for easyness of further computations
vectors = vectors.T
print("Updated shape of eigen vectors = ",vectors.shape)Shape of eigen vectors = (784, 2)
Updated shape of eigen vectors = (2, 784)

Here the vectors[1] represent the eigen vector corresponding 1st principal eigen vector and vectors[0] represent the eigen vector corresponding 2nd principal eigen vector.

Projecting the original data sample on the plane formed by two principal eigen vectors by vector-vector multiplication.

import matplotlib.pyplot as plt
new_coordinates = np.matmul(vectors, sample_data.T)
print (" resultant new data points' shape ", vectors.shape, "X", sample_data.T.shape," = ", new_coordinates.shape)resultant new data points' shape (2, 784) X (784, 42000) = (2, 42000)

Lets have a look 2 component values

import pandas as pd# appending label to the 2d projected data
new_coordinates = np.vstack((new_coordinates, labels)).T
# creating a new data frame for ploting the labeled points.
dataframe = pd.DataFrame(data=new_coordinates, columns=("1st_principal", "2nd_principal", "label"))
1st_principal 2nd_principal label
0 -5.226445 -5.140478 1.0
1 6.032996 19.292332 0.0
2 -1.705813 -7.644503 1.0
3 5.836139 -0.474207 4.0
4 6.024818 26.559574 0.0

Plot the 2D data with seaborn

import seaborn as sn
sn.FacetGrid(dataframe, hue="label", size=6).map(plt.scatter, '1st_principal', '2nd_principal').add_legend()

Implementation of PCA using Scikit-learn:

Lets perform the same operation with Scikit-learn

# initializing the pca
from sklearn import decomposition
pca = decomposition.PCA()
# configuring the parameteres
# the number of components = 2
pca.n_components = 2
pca_data = pca.fit_transform(sample_data)
# pca_reduced will contain the 2-d projects of simple data
print("shape of pca_reduced.shape = ", pca_data.shape)
# attaching the label for each 2-d data point
pca_data = np.vstack((pca_data.T, labels)).T
# creating a new data fram which help us in ploting the result data
pca_df = pd.DataFrame(data=pca_data, columns=("1st_principal", "2nd_principal", "label"))
sn.FacetGrid(pca_df, hue="label", size=6).map(plt.scatter, '1st_principal', '2nd_principal').add_legend()

We can also check the maximum features needed to receive higher accuracy.

# PCA for dimensionality redcution (non-visualization)pca.n_components = 784
pca_data = pca.fit_transform(sample_data)
percentage_var_explained = pca.explained_variance_ / np.sum(pca.explained_variance_);cum_var_explained = np.cumsum(percentage_var_explained)# Plot the PCA spectrum
plt.figure(1, figsize=(6, 4))
plt.plot(cum_var_explained, linewidth=2)

If we take 200-dimensions, approx. 90% of variance is expalined.

Limitations of PCA:


We can see that ƛ1 = ƛ2 or almost equal, so if we drop any one feature, there will be maximum loss of information, which is not desirable.


PCA does not take local structure present in the data into consideration. Both the cluster C1 and C2 will be projected in the same area on f1'. So, it will be difficult to make out the class label while transforming.


Looking at distribution of datasets, it is found that they are non-linear in nature. But PCA relies on linear assumptions. It assumes the data points are linearly correlated. If we project all the points on f1' and convert from 2D to 1D, it will lose entire structure of the data.


PCA is simple algorithm which assumes linearly correlated dataset and convert high dimension data into low dimension. It is observed that PCA performs good during conversion of data but not in visualization. However, there are better algorithm like T-SNE, which assumes non- linear dataset and preserves local structure of the dataset.

Thank you for going through this long chapter. Please feel free to comment if there is any suggestions or queries. You can also knock me in Linkedin. Happy reading……….




Sagor Saha

Machine learning enthusiast