PCA for Visualization and Dimension Reduction….

source: https://i.stack.imgur.com/lNHqt.gif

Table of Contents:

  1. PCA intuition (geometric)
  2. Column standardization
  3. Optimization problem
  4. Calculation of covariance matrix
  5. Eigen values and Eigen vectors
  6. Interpretation of Eigen values
  7. Summary
  8. Implementation of PCA from scratch
  9. Implementation of PCA using scikit learn
  10. Limitations of PCA
  11. Conclusion
  12. References

PCA intuition:

Data points on f1 and f2 axis

Column Standardization:

converting data points
Column standardization
Column-centered graph
maximum variance position

Optimization problem:

data point projected on the plane

Calculation of co-variance matrix:

Eigen values and Eigen vectors:

Interpretation of eigenvalues(ƛ):

100% data representation on f1'
75% data representation on f1'
60% data representation on f1'
50% data representation on f1'


  1. Perform column standardization.
  2. Calculate the covariance matrix.
  3. Compute the eigenvalues and eigenvectors from the covariance matrix.
  4. Sort these eigenvalue pairs in descending order of magnitude.
  5. Select the top eigenvalues which retain maximum variance.
  6. Perform data transformation original data by using the eigenvectors corresponding to top eigenvalues.

Implementation of PCA from scratch:

#importing the libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# reading the data
d0 = pd.read_csv('train.csv')
# save the labels into a variable l.
labels = d0['label']
# Drop the label feature and store the pixel data in d.
data = d0.drop("label",axis=1)
# Finding the size of data
# display or plot a number.
idx = 1
# reshape from 1d to 2d pixel array
grid_data = data.iloc[idx].to_numpy().reshape(28,28)
plt.imshow(grid_data, interpolation = "none", cmap = "gray")
The value stored in the first index
# Data-preprocessing: Standardizing the datafrom sklearn.preprocessing import StandardScaler
standardized_data = StandardScaler().fit_transform(data)
(42000, 784)
#find the co-variance matrix which is : A^T * A
sample_data = standardized_data
# matrix multiplication using numpy
covar_matrix = np.matmul(sample_data.T , sample_data)
print ( "The shape of variance matrix = ", covar_matrix.shape)The shape of variance matrix = (784, 784)
from scipy.linalg import eigh# the parameter 'eigvals' is defined (low value to heigh value) 
# eigh function will return the eigen values in asending order
# this code generates only the top 2 (782 and 783) eigenvalues.
values, vectors = eigh(covar_matrix, eigvals=(782,783))
print("Shape of eigen vectors = ",vectors.shape)
# converting the eigen vectors into (2,d) shape for easyness of further computations
vectors = vectors.T
print("Updated shape of eigen vectors = ",vectors.shape)Shape of eigen vectors = (784, 2)
Updated shape of eigen vectors = (2, 784)
import matplotlib.pyplot as plt
new_coordinates = np.matmul(vectors, sample_data.T)
print (" resultant new data points' shape ", vectors.shape, "X", sample_data.T.shape," = ", new_coordinates.shape)resultant new data points' shape (2, 784) X (784, 42000) = (2, 42000)
import pandas as pd# appending label to the 2d projected data
new_coordinates = np.vstack((new_coordinates, labels)).T
# creating a new data frame for ploting the labeled points.
dataframe = pd.DataFrame(data=new_coordinates, columns=("1st_principal", "2nd_principal", "label"))
1st_principal 2nd_principal label
0 -5.226445 -5.140478 1.0
1 6.032996 19.292332 0.0
2 -1.705813 -7.644503 1.0
3 5.836139 -0.474207 4.0
4 6.024818 26.559574 0.0
import seaborn as sn
sn.FacetGrid(dataframe, hue="label", size=6).map(plt.scatter, '1st_principal', '2nd_principal').add_legend()

Implementation of PCA using Scikit-learn:

# initializing the pca
from sklearn import decomposition
pca = decomposition.PCA()
# configuring the parameteres
# the number of components = 2
pca.n_components = 2
pca_data = pca.fit_transform(sample_data)
# pca_reduced will contain the 2-d projects of simple data
print("shape of pca_reduced.shape = ", pca_data.shape)
# attaching the label for each 2-d data point
pca_data = np.vstack((pca_data.T, labels)).T
# creating a new data fram which help us in ploting the result data
pca_df = pd.DataFrame(data=pca_data, columns=("1st_principal", "2nd_principal", "label"))
sn.FacetGrid(pca_df, hue="label", size=6).map(plt.scatter, '1st_principal', '2nd_principal').add_legend()
# PCA for dimensionality redcution (non-visualization)pca.n_components = 784
pca_data = pca.fit_transform(sample_data)
percentage_var_explained = pca.explained_variance_ / np.sum(pca.explained_variance_);cum_var_explained = np.cumsum(percentage_var_explained)# Plot the PCA spectrum
plt.figure(1, figsize=(6, 4))
plt.plot(cum_var_explained, linewidth=2)

Limitations of PCA:






Machine learning enthusiast

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Learn Hugging Face Transformers & BERT with PyTorch in 5 Minutes

10 Machine Learning Algorithms Which we Should Know.

Machine learning makes sure that our future seaweed diets are viable and delicious

NLP-Day 2: Why You Should Always Learn Your Vocabulary

TensorFlow 2.5.0 RC on WSL2

Have fun

Age of AI Talk:``Deep Learning est Mort! Vive Differentiable Programming”

Myers-Briggs Project predict human behavior

What are the Best Image Annotation Platforms for Computer Vision?

Sagor Saha

Sagor Saha

Machine learning enthusiast

More from Medium

Z-score in detail with examples

Machine Learning Optimization Methods and Techniques

MAE vs MSE Error Metrics

Math Behind A Machine Learning Algorithm Linear Regression