How to perform clustering with dimensionality reduction for large datasets

Tina Akiiki
3 min readJun 9, 2022

This post demonstrates how to perform clustering with dimensionality reduction using PCA which I learnt while completing the introduction to machine learning course by Udacity.

PCA transforms larger datasets into a smaller subset of features which still contain a lot of the information in the larger dataset, which allows ML algorithms to run faster, with a minimal loss in accuracy.

For the demonstration, I used data from the India Census 2011 on Kaggle which contains raw counts per district for a variety of features such as male, female, housing type, source of water among others.

Firstly, I began by exploring the dataset and found that it has 118 features and no missing values, so there was no need for data preprocessing in terms of handling missing values.

Since PCA works with numerical data, I checked for categorical features as to encode them before scaling the data:

I used label encoding to map each state name and district name to an integer as opposed to one hot encoding which would create a column with a 1 or 0 for each state name and district name which would double the number of columns we needed to use and slow down the clustering algorithm.

Next, I dropped the encoded columns and scaled the dataset using MinMaxScaler since it works better with features that have outlier values which is expected for the encoded features. Scaling the values is important because PCA creates new dimensions based on the variance in the values, which can be skewed if some features have much larger values than others e.g age vs number of women.

I then run PCA with 80 components and computed the cumulative explained variance for the PCA dimensions, in order to choose the optimal number of components. I found that by using only 10 components, I could still maintain a dataset that explained 90% of the variability in the larger dataset.

I then re-fitted PCA with the select number of components, to create a dataset that I could use for the clustering process.

Using the 10 components, I selected the optimal number of clusters as shown in the elbow plot above, and then added fit the KMeans model on the PCA dimensions.

In order to interpret the differences between the two clusters, I used the inverse transform to reverse the scaling and PCA on the dataset, and then compared the mean values of each feature between the clusters. Inverse transforms are useful for interpreting results as the PCA dimensions are harder to interpret.

After performing the inverse transform above, I noticed that the districts in the first cluster typically have more females than the second cluster. This could be an insight to further drill into in a typical population segmentation project.

This concludes this post on how to use dimensionality reduction in a clustering project.

References

https://builtin.com/data-science/step-step-explanation-principal-component-analysis

--

--