How to Perform a Cluster Analysis in R

Written by Coursera Staff • Updated on

Building skills in data analysis techniques, such as cluster analyses, can help you analyze and interpret information more effectively. Learn what a cluster analysis is and how to perform your own.

[Featured image] A marketer uses cluster analysis in R to identify specific groups of customers.

Key takeaways

To conduct a cluster analysis in R, you prepare your data, normalize it, choose your variables, select a cluster method, and visualize the clusters.

  • Three types of cluster analysis in R are hierarchical clustering, k-means clustering, and density-based spatial clustering of applications with noise (DBSCAN).

  • To use R for your cluster analysis, first install R, then set up the environment, and learn a few basic commands.

Explore different types of cluster analyses, how to start learning R, and the basic steps to perform a cluster analysis with this programming language and software. If you’re ready to build your data analysis skills, enroll in the IBM Data Analytics with Excel and R Professional Certificate, where in as little as three months, you can learn about data visualization software, data storytelling, web scraping, big data, and more. 

What is cluster analysis?

Cluster analysis is a powerful tool in data science used to group structured or unstructured data or objects so that each group (cluster) of objects is more closely related in value to each other than to other groups. This technique is popular with professionals in various fields, including marketing, biology, and social sciences, for uncovering patterns and relationships in data.

You can perform cluster analysis with statistical programming languages such as SAS and R. One benefit of R is that it’s a free, open-source programming environment specifically designed for statistical computing and graphics. Using R software can make cluster analysis more straightforward thanks to a comprehensive set of packages and functions. These tools simplify the process of clustering and interpreting complex data sets.

Read more: What Is Clustering?

Cluster analysis examples in R: Different types

When performing a cluster analysis, you can use a few different methods in R. Three of the most popular methods are as follows.

1. Hierarchical clustering 

This method builds a hierarchy of clusters by starting with individual points and combining them into larger clusters (agglomerative) or by starting with the entire data set and dividing it into smaller clusters (divisive). Agglomerative clustering is typically a good choice if you want to identify small clusters, while divisive clustering is better if you are looking for large clusters. The clusters this method represents after the clustering process are defined by the centroid or the medoid. This type of method is reproducible, which you may want to consider depending on your purpose.

2. K-means cluster analysis in R 

This is a popular method used for clustering. When using this approach, you will specify the number of clusters you want, which is your “k” value. The algorithm then works to classify objects that are most similar into groups. The objects are grouped based on their distance to a cluster's nearest mean (centroid). The process iteratively refines the groupings to minimize variances within each cluster. One limitation of this approach is that it is sensitive to outliers, so it’s important you understand the structure of your data before deciding on the approach.

3. Density-based spatial clustering of applications with noise (DBSCAN) 

DBSCAN looks at how data are grouped, marking certain ranges of values as high-density regions and labeling those in low-density regions as outliers or noise. This helps to see where the values of data cluster together. With DBSCAN, you don’t need to specify the number of clusters. However, choosing appropriate values for neighbors and minimum features will influence your results.

Getting started with a cluster analysis in R

If you decide to use R for your cluster analysis, your first step is to install R, then set up the environment, and, finally, learn a few basic commands. 

  • Installing R: First, you need to install R from the Comprehensive R Archive Network (CRAN). You can also install RStudio, a popular integrated development environment for R.

  • Setting up: Once installed, you can set up your environment by installing packages like cluster, factoextra, and dendextend, which are commonly used for clustering and data visualization.

  • Learning basic commands: Next, familiarize yourself with basic R commands and syntax. For cluster analysis, understanding data import (read.csv, read.table), data manipulation (such as handling missing values), and basic statistical functions may benefit you.

How to do a cluster analysis in R

While individual steps may vary depending on your analysis needs, following these basic steps is a good starting point. Consider the following method to prepare your data and perform a basic cluster analysis. 

1. Prepare your data.

Before diving into cluster analysis in R, preparing your data correctly is an important step for meaningful results. Start by cleaning your data. This involves dealing with missing values, organizing your columns and rows, and correcting errors such as duplicates. In R, functions like na.omit() can help remove missing values, and unique() can identify duplicates.

2. Normalize your data.

Clustering algorithms respond to how you scale your data. With data normalization, each feature contributes equally to the results. You can use functions like scale() in R to normalize your data and improve your results.

3. Choose your variables.

Once you clean and normalize your data, you’ll choose relevant variables for clustering. You can base this selection on domain knowledge to avoid irrelevant variables or use statistical methods like principal component analysis (PCA) to find the most weighted variables.

4. Decide on your cluster method.

For hierarchical clustering, use functions like hclust(). You can use the dist() function first to compute the distance matrix.

For k-means clustering, you may use the factoextra package or the kmeans() function. It requires specifying the number of clusters.

For DBSCAN, you can use the dbscan package in R for basic DBSCAN functions. It’s good for data containing clusters of similar density.

5. Visualize your clusters.

Visualization is a key step in interpreting the results of cluster analysis. You can use images such as scatter plots, dendrograms, pie charts, bar plots, and pair plots to visualize clusters. To make these visualizations, you can use a visualization package in R called ggplot2 to create sophisticated images customized to your needs.

With ggplot2, you can enhance scatter plots with color to represent clusters and use faceting to display multiple dimensions of data. Visualizations help assess the clustering tendency of your data, understand the shape and size of clusters, and identify any outliers or anomalies.

Which graph is best to show correlation?

A scatter plot is the most effective graph for illustrating the correlation between two variables. You can use a scatter plot to identify patterns within data. These patterns can help you better understand the relationships that might exist between the variables within the data. For example, you might create a scatter plot to determine if higher temperatures correlate with an increase in ice cream sales.

To stay current regarding trends and job opportunities in data analysis and other career fields, join Career Chat on LinkedIn. You can also explore these additional free resources:

With Coursera Plus, you can learn and earn credentials at your own pace from over 350 leading companies and universities. With a monthly or annual subscription, you’ll gain access to over 10,000 programs—just check the course page to confirm your selection is included.

Updated on
Written by:

Editorial Team

Coursera’s editorial team is comprised of highly experienced professional editors, writers, and fact...

This content has been made available for informational purposes only. Learners are advised to conduct additional research to ensure that courses and other credentials pursued meet their personal, professional, and financial goals.