How to apply R for text mining

Photo by editor | Ideogram

Text mining helps us extract critical information from enormous amounts of text. R is a useful tool for text mining because it has many packages designed for this purpose. These packages support you neat, analyze and visualize text.

Installing and loading R packages

First you need to install these packages. You can do this with straightforward commands in R. Here are some critical packages to install:

tm (text mining): Provides text preprocessing and mining tools.
blank text: Used to neat and prepare data for analysis.
word cloud: Generates word cloud visualizations of text data.
Śnieżka C: Provides stemming tools (reduce words to their root forms)
ggplot2: A widely used data visualization package.

Install the necessary packages using the following commands:

install.packages("tm")
install.packages("textclean")    
install.packages("wordcloud")    
install.packages("SnowballC")         
install.packages("ggplot2")

Load them into your R session after installation:

library(tm)
library(textclean)
library(wordcloud)
library(SnowballC)
library(ggplot2)

Data collection

Text mining requires raw text data. Here’s how to import a CSV file in R:

# Read the CSV file
text_data <- read.csv("IMDB_dataset.csv", stringsAsFactors = FALSE)

# Extract the column containing the text
text_column <- text_data$review

# Create a corpus from the text column
corpus <- Corpus(VectorSource(text_column))

# Display the first line of the corpus
corpus[[1]]$content

data set

Text preprocessing

Raw text requires cleaning before analysis. We changed all text to lowercase and removed punctuation and numbers. Then we remove common words that do not add meaning and reduce the remaining words to their basic forms. Finally, we neat up any extra spaces. Here is a typical preprocessing pipeline in R:

# Convert text to lowercase
corpus <- tm_map(corpus, content_transformer(tolower))

# Remove punctuation
corpus <- tm_map(corpus, removePunctuation)

# Remove numbers
corpus <- tm_map(corpus, removeNumbers)

# Remove stopwords 
corpus <- tm_map(corpus, removeWords, stopwords("english"))

# Stem words 
corpus <- tm_map(corpus, stemDocument)

# Remove white space
corpus <- tm_map(corpus, stripWhitespace)

# Display the first line of the preprocessed corpus
corpus[[1]]$content

pre-processing

Creating a Document Term Matrix (DTM)

After preprocessing the text, create a document term matrix (DTM). DTM is a table that counts the frequency of occurrence of terms in the text.

# Create Document-Term Matrix
dtm <- DocumentTermMatrix(corpus)

# View matrix summary
inspect(dtm)

dtm dtm

Visualization of results

Visualization helps you better understand the results. Word clouds and bar charts are popular methods for visualizing text data.

Word cloud

One popular way to visualize word frequency is to create a word cloud. The word cloud shows the most common words in enormous font. This makes it easier to see which dates are critical.

# Convert DTM to matrix
dtm_matrix <- as.matrix(dtm)

# Get word frequencies
word_freq <- sort(colSums(dtm_matrix), decreasing = TRUE)

# Create word cloud
wordcloud(names(word_freq), freq = word_freq, min.freq = 5, colors = brewer.pal(8, "Dark2"), random.order = FALSE)

word cloud

Bar chart

After you create a document term matrix (DTM), you can visualize word frequency in a bar chart. This will display the most common terms used in your text data.

library(ggplot2)

# Get word frequencies
word_freq <- sort(colSums(as.matrix(dtm)), decreasing = TRUE)

# Convert word frequencies to a data frame for plotting
word_freq_df <- data.frame(term = names(word_freq), freq = word_freq)

# Sort the word frequency data frame by frequency in descending order
word_freq_df_sorted <- word_freq_df[order(-word_freq_df$freq), ]

# Filter for the top 5 most recurrent words
top_words <- head(word_freq_df_sorted, 5)

# Create a bar chart of the top words
ggplot(top_words, aes(x = reorder(term, -freq), y = freq)) +
    geom_bar(stat = "identity", fill = "steelblue") +
    coord_flip() +
    theme_minimal() +
    labs(title = "Top 5 Word Frequencies", x = "Terms", y = "Frequency")

bar chart

Topic modeling with LDA

Latent Dirichlet Allocation (LDA) is a common topic modeling technique. Finds hidden topics in enormous text data sets. The topicmodels package in R helps you apply LDA.

library(topicmodels)

# Create a document-term matrix
dtm <- DocumentTermMatrix(corpus)

# Apply LDA
lda_model <- LDA(dtm, k = 5)  

# View topics
topics <- terms(lda_model, 10)  

# Display the topics
print(topics)

topicmodeling

Application

Text mining is an effective way to gather insights from text. R offers many useful tools and packages for this purpose. You can easily neat and prepare text data. You can then analyze it and visualize the results. You can also explore hidden topics using methods such as LDA. Overall, R makes it straightforward to extract valuable information from text.

Jayita Gulati is a machine learning enthusiast and technical writer with a passion for building machine learning models. She holds a Master's degree in Computer Science from the University of Liverpool.

Our top 3 partner recommendations

1. The best VPN for engineers - Stay secure and private online with a free trial

2. The best project management tool for technical teams - Augment your team's effectiveness today

4. The best network management tool - Best for medium and enormous companies

Categories

How to apply R for text mining

Installing and loading R packages

Data collection

Text preprocessing

Creating a Document Term Matrix (DTM)

Visualization of results

Word cloud

Bar chart

Topic modeling with LDA

Application

Our top 3 partner recommendations

Penalties: Does the team that kicks first have a better chance of winning?

3 questions: Beyond data-driven aesthetics

Almost anyone can now sell you GLP-1 on the Internet

7 Real Python Projects You Can Build in 2026 (with Guides)

Start building with Nano Banana 2 Lite and Gemini Omni Flash

More News

Penalties: Does the team that kicks first have a better chance of winning?

7 Real Python Projects You Can Build in 2026 (with Guides)

Up-to-date York will soon be hotter than Phoenix

Your RAG pipeline is probably useless. Here’s a better alternative

Penalties: Does the team that kicks first have a better chance of winning?

3 questions: Beyond data-driven aesthetics

Almost anyone can now sell you GLP-1 on the Internet