
Photo by editor | Ideogram
Text mining helps us extract critical information from enormous amounts of text. R is a useful tool for text mining because it has many packages designed for this purpose. These packages support you neat, analyze and visualize text.
Installing and loading R packages
First you need to install these packages. You can do this with straightforward commands in R. Here are some critical packages to install:
- tm (text mining): Provides text preprocessing and mining tools.
- blank text: Used to neat and prepare data for analysis.
- word cloud: Generates word cloud visualizations of text data.
- Śnieżka C: Provides stemming tools (reduce words to their root forms)
- ggplot2: A widely used data visualization package.
Install the necessary packages using the following commands:
install.packages("tm")
install.packages("textclean")
install.packages("wordcloud")
install.packages("SnowballC")
install.packages("ggplot2")
Load them into your R session after installation:
library(tm)
library(textclean)
library(wordcloud)
library(SnowballC)
library(ggplot2)
Data collection
Text mining requires raw text data. Here’s how to import a CSV file in R:
# Read the CSV file
text_data <- read.csv("IMDB_dataset.csv", stringsAsFactors = FALSE)
# Extract the column containing the text
text_column <- text_data$review
# Create a corpus from the text column
corpus <- Corpus(VectorSource(text_column))
# Display the first line of the corpus
corpus[[1]]$content
Text preprocessing
Raw text requires cleaning before analysis. We changed all text to lowercase and removed punctuation and numbers. Then we remove common words that do not add meaning and reduce the remaining words to their basic forms. Finally, we neat up any extra spaces. Here is a typical preprocessing pipeline in R:
# Convert text to lowercase
corpus <- tm_map(corpus, content_transformer(tolower))
# Remove punctuation
corpus <- tm_map(corpus, removePunctuation)
# Remove numbers
corpus <- tm_map(corpus, removeNumbers)
# Remove stopwords
corpus <- tm_map(corpus, removeWords, stopwords("english"))
# Stem words
corpus <- tm_map(corpus, stemDocument)
# Remove white space
corpus <- tm_map(corpus, stripWhitespace)
# Display the first line of the preprocessed corpus
corpus[[1]]$content
Creating a Document Term Matrix (DTM)
After preprocessing the text, create a document term matrix (DTM). DTM is a table that counts the frequency of occurrence of terms in the text.
# Create Document-Term Matrix
dtm <- DocumentTermMatrix(corpus)
# View matrix summary
inspect(dtm)
Visualization of results
Visualization helps you better understand the results. Word clouds and bar charts are popular methods for visualizing text data.
Word cloud
One popular way to visualize word frequency is to create a word cloud. The word cloud shows the most common words in enormous font. This makes it easier to see which dates are critical.
# Convert DTM to matrix
dtm_matrix <- as.matrix(dtm)
# Get word frequencies
word_freq <- sort(colSums(dtm_matrix), decreasing = TRUE)
# Create word cloud
wordcloud(names(word_freq), freq = word_freq, min.freq = 5, colors = brewer.pal(8, "Dark2"), random.order = FALSE)
Bar chart
After you create a document term matrix (DTM), you can visualize word frequency in a bar chart. This will display the most common terms used in your text data.
library(ggplot2)
# Get word frequencies
word_freq <- sort(colSums(as.matrix(dtm)), decreasing = TRUE)
# Convert word frequencies to a data frame for plotting
word_freq_df <- data.frame(term = names(word_freq), freq = word_freq)
# Sort the word frequency data frame by frequency in descending order
word_freq_df_sorted <- word_freq_df[order(-word_freq_df$freq), ]
# Filter for the top 5 most recurrent words
top_words <- head(word_freq_df_sorted, 5)
# Create a bar chart of the top words
ggplot(top_words, aes(x = reorder(term, -freq), y = freq)) +
geom_bar(stat = "identity", fill = "steelblue") +
coord_flip() +
theme_minimal() +
labs(title = "Top 5 Word Frequencies", x = "Terms", y = "Frequency")
Topic modeling with LDA
Latent Dirichlet Allocation (LDA) is a common topic modeling technique. Finds hidden topics in enormous text data sets. The topicmodels package in R helps you apply LDA.
library(topicmodels)
# Create a document-term matrix
dtm <- DocumentTermMatrix(corpus)
# Apply LDA
lda_model <- LDA(dtm, k = 5)
# View topics
topics <- terms(lda_model, 10)
# Display the topics
print(topics)
Application
Text mining is an effective way to gather insights from text. R offers many useful tools and packages for this purpose. You can easily neat and prepare text data. You can then analyze it and visualize the results. You can also explore hidden topics using methods such as LDA. Overall, R makes it straightforward to extract valuable information from text.
Jayita Gulati is a machine learning enthusiast and technical writer with a passion for building machine learning models. She holds a Master's degree in Computer Science from the University of Liverpool.
Our top 3 partner recommendations
1. The best VPN for engineers - Stay secure and private online with a free trial
2. The best project management tool for technical teams - Augment your team's effectiveness today
4. The best network management tool - Best for medium and enormous companies
