Alan Turing (1950) opening statement “I propose to consider the question, ‘Can machines think?’” (p. 433) in his article “Computing Machinery and Intelligence”. The Turing Test has become a basis of natural language processing.
The essence of the Capstone project is to create an application that uses NLP techniques and predictive analytics, and like SwiftKey’s applications, takes in a word phrase and returns next-predicted word.
The project is developed in partnership with SwiftKey as the company well known for their predictive text analytics. As of March 1, 2016, SwiftKey became part of the Microsoft family of products. SwiftKey applications are used on Android and iOS anticipating and providing next-word choices while keyboard typing through Natural Language Processing (NLP) techniques. Microsoft’s Word Flow Technology is another example of NLP in action.
The milestone reports purpose is to identify the initial steps taken to produce the overall capstone project, a Natural Language Processing word prediction alogrithm application.
Additionally, the work presented in this project follows the tenets of reproducible research and all code is available in an open-source repository to enable readers to review the approach, reproduce the results, and collaborate to enhance the model.
The data for analysis to be used is from
https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
This analysis is based on three text files containing internet blogs, news feeds, and Twitter tweets from HCCorpora.
This report will outline the process for:
Having downloaded and extracted the zip file, we then read the files and adjust for encoding.
A sample of each text is taken to optimise performance of the analyse.
data.blogs <- read_file("final/en_US/en_US.blogs.txt")
data.news <- read_file("final/en_US/en_US.news.txt")
data.twitter <- read_file("final/en_US/en_US.twitter.txt")
data.blogs = iconv(data.blogs, "latin1", "ASCII", sub = "")
data.news = iconv(data.news, "latin1", "ASCII", sub = "")
data.twitter = iconv(data.twitter, "latin1", "ASCII", sub = "")
sample.blogs <- smaller(data.blogs)
sample.news <- smaller(data.news)
sample.twitter <- smaller(data.twitter)
sample.all <- c(sample.blogs, sample.news, sample.twitter)
data_summary <- data.frame(
Blogs=c((length(nchar(data.blogs)))),News =c((length(nchar(data.news)))),
Twitter =c((length(nchar(data.twitter)))))
kable(data_summary)
Blogs | News | |
---|---|---|
899288 | 1010242 | 2360148 |
group_texts <- list(sample.blogs, sample.news, sample.twitter)
group_texts <- tolower(group_texts)
group_texts <- removeNumbers(group_texts)
group_texts <- removePunctuation(group_texts, preserve_intra_word_dashes = TRUE)
group_texts <- gsub("http[[:alnum:]]*", "", group_texts)
group_texts <- stripWhitespace(group_texts)
group_texts <- gsub("\u0092", "'", group_texts)
group_texts <- gsub("\u0093|\u0094", "", group_texts)
corpus.data <- VCorpus(VectorSource(group_texts))
toEmpty <- content_transformer(function(x, pattern) gsub(pattern, "", x, fixed = TRUE))
toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x, fixed = TRUE))
corpus.data <- tm_map(corpus.data, toEmpty, "#\\w+")
corpus.data <- tm_map(corpus.data, toEmpty, "(\\b\\S+\\@\\S+\\..{1,3}(\\s)?\\b)")
corpus.data <- tm_map(corpus.data, toEmpty, "@\\w+")
corpus.data <- tm_map(corpus.data, toEmpty, "http[^[:space:]]*")
corpus.data <- tm_map(corpus.data, toSpace, "/|@|\\|")
save_file("https://goo.gl/To9w5B", "bad_word_list.txt")
bad_words <- readLines("./bad_word_list.txt")
corpus.data <- tm_map(corpus.data, removeWords, bad_words)
corpus.data <- tm_map(corpus.data, stemDocument)
Jurafsky and Martin (2000) provide a seminal work within the domain of NLP. The authors present a key approach for building prediction models called the N-Gram, which relies on knowledge of word sequences from (N-1) prior words. It is a type of language model based on counting words in the corpora to establish probabilities about next words.
Here we will tokenize the sample texts to generate uni/bi/tri grams.
This will allow us to see the most used words and expressions.
g1 <- ggplot(gram1freq, aes(x = word, y = freq)) +
geom_bar(stat = "identity", fill = "red") +
ggtitle("1-gram") +
xlab("1-grams") + ylab("Frequency")
g1
g2 <- ggplot(gram2freq, aes(x = word, y = freq)) +
geom_bar(stat = "identity", fill = "yellow") +
ggtitle("2-gram") +
xlab("2-grams") + ylab("Frequency")
g2
g3 <- ggplot(gram3freq, aes(x = word, y = freq)) +
geom_bar(stat = "identity", fill = "blue") +
ggtitle("3-gram") +
xlab("3-grams") + ylab("Frequency")
g3
Following is a wordcloud of unigrams and distributions of 2- and 3-gram phrases. These are calculated from taking 100 samples from each of the three N-Grams analysis. The more pronounced the word / phrase, the more frequent its appearance in the texts.
create_wordcloud <- function(tdm, palette = "Dark2") {
mtx = as.matrix(tdm)
# get word counts in decreasing order
word_freqs = sort(rowSums(mtx), decreasing = TRUE)
# create a data frame with words and their frequencies
dm = data.frame(word = names(word_freqs), freq = word_freqs)
dm <- sqldf("select * from dm limit 100")
# plot wordcloud
wordcloud(dm$word, dm$freq, random.order = FALSE, colors = brewer.pal(8, palette))
}
create_wordcloud(tdm1, "Set1")
create_wordcloud(tdm2, "Set2")
create_wordcloud(tdm3, "Set3")
Using the data in this analyse a Katz back off n-gram model will used to build a predictive text input application.
The Katz back-off is a generative n-gram language model that estimates the conditional probability of a word given its history in the n-gram. It accomplishes this estimation by “backing-off” to models with smaller histories under certain conditions. By doing so, the model with the most reliable information about a given history is used to provide the better results.
This will be hosted in a ShinyApp web application that accepts user input and outputs the next predicted words.