Word prediction with Natural Language Processing


Introduction

This report outlines the methodology for building a word prediction application using Natural Language Processing techniques as part of the the Coursera Data Science Specialization.

The essence of the Capstone project is to create an application that uses NLP techniques and predictive analytics, and like SwiftKey’s applications, takes in a word phrase and returns next-predicted word.

The project is developed in partnership with SwiftKey as the company well known for their predictive text analytics. As of March 1, 2016, SwiftKey became part of the Microsoft family of products. SwiftKey applications are used on Android and iOS anticipating and providing next-word choices while keyboard typing through Natural Language Processing (NLP) techniques. Microsoft’s Word Flow Technology is another example of NLP in action.

Additionally, the work presented in this project follows the tenets of reproducible research and all code is available in an open-source repository to enable readers to review the approach, reproduce the results, and collaborate to enhance the model.

The milestone report outlines the initial approach to building a series of ngram models from a range of text documents.

Overview of the application

The application was developed in R using a number of packages and the Shiny web framework.

Below outlines the methodology used to build,predict and evaluate the application.

NGram model

  1. Sample text taken from SwiftKey corpus data (15% of original)
  2. The text was cleaned from non-ASCII characters, derogatory language, punctuation, non words and extra whitespace
  3. The corpus was processed to produce 5 ngram models which were morphed into a data table together with the totalled frequence of the ngram (bag of words).

Prediction

A prediction function then takes a sentence as input and execute the below steps

  1. Validates and clean the input sentence (using the same ‘clean_text’ function to build the ngram models)
  2. For each ngram gram model take the N last words for the input text where N is the size-1 of the ngram model. For example:
image

image

Description of the algorithm used to make the prediction

With a data table containing the ngram model, sentence, frequency and predicted word, the top 3 most probable words are predicted using a Stupid Backoff smoothing strategy.

A pseudo code description to calculate the ‘score’ for each word follows:

if the rows ngram model was 5

  score = matched 5 gram Count / input 4 gram Count

else if the rows ngram model was 4

  score = 0.4 * matched 4 gram Count / input 3 gram Count

else if the rows ngram model was 3

  score = 0.4 * 0.4 * matched 3 gram Count / input 2 gram Count

else if the rows ngram model was 2

  score = 0.4 * 0.4 * 0.4 * matched 2 gram Count / input 1 gram Count

Finally we group and sum similar words

For example if the predicted word ‘you’ was found in ngram 4 (and thus ngram 3 & 2) it may look like

ngram predicted score
4 you 0.2
3 you 0.1
2 you 0.05

The total score for the predicted word ‘you’ is (0.2 + 0.1 + 0.05) = 0.35

The final scoring is aggregated and summed for each word. The top 3 words are selected according to the highest score.

When no results are found the 3 most common words from the English language (‘the’, ‘be’, ‘to’) are returned as a response.

Evaluation

The prediction model was evaluated using the Benchmark.R tool (see references for source).

Initial predicts were quite high but also quite slow. The decision to only using 1-3 ngram models to speed up the search cut the time in half and only dropped the accuracy by 10%.

image

image

Instructions

To use the application navigate to the following URL

https://chrismckelt.shinyapps.io/datascience-capstone/

Start typing in text

image

image

For access to the code please contact the author using one of the contact links on the site.


Trello with GetCorrello as an Agile Kanban Tool

After initially using Trello with ‘Scrum for Trello’‘ to run our companies Agile process and having to manually create burn downs and release estimates with Excel (after mangling data from the Trello API and a Google Spread sheet) we moved to Trello with GetCorrello (and switched to Kanban after a few months with the teams consent)

Charts from this tool are below.   The GetCorrello team were able to add features on demand. Would use them again!

 https://getcorrello.com

Some samples

Burndown

image

Cumulative Flow Diagram

image

Estimated time to complete per label

image

Control chart

image

Cycle times

image

Stats

image

Slack integration

image

 

Our Kanban process

https://www.dropbox.com/s/fj1mr93gqhxv8nc/Fair%20Go%20Kanban%20Process.pptx?dl=0