Introduction
Introduction to The Post
Have you ever wondered how your email filters out spam messages? Or maybe how autocorrect on your phone knows what you’re trying to type? In this post, we’ll cover some basics of natural language processing like reading in and creating structure in messy text data, and then cleaning and tokenizing that data. Then the post will cover some of the more advanced topics like lemmatising, stemming, and vectorizing the data. In other words, converting it from text into a numeric matrix. The post do this with a focus on preparing the data to build a machine learning classifier on top of it. The post will also learn how to build two different types of machine learning models, while thoroughly testing and evaluating different variations of those models. We’ll have the tools to go from messy dataset to concise and accurate predictions from machine learning model, to deliver solutions to complex business problems.
Introduction to Natural Language Processing (NLP)
Natural language processing is a field concerned with the ability of a computer to understand, analyze, manipulate, and potentially generate human language. By human language, we’re simply referring to any language used for everyday communication. This can be English, Spanish, French, anything like that. Now it’s worth noting that Python doesn’t naturally know what any given word means. All it will see is a string of characters. For instance, it has no idea what natural actually means. It sees that it’s seven characters long, but the individual characters don’t mean anything to Python and certainly the collection of those characters together don’t mean anything, either. So we know that, what an N is, what an A is, and we know that together, those seven characters makes up the word natural, and we know what that means. So NLP is the field of getting the computer to understand what naturally actually signifies, and from there we can get into the manipulation or potentially even generation of that human language.
You probably experience natural language processing on a daily basis. They may not really even know it. So here are a few examples that you may see on a day to day basis. The first would be a spam filter, so this is just where your email server is determining whether an incoming email is spam or not, based on the content of the body, the subject, and maybe the email domain. The second is auto-complete, where Google is basically predicting what you’re interested in searching for based on what you’ve already entered and what others commonly search for with those same phrases. So if I search for natural language processing, it knows that many other people are interested in learning NLP with Python, or learning it through a course, or looking for jobs related to natural language processing. So it can auto-complete your search for you. The last is auto-correct, where say iPhone is trying to help you correct a misspelling. It shows how auto-correct has actually evolved over time and continues to evolve and learn by upgrading the operating system. So with iOS 6, if you’re trying to say, “I’ll be ill tomorrow,” It wouldn’t necessarily correct I’ll be I’ll tomorrow until iOS 7, where it actually corrects, it auto-completes tomorrow and corrects I’ll into ill. So it’ll correctly send as I’ll be ill tomorrow. So that just kind of shows how NLP is still evolving and how a system like iOS is still kind of learning what natural language even means.
Now NLP is a very broad umbrella that encompasses many topics. A few of those might be sentiment analysis, topic modeling, text classification, and sentence segmentation or part-or-speech tagging. The core component of natural language processing is extracting all the information from a block of text that is relevant to a computer understanding the language. This is task specific, as well. Different information is relevant for a sentiment analysis task than is relevant for a topic modeling task. So that’s a very quick introduction into what natural language processing is.
Introduction to NLTK
The natural language toolkit is the most utilized package for handling natural language processing tasks in Python. Usually called NLTK for short, it is a suite of open-source tools originally created in 2001 at the University of Pennsylvania for the purpose of making building NLP processes in Python easier. This package has been expanded through the extensive contributions of open-source users in the years since its original development. NLTK is great because it basically provides a jumpstart to building any NLP process by giving you the basic tools that you can then chain together to accomplish your goal rather than having to build all those tools from scratch and a lot of tools are packaged into NLTK.
NLP Basics
How to install NLTK on local machine
Both sets of instructions below assume you already have Python installed. These instructions are taken directly from http://www.nltk.org/install.html.
Mac/Unix
From the terminal:
- Install NLTK: run
pip install -U nltk
- Test installation: run
python
then typeimport nltk
Windows
- Install NLTK: http://pypi.python.org/pypi/nltk
- Test installation:
Start>Python35
, then typeimport nltk
Download NLTK data
1 | import nltk |
The following images will show the downloader of NLTK:
In above image, select ‘all packages’ and click ‘download’ button to start installing all NLTK packages.
The above image shows that all NLTK packages have been installed.
Small example of using NLTK package:
1 | from nltk.corpus import stopwords |
Output:
1 | ['i', 'herself', 'been', 'with', 'here', 'very', 'doesn', 'won'] |
Reading in Text Data
What is Unstructured Data
Unstructured data could mean that it’s binary data, it could mean no delimiters, or it could mean no indications of any rows. A few examples might be an email, PDF file, social media post, these may just get dumped into a file with no indication of where, maybe, a subject of an email ends and the body of the email begins, or even where one email ends and the next begins. It could also get cluttered by things like HTML tags and it can get really messy. It’s important to note that Python is pretty smart, but ultimately, unless it’s told otherwise, it basically sees everything as a string of characters. It needs to be told what those characters mean.
Read Semi-structured data
The following image presents a semi-structured SMS data:
This dataset is a collection of text messages, each with a label of either spam or ham. It’s not a clean CSV file, but it’s not terribly unstructured, either. Each row has a distinct text message and a distinct label as either spam or ham. So, in the context of text datasets, this is actually pretty well structured, so this shouldn’t be too difficult.
Reading the data and print it out:
1 | # Read in the raw text |
Output:
1 | "ham\tI've been searching for the right words to thank you for this breather. I promise i wont take your help for granted and will fulfil my promise. You have been wonderful and a blessing at all times.\nspam\tFree entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's\nham\tNah I don't think he goes to usf, he lives around here though\nham\tEven my brother is not like to speak with me. They treat me like aid" |
You could see that it’s just basically a block of text, and you’ll see that you have these \t and these \n separators. The \t’s are between the labels and the text message bodies, and the \n’s are typically at the end of those lines.
The following code is going to replace \n with \t and then split this into a list:
1 | parsedData = rawData.replace("\t", "\n").split("\n") |
Output:
1 | ['ham', |
Split the label and the text into lists:
1 | labelList = parsedData[0::2] |
Print the results:
1 | print(labelList[0:5]) |
Output:
1 | ['ham', 'spam', 'ham', 'ham', 'ham'] |
Combine both lists and put them into pandas dataframe:
1 | import pandas as pd |
Output:
1 | ValueError: arrays must all be same length |
Spot the error:
1 | print(len(labelList)) |
Output:
1 | 5571 |
Print the last 5 values:
1 | print(labelList[-5:]) |
Output:
1 | ['ham', 'ham', 'ham', 'ham', ''] |
Correction:
1 | fullCorpus = pd.DataFrame({ |
Output:
label | body_list | |
---|---|---|
0 | ham | I’ve been searching for the right words to tha… |
1 | spam | Free entry in 2 a wkly comp to win FA Cup fina… |
2 | ham | Nah I don’t think he goes to usf, he lives aro… |
3 | ham | Even my brother is not like to speak with me. … |
4 | ham | I HAVE A DATE ON SUNDAY WITH WILL!! |
Read the file using pandas:
1 | dataset = pd.read_csv("SMSSpamCollection.tsv", sep = "\t", header = None) |
Output:
0 | 1 | |
---|---|---|
0 | ham | I’ve been searching for the right words to tha… |
1 | spam | Free entry in 2 a wkly comp to win FA Cup fina… |
2 | ham | Nah I don’t think he goes to usf, he lives aro… |
3 | ham | Even my brother is not like to speak with me. … |
4 | ham | I HAVE A DATE ON SUNDAY WITH WILL!! |
Exploring The Dataset
Before diving into any in-depth analysis, data cleaning or model building, we want to do some very high-level exploration of our data to understand what we’re working with. So we might ask questions like what is the shape of our data, how many ham or spam are in our data set, and are there any missing values. So this will inform the decisions that we make as we move forward.
Read the data:
1 | import pandas as pd |
Output:
label | body_list | |
---|---|---|
0 | ham | I’ve been searching for the right words to tha… |
1 | spam | Free entry in 2 a wkly comp to win FA Cup fina… |
2 | ham | Nah I don’t think he goes to usf, he lives aro… |
3 | ham | Even my brother is not like to speak with me. … |
4 | ham | I HAVE A DATE ON SUNDAY WITH WILL!! |
What is the shape of the dataset?
1 | print("Input data has {} rows and {} columns".format(len(fullCorpus), len(fullCorpus.columns))) |
Output:
1 | Input data has 5568 rows and 2 columns |
How many spam/ham are there?
1 | print("Out of {} rows, {} are spam, {} are ham".format(len(fullCorpus), |
Output:
1 | Out of 5568 rows, 746 are spam, 4822 are ham |
How much missing data is there?
1 | print("Number of null in label: {}".format(fullCorpus["label"].isnull().sum())) |
Output:
1 | Number of null in label: 0 |
This helped us better understand our data once we actually got it read in. The insights pulled from this very basic exploration will help dictate how we approach the rest of our data cleaning and model building.
Regular Expressions
Introduction to Regular Expression
A regular expression, or a regex for short is a text string used for describing a certain search pattern. So if you’re familiar with wildcards for search, like if you wanted to search for any CSV file on your computer using *.csv, this is basically just a supercharged version of that. Regular expressions can take various forms.
To give a very quick example of what it means, the regular expression ‘nlp’ will just search for the explicit “nlp” string within some other string. This isn’t so much a search pattern as it is an explicit command for what we want to find. So if it was, “I love nlp” then this search pattern would just capture and return “nlp.” Another way to identify the “nlp” string would be to use the expression ‘[j-q]’. And this will just search for all single characters between ‘j’ and ‘q’ in whatever text we’re looking at, but this will search for all characters between ‘j’ and ‘q’, not just ‘n’, ‘l’, and ‘p’. The other consideration here is that this will only return single characters at a time. So this would return ‘n’ and then ‘l’ and then ‘p’ and also whatever other characters between ‘j’ and ‘q’ in your text string. This isn’t usually what we’re looking for. So we can solve that issue of only returning a single character by simply placing a plus sign outside of our brackets like ‘[j-q]+’. What that will tell Python is that it can search for strings longer than one character. So what this will look for is any character between ‘j’ and ‘q’, just with the added flexibility of returning strings of multiple characters together that are between ‘j’ and ‘q’. Switching gears a little bit, ‘[0-9]+’ will return all numbers with the flexibility of returning sequences of more than one number. So if there’s a year, like 2017, it will return the full year, rather than each number individually. Then lastly, to combine these two concepts, ‘[j-q0-9]+’ will search for sequences of characters between ‘j’ and ‘q’, or numbers between 0 and 9. So if you had a course name that was “nlp2017” without any spaces, then it would return that full string, but if you had “nlp 2017” with a space in between them, then that would return them as two separate sequences. This is just five very quick examples, but there is literally an infinite number of patters that you could come up with.
Regex give you the power and flexibility to search for almost any kind of pattern you could imagine. The examples are useful, but why do we actually care about this? Regexes are particularly useful when dealing with text data because a lot of the data is unstructured, where you need to be able to use these patterns to try to create some structure within the document. For instance, you could use a regex to identify the white space between words or tokens, or even let Python know how to split up a certain sentence. Another use is to identify delimiters between columns or end-of-line escape characters that indicate the end of one line and the beginning of another like we saw in our SMS Spam Collection dataset. They can also be used to remove punctuation or numbers, clean HTML tags, or just identify some random patterns that you’re interested in. A few examples of regular use cases might be confirming passwords that meet some criteria. So maybe a company requires one capital letter, one lower case and one special character in their passwords. You can create a regex to confirm that each new password created matches that criteria. Then the last three all kind of fall under the same broad category of searching for a certain pattern, like filenames, so find all the CSV’s that meet this criteria, or some portion of a URL, so maybe it’s whatever follows .com or .org, or scraping for key information from a larger document like package version numbers from a technical report.
How to Use Regular Expressions
The primary reason that we’re talking about regex is in order to tokenize sentences or split a sentence into a list of words so that Python can understand what it needs to be looking at. Right now, Python just sees a string of characters, so we need to tell it what to focus on, and how to organize those characters. For our machine learning model, Python will need to split the string into what we call tokens, or words, so that the model can learn how those tokens relate to the corresponse variable.
Python’s re
package is the most commonly used regex resource. More details can be found here.
Import re
package and define 3 sentences:
1 | import re |
Splitting a sentence into a list of words:
First Sentance:
1 | re.split("\s", re_test) |
Output:
1 | ['This', |
Second Sentance:
1 | # Split on a space to find words |
Output:
1 | ['This', |
You can see that the above code doesn’t work well if there are more spaces between words. The following code can solve this problem:
1 | # Split on maybe more than one spaces to find words |
Output:
1 | ['This', |
Let’s try the third sentance:
1 | re.split("\s+", re_test_messy1) |
Output:
1 | ['This-is-a-made/up.string*to>>>>test----2""""""different~regex-methods'] |
As you can see that the code above only looks for spaces but the third sentance contains lots of special characters so the code will not work on this sentance. However, the following code fix this problem by only searching non word characters.
1 | re.split("\W+", re_test_messy1) |
Output:
1 | ['This', |
There are other two options that can search all useful words instead of searching for non words but give the same result in the end. The following 3 lines of code will show you how it does.
1 | re.findall("\S+", re_test) |
Notice: uppercase usually means the opposite of lowercase that lowercase searches for all specified instances but uppercase will search others instead of the specified instances
So that’s how we can use two different methods from the re package along with several different regexes to properly tokenize messy sentences. Now that we’ve covered some of the basic regex usage for the purpose of tokenizing, there are a few takeaways to keep in mind. There are two methods from the re package that can be used for tokenizing. findall() will search for the actual words while ignoring the things that separate the words, while split() will search for the characters that split the words while ignoring the actual words themselves. And the regexes that are most useful for tokenizing, keep in mind that anything using a W is based on words, while anything with an S is based on white spaces. In our daily work, it’s much more common to be using the W regex because it allows the flexibility for words to be separated by spaces, or special characters. But having an understanding of what the S offers you is a nice tool to hold in your back pocket.
Regular Expression Replacements
The following sentances are the examples that we need to capture a section of the sentance and replace with other words:
1 | pep8_test = 'I try to follow PEP8 guidelines' |
We need to replace PEP8, PEP7 and PEEP8 with PEP8 Python Styleguide. The following code is the experiment of this replacement:
1 | import re |
Output:
1 | ['try', 'to', 'follow', 'guidelines'] |
The above code is to find out all lower case words, but we need to find the uppercase ones, so we are going to change it like this:
1 | re.findall("[A-Z]+", pep8_test) |
Output:
1 | ['I', 'PEP'] |
As you can notice from the output, it captures all uppercase word. However, we also need digits in the end to get the output of PEP8, PEP7 and PEEP8. Therefore, we simply put [0-9]+ to make it possible.
1 | re.findall("[A-Z]+[0-9]+", peep8_test) |
Output:
1 | ['PEEP8'] |
Here is our final searching result, but we need to replace this section, how can we do?
1 | re.sub("[A-Z]+[0-9]+", "PEP8 Python Styleguide", peep8_test) |
Output:
1 | 'I try to follow PEP8 Python Styleguide guidelines' |
You can see that the sub function can help us replacing our search section with another defined section.
Now this regex certainly isn’t perfect, you can imagine a scenario that it would miss, for instance if there’s a space between pep and 8 or if it was lowercase, it would miss both of those, so you’d likely need to spend some time refining your regex. However the point is to illustrate how you can use regex and give a practical example of a case that you might use it in. So now up to this point we’ve explored three different regex methods with some practical application of how you might use it. But there are a lot of other regex methods within this repackage. Few are listed below, so there’s search, match, full match, find itter, and escape. Even more broadly, the regex that we explored are very, very simple. They can get very complex. And the best way to learn it is by defining your own string with the goal of identifying some substring, pull up a regex cheat sheet, and start exploring different patterns to try to identify that substring.
Other examples of regex methods
- re.search()
- re.match()
- re.fullmatch()
- re.finditer()
- re.escape()
Machine Learning Pipeline
Up to this point, we’ve learned some basics of NLP and NLTK. We’ve learned how to read in messy text, and we’ve learned how to use regular expressions to search for and manipulate that text. Now, we’ll take a step back to understand how this all fits together in the broader machine learning pipeline before we dive into each step individually. This section is going to introduce some new topics as well and we’ll cover each of these topics later. This is meant only to provide the proper context for how this all fits together.
In a typical machine learning text pipeline, you’ll start with some document with raw text in it, like the SMS data set that we’re working with. It’s important to note that at this stage, the computer has no idea what it’s looking at. All it sees is a collection of characters. It doesn’t know the word ham from the word spam. The characters mean nothing. It doesn’t even know a space from a number or a letter. They’re all the same.
So the first thing we need to do is tokenize our text. We did this earlier in this chapter by splitting on white space or special characters. So you would take the sentence, “I am learning NLP,” and it would split into a list with four tokens, I and then am and then learning, and lastly NLP. So now, instead of just seeing one long string of characters like it was seeing before, now Python will see a list with four distinct tokens, so it knows what to look at. So now we have a list of tokens, so Python knows what to look at.
However, some of the words might be a little bit more important than other words. For instance, the words the, and, of, or, appear very frequently but don’t really offer much information about the sentence itself. These are what’s called stop words. We took a quick look at these earlier in this post. Typically, you will remove these words to allow Python to really focus in on the most pivotal words in our sentence. So in the example we used previously, instead of a list with I, am, learning, NLP, once you remove stop words, now you’re just left with learning and NLP. This still gets across the most important point of the sentence, but now you’re only looking at half the amount of tokens. Also, the process of stemming helps Python realize that words like learn, learned and learning all have basically the same semantic meaning. You may not think this is a big deal, and in a small sample it’s really not, but when you all of a sudden have a million text messages, and a corpus of 150,000 words, any words that you can remove to allow Python to focus on the most pivotal words can really make a big difference.
So now Python sees a list of tokens you care about, and the key words that we think are useful for building some kind of machine learning model. Even though Python now knows what you care about, it still only sees characters. It doesn’t know what learning or NLP even means. So we have to convert it to a format that a machine learning algorithm can actually ingest and use to build a model. This is a process called vectorizing. It’s basically converting the text to a numeric representation of that text, where you are essentially counting the occurrences of each word in each text message using a matrix with one row per text message and one column per word.
So that you have this numeric matrix, you can now fit your actual machine learning model by feeding in your vectorized data along with your spam or ham labels. The model will then learn the relationships between the words and the labels in order to train a model to make predictions on text messages that it has never seen before and determine whether they are spam or not. There are various types of machine learning models. It will be up to you to select a few of them to try out. You’ll tailor your choices based on the type of input data you’re giving it, what you’re trying to predict, how much compute power you have, things like that. You’ll typically test out a number of what’s called candidate models before selecting which model performs best. Once you select the best model, you’ll evaluate that on a holdout test set, and this is typically a set of data that you’ll set off to the side in the very beginning for the purpose of testing your final model on it to see how your model will perform on data that it’s never seen or touched before. If it passes this final test, then you’ll prepare to implement it within whatever framework you’re working with. In this example, it’s an illustration of a spam filter, trying to filter out whether incoming email is spam or not.
Implementation
In the previous part, we put those together at a conceptual level, laying out what the full machine learning pipeline looks like. In this part, we’re going to actually write the code to handle to cleaning portion, or the pre-processing as it’s typically referred to, of this machine learning program. There are four steps below that you’ll see in a lot of text cleaning pipelines: removing the punctuation, tokenization, removing stop words, and lemmatising or stemming. We’re going to focus on the first three steps in this session, then we’ll cover lemmatising and stemming in the next chapter of the post, as those are a little bit more advanced and not implemented in every pipeline.
The following sample is our target output for all:
1 | # What does the cleaned version look like? |
Output:
label | body_text | body_text_nostop | |
---|---|---|---|
0 | ham | I’ve been searching for the right words to thank you for this breather. I promise i wont take yo… | [‘ive’, ‘searching’, ‘right’, ‘words’, ‘thank’, ‘breather’, ‘promise’, ‘wont’, ‘take’, ‘help’, '… |
1 | spam | Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive … | [‘free’, ‘entry’, ‘2’, ‘wkly’, ‘comp’, ‘win’, ‘fa’, ‘cup’, ‘final’, ‘tkts’, ‘21st’, ‘may’, '2005… |
2 | ham | Nah I don’t think he goes to usf, he lives around here though | [‘nah’, ‘dont’, ‘think’, ‘goes’, ‘usf’, ‘lives’, ‘around’, ‘though’] |
3 | ham | Even my brother is not like to speak with me. They treat me like aids patent. | [‘even’, ‘brother’, ‘like’, ‘speak’, ‘treat’, ‘like’, ‘aids’, ‘patent’] |
4 | ham | I HAVE A DATE ON SUNDAY WITH WILL!! | [‘date’, ‘sunday’] |
Removing Punctuation
Import data:
1 | import pandas as pd |
Output:
label | body_text | |
---|---|---|
0 | ham | I’ve been searching for the right words to thank you for this breather. I promise i wont take yo… |
1 | spam | Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive … |
2 | ham | Nah I don’t think he goes to usf, he lives around here though |
3 | ham | Even my brother is not like to speak with me. They treat me like aids patent. |
4 | ham | I HAVE A DATE ON SUNDAY WITH WILL!! |
Import string package and show what punctuation has:
1 | import string |
Output:
1 | '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~' |
This is really helpful to allow Python to identify what we’re looking for. The reason that we care about this is that periods, parentheses, and other punctuation look like just another character to Python. But realistically, the period doesn’t really help pull the meaning out of a sentence. In the following example, for us “I like NLP.”, with a period, is exactly the same as, “I like NLP”. They mean the same thing to us, but when you give that to Python, Python says those are not equivalent things. And Python isn’t saying “I like NLP” without a period is different than “I like NLP.” With a period, in that they’re really close, but one has a period and one doesn’t. To Python, these might was well be “I like NLP” versus “I hate NLP”. It knows they’re different without any ability to understand how different they are.
1 | "I like NLP." == "I like NLP" |
Output:
1 | False |
There is the function to remove the punctuation:
1 | import re |
Output:
label | body_text | body_text_nostop | |
---|---|---|---|
0 | ham | I’ve been searching for the right words to thank you for this breather. I promise i wont take yo… | Ive been searching for the right words to thank you for this breather I promise i wont take your… |
1 | spam | Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive … | Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive e… |
2 | ham | Nah I don’t think he goes to usf, he lives around here though | Nah I dont think he goes to usf he lives around here though |
3 | ham | Even my brother is not like to speak with me. They treat me like aids patent. | Even my brother is not like to speak with me They treat me like aids patent |
4 | ham | I HAVE A DATE ON SUNDAY WITH WILL!! | I HAVE A DATE ON SUNDAY WITH WILL |
Tokenisation
As we discussed previously, tokenizing is splitting some string or sentence into a list of words. We learn that you have to account for extra cases in your strings, like if they’re separated by special characters or multiple spaces. So we’ll just use what we learned in our lesson about regexes, and combine that with the approach we learned in the last lesson, where we removed punctuation by writing our own function, and then applying it to our data set using a lambda function in order to tokenize our text.
The following code is going to tokenise text data:
1 | import re |
Output:
label | body_text | body_text_clean | body_text_tokenised | |
---|---|---|---|---|
0 | ham | I’ve been searching for the right words to thank you for this breather. I promise i wont take yo… | Ive been searching for the right words to thank you for this breather I promise i wont take your… | [ive, been, searching, for, the, right, words, to, thank, you, for, this, breather, i, promise, … |
1 | spam | Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive … | Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive e… | [free, entry, in, 2, a, wkly, comp, to, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, to… |
2 | ham | Nah I don’t think he goes to usf, he lives around here though | Nah I dont think he goes to usf he lives around here though | [nah, i, dont, think, he, goes, to, usf, he, lives, around, here, though] |
3 | ham | Even my brother is not like to speak with me. They treat me like aids patent. | Even my brother is not like to speak with me They treat me like aids patent | [even, my, brother, is, not, like, to, speak, with, me, they, treat, me, like, aids, patent] |
4 | ham | I HAVE A DATE ON SUNDAY WITH WILL!! | I HAVE A DATE ON SUNDAY WITH WILL | [i, have, a, date, on, sunday, with, will] |
Example of case sensitive:
1 | "NLP" == "nlp" |
Output:
1 | False |
Removing Stop Words
The last step in cleaning up this data is to remove stopwords. Now we’ve discussed stopwords previously. They are commonly-used words like the, but, if, that don’t contribute much to the meaning of a sentence. So we want to remove them, to limit the number of tokens Python actually has to look at when building our model. For instance, take the sentence, I am learning NLP. After tokenizing, it would have four tokens, I, am, learning, and NLP. Then after removing stopwords, instead of a list with four tokens, you’re now left with just learning and NLP. So it gets across the same message, and now, your machine learning model only has to look at half the number of tokens.
Get all stop words:
1 | import nltk |
Remove all stop words:
1 | def remove_stopwords(tokenised_list): |
Output:
label | body_text | body_text_clean | body_text_tokenised | body_text_nostop | |
---|---|---|---|---|---|
0 | ham | I’ve been searching for the right words to thank you for this breather. I promise i wont take yo… | Ive been searching for the right words to thank you for this breather I promise i wont take your… | [ive, been, searching, for, the, right, words, to, thank, you, for, this, breather, i, promise, … | [ive, searching, right, words, thank, breather, promise, wont, take, help, granted, fulfil, prom… |
1 | spam | Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive … | Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive e… | [free, entry, in, 2, a, wkly, comp, to, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, to… | [free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv… |
2 | ham | Nah I don’t think he goes to usf, he lives around here though | Nah I dont think he goes to usf he lives around here though | [nah, i, dont, think, he, goes, to, usf, he, lives, around, here, though] | [nah, dont, think, goes, usf, lives, around, though] |
3 | ham | Even my brother is not like to speak with me. They treat me like aids patent. | Even my brother is not like to speak with me They treat me like aids patent | [even, my, brother, is, not, like, to, speak, with, me, they, treat, me, like, aids, patent] | [even, brother, like, speak, treat, like, aids, patent] |
4 | ham | I HAVE A DATE ON SUNDAY WITH WILL!! | I HAVE A DATE ON SUNDAY WITH WILL | [i, have, a, date, on, sunday, with, will] | [date, sunday] |
So now we have a cleaned column that has been tokenised, we’ve removed the punctuation and we’ve removed the stopwords. So that is a very abbreviated look at what a pre-processing pipeline looks like as you’re preparing to get your raw text into a format that a machine learning model can actually use. In the next chapter, we’ll explore some extra, slightly more advanced cleaning techniques and concepts that we can apply to our text to further help a machine learning model focus on the things that are really important.
Supplemental Data Cleaning
Introducting Stemming
The formal definition of stemming is the process of reducing inflected or derived words to their word stem or root. More simply put, the process of stemming means often crudely chopping off the end of a word, to leave only the base. So this means taking words with various suffixes and condensing them under the same root word. Recall when we removed stop words, it was to reduce the number of words Python has to look at or consider. Stemming is shooting for the same goal by reducing variations of the same root word.
Examples:
before | after |
---|---|
Stemming / stemmed | Stem |
Electricity / electrical | Electr |
Berries / berry | berri |
Connection / connected / connective | Connect |
So this seems pretty useful, but stemming uses very crude rules, so it isn’t perfect. For instance, look at meaning and meanness. These words aren’t really all that closely related, but they’ll be both stripped down to a base of mean. And thus Python will think that meanness and meaning are the same exact thing. So stemmers are correct in most cases, but the trade-off with these simple rules is that it won’t always be right.
So this all seems interesting, but why do we really care about this? Why does this actually help us for model building? If Python sees grew, grow, and growing as three separate things, that means it has to keep those three separate words in memory. Imagine every variation of every root word. Maybe we have a thousand root words, but in our corpus, we have two thousand total words with every suffix added to the root words. The alternative in this grew, grow, and growing example is applying the stemmer, and now it only has to know what grow means, as each variation of grow is replaced simply by grow. So Python has to look at a lot more tokens without a stemmer and it doesn’t know that these separate tokens are even related. So the benefits of a stemmer is it reduces the corpus of words the model is exposed to, so it’s just grow, instead of grew, grow, and growing, and it explicitly correlates words with similar meaning. So Python could learn through the training process that we’ll discuss later, that grow, grew, and growing are similar in meaning, but it also may not, it depends on a lot of different factors. In this case, we’re not leaving it up to Python. We’re being explicit by replacing similar words with just one common root word. There are a number of different types of stemmers that use various algorithms and methods to generate the stemmed version of words. A few that are included in the NLTK package are the Porter Stemmer, the Snowball Stemmer, the Lancaster Stemmer, and a Regex-Based Stemmer. We’ll be focusing on the most popular stemmer in this list, the Porter Stemmer.
Using Stemming
To make the use of stemming, there are two stages. First, we’ll test out the stemmer on specific words to understand how it works. Then we’ll apply the stemmer on the SMS spam collection data set to further clean up our data.
Import package:
1 | import nltk |
First example of using the function:
1 | print(ps.stem("grows")) |
Output:
1 | grow |
So that reduces them all to the proper root word of grow. So now these three words can be treated as the same word, rather than Python seeing them as three distinctly different words.
We showed before how the stemmer isn’t perfect, where it stemmed both meaning and meanness down to mean, even though they don’t represent the same thing. However, if you look at a different example that could be a little difficult, we’ll do run, running, and runner. You could see how all three of these might be reduced down to just run. Even though the first two are actions and the last one describes a person.
Second example of using the function:
1 | print(ps.stem("run")) |
Output:
1 | run |
So the stemmer can actually tell that the first two are different than the last one in some way. So stemmers certainly aren’t perfect, but they still do a pretty good job of identifying words that have the same meaning.
Let’s use it for our SMS Spam example!
Import packages and read the file:
1 | import pandas as pd |
Output:
label | body_text | |
---|---|---|
0 | spam | Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive … |
1 | ham | Nah I dont think he goes to usf he lives around here though |
2 | ham | Even my brother is not like to speak with me They treat me like aids patent. |
3 | ham | I HAVE A DATE ON SUNDAY WITH WILL!! |
4 | ham | As per your request ‘Melle Melle (Oru Minnaminunginte Nurungu Vettam)’ has been set as your call… |
Clean up text:
1 | def clean_text(text): |
Output:
label | body_text | body_text_nostop | |
---|---|---|---|
0 | spam | Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive … | [free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv… |
1 | ham | Nah I dont think he goes to usf he lives around here though | [nah, dont, think, goes, usf, lives, around, though] |
2 | ham | Even my brother is not like to speak with me They treat me like aids patent. | [even, brother, like, speak, treat, like, aids, patent] |
3 | ham | I HAVE A DATE ON SUNDAY WITH WILL!! | [date, sunday] |
4 | ham | As per your request ‘Melle Melle (Oru Minnaminunginte Nurungu Vettam)’ has been set as your call… | [per, request, melle, melle, oru, minnaminunginte, nurungu, vettam, set, callertune, callers, pr… |
We saw that the ps.stem method is what stems each word. So the column that we’ll be operating on from this data frame is this tokenized list. So we’ll want to iterate through the list and stem each word and then return the stemmed version back to the list. So this should be starting to sound familiar at this point. We’ll again write our own function using ps.stem within list comprehension in order to stem each word.
The following code is the way to stem the text:
1 | def stemming(tokenised_text): |
Output:
label | body_text | body_text_nostop | body_text_stemmed | |
---|---|---|---|---|
0 | spam | Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive … | [free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv… | [free, entri, 2, wkli, comp, win, fa, cup, final, tkt, 21st, may, 2005, text, fa, 87121, receiv,… |
1 | ham | Nah I dont think he goes to usf he lives around here though | [nah, dont, think, goes, usf, lives, around, though] | [nah, dont, think, goe, usf, live, around, though] |
2 | ham | Even my brother is not like to speak with me They treat me like aids patent. | [even, brother, like, speak, treat, like, aids, patent] | [even, brother, like, speak, treat, like, aid, patent] |
3 | ham | I HAVE A DATE ON SUNDAY WITH WILL!! | [date, sunday] | [date, sunday] |
4 | ham | As per your request ‘Melle Melle (Oru Minnaminunginte Nurungu Vettam)’ has been set as your call… | [per, request, melle, melle, oru, minnaminunginte, nurungu, vettam, set, callertune, callers, pr… | [per, request, mell, mell, oru, minnaminungint, nurungu, vettam, set, callertun, caller, press, … |
Now it’s worth noting that the stemmer won’t do a great job with slang or abbreviations. So it’s probably not a great fit for a text message data set. To be noticed that entry is changed to entri with an i so it could also accommodate plural, entries. Same thing with wkli. Another one is on the second line, lives is reduced down to live. So, we know what stemming represents, and how to actually apply it. Stemming helps us reduce the corpus of words that the models are exposed to, and it explicitly correlates words with similar meaning.
Introducting Lemmatising
The formal definition is that it’s the process of grouping together the inflected forms of a word so they can be analyzed as a single term, identified by the word’s lemma. The lemma is the canonical form of a set of words. For instance, type, typed, and typing would all be forms of the same lemma. More simply put, lemmatising is using vocabulary analysis of words to remove inflectional endings and return to the dictionary form of a word. So again, type, typed, and typing would all be simplified down to type, because that’s the root of the word. Each variation carries the same meaning just with slightly different tense. So you might be thinking that that sounds an awful lot like stemming, and you wouldn’t be wrong. They are aiming to accomplish the same thing, but they are doing it in just slightly different ways. And in practical terms, there’s an accuracy and speed trade-off that you’re making when you opt for one over the other.
The goal of both is to condense derived words down into their base form, to reduce the corpus of words that the model’s exposed to, and to explicitly correlate words with similar meaning. The difference is that stemming takes a more crude approach by just chopping off the ending of a word using heuristics, without any understanding of the context in which a word is used. Because of that, stemming may or may not return an actual word in the dictionary. And it’s usually less accurate, but the benefit is that it’s faster because the rules are quite simple. lemmatising leverages more informed analysis to create groups of words with similar meaning based on the context around the word, part of speech, and other factors. lemmatisers will always return a dictionary word. And because of the additional context it’s considered, this is typically more accurate. But the downside is that it may be more computationally expensive. So this is a very brief introduction into lemmatising.
Using Lemmatising
To make the use of Lemmatising, there are two stages. First, we’re going to test out the lemmatiser on specific words to understand how it works and then we’ll apply it on the SMS Spam Collection Data Set to further clean it up. So the same process that we saw on the stemming notebook. Just like we saw with stemmers, there are a few different lemmatisers as well that handle words in slightly different ways. So we’re going to use the WordNet lemmatiser. This is probably the most popular lemmatiser. WordNet is a collection of nouns, verbs, adjective and adverbs that are grouped together in sets of synonyms, each expressing a distinct concept. This lemmatiser runs off of this corpus of synonyms, so given a word, it will track that word to its synonyms, and then the distinct concept that that group of words represents.
Test out WordNet lemmatiser (read more about WordNet here)
Import the nltk package and apply both lemmatiser and stemmer function:
1 | import nltk |
Comparation between Lemmatiser and Stemmer
First Example
Stemmer:
1 | print(ps.stem("meanness")) |
Output:
1 | mean |
As we have mentioned that, meanness and meaning are different but stemmer still cuts off and return root words for both words.
Lemmartiser:
1 | print(wn.lemmatise("meanness")) |
Output:
1 | meanness |
Two things need to be highlighted here. First, stemming uses the algorithmic approach so it’s only concerned with the string that it’s given, and it will essentially chop off the suffix. lemmatising is a little bit more complex in that it searches the corpus to find related words and condense it down to the core concept. The problem is that if this word isn’t in the corpus, then it will just return the original word, so that’s what’s happening in this example. With that said, not condensing it in this case is probably better than incorrectly stemming it using the Porter stemmer.
Second Example
Stemmer:
1 | print(ps.stem("goose")) |
Output:
1 | goos |
You can see that the stemmer doesn’t quite know what to do here with goose and geese so it returns two different root words. So again, Python will still view goose and geese as two different things even if you use the stemmer.
Lemmartiser:
1 | print(wn.lemmatise("goose")) |
Output:
1 | goose |
You can see that the lemmatiser correctly maps both of these back to goose. Python will then be able to now realize that these are the same words, so a lemmatiser can be quite powerful in some relatively complex situations.
Apply lammatiser to SMS Spam Collection Data
Read the text file:
1 | import pandas as pd |
Output:
label | body_text | |
---|---|---|
0 | spam | Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive … |
1 | ham | Nah I dont think he goes to usf he lives around here though |
2 | ham | Even my brother is not like to speak with me They treat me like aids patent. |
3 | ham | I HAVE A DATE ON SUNDAY WITH WILL!! |
4 | ham | As per your request ‘Melle Melle (Oru Minnaminunginte Nurungu Vettam)’ has been set as your call… |
Clean up text:
1 | def clean_text(text): |
Output:
label | body_text | body_text_nostop | |
---|---|---|---|
0 | spam | Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive … | [free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv… |
1 | ham | Nah I dont think he goes to usf he lives around here though | [nah, dont, think, goes, usf, lives, around, though] |
2 | ham | Even my brother is not like to speak with me They treat me like aids patent. | [even, brother, like, speak, treat, like, aids, patent] |
3 | ham | I HAVE A DATE ON SUNDAY WITH WILL!! | [date, sunday] |
4 | ham | As per your request ‘Melle Melle (Oru Minnaminunginte Nurungu Vettam)’ has been set as your call… | [per, request, melle, melle, oru, minnaminunginte, nurungu, vettam, set, callertune, callers, pr… |
Lemmatise text using body_text_nostop data:
1 | def lemmatising(tokenised_text): |
Output:
label | body_text | body_text_nostop | body_text_lammatised | |
---|---|---|---|---|
0 | spam | Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive … | [free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv… | [free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv… |
1 | ham | Nah I dont think he goes to usf he lives around here though | [nah, dont, think, goes, usf, lives, around, though] | [nah, dont, think, go, usf, life, around, though] |
2 | ham | Even my brother is not like to speak with me They treat me like aids patent. | [even, brother, like, speak, treat, like, aids, patent] | [even, brother, like, speak, treat, like, aid, patent] |
3 | ham | I HAVE A DATE ON SUNDAY WITH WILL!! | [date, sunday] | [date, sunday] |
4 | ham | As per your request ‘Melle Melle (Oru Minnaminunginte Nurungu Vettam)’ has been set as your call… | [per, request, melle, melle, oru, minnaminunginte, nurungu, vettam, set, callertune, callers, pr… | [per, request, melle, melle, oru, minnaminunginte, nurungu, vettam, set, callertune, caller, pre… |
5 | spam | WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To c… | [winner, valued, network, customer, selected, receivea, 900, prize, reward, claim, call, 0906170… | [winner, valued, network, customer, selected, receivea, 900, prize, reward, claim, call, 0906170… |
6 | spam | Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with came… | [mobile, 11, months, u, r, entitled, update, latest, colour, mobiles, camera, free, call, mobile… | [mobile, 11, month, u, r, entitled, update, latest, colour, mobile, camera, free, call, mobile, … |
7 | ham | I’m gonna be home soon and i don’t want to talk about this stuff anymore tonight, k? I’ve cried … | [im, gonna, home, soon, dont, want, talk, stuff, anymore, tonight, k, ive, cried, enough, today] | [im, gonna, home, soon, dont, want, talk, stuff, anymore, tonight, k, ive, cried, enough, today] |
8 | spam | SIX chances to win CASH! From 100 to 20,000 pounds txt> CSH11 and send to 87575. Cost 150p/day, … | [six, chances, win, cash, 100, 20000, pounds, txt, csh11, send, 87575, cost, 150pday, 6days, 16,… | [six, chance, win, cash, 100, 20000, pound, txt, csh11, send, 87575, cost, 150pday, 6days, 16, t… |
9 | spam | URGENT! You have won a 1 week FREE membership in our £100,000 Prize Jackpot! Txt the word: CLAIM… | [urgent, 1, week, free, membership, 100000, prize, jackpot, txt, word, claim, 81010, tc, wwwdbuk… | [urgent, 1, week, free, membership, 100000, prize, jackpot, txt, word, claim, 81010, tc, wwwdbuk… |
Just like the stemmer, the lemmatiser won’t do particularly well with slang or abbreviations, so it’s not ideal for this data set. It might be much more effective if it was used on a collection of book reports or journal articles. There are a couple things that the lemmatiser was able to impact. It transitioned this lives down into life, and it also transitioned mobiles into mobile, and so you’ll notice that these aren’t super interesting examples necessarily, but as you saw above, the lemmatiser can do some relatively sophisticated things, at least more sophisticated than the stemmer. Now you’ve learned what lemmatising is, and how to actually apply it, so both stemming and lemmatising helps us reduce the corpus of words that the model is exposed to, and it explicitly correlates words with similar meaning. The lemmatiser is typically more accurate than the stemmer but the trade-off is that it takes a little bit longer to run. Based on your machine learning pipeline, if the lemmatiser is going to be a bottleneck, then you may opt for the more simple stemmer.
Vectorising Raw Data
Introducing Vectorising
The process that we use to convert text to a form that Python and a machine learning model can understand is called vectorizing. This is defined as the process of encoding text as integers to create feature vectors. Now if you don’t have much machine learning experience, you may be wondering what a feature vector is. A feature vector is an n-dimensional vector of numerical features that represent some object. So in our context, that means we’ll be taking an individual text message and converting it to a numeric vector that represents that text message.
Based on the following example, what we’re doing when we vectorize text is we’re taking this dataset that has one line per document with the cell entry as the actual text message and then we’re converting it to a matrix that still has one line per document, but then you have every word used across all documents as the columns of your matrix. And then within each cell is counting how many times that certain word appeared in that document. And this is called your document term matrix. We’ll be referring to this term quite a bit. Then once we have this numeric representation of each text message, then we can carry on down the pipeline and fit and train a model.
body_text | call | claim | free | txt | label |
---|---|---|---|---|---|
Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive … | 0 | 0 | 1 | 1 | spam |
Nah I dont think he goes to usf he lives around here though | 0 | 0 | 0 | 0 | ham |
Even my brother is not like to speak with me They treat me like aids patent. | 0 | 0 | 0 | 0 | ham |
I HAVE A DATE ON SUNDAY WITH WILL!! | 0 | 0 | 0 | 0 | ham |
As per your request ‘Melle Melle (Oru Minnaminunginte Nurungu Vettam)’ has been set as your call… | 0 | 0 | 0 | 0 | ham |
WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To c… | 1 | 2 | 0 | 0 | spam |
Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with came… | 1 | 0 | 2 | 0 | spam |
I’m gonna be home soon and i don’t want to talk about this stuff anymore tonight, k? I’ve cried … | 0 | 0 | 0 | 0 | ham |
SIX chances to win CASH! From 100 to 20,000 pounds txt> CSH11 and send to 87575. Cost 150p/day, … | 0 | 0 | 0 | 1 | spam |
URGENT! You have won a 1 week FREE membership in our £100,000 Prize Jackpot! Txt the word: CLAIM… | 0 | 1 | 1 | 1 | spam |
To understand the motivation behind this process, as we mentioned previously that when looking at a word, Python only sees characters. So we need to convert this into a format that Python can understand in order for our machine learning models to start to learn what certain words indicate about the overall sentence or document or label that we’re trying to predict. So we vectorize this text to create a matrix that only has numeric entries. So in our case, counting how many times each word appears in each text message. The machine learning algorithm understands these counts. So if it sees a one or a two or a three in a cell, then that model can start to correlate that with whatever we’re trying to predict. In our case, that’s spam. To roughly understand then, what the words, sentences, and documents represent. So in our context, this means it can use how frequently these certain words appear to determine whether the individual text message is spam or not.
Let’s classify the following document term matrix each as either spam or ham. We’re just focusing on two words here used in text messages, offer and lol, along with the label of either spam or ham. So that’s what this looks like after vectorizing. Now how does a machine learning model use this information to learn what these words mean? It was mentioned before that by looking at the counts in the cells, that it can start to correlate which words happen in combination with certain labels.
Original:
id | offer | lol | label |
---|---|---|---|
1 | 0 | 4 | ham |
2 | 0 | 1 | ham |
3 | 4 | 0 | spam |
4 | 1 | 2 | ham |
5 | 2 | 0 | spam |
6 | 1 | 1 | spam |
So let’s isolate just the non-spam messages in following table. You can start to notice that offer occurs very infrequently, but lol occurs in a lot of these non-spam text messages. From the numbers here, the model could pretty easily pick up on the fact that lol occurs quite frequently with non-spam text messages and offer occurs very infrequently. You could see how this would allow a model to start to learn how to predict when a text is spam or not, based just on the text body.
Filtered ham:
id | offer | lol | label |
---|---|---|---|
1 | 0 | 4 | ham |
2 | 0 | 1 | ham |
4 | 1 | 2 | ham |
So now, let’s do the same thing, but we’ll jump over to the spam messages in the following table. You can pretty quickly notice that it’s the opposite. Offer occurs quite frequently, while lol occurs quite infrequently. So the model would pick up on the fact that offer occurs quite frequently in spam messages and lol occurs quite infrequently.
Filtered spam:
id | offer | lol | label |
---|---|---|---|
3 | 4 | 0 | spam |
5 | 2 | 0 | spam |
6 | 1 | 1 | spam |
So now, only considering these two words, the model has learned that offer occurs frequently with spam and infrequently with non-spam, while lol occurs frequently with non-spam and infrequently with spam. So you could see how maybe with even just these two words, the model could start making ham or spam predictions about new text messages based only on the number of times these two words occur. With that said, this is an extremely simple and exaggerated example for the purpose of illustration. In reality, the model would need to learn the relationships of much more than two words to make an accurate prediction. But this example was meant to show how vectorizing helps the model roughly learn what words correlate with which labels.
So far, the post been talking about the entry of each cell in the document term matrix containing the count of how many times a given word appears in that text message, but that’s only one method of vectorization. And that’s called, not surprisingly, count vectorization. There are two other variations of count vectorization called N-grams and term frequency - inverse document frequency, which is often referred to as TF-IDF. We’re going to cover each of these three methods of vectorization in more detail.
All three of these methods will generate very similar document-term matrices where there’s one line per document, or text message in our case, and then the columns will represent each word or potentially a combination of words. The main difference between the three is what’s in the actual cells of the matrix. So we’ll start with count vectorization.
Count Vectorisation
Count vectorization creates the document-term matrix and then simply counts the number of times each word appears in that given document, or text message in our case, and that’s what’s stored in the given cell, so it’s pretty straight forward.
We’re going to use our SMSSpamCollection dataset, and then we’ll build a function to clean it up.
1 | import pandas as pd |
Create function to remove punctuation, tokenize, remove stopwords, and stem:
1 | def clean_text(text): |
The difference here is, we’re not going to use a lambda function to apply it to our data, like we have in the past. The CountVectorizer actually allows you to pass in a function to clean and tokenize your data.
Apply CountVectorizer:
1 | from sklearn.feature_extraction.text import CountVectorizer |
Output:
1 | (5567, 8104) |
This tell us there are 5567 text messages, and across those 5567 text messages, there are 8104 unique words, which means, our document-term matrix has 5567 rows and 8104 columns, and then the get_feature_names, basically means, here are the names of the columns of our document-term matrix.
Smaller example:
1 | data_sample = data[0:20] |
Output:
1 | (20, 192) |
Instead of 5,567 by 8,104, our matrix is just 20 rows by 192 columns, and here are the new feature names. So there’s some numbers in there, but you can also see around, brother, call, caller.
One thing to be noted is that the raw data output of the CountVectorizer, is what’s called a Sparse Matrix. So what is a Sparse Matrix? when you have a matrix in which a very high percent of the entries are zero, as we do in this case, instead of storing all these zeros in the full matrix, which would make it extremely inefficient, it’ll just be converted to only storing the locations and the values of the non-zero elements, which is much more efficient for storage.
So if we just try and print out this X_counts_sample, you would’ve previously expected it just to give us our matrix, but instead, it’ll just say this is a sparse matrix object with 218 stored elements in the following example.
1 | X_counts_sample |
Output:
1 | <20x192 sparse matrix of type '<class 'numpy.int64'>' |
1 | X_counts_df = pd.DataFrame(X_counts_sample.toarray()) |
Output:
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | … | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … |
2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … |
3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … |
4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … |
5 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | … |
6 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | … |
7 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … |
8 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | … |
9 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | … |
10 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … |
11 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … |
12 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … |
13 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … |
14 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | … |
15 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … |
16 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … |
17 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … |
18 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … |
19 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … |
20 rows × 192 columns
You’ll notice that the column names don’t contain the word they actually represent. They’re just numbered from zero to 191. Now again, for Python, this doesn’t really matter because it doesn’t know the difference between a column name of 5, and a column name of text. So it’s just going to learn from the entries in that column, and its relationship with our label, to figure out how it can contribute to the model.
To be able to see what words those columns actually represent:
1 | X_counts_df.columns = count_vect_sample.get_feature_names() |
Output:
… | want | wap | watch | way | week | wet | win | winner | wkli | word | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | … | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
1 | … | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | … | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | … | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | … | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 | … | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
6 | … | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
7 | … | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
8 | … | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
9 | … | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
10 | … | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
11 | … | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
12 | … | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
13 | … | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 |
14 | … | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
15 | … | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
16 | … | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
17 | … | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
18 | … | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
19 | … | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
20 rows × 192 columns
Now you’ll see this is exactly the same data frame, now we just have the actual column names here.
N-gram Vectorising
Creates a document-term matrix where counts still occupy the cell but instead of the columns representing single terms, they represent all combinations of adjacent words of length n in your text.
“NLP is an interesting topic”
n | Name | Tokens |
---|---|---|
2 | bigram | [“nlp is”, “is an”, “an interesting”, “interesting topic”] |
3 | trigram | [“nlp is an”, “is an interesting”, “an interesting topic”] |
4 | four-gram | [“nlp is an interesting”, “is an interesting topic”] |
The n-grams process creates a document-term matrix like we saw before. Now we still have one row per text message and we still have counts that occupy the individual cells but instead of the columns representing single terms like we saw in the previous method, now they represent all combinations of adjacent words of length and in your text. As the above example, let’s use the string NLP is an interesting topic. This table shows how that would break down. In n-grams if n equals two then that’s called the bigram and it’ll pull all combinations of two adjacent words in our string. In NLP is an interesting topic, it will pull out four tokens, NLP is, is an, an interesting, interesting topic. When n equals three that’s called trigrams. It’ll pull all combinations of three adjacent words in our string. We’ll create three tokens, NLP is an, is an interesting, and an interesting topic. Again, when n equals four, that’s called a four-gram. It’ll pull all combinations of four adjacent words in our string as you see here with these two tokens. You can make n as large or as small as you’d like. You could think of count vectorization as n-grams with n equal a two. It’ll just pull out the unigrams. When you use n-grams there’s usually an optimal n value or range that will yield the best performance. Generally you’ll tune this value to see what generates the best model. The value here is that you get a little more context around your words. Rather than only seeing one word at a time, you’ll see two or three or four. Just to tie this back to something that we talked about earlier, NLP in everyday life, Google’s auto complete uses an n-grams like approach. If you type natural language into Google, it knows that a very regularly used trigram starting with natural language is natural language processing. It might suggest that as a full phrase that you’d like to search for.
Read the data:
1 | import pandas as pd |
Create function to remove punctuation, tokenize, remove stopwords, and stem:
1 | def clean_text(text): |
Output:
label | body_text | cleaned_text | |
---|---|---|---|
0 | spam | Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C’s apply 08452810075over18’s | free entri 2 wkli comp win fa cup final tkt 21st may 2005 text fa 87121 receiv entri questionstd txt ratetc appli 08452810075over18 |
1 | ham | Nah I don’t think he goes to usf, he lives around here though | nah dont think goe usf live around though |
2 | ham | Even my brother is not like to speak with me. They treat me like aids patent. | even brother like speak treat like aid patent |
3 | ham | I HAVE A DATE ON SUNDAY WITH WILL!! | date sunday |
4 | ham | As per your request ‘Melle Melle (Oru Minnaminunginte Nurungu Vettam)’ has been set as your callertune for all Callers. Press *9 to copy your friends Callertune | per request mell mell oru minnaminungint nurungu vettam set callertun caller press 9 copi friend callertun |
Apply CountVectorizer with N-Grams:
1 | from sklearn.feature_extraction.text import CountVectorizer |
Output:
1 | (5567, 31260) |
You can see that we still have the same 5,567 rows but now instead of the 8,000 that we saw before, you can seen that there’s over 31,000 columns. That means 31,000 unique combinations of two words. In these feature names, you’ll see two-word combinations.
A smaller example:
1 | data_sample = data[0:20] |
Output:
1 | (20, 198) |
Instead of over 31,000 columns, now we’re back down to 198. If you remember, that’s still higher than the 192 unigrams that we saw in our last notebook. At this point, it’s worth noting that using this n-gram range can end up creating a matrix with a ton of features. If you did an n-gram range of one comma two so that’s grabbing all the unigrams and bigrams. In the previous session we saw that there are over 8,000 unigrams in the full data set and in this session we saw that there are 31,000 bigrams in the full sample. Together that’s 39,000 columns that are only using unigrams and bigrams. Imagine if we added trigrams on top of that. Just be careful with the n-grams. It’s definitely worth experimenting to see both what’s going to work within memory and which is going to help you generate the best model.
Vectorizers output sparse matrices:
Sparse Matrix: A matrix in which most entries are 0. In the interest of efficient storage, a sparse matrix will be stored by only storing the locations of the non-zero elements.
1 | X_counts_df = pd.DataFrame(X_counts_sample.toarray()) |
Output:
vettam set | want talk | wap link | way feel | way gota | way meet | week free | win cash | win fa | winner valu | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
7 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
9 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
10 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
11 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
12 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
13 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
14 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
15 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
16 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
17 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
18 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
19 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
20 rows × 198 columns
This is a brief introduction to how you implement n-grams. The value to n-grams over regular count vectorization is that it provides a little bit more context around words. There is certainly a trade-off when you’re choosing your n value for your n-grams. If you select only bigrams, then maybe that’s not enough to provide useful context. If you go all the way up to sevengrams, you’re going to have a ton of features and you’ll only see the same sevengram, in other words, the same sequence of seven words maybe in one text message. In other words what that means is every column of sevengrams would be non zero in only one row and you’d have a massive matrix. N-grams can be powerful but they require a little bit more care in implementing.
Inverse Document Frequency Weighting (TF-IDF)
TF-IDF creates a document term matrix, where there’s still one row per text message and the columns still represent single unique terms. But instead of the cells representing the count, the cells represent a weighting that’s meant to identify how important a word is to an individual text message.
The above formula lays out how this weighting is determined. It may look a little bit intimidating, but it’s actually quite simple.
You start with this TF term, which is just the number of times that term I occurs in text message J, divided by the number of terms in text message J. It’s just the percent of terms in this given text message that are this specific word.
For example (Descriptions below):
In the above calculations, if we use “I like NLP,” and the word we’re focused on is NLP, then this term would be 1 divided by 3, or 0.33. Then the second part of this equation measures how frequently this word occurs across all other text messages. It calculates the number of text messages in the data set divided by the number of text messages that this word appears in. That takes the log of all of that. Let’s just say that we have 20 text messages, so that’s going to represent N in this case, and only one of those contains NLP. That’s going to be df. The second part of this equation would then be log of 20 divided by 1. As this fraction inside the log gets larger, the log of that fraction also gets larger. Now let’s say that you have 40 text messages instead of 20, but NLP still only occurs in one of them, so the denominator here will still only be 1. Now this fraction is 40 over 1. The term NLP is less frequent, and this term collectively is going to be larger. Basically, all this says is that the rarer the word is, the higher that this value’s going to be. If a word occurs very frequently within a particular text message, so that’s TF, but very infrequently elsewhere, that’s going to be the second term. Then a very large number will be assigned, and it’ll be assumed to be very important to differentiating that text message from others. In summary, this method helps you pull out important but seldom-used words.
The following code is going to read the data:
1 | import pandas as pd |
Create function to remove punctuation, tokenize, remove stopwords, and stem:
1 | def clean_text(text): |
Apply TfidfVectorizer:
1 | from sklearn.feature_extraction.text import TfidfVectorizer |
Output:
1 | (5567, 8104) |
Apply TfidfVectorizer to smaller sample:
1 | data_sample = data[:20] |
Output:
1 | (20, 192) |
Vectorizers output sparse matrices:
Sparse Matrix: A matrix in which most entries are 0. In the interest of efficient storage, a sparse matrix will be stored by only storing the locations of the non-zero elements.
1 | X_tfidf_df = pd.DataFrame(X_tfidf_sample.toarray()) |
Output:
100000 | 11 | 12 | 150pday | 16 | |
---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 0 |
2 | 0 | 0 | 0 | 0 | 0 |
3 | 0 | 0 | 0 | 0 | 0 |
4 | 0 | 0 | 0 | 0 | 0 |
5 | 0 | 0 | 0.231645 | 0 | 0 |
6 | 0 | 0.197682 | 0 | 0 | 0 |
7 | 0 | 0 | 0 | 0 | 0 |
8 | 0 | 0 | 0 | 0.224905 | 0.197695 |
9 | 0.252972 | 0 | 0 | 0 | 0 |
20 rows × 192 columns
You’ll notice this looks a lot different than our other matrices. Instead of regular integers in the cells, you have decimals. The .2316 in column ‘12’ and 5th text message is likely more important than the .1977 in column ‘11’ and 6th text message. This .2316 is likely more important than this .1976. What that means is, either 12 occurs more frequently in the 5th text message than 11 does in the 6th text message, or it means 12 occurs less frequently across all the other text messages than 11 does across all the other text messages.
So in summary, we created this false choice here, indicating that there are three different ways to vectorize. These are all very closely related, though, and some can actually be used together. TF-IDF is basically a count vectorizer that includes some consideration for the length of the document, and also how common the word is across other text messages. And then n-grams is just used within either of these two methods to look for groups of adjacent words instead of just looking for single terms. They’re all just slight modifications of each other, and typically you’ll test different vectorization methods depending on your problem, and then you let the results determine which one you use.
Feature Engineering
Introducing Feature Engineering
Feature engineering is the process of creating new features and/or transforming existing features to get the most out of your data. So up to this point, we’ve just been talking about what we’re given without really imagining what other features we might be able to extract from this data that would be helpful to predict spam or ham. The absence of this step could mean we’re potentially leaving some significant value on the table. So the model will now see the words in the text as represented by the vectorization, but nothing else.
What else could we extract from that text that would be helpful for the model to decipher spam from ham? For instance, maybe we could include the length of the text field. Maybe spam tends to be a little bit longer than real text messages. Or maybe we could include what percent of the characters in the text message are punctuation. Maybe real text messages underuse punctuation. Or maybe what percent of characters are capitalized are indicative of whether it’s spam or not. So that’s a couple ideas of some features that you could create that would help our model identify spam from nonspam.
So given these new features, or really any other already existing features, maybe you need to apply some sort of transformation to your data to make it more well-behaved. One broad popular type of transformations are called power transformations. So this would include squaring your data, taking the square root, et cetera. One example of why you might need to transform your data would be if you have a very skewed data set with a very long right tail where you have a lot of outliers. In that case, you might want to apply a log transformation which basically pulls that long tail and all those outliers back towards the bulk of the data. And what this does is it helps the model draw correlations and better understand the data without trying to overfit to that long tail and those outliers.
Another topic that falls under transformations is standardizing your data, or transforming it all to be on the same scale. Some models perform better when all features are on the same scale. It’s especially important as you get to this phase that you’re keeping the problem context in mind. When you do feature creation, you’re always trying to imagine what additional information might be helpful to the model within the context and understanding of what exactly it’s trying to predict. For instance, as an extreme example in our spam detection problem, the number of As that appear in a text message likely isn’t predictive of whether this is spam or ham. But maybe the amount of punctuation or the length of a text message is. So always keep your problem in mind. And this is the stage where you’re allowed to get a little bit creative to try to extract as much value out of your data as possible.
Feature Creation
Read in the data:
1 | import pandas as pd |
Create feature for text message length:
1 | # Be careful of the white spaces, need to be subtract them |
Output:
label | body_text | body_len | |
---|---|---|---|
0 | spam | Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C’s apply 08452810075over18’s | 128 |
1 | ham | Nah I don’t think he goes to usf, he lives around here though | 49 |
2 | ham | Even my brother is not like to speak with me. They treat me like aids patent. | 62 |
3 | ham | I HAVE A DATE ON SUNDAY WITH WILL!! | 28 |
4 | ham | As per your request ‘Melle Melle (Oru Minnaminunginte Nurungu Vettam)’ has been set as your callertune for all Callers. Press *9 to copy your friends Callertune | 135 |
Create feature for % of text that is punctuation:
1 | import string |
Output:
label | body_text | body_len | punct% | |
---|---|---|---|---|
0 | spam | Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C’s apply 08452810075over18’s | 128 | 4.7 |
1 | ham | Nah I don’t think he goes to usf, he lives around here though | 49 | 4.1 |
2 | ham | Even my brother is not like to speak with me. They treat me like aids patent. | 62 | 3.2 |
3 | ham | I HAVE A DATE ON SUNDAY WITH WILL!! | 28 | 7.1 |
4 | ham | As per your request ‘Melle Melle (Oru Minnaminunginte Nurungu Vettam)’ has been set as your callertune for all Callers. Press *9 to copy your friends Callertune | 135 | 4.4 |
The goal here is to generate new features that help a model distinguish spam from real text messages. So it’s always useful to find some way to see if your new features appear to be predictive, or correlated to the response in some way.
Feature Evaluation
We’ll use overlayed histograms to look at the value of these created features. So we’re going to import pyplot from matplotlib and we’re going to import numpy and store it as np. And then we’re going to tell Python to print out matplotlib inline right in our notebook.
1 | from matplotlib import pyplot |
We’re going to build two histograms. The first is going to look at the distribution of our body length for spam, and then the second one is going to look at the distribution of body length for non-spam, and that’s how we’re going to use these histograms to see if this new created feature is helpful for distinguishing spam from non-spam.
The first parameter that we have to pass in is the thing that we actually want to plot, so that’s body_len.
1 | # We're going to pass in the starting point, which is going to be zero, |
Output:
You can see that body length is very different for ham versus spam. So spam text messages seem to be quite a bit longer than regular text messages. So it appears that this extra feature could be really helpful for the model to distinguish ham from spam. So if we didn’t create this feature, the model may not necessarily pick up on this difference.
Evaluation to Punctuation Percentage Feature:
1 | # Change the upper bound to 50 |
Output:
You can see there’s not nearly as much of a difference in punctuation use. You can also see that spam might be a little bit more concentrated here on the left, whereas ham tends to have more of a tail over to the right-hand side.
It’s pretty clear which one of these new features is likely to help out the model the most. So in terms of our original hypotheses, our hypothesis that spam messages tend to be longer than non-spam messages seems to be correct based on this evaluation, and this feature is likely to provide some value to the model. However, our hypothesis that ham messages contain less punctuation than spam doesn’t appear to be accurate, and it isn’t quite clear whether this feature will provide value to the model. Now, in cases like this where there is some separation between the distributions, typically we’ll err on the side of leaving this feature in the model just to see what kind of value the model itself may be able to extract out of it. qSo this is an example of how you might evaluate whether some newly created features will be useful to the model.
Identifying Features for Treansformation
Reading the raw data:
1 | import pandas as pd |
Create the two new features
1 | import string |
Output:
label | body_text | body_len | punct% | |
---|---|---|---|---|
0 | spam | Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C’s apply 08452810075over18’s | 128 | 4.7 |
1 | ham | Nah I don’t think he goes to usf, he lives around here though | 49 | 4.1 |
2 | ham | Even my brother is not like to speak with me. They treat me like aids patent. | 62 | 3.2 |
3 | ham | I HAVE A DATE ON SUNDAY WITH WILL!! | 28 | 7.1 |
4 | ham | As per your request ‘Melle Melle (Oru Minnaminunginte Nurungu Vettam)’ has been set as your callertune for all Callers. Press *9 to copy your friends Callertune | 135 | 4.4 |
In order to determine whether transformation might be helpful, we can look at the distribution of our data using a histogram. Now, on the previous session, we looked at the normalized overlayed histograms, but we didn’t look at the full histogram so we’re still not exactly sure what the full distribution looks like for these new features. We only know when it’s split by label. The first thing we’ll do is look at those full distributions and then we can determine which one might be a fit for transformation. Now, what we’re looking for here is a dramatic skew with a really long tail or maybe a few outliers. These are scenarios that would make a feature a prime candidate for transformation.
Import some necessary packages:
1 | from matplotlib import pyplot |
Plot the two new features:
1 | bins = np.linspace(0, 200, 40) |
Output:
We saw that spam were mostly long so those are the ones concentrated down here. And real text are mostly short so those are the ones kind of concentrated down here. So we see this bimodal distribution here with two different spikes. I seems that this isn’t really a great candidate for transformation because it’s not really heavily skewed and there’s not really any clear outliers.
1 | bins = np.linspace(0, 50, 40) |
Output:
This one could very well be a nice distribution for a transformation. It’s fairly skewed here where we see a lot close to zero and then we see the tail extending all the way up to 40 with some of these outliers. A model might dig too much into this skewed tail over here and generate a model maybe that’s a little bit biased. We’re going to focus on this feature for our transformations.
Box-Cox Power Transformation
A transformation is a process that alters each data point in a certain column in a systematic way that makes it cleaner for a model to use. For instance, that could mean squaring each value, or maybe taking the square root of each value in a given column. So let’s say a distribution for a certain feature has a long right tail like this one does in the following image. Then the transformation would aim to pull that tail in to make it a more compact distribution like we see in the example here using a log transformation. We do this so that the model doesn’t get distracted trying to chase down outliers in that tail.
The series of transformation that we’ll be working with are call the Box-Cox Power Transformations. This is a very common type of transformation. The base form of this type of transformation is y to the x power, where y is the value in an individual cell, and the x is the exponent of the power transformation you’re applying. You’ll notice that the following table shows some common power transformations using exponents from negative two up to positive two. For the first line in the table with an exponent of negative two that translates to y to the negative two which is the same as one over y squared.
Base Form:
X | Base Form | Transformation |
---|---|---|
-2 | ||
-1 | ||
-0.5 | ||
0 | ||
0.5 | ||
1 | ||
2 |
So let’s introduce an example which shows in the following table. Let’s say that 50% of the characters in a given text message are punctuation. So the value in that cell will be 50. So let’s go through these different transformations and see how that would impact the transformed value. Starting with the first line, one over y squared. So in this example that would be one over 50 squared, or one over 2,500, and that’ll give you 0.0004. Then the next transformation is just one over 50 and then it’s one over square root of 50 and so on. So this kind of gives you an idea of how different power transformations alter the original values.
Original Value | Transformation | Transformed Value |
---|---|---|
50 | 0.0004 | |
50 | 0.02 | |
50 | 0.14 | |
50 | 1.7 | |
50 | 7.07 | |
50 | 50 | |
50 | 2500 |
So this kind of gives you an idea of how different power transformations alter the original values. In practice, what this process looks likes would be as following Process session. First you determine what range of exponents you want to test out. So in our example we had a range from negative two to positive two. And that’s a commonly used range. Then you’d apply these transformations to each value in the feature you’d like to transform. Then you’d use some criteria to determine which of the transformations yielded the best distribution. You can read about what different criteria you can use to determine the best distribution, but we, we’re just going to plot it in a histogram and pick the one that looks the most like a normal distribution because this means it’ll be a nice and compact distribution that’ll be easier for the model to use.
Process
- Determine what range of exponents to test.
- Apply each transformation to each value of your chosen feature.
- Use some criteria to determine which of the transformations yield the best distribution.
So what we want to do is apply a bunch of different power transformations and pick the histogram that looks the most like a normal distribution.
1 | for i in range(1, 6): |
Output:
For the first loop through you’ll see that this is the exact distribution that we saw before.
As we move to the next one, the square root transformation, or exponent of one half, you can see it’s kind of starting to pull this tail in. You can see how much the scale has changed and you see it looks a little bit nicer, a little bit more compact, a little bit more like a normal distribution.
Then you go to one third, it’s even more so. More compact, more like a normal distribution.
One fourth is a little bit better and then one fifth is even better. Again you’ll notice that the outliers continue to get pulled in closer and closer to the center of the distribution. So given this view either one fourth or one fifth would be chosen as a transformation. Both of those look pretty good.
Before we move on, to be noted that you do have this stack to the left and all these are just zeros, so this means that there’s no punctuation. So any power transformation of zero is just going to keep it at zero so we’ll maintain that stack on the left. What we’re mostly concerned about is the rest of the distribution seeing how the transformation effects that. So these power transformations are a commonly used method to transformation skewed data or data that isn’t behaving particularly well. It helps your model key in on the data and leverage it to make predictions in a cleaner way.
Building Machine Learning Classifiers
What is Machine Learning
If you ask 10 people, you may get 10 different definitions of what machine learning is. As an illustration of that, here’s a few widely accepted definitions.
-
The field of study that gives computers the ability to learn without being explicitly programmed. - Arthur Samuel, 1959
-
A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measure by P, improves with experience E. - Tom Mitchell, 1998
-
Algorithms that can figure out how to perform important tasks by generalizing from examples. - University of Washington, 2012
-
Practice of using algorithms to parse data, learn from it, and then make a determination or prediction about something in the world. - NVIDA, 2016
There are a few different viewpoints on how to formally define machine learning. The point being is that there’s no clear, agreed upon definition, but hopefully given these four definitions, you’ll start to develop an idea of what machine learning is.
Two Broad Types of Machine Learning
Supervised learning is the type of machine learning that we’ll be doing in this post. Unsupervised learning is where you don’t have any explicit labels and it’s about deriving structure from the data where you don’t know the effect of any of the features. In other words, you don’t have anything you’re trying to predict, necessarily. You’re just trying to back out some sort of information or structure using the variables that you are given. A couple of quick examples. An example of supervised learning would be a spam filter. The machine learning model would predict whether any given email is spam based on known information about the email, so email content, maybe sender, recipient, what time it was sent, maybe the structure of the email, things like that. An example of unsupervised learning would be grouping together similar emails into distinct folders based on the content. There is no right or wrong answer necessarily, but if a model can identify that these 15 emails are all in regards to a vacation to Italy and those other ones are about planning a certain family gathering, and then it can group those together into their own little bundles. That’s what unsupervised learning is.
Cross-validation and Evaluation Metrics
There are a lot of different methods and metrics that you can use. The first thing that we need to define is a holdout test set. This is a sample of data that is set aside and not used in any of the fitting of the model for the purpose of evaluating the model’s ability to generalize to unseen data. So this is meant to simulate how the model will perform in real world scenarios. So this is the entire point of building these models, to generalize and say something about the world. Now given that concept of a holdout test set. Hold that in the back of your mind for just a minute. We’ll be primarily be using K-Fold Cross Validation to evaluate our models. In this process the full data set is k-subsets and the holdout method is repeated k times. That is, in each iteration one of the k-subsets is treated as the holdout test set and the other k-1 subsets are put together to train the model. The purpose is that this gives you a little bit more robust read on the performance of the model rather than just having one single hold out test set for the model to be evaluated on. Now you have K test sets and k evaluation metrics to understand the potential performance outcomes.
The above image is a example of what cross validation actually looks like. In this example, we’ll start with a full data set of 10,000 examples and we want to run five fold cross validation. So again, that means k equals five. So the first step is to split that 10,000 example data set into k or five subsets. So you’ll see that we now have five subsets of data and each has 2,000 examples. So this is sampling without replacement, so all 10,000 examples are still accounted for and it’s worth noting that these subsets will remain the same throughout this entire process. So an example in subset one will remain in subset one all the way through the end of this cross validation. We assign one of these subsets as a test set, which is subset five in red here and then we assign the other four that are in blue to the training set. Now we’ll fit a model on the 8,000 examples in blue and then we’ll evaluate the model on the 2,000 example test set in red and then we’ll record the performance metric and store that away in an array. To be clear this is all handled under the hood in Psychic Learn. You don’t have to manually implement these steps, you’ll see when we actually write this with code that Psychic Learn handles all this but it’s important for you to understand what it’s doing when it carries out this process. Again you’ll pick an evaluation metric and after the first iteration it’ll store the performance of the model on the holdout test set. So here we’re saying that’s .867. Next we’ll move on to the second iteration, where now the fourth subset is the test set and subsets one through three along with the fifth one are now the training set. So again, we’ll re-fit a brand new model on these 8,000 training examples and evaluate on the 2,000 test examples in that fourth subset. And then we’ll store the performance metric, here we’re saying that’s .884. Then again for the third iteration, the third subset will be or test set and then model will be trained on the 8,000 examples in the first, second, fourth and fifth. So you train a brand new model evaluate it on the third subset, store the performance metric here we’ll say it’s .901. Again for the fourth iteration, same process as before but now the second subset is our holdout test set. Last but not least, the first subset is the test set and two through five are the training data. For the model, evaluate it on the first subset and store the evaluation metric. Now it’s worth noting that at this point, every subset and thus every example has been used in a training set four times and in an evaluation set once. So we have now used this model configuration to fit a model on all different combinations of these examples and evaluated it on every single point in this data set. So you can see why this can be a really powerful tool to gauge a model’s ability to generalize. And then lastly, you would normally output maybe the full array of scores so all five scores or you might just output the simple average. So the average is .885. But now we see a more robust gauge of what the potential outcome of this model would be. So let’s just say we’re using a single holdout test set and using that to gauge the performance of the model for our business. Now the first iteration you’ll get a score of .867 but in average overall five is .885 and the highest is .901. So that’s a difference of .018 from the lowest score to the average and .34 from the lowest score to the highest score. That may not seem like a big difference but in a business setting where this could impact millions of dollars, that’s a huge difference. So having a read on how this model performs over five different test sets is a huge advantage to truly understanding how your model might perform in the real world. Now this gives us more confidence that the model will perform around .885 and gives us a range of plausible outcomes or some error of ours on the projection.
So let’s talk about actual evaluation metrics now. So for a classification problem like this spam ham data set that we’re working with, we’ll generally use three main performance metrics.
The first is accuracy shown above, so that’s just the number that you’ve predicted correctly over the total number of observations. So if you have 10,000 observations and you’ve got 800 of them labeled correctly then your accuracy is 80%.
The second metric is precision as shown above. Within the context of the problem that we’re working with, that would be represented by the number that the model predicted as spam that are actually spam divided by the total number that the model predicted as spam.
The last evaluation metric is called recall presented above, so that would be the number predicted by the model to be spam that are actually spam, so again that’s the same numerator as precision but now it’s just divided by the total number that are actually spam instead of the total number that are predicted as spam. So that’s accuracy, precision and recall.
So the above image is the visual representation of precision and recall, you can see that the numerator in both is the same, the amount that you correctly identified but the denominator is different. Precision is all things that you said were relevant while recall is all the things that are actually relevant. So in our example relevant means it’s spam. So now precision recall give you the ability to kind of tailor the aggressiveness of your algorithm based on your business problem. For instance, if false positives are really costly then you’ll want to optimize your model for precision. But if false negatives are really costly then you’ll want to optimize the model for recall. Only knowing accuracy may not really give you incite into this kind of trade off.
Introducting Random Forest
Random forest is one type of a machine learning algorithm that falls into a broader category of ensemble learners. This takes advantage of the ensemble method, which is a technique that creates multiple models and then combines them to produce better results than any of the single models individually. The idea behind ensemble learning is that you can combine a lot of weak models to create a single strong model. The basic idea is that this leverages the aggregate opinion of many over the isolated opinion of one. This method has a very strong theoretical motivation.
Random forest is an ensemble learning method that constructs a collection of decision trees and then aggregates the predictions of each tree to determine the final prediction. So in this case, your weak models are the individual decision trees, and then those are combined into the strong model that is the aggregated random forest model. So you may say, I want to build a random forest model to predict ham or spam. Let’s just say that the random forest has one hundred decision trees in it. Then each of the hundred decision trees are built independently of one another, and each will output a prediction of either spam or ham. So let’s say 60 of those decision trees vote spam and 40 vote ham. Then the final prediction of the random forest model will be spam. So it’s really just a simple voting method for the trees.
There are a lot of benefits to using random forest. It’s a very versatile and powerful machine learning algorithm. The following list are the benefits:
- Can be used for classification or regression
- Easily handles outliers, missing values, etc.
- Accepts various types of inputs (continuous, ordinal, etc.)
- Less likely to overfit
- Output feature importance
So random forest is really versatile, and it often makes a terrific first pass at your data, because you rarely have to do a lot of data cleaning, because it can accept pretty much anything. Beyond that, it’s powerful, and it outputs feature importance to help you get a feel for which of your features are really useful, and which aren’t.
Building a Random Forest Model
Clean up the data based on previous chapters and this time we use tf-idf to vectorise the data, then split data into actual features and labels:
1 | import nltk |
Output:
body_len | punct% | 0 | 1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|---|---|---|
0 | 128 | 4.7 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 49 | 4.1 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 62 | 3.2 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 28 | 7.1 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 135 | 4.4 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 8106 columns
Import necessary package and print out the parameters as well as the hyperparameters:
1 | from sklearn.ensemble import RandomForestClassifier |
Output:
1 | ['__abstractmethods__', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_abc_impl', '_estimator_type', '_get_param_names', '_get_tags', '_make_estimator', '_more_tags', '_required_parameters', '_set_oob_score', '_validate_X_predict', '_validate_estimator', '_validate_y_class_weight', 'apply', 'decision_path', 'feature_importances_', 'fit', 'get_params', 'predict', 'predict_log_proba', 'predict_proba', 'score', 'set_params'] |
The first is feature_importances. This is what outputs the value of each feature to the model. This is an awesome tool. Most algorithms do not provide this, and it’s really helpful. Then, fit is what allows you to fit your actual model and then you’ll store that fit model as an object. Then, you can use this predict method from that fit model object, to make predictions on your test set.
The hyperparameters contained within our random forest classifier, that was printed out from here. The first is max_depth, so this is going to be how deep each one of your decision trees is. You’ll notice that the default is none. Basically, that means that it will build each decision tree until it minimizes some loss criteria. And then the second one is n_estimators. This is how many decision trees that will be built within your random forest, so the default is 10. These defaults mean, your random forest would build 10 decision trees of unlimited depth. Then, there would be a vote among these 10 trees to determine the final prediction.
Import packages:
1 | from sklearn.model_selection import KFold, cross_val_score |
What these functions will actually do is Kfold will actually facilitate the splitting of your full data set into the subsets that we saw in the slides, and then this cross_val_score is what will help us get the actual scoring.
1 | # Setting n_jobs to negative one allows this to run faster by building the individual decision trees in parallel. |
Notice: Scikit-learn always expects you to input your X_features and your label separately in this way, so often people will just split up their original data frame into X_features and Y, before they even run anything through scikit-learn.
Output:
1 | array([0.96678636, 0.97307002, 0.96945193, 0.95507637, 0.95867026]) |
Random Forest with Holdout Test Set
Reading in data, creating new features, cleaning that data, and then vectorizing it:
1 | import nltk |
Output:
body_len | punct% | 0 | 1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|---|---|---|
0 | 128 | 4.7 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 49 | 4.1 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 62 | 3.2 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 28 | 7.1 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 135 | 4.4 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 8106 columns
The first step is to import the precision recall F score support function, from the SK learn dot metrics module. And we’re going to go ahead and store that as score, so that we don’t have to call the entire function name every time that we want to use it. And then we’ll import the train test split function from SK learn dot model selection.
1 | from sklearn.metrics import precision_recall_fscore_support as score |
In the code below, the first step is to split our data into a training set, and a test set. So we’ll call our train test split function, and what we’ll pass into this function, is first we’ll pass in the X features, and then we’ll pass in our label, so data bracket label, and then the last parameter that we need to pass in, is the test size, so in other words, what percent of our original dataset do we want to allocate to the test set. And the commonly used value there is 20%, so 0.2. This train test split will output four datasets. We need to tell it what we want it to store it as. So the first one that I’ll output is the X training set, so we’ll call it X train, then the next one is X test, and then Y train, and then Y test. This train test split function will output four datasets, so we have to tell it what to name those datasets, or what we want it to output that data to. So the first one that it’ll output is X train, and then the next one that it’ll output is X test, and then the next one that it’ll output is Y train, and then lastly, it will output Y test. And it’s always in that order, so it’s very important that you name your data frames accordingly.
1 | X_train, X_test, y_train, y_test = train_test_split(X_features, data["label"], test_size = 0.2) |
Import random forest classifier and setting hyperparameter settings as well as fitting the model:
1 | from sklearn.ensemble import RandomForestClassifier |
Getting the feature importances:
1 | # zip is going to wrapping two arrays together |
Output:
1 | [(0.04540639892466463, 'body_len'), |
Now you’ll see all these numbers here, as we saw in our full data frame, when you vectorize, the actual words don’t become the column names, they’re just assigned a number. So that’s what these represent. But the main is that body length is pretty clearly the most important feature, which is not surprising, based on the feature evaluation that we did.
If you don’t see body length here, as your most important feature, that’s okay. Random forest includes some random sampling for each of the decision trees, so if the sampling in your trees are slightly different than the sampling in my trees, you may come out with different feature importances. So you may run your model a few different times, and each time you could get slightly different feature importances.
Making prediction and see the score:
1 | y_pred = rf_model.predict(X_test) |
Output:
1 | Precision: 1.0 / Recall: 0.616 / Accuracy: 0.95 |
So just as a reminder of what that actually means in the context of a spam filter, 100% precision, what that actually means is that when the model identified something as spam, it actually was spam 100% of the time. So that’s great. The 61.6% recall means that of all the spam that has come into your email, 61.6% of that spam was properly placed in the spam folder, which means that the other 44.8% went into your inbox, so that’s not great. And lastly, the 95% accuracy just means that of all the emails that came into your email, spam or non-spam, they were identified correctly as one or the other, 95% of the time. So in summary, the amount of spam still making it to our inbox, tells us that our model’s not quite aggressive enough in identifying spam.
Random Forest Model with Grid Search
In the last session, we fit just a single model with a single set of hyperparameter settings, and then we generated a single set of evaluation metrics. But aren’t you a little curious to see if we can maybe make our model better, simply by changing the hyperparameter settings, like the number of estimators, or the max depth? We mentioned the last model wasn’t quite aggressive enough. Could we capture more spam by altering the hyperparameter settings? That’s where grid-search comes in. Grid-search basically means defining a grid of hyperparameter settings, and then exploring a model fit with each combination of those hyperparameter settings. So in our case, that means setting a range of number of estimators, and a range of max depth, that you’d like to explore. And then grid-search will test every combination of those, and fit a model and evaluate it, to see which hyperparameter combination generates the best model.
Clean the data:
1 | import nltk |
Output:
body_len | punct% | 0 | 1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|---|---|---|
0 | 128 | 4.7 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 49 | 4.1 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 62 | 3.2 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 28 | 7.1 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 135 | 4.4 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 8106 columns
Import some necessary packages:
1 | from sklearn.ensemble import RandomForestClassifier |
Splitting the data into training and test set:
1 | X_train, X_test, y_train, y_test = train_test_split(X_features, data['label'], test_size=0.2) |
The function to train and predict the data:
1 | def train_RF(n_est, depth): |
So now we have the function above that accepts number of estimators, and depth, constructs a random forest classifier object, using those number of estimators and depth, fits that model, predicts, and then generates some results metrics, and prints out those results at the very end.
The following nested for loop will iterate through each combination of these parameter settings, and train a model, evaluate that model on a test set, and then print out the results. That’s essentially what grid-search is. So the first time through, we’ll grab number of estimators equal to 10, and depth equal to 10, it’ll pass it into our function above, it’ll fit a model with those parameter settings and print out the results, and then it’ll iterate through each combination.
1 | for n_est in [10, 50, 100]: |
Output:
1 | Est: 10 / Depth: 10 ---- Precision: 1.0 / Recall: 0.408 / Accuracy: 0.934 |
So in this example, as the depth increases from 10, to 20, to 30, and eventually to none, the recall increases quite drastically, while you see the precision doesn’t really drop. So the model is getting much better and more aggressive as the depth increases. On the other side, you’ll notice that adding estimators might be helping a little bit, but the improvement isn’t as drastic as adding depth to the individual trees. So by looking at these results, we can immediately eliminate any model that has a limited max depth. 10 is clearly pretty bad, no matter how many estimators you have. 20 isn’t really great either. Once you get towards 30, it starts to level out. So we know that the best random forest model is one with very high max depth, probably no limit. And number of estimators plays a little bit of a role, but not nearly as much as max depth. So this is a very broad view into how we would typically approach model fitting and evaluation. In general, we would explore a much wider range for each parameter setting, and we would probably explore more parameters also.
Evaluate Random Forest Model Performance
we’re going to combine grid search and cross-validation to create a very powerful model tuning and evaluation tool that is often the default tool for tuning and evaluating machine learning models. To recap very quickly, grid search is setting up different parameter settings that you want to test and then exhaustively searching that entire grid to determine the best model. And then cross-validation takes your data set, divides it into k subsets, then you repeat the holdout method where you train on some data set and evaluate it on a separate data set k times. So, in each iteration, you’re using a different subset of data as the test set and all the rest of the data as training set. So, combining these ideas into GridSearchCV, what this method will allow you to do is define a grid of parameters that you want to explore and then within each setting, it will run cross-validation.
Clean the data and use two vectorisation methods - tf-idf and count vectoriser:
1 | import nltk |
Output:
body_len | punct% | 0 | 1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|---|---|---|
0 | 128 | 4.7 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 49 | 4.1 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 62 | 3.2 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 28 | 7.1 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 135 | 4.4 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 8106 columns
Import some necessary packages:
1 | from sklearn.ensemble import RandomForestClassifier |
1 | # Instentiate randon forest classifier |
Same process as above but change the vectorisor into count vectorisor:
1 | rf = RandomForestClassifier() |
1 | tfidf_result[["mean_fit_time", "mean_score_time", "mean_test_score", "param_max_depth", "param_n_estimators"]] |
Output:
mean_fit_time | mean_score_time | mean_test_score | param_max_depth | param_n_estimators | |
---|---|---|---|---|---|
8 | 47.2303 | 0.59588 | 0.974852 | 90 | 300 |
7 | 24.3235 | 0.403694 | 0.974313 | 90 | 150 |
10 | 25.5253 | 0.390608 | 0.973954 | None | 150 |
11 | 36.2061 | 0.268437 | 0.972876 | None | 300 |
3 | 3.63283 | 0.183171 | 0.972517 | 60 | 10 |
There are 12 parameter combinations, so there’s 12 lines in this data set, but we’re only looking at the top five. mean_fit_time is the average time it takes each model to fit, mean_score_time is the average amount of time it takes each model to make a prediction on the test set, mean_test_score is the average accuracy on the test set, and then mean_train_score is the average accuracy on the training set. So in terms of parameter combinations, you’ll notice that the best performing models are the ones with the deepest individual decision trees. And you’ll see with the number of estimators doesn’t seem to matter quite as much as the top model. If you look at the mean_fit_time, you’ll see that it’s much, much faster for 10 estimators than it is for 150 or 300.
1 | count_result[["mean_fit_time", "mean_score_time", "mean_test_score", "param_max_depth", "param_n_estimators"]] |
Output:
mean_fit_time | mean_score_time | mean_test_score | param_max_depth | param_n_estimators | |
---|---|---|---|---|---|
7 | 18.9076 | 0.260673 | 0.973056 | 90 | 150 |
8 | 36.1114 | 0.410019 | 0.972517 | 90 | 300 |
11 | 32.4232 | 0.290319 | 0.972337 | None | 300 |
10 | 20.0984 | 0.265436 | 0.971978 | None | 150 |
5 | 29.0064 | 0.329088 | 0.970181 | 60 | 300 |
So now let’s take a look at how the mean_test_score on this tf-idf data set compares to the count vectorizer data set. Let’s scroll down here and this is exactly the same setup as the prior section, just with the CountVectorizer instead of TfidfVectorizer. So, this call out that the mean_test_score at 97.3% is just a tad below the 97.4% for the tf-idf, so tf-idf is doing slightly better.
In practice, we would usually explore a lot more settings than we’re exploring here. We would also test N-grams, whether we should include stopwords, whether removing punctuation is helpful, and we test different parameters within the vectorizer and four, five other hyperparameter settings within Random Forest. As you can imagine, exploring all these different combinations can quickly generate an enormous base of candidate models. It’s not uncommon to test over a hundred or even a thousand in some cases. So, that’s a quick introduction into Random Forest and learning how to tune and evaluate the model using two different vectorizing frameworks.
Introducting Gradient Boosting
Gradient Boosting has some similarities to random forest, but some key differences as well. Gradient boosting is also an ensemble method, just like random forest. Just a quick review, the ensemble method is a technique that created multiple models and then combines them to produce better results than any of the single models individually. Gradient boosting is an ensemble method that takes an iterative approach to combining weak learners to create a strong learner by focusing on the mistakes of prior iterations. So in the previous session, we learned about how random forest builds a certain number of fully-grown decision trees simultaneously. And then it uses a voting method to combine the predictions of each. Gradient boosting uses decision trees as well, but they’re incredibly basic, like a decision stump. And then it evaluates what it gets right and what it gets wrong on that first tree, and then with the next iteration it places a heavier weight on those observations that it got wrong and it does this over and over and over again focusing on the examples it doesn’t quite understand yet until it has minimized the error as much as possible.
So this is an incredibly powerful technique but you might be thinking, how is this really different than random forest? They are the same in that they’re both ensemble methods based on decision trees, but there’s a lot of differences as well. For instance, they’re different in that gradient boosting uses a method called “boosting” while random forest used a method called “bagging”. Both of these methods include sampling for each different tree that is built. The five second version of the difference between boosting and bagging is that bagging samples randomly, while boosting samples with an increased weight on the ones that it got wrong previously. Because all the trees in a random forest are built without any consideration for any of the other trees, this is incredibly easy to parallelize, which means that it can train really quickly. So if you have 100 trees, you could train them all at the same time. Whereas gradient boosting is iterative in that it relies on the results of the tree before it in order to apply a higher weight to the ones that the previous tree got incorrect. So boosting can’t be parallelized and so it takes much longer to train. As you get into massive training sets, this becomes a serious consideration.
Another difference is that the final predictions for random forest are typically an unweighted average or an unweighted voting, while boosting uses a weighted voting. And then lastly, random forest is easier to tune, faster to train and harder to overfit, while gradient boosting is harder to tune, slower to train, and easier to overfit. So with that, why would you go with gradient boosting? Well, the trade off is that gradient boosting is typically more powerful and better-performing if tuned properly.
Enough about random forest. Let’s focus on just gradient boosting. So what are the benefits? Well, it’s one of the most powerful machine learning classifiers out there. It also accepts various types of inputs just like random forest so it makes it very flexible. It can also be used for classification or regression, and the outputs feature importances which can be super useful. But it’s not perfect. Some of the drawbacks are that it takes longer to train because it can’t be parallelized, it’s more likely to overfit because it obsesses over those ones that it got wrong, and it can get lost pursuing those outliers that don’t really represent the overall population. And it’s harder to tune because there are more parameters. That’s a very, very brief introduction into gradient boosting.
Gradient-boosting Grid Search
Read in and clean data using tf-idf:
1 | import nltk |
Output:
body_len | punct% | 0 | 1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|---|---|---|
0 | 128 | 4.7 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 49 | 4.1 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 62 | 3.2 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 28 | 7.1 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 135 | 4.4 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 8106 columns
Import the package for the classifier:
1 | from sklearn.ensemble import GradientBoostingClassifier |
Print out the attributes:
1 | print(dir(GradientBoostingClassifier)) |
Output:
1 | ['_SUPPORTED_LOSS', '__abstractmethods__', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_abc_impl', '_check_initialized', '_check_params', '_clear_state', '_estimator_type', '_fit_stage', '_fit_stages', '_get_param_names', '_get_tags', '_init_state', '_is_initialized', '_make_estimator', '_raw_predict', '_raw_predict_init', '_required_parameters', '_resize_state', '_staged_raw_predict', '_validate_estimator', '_validate_y', 'apply', 'decision_function', 'feature_importances_', 'fit', 'get_params', 'predict', 'predict_log_proba', 'predict_proba', 'score', 'set_params', 'staged_decision_function', 'staged_predict', 'staged_predict_proba'] |
Now, for the attributes and methods, you can see that they’re almost exactly the same as they are with random forest. So, you’ll notice our fit, you’ll notice our predict, you’ll the feature importances. This is one of the great things about scikit learn Once you learn how to use one classifier or object, you pretty much know how to use them all, even down to the naming conventions of the actual models. So you have random forest classifier, we have gradient boosting classifier, and then if we’re to expand from there, we have logistic regression and so on and so on. Looking at the default settings and the hyperparameter settings for gradient boosting, you’ll again see max depth, and you’ll also see n estimators.
If you recall back to random forest, the default setting for max depth was none, so it could build each tree as deep as it wanted. Here, the max depth default setting is three. Then again, for random forest the default for the number of estimators was 10, and for gradient boosting it’s 100. So that calls out one of the big differences between random forest and gradient boosting. Random forest is built with a couple fully grown trees, whereas gradient boosting uses a lot of very basic trees, which speaks to how these models are build differently, and even how they optimize differently. Another thing that you’ll notice, is that there’s no end jobs parameter here like we saw for random forest. Remember that parameter made your model train faster, by parallelizing it. But as we mentioned before, you can’t parallelize gradient boosting because each iteration builds on the prior iteration. So that’s why we don’t have that end jobs parameter. There is one more parameter: and that’s the learning rate parameter. Learning rate determines how quickly an algorithm optimizes, but it also has performance implications, because it can cause the model to optimize too quickly, without truly finding the best model.
Import packages:
1 | from sklearn.metrics import precision_recall_fscore_support as score |
Split the data:
1 | X_train, X_test, y_train, y_test = train_test_split(X_features, data['label'], test_size=0.2) |
The following code is the function to perform gradient boosting:
1 | def train_GB(est, max_depth, lr): |
Using grid search to train the model:
1 | for n_est in [50, 100, 150]: |
Because gradient boosting can’t be parallelized, and because we’re testing so many parameter settings, this could take an hour or two to run on your exercise notebooks.
Output:
1 | Est: 50 / Depth: 3 / LR: 0.01 ---- Precision: 0.0 / Recall: 0.0 / Accuracy: 0.866 |
There might be some warnings which means that at least one of the models didn’t predict a single text message to be spam. So if it doesn’t predict any messages to be spam, then precision can’t be calculated. So this warning is just saying that it sets precision to zero in those cases.
So first let’s take a look at some of the worst models. Now, keep in mind, that the only thing we care about here are the actual results. We’re looking at these results to dictate which hyperparameter settings generated the best model. You could think of the hyperparameter settings just as instructions to the model to dictate how that model was built. For the worst models on this list, you could see that the learning rate of 0.01 is a very common theme, and in general, the worst models also had a low number of estimators. Let’s take a look at some of the best models. Based on the results here, all of the best models had a learning rate of 0.1. Now we can draw that distinction, that that instruction to the model, of whether the learning rate is 0.01, or it’s 0.1, is making a big difference in terms of the results of the model. For this problem, it appears that this learning rate of 0.1 is ideal. You’ll also note that the estimators and the max depth are on the high end of the ranges that we tested out. If this process doesn’t make a lot of sense yet, you can think of it like baking a cake. So we have our different ingredients, we have our flour, we have our chocolate, and we have our icing, and we’re tweaking these recipes just a little bit, to determine which of the recipes will generate the best-tasting cake. So we’ll use these recipes, we’ll bake a bunch of different cakes, and then we’ll select the best-tasting cakes, and then from that, we’ll be able to say that these specific recipes are what generate the best cakes.
Evaluate Gradient-boosting Model Performance
We’re starting to notice that model building is an iterative process. Unfortunately, we don’t just write a couple lines of code, fit a model, and then wipe our hands of it. We start with an initial exploration, then we take what we learned, and we dive a little bit deeper to learn a little bit more. Then we take what we learned in that phase, and dive even deeper on a more specific model. Until eventually, we’ve a built a model that we’re confident in, and that we thoroughly evaluated. With that context, we’re going to enter the next phase of our build using GridSearchCV to explore our model, while allowing our prior exploration to point us in the right direction.
So just to recap very quickly. GridSearch means setting up different parameter settings that you want to test, and then exhaustively searching that entire grid to determine the best model. And cross-validation means you take your dataset, divide it into k subsets, and then you repeat the holdout test method where you train on some dataset, and then evaluate it on a separate dataset k times. Each iteration you use a different subset of the data as the test data, and then all the rest of the data as the training set. So then we combine these ideas into GridSearchCV. What this method will do is it allows you to define a grid of parameters that you want to explore, and then within each hyper-parameter setting, that will run cross-validation.
Read the file and using both TF-IDF and Count Vectoriser:
1 | import nltk |
Output:
body_len | punct% | 0 | 1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|---|---|---|
0 | 128 | 4.7 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 49 | 4.1 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 62 | 3.2 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 28 | 7.1 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 135 | 4.4 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 8106 columns
Import the packages:
1 | from sklearn.ensemble import GradientBoostingClassifier |
If you remember back to when we did GridSearchCV with random forest, we leave that classifier empty, and then we define our parameters using this parameter grid. What the parameter grid will be, is it’ll be a dictionary, where the keys are the hyper-parameter values, and the parameter grid will be a dictionary where the keys of the dictionary will be your hyper-parameters, and then the values will be a list of settings that you want to try out.
The following code presents this concept in TF-IDF:
1 | gb = GradientBoostingClassifier() |
The n_jobs=-1 means that we’ll train models on different subsets and parameter settings in parallel. It does not mean that each of the models themselves will be trained in parallel. Again, that’s because gradient boosting cannot be trained in parallel, because each iteration depends on the prior iteration.
Output:
The following code does the same thing but with count vectorisor
1 | gb = GradientBoostingClassifier() |
Output:
mean_fit_time | mean_score_time | mean_test_score | param_max_depth | param_n_estimators | |
---|---|---|---|---|---|
1 | 241.798 | 0.261987 | 0.970361 | 7 | 150 |
3 | 387.388 | 0.259972 | 0.969822 | 11 | 150 |
4 | 361.938 | 0.237765 | 0.969822 | 15 | 100 |
5 | 425.78 | 0.196678 | 0.968744 | 15 | 150 |
2 | 282.125 | 0.283908 | 0.968385 | 11 | 100 |
mean_fit_time | mean_score_time | mean_test_score | param_max_depth | param_n_estimators | |
---|---|---|---|---|---|
3 | 364.091 | 0.253393 | 0.969822 | 11 | 150 |
5 | 416.869 | 0.170237 | 0.969463 | 15 | 150 |
1 | 239.225 | 0.304996 | 0.968565 | 7 | 150 |
0 | 176.768 | 0.287938 | 0.968205 | 7 | 100 |
4 | 352.215 | 0.23997 | 0.967846 | 15 | 100 |
For random forest, the most time consuming model took around thirty seconds to fit. So these models are taking a minimum of 150 seconds, and up to almost 370. So you can really see the difference in how long it takes to fit. You can see that the mean_test_scores are all just around 96 - 97 percent, or just below. And you can see that the best models are the ones with 150 estimators, and around 11 max_depth. So the very best model has 150 estimators, and a max_depth of 7. So let’s look at the results for the count vectorizing really quick. So you can see that the time it takes to fit the the mean_test_score, is pretty much right in line with tfidf. So it doesn’t look like there’s too much of a difference, all the test score results are just around 96 to 97 percent. So it’s the one with 150 estimators, and a max_depth of 11. Okay, so now we’ve tested two different algorithms, random forest, and gradient boosting, on two different vectorization methods, tfidf, and count vectorization, across a variety of hyper-parameter settings. So now we have a pretty good idea of what the models look like, and how they’re performing.
Model Selection: Data Prep
Now we’ve gone through pretty much the entire machine learning process. We’ve read in raw text, cleaned that text, created and transformed features in feature engineering, we’ve fit a simple model and evaluated it on a holdout test set, we’ve tuned hyperparameters and evaluated each one using GridSearchCV, and now we’re going to cap it all off by comparing our best performing models to select the very best model. But before we do that, One thing that have to be mentioned is that we’ve been bending the rules just a little bit in regards to our vectorizers. Vectorizers are like models. They need to be fit on a training set and then stored in order to transform the test set. So when we say fit on the training set, in the context of a vectorizer, it basically just means it stores all of the words in the training set. Then when it transforms the test set, it will only create columns for the words that were in the training set. Any words that appear in the test set but not in the training set, will not show up in the vectorized version of the test set. The vectorizer will only recognize words that it saw in the training set. Up to this point, we’ve been training the vectorizer on the entire data set, instead of just the training set because it makes things easier with GridSearch and breaking them apart would require an introduction to scikit-learn pipelines, which is beyond the scope of this course. Now we’re going to tweak our process just a little bit as we go into the final model selection. What this process will look like is that we’ll split into training and test set, and then we’ll train the vectorizers on the training set and use that to transform the test set. And then we’ll fit the very best gradient boosting model and the best random forest model on the training set and then predict on the test set. And then lastly, we’ll thoroughly evaluate the results of these two models to select the very best model.
Reading the data, clean the data and create the features:
1 | import nltk |
Splitting into train / test set:
1 | from sklearn.model_selection import train_test_split |
We’re going to vectorize using TFIDF, as that generated results that were slightly ahead of the count vectorizer. Now, it’s worth noting that the difference was very, very slight, so this is more of a judgment call than anything. Just like with our other scikit-learn models, we’re going to instantiate our vectorizer in this way, and then we’re going to pass in our cleaning function into the analyzer parameter. And then we’ll fit that on body_text from X_train
1 | tfidf_vect = TfidfVectorizer(analyzer=clean_text) |
In the code above, we have a stored vectorizer in tfidf_vect_fit that was fit on only our training data, so it’s basically stored all the words from our training set and it’ll keep those to be used to created the columns once we transform both our training and our test set. So again, keep in mind, all we did is we used this training set to fit our vectorizer object and then we stored that vectorizer object. We still haven’t done anything with the training set. So the training set still has these [‘body_text’, ‘body_len’, ‘punct%’] columns in it. It doesn’t have any vectorized columns. The following code is going to create those vectorised columns by extending the code above.
1 | # we're first fitting and then we're transforming this time |
Output:
body_len | punct% | 0 | 1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|---|---|---|
0 | 130 | 3.8 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 116 | 12.9 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 18 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 55 | 1.8 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 62 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 7199 columns
The vectorizer on the full data set generated over 8,000 features. In other words, it recognized over 8,000 unique words. Now it only contains just over 7,000. Again, that’s because this vectorizer was fit only on the training data. So what this tells us is that there are around 1,000 words in the test set that won’t be recognized by the vectorizer. So they’ll essentially be ignored. So that’s how you properly fit your vectorizer and prepare your data for your final model selection.
Model Selection: Results
Now we have all of our data prepared. So we have our training set with our vectorized data and our created features, and then we also have our test set that was transformed using the vectorizer trained only on the training set. Let’s jump into final model selection.
Import some packages and one more function that we’re importing here. This one’s called time. This will allow us to track how long it’s taking for our models to train and predict, because, as you’ll see in just a few minutes, that’s one factor that comes into play as part of final model selection. We’ll go ahead and run that:
1 | from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier |
The reason that we care about this predict time in particular is that if you’re using your model in an environment where it needs to make very high-frequency predictions, then even if it’s taking a second, or maybe even half a second, to make a prediction, that might be too long and it’ll create a bottleneck in your process.
The following is the code that comparing between random forest classifier and gradient boosting classifier.
Random Forest:
1 | rf = RandomForestClassifier(n_estimators=150, max_depth=None, n_jobs=-1) |
Output:
1 | Fit time: 2.782 / Predict time: 0.168 ---- Precision: 1.0 / Recall: 0.844 / Accuracy: 0.979 |
Gradient Boosting:
1 | gb = GradientBoostingClassifier(n_estimators=150, max_depth=11) |
Output:
1 | Fit time: 202.739 / Predict time: 0.116 ---- Precision: 0.925 / Recall: 0.844 / Accuracy: 0.97 |
For the final comparison, what we want to do is we want to compare Fit time, Predict time, Precision, Recall, and Accuracy between these two models with a particular focus on Predict time, Precision, and Recall. The reason being that once these models are fit, you generally store them for the purpose of making predictions later on. They wouldn’t really ever be refit or retrained again until you decide that your current model needs to be replaced. That’s why we care more about predict time than we do fit time. And then precision and recall just give you a more in-depth look into how your model’s performing than accuracy does. That’s why we really focus on those three metrics.
Let’s take a note of the results. You’ll notice that even though GradientBoosting takes way longer than RandomForest does to fit, it actually takes less time to predict. In terms of precision and recall, our RandomForest has much better precision at 100%, but GradientBoosting has slightly better recall. Now we find ourselves in a situation where no matter which model we pick we’re making some kind of trade off. If we pick RandomForest, that means that we care more about precision than we do predict time or recall, and vice versa. This kind of trade off is very common, which brings me to a couple of very important points.
First, generally you’ll dive into the metrics much more than we are here. We wouldn’t base it only on overall precision, recall, and predict time. We’d split our test set in a variety of different ways to understand how it does across a number of different dimensions. We might say let’s look only at text messages that have a length greater than 50, see how our model does there. Or, let’s look at text messages that have zero punctuation and see how our model does there. We’d slice it in a variety of different ways to really understand where the model’s doing well, and maybe where it doesn’t do well. That would also include looking at specific text messages that the model is getting wrong.
The second point is after thorough training and evaluation process, you usually end up in a place where you have some kind of trade off like we have here between performance and predict time. And, in this case, and this is very important, you make your decision based on the business problem or the business context. What that means is is having a longer predict time going to create a huge bottleneck in your process. In some business context, having a model that takes over .2 seconds to predict might be a deal breaker so you might have no choice but to go with the GradientBoosting model.
Additionally, most problems either have a higher cost on false positives, which means you would prioritize precision, or false negatives, which means you’d prioritize recall. For instance, for a spam filter, you can probably deal with some spam in your inbox here and there, but you don’t want your spam filter to capture real emails, so we’d prioritize here for precision. So when it says it’s spam, it better be spam. In this case, false positives are very costly. The second case would be something like anti-virus software. False positives where they say that you have a virus but you really don’t, that can be scary without a doubt, but if you’re getting hacked and your software doesn’t catch it, that’s much, much worse. In this case, we should optimize for recall so that if there’s a breach, the model better be able to catch it. With all that said, assuming that predict time is not a deal breaker for your business problem, and you don’t necessarily have a super-clear answer of whether false positives or false negatives are more costly, the model that you’d probably select here is the RandomForest model. That’s because the precision is so much better than the GradientBoosting model, and the recall is very close.
Now that you’ve gone through the entire pipeline of reading in text, cleaning it, vectorizing it, creating features, and exploring various models to pick an optimal model, now you can confidently say that given some messy text with some kind of labels, you can clean that up and build a model to return concise predictions about that text.
Summary
We’ve now learned how to read in-text data when we don’t know how many columns there are, what separates the columns, or even what format it’s in. Then we learned how to clean up that data using regular expressions and tools from various packages to remove punctuation, tokenize, remove stop words, and stem and lemmatize our data. Then we learned a few different methods to vectorize our data to get it into a format that a machine learning model can understand and train on. Then we learned how to create and transform features and then lastly, we learned how to build and evaluate a few different types of models using the text and created features. We now have the tools to go from messy, unstructured data set to concise predictions. This is an incredibly powerful tool for natural language processing but don’t stop here. We only touched on some of the most high-level topics. Use this as a road map to continue learning and building your own powerful models.