6.6. Natural Language Processing

This section some tools to process and work with text.

6.6.1. TextBlob: Processing Text in One Line of Code

Processing text doesn’t need to be hard. If you want to find the sentiment of the text, tokenize text, find noun phrase and word frequencies, correct spelling, etc in one line of code, try TextBlob.

!pip install textblob
!python -m textblob.download_corpora
[nltk_data] Downloading package brown to /home/khuyen/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.
[nltk_data] Downloading package punkt to /home/khuyen/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /home/khuyen/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/khuyen/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package conll2000 to /home/khuyen/nltk_data...
[nltk_data]   Unzipping corpora/conll2000.zip.
[nltk_data] Downloading package movie_reviews to
[nltk_data]     /home/khuyen/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.
Finished.
from textblob import TextBlob

text = "Today is a beautiful day"
blob = TextBlob(text)

blob.words # Word tokenization
WordList(['Today', 'is', 'a', 'beautiful', 'day'])
blob.noun_phrases # Noun phrase extraction
WordList(['beautiful day'])
blob.sentiment # Sentiment analysis
Sentiment(polarity=0.85, subjectivity=1.0)
blob.word_counts # Word counts
defaultdict(int, {'today': 1, 'is': 1, 'a': 1, 'beautiful': 1, 'day': 1})
# Spelling correction
text = "Today is a beutiful day"
blob = TextBlob(text)
blob.correct()
TextBlob("Today is a beautiful day")

Link to TextBlob.

Link to my article about TextBlob.

6.6.2. Convert Names into a Generalized Format

!pip install mlxtend

Names collected from different sources might have different formats. To convert names into the same format for further processing, use mlxtend’s generalize_names.

from mlxtend.text import generalize_names

generalize_names("Tran, Khuyen")
'tran k'
generalize_names("Khuyen Tran")
'tran k'
generalize_names("Khuyen Tran", firstname_output_letters=2)
'tran kh'

Link to mlxtend.

6.6.3. sumy: Summarize Text in One Line of Code

!pip install sumy

If you want to summarize text using Python or command line, try sumy.

The great things about sumy compared to other summarization tools are that it is easy to use and it allows you to use 7 different methods to summarize the text.

Below is how sumy summarizes the article How to Learn Data Science (Step-By-Step) in 2020 at DataQuest.

$ sumy lex-rank --length=10 --url=https://www.dataquest.io/blog/learn-data-science/ 
!sumy lex-rank --length=10 --url=https://www.dataquest.io/blog/learn-data-science/ 
So how do you start to learn data science?
If I had started learning data science this way, I never would have kept going.
I learn when I’m motivated, and when I know why I’m learning something.
There’s some science behind this, too.
If you want to learn data science or just pick up some data science skills, your first goal should be to learn to love data.
But it’s important to find that thing that makes you want to learn.
By working on projects, you gain skills that are immediately applicable and useful, because real-world data scientists have to see data science projects through from start to finish, and most of that work is in fundamentals like cleaning and managing the data.
And so on, until the algorithm worked well.
Find people to work with at meetups.
For more information on these, you can take a look at our Data Scientist learning path , which is designed to teach all of the important data science skills for Python learners.

Link to Sumy.

6.6.4. Spacy_streamlit: Create a Web App to Visualize Your Text in 3 Lines of Code

!pip install spacy-streamlit

If you want to quickly create an app to visualize the structure of a text, try spacy_streamlit.

To understand how to use spacy_streamlit, we add the code below to a file called streamlit_app.py:

# streamlit_app.py
import spacy_streamlit 

models = ['en_core_web_sm']
text = "Today is a beautiful day"
spacy_streamlit.visualize(models, text)

On your terminal, type:

$ streamlit run streamlit_app.py

Output:

!python -m spacy download en_core_web_sm
!streamlit run streamlit_app.py

  You can now view your Streamlit app in your browser.

  Local URL: http://localhost:8501
  Network URL: http://192.168.1.90:8501

^C
  Stopping...

Click the URL and you should see something like below:

image

Link to spacy-streamlit.

6.6.5. textacy: Extract a Contiguous Sequence of 2 Words

!pip install spacy textacy
!python -m spacy download en_core_web_sm

If you want to extract a contiguous sequence of 2 words, for example, ‘data science’, not ‘data’, what should you do? That is when the concept of extracting n-gram from text becomes useful.

A really useful tool to easily extract n-gram with a specified number of words in the sequence is textacy.

import pandas as pd 
import spacy 
from textacy.extract import ngrams

nlp = spacy.load('en_core_web_sm')

text = nlp('Data science is an inter-disciplinary field that uses'
' scientific methods, processes, algorithms, and systme to extract'
' knowledge and insights from many structural and unstructured data.')

n_grams = 2 # contiguous sequence of a word
min_freq = 1 # extract n -gram based on its frequency

pd.Series([n.text for n in ngrams(text, n=n_grams, min_freq=1)]).value_counts()
Data science          1
disciplinary field    1
uses scientific       1
scientific methods    1
extract knowledge     1
unstructured data     1
dtype: int64

Link to textacy

6.6.6. Convert Number to Words

!pip install num2words

If there are both number 105 and the words ‘one hundred and five’ in a text, they should deliver the same meaning. How can we map 105 to ‘one hundred and five’? There is a Python libary to convert number to words called num2words.

from num2words import num2words

num2words(105)
'one hundred and five'
num2words(105, to='ordinal')
'one hundred and fifth'

The library can also generate ordinal numbers and support multiple languages!

num2words(105, lang='vi')
một trăm lẻ năm
num2words(105, lang='es')
'ciento cinco'

Link to num2words.

6.6.7. texthero.clean: Preprocess Text in One Line of Code

!pip install texthero

If you want to preprocess text in one line of code, try texthero. The texthero.clean method will:

  • fill missing values

  • convert upper case to lower case

  • remove digits

  • remove punctuation

  • remove stopwords

  • remove whitespace

The code below shows an example of texthero.clean.

import numpy as np
import pandas as pd
import texthero as hero

df = pd.DataFrame(
    {
        "text": [
            "Today is a    beautiful day",
            "There are 3 ducks in this pond",
            "This is. very cool.",
            np.nan,
        ]
    }
)

df.text.pipe(hero.clean)
0    today beautiful day
1             ducks pond
2                   cool
3                       
Name: text, dtype: object

Texthero also provides other useful methods to process and visualize text.

Link to texthero.

6.6.8. texthero: Reduce Dimension and Visualize Text in One Line of Code

!pip install texthero gdown

If you want to visualize the text column in your pandas DataFrame in 2D, you first need to clean, encode, and reduce the dimension of your text, which could be time-consuming.

Wouldn’t it be nice if you can do all of the steps above in 2 lines of code? texthero allows you to do exactly that.

In the code below, I use texthero to visualize the descriptions of CNN news downloaded from Kaggle. Each point is an article and is colored by its category.

import pandas as pd
import texthero as hero
import gdown 

gdown.download('https://drive.google.com/uc?id=1QPGCZ8mud5ptt8qJR79XQ6KoQnJuT-4D')
df = pd.read_csv("small_CNN.csv")
df["pca"] = df["Description"].pipe(hero.clean).pipe(hero.tfidf).pipe(hero.pca)
hero.scatterplot(df, col="pca", color="Category", title="CNN News")

Link to texthero.

6.6.9. wordfreq: Estimate the Frequency of a Word in 36 Languages

!pip install wordfreq

If you want to look up the frequency of a certain word in your language, try wordfreq.

wordfreq supports 36 languages. wordfreq even covers words that appear at least once per 10 million words.

import matplotlib.pyplot as plt
import seaborn as sns
from wordfreq import word_frequency

word_frequency("eat", "en")
0.000135
word_frequency("the", "en")
0.0537
sentence = "There is a dog running in a park"
words = sentence.split(" ")
word_frequencies = [word_frequency(word, "en") for word in words]

sns.barplot(words, word_frequencies)
plt.show()
/home/khuyen/book/venv/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(
../_images/natural_language_processing_69_1.png

Link to wordfreq.

6.6.10. newspaper3k: Extract Meaningful Information From an Articles in 2 Lines of Code

!pip install newspaper3k nltk

If you want to quickly extract meaningful information from an article in a few lines of code, try newspaper3k.

from newspaper import Article
import nltk

nltk.download("punkt")
[nltk_data] Downloading package punkt to /home/khuyen/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
True
url = "https://www.dataquest.io/blog/learn-data-science/"
article = Article(url)
article.download()
article.parse()
article.title
'How to Learn Data Science (A step-by-step guide)'
article.publish_date
datetime.datetime(2020, 5, 4, 7, 1, tzinfo=tzutc())
article.top_image
'https://www.dataquest.io/wp-content/uploads/2020/05/learn-data-science.jpg'
article.nlp()
article.summary
'How to Learn Data ScienceSo how do you start to learn data science?\nIf you want to learn data science or just pick up some data science skills, your first goal should be to learn to love data.\nRather, consider it as a rough set of guidelines to follow as you learn data science on your own path.\nI personally believe that anyone can learn data science if they approach it with the right frame of mind.\nI’m also the founder of Dataquest, a site that helps you learn data science in your browser.'
article.keywords
['scientists',
 'guide',
 'learning',
 'youre',
 'science',
 'work',
 'skills',
 'youll',
 'data',
 'learn',
 'stepbystep',
 'need']

Link to newspaper3k.

6.6.11. Questgen.ai: Question Generator in Python

!pip install git+https://github.com/ramsrigouthamg/Questgen.ai
!pip install git+https://github.com/boudinfl/pke.git

!python -m nltk.downloader universal_tagset
!python -m spacy download en 
!wget https://github.com/explosion/sense2vec/releases/download/v1.0.0/s2v_reddit_2015_md.tar.gz
!tar -xvf  s2v_reddit_2015_md.tar.gz

It can be time-consuming to generate questions for a document. Wouldn’t it be nice if you can automatically generate questions using Python? That is when Questgen.ai comes in handy.

With a few lines of code, the questions for your document are automatically generated.

from pprint import pprint
import nltk
nltk.download('stopwords')
from Questgen import main
payload = {
    "input_text": """The weather today was nice so I went for a walk. I stopped for a quick chat with my neighbor.
    It turned out that my neighbor just got a dog named Pepper. It is a black Labrador Retriever."""
}

With Questgen.ai, you can either generate boolean questions:

qe = main.BoolQGen()
output = qe.predict_boolq(payload)
pprint(output)
{'Boolean Questions': ['Is there a dog in my neighborhood?',
                       "Is pepper my neighbor's dog?",
                       'Is pepper the same as a labrador?'],
 'Count': 4,
 'Text': 'The weather today was nice so I went for a walk. I stopped for a '
         'quick chat with my neighbor.\n'
         '    It turned out that my neighbor just got a dog named Pepper. It '
         'is a black Labrador Retriever.'}

… or generate FAQ questions:

output = qg.predict_shortq(payload)
pprint(output)
Running model for generation
{'questions': [{'Question': 'What was the purpose of the stop?', 'Answer': 'chat', 'id': 1, 'context': 'I stopped for a quick chat with my neighbor.'}, {'Question': 'Who got a dog named Pepper?', 'Answer': 'neighbor', 'id': 2, 'context': 'It turned out that my neighbor just got a dog named Pepper. I stopped for a quick chat with my neighbor.'}]}
{'questions': [{'Answer': 'chat',
                'Question': 'What was the purpose of the stop?',
                'context': 'I stopped for a quick chat with my neighbor.',
                'id': 1},
               {'Answer': 'neighbor',
                'Question': 'Who got a dog named Pepper?',
                'context': 'It turned out that my neighbor just got a dog '
                           'named Pepper. I stopped for a quick chat with my '
                           'neighbor.',
                'id': 2}],
 'statement': 'The weather today was nice so I went for a walk. I stopped for '
              'a quick chat with my neighbor. It turned out that my neighbor '
              'just got a dog named Pepper. It is a black Labrador Retriever.'}

Link to Questgen.ai.

6.6.12. Word Ninja: Slice Your Lumped-Together Words

!pip install wordninja 

If you want to slice your lumped-together words, use Word Ninja. You will be surprised how well it works.

Below are some examples.

import wordninja 

wordninja.split("honeyinthejar")
['honey', 'in', 'the', 'jar']
wordninja.split("ihavetwoapples")
['i', 'have', 'two', 'apples']
wordninja.split("aratherblusterday")
['a', 'rather', 'bluster', 'day']

Link to wordninja.

6.6.13. textstat: Calculate Statistics From Text

!pip install textstat

If you want to get some important statistics from text such as readability score or reading time, use textstat.

To get the readability score, use automated_readability_index. The ARI (Automated Readability Index) approximates the grade label needed to comprehend the text. If the ARI is 10.8 then the grade level needed to comprehend the text is 10th to 11th grade

import textstat 
text = "The working memory system is a form of conscious learning. But not all learning is conscious. Psychologists have long marveled at children’s ability to acquire perfect pronunciation in their first language or recognize faces."
textstat.automated_readability_index(text)
10.8

To measure the reading time in seconds, use reading_time. The reading time of the text above is 2.82s.

textstat.reading_time(text, ms_per_char=14.69)
2.82

Link to textstat.

6.6.14. RapidFuzz: Rapid String Matching in Python

!pip install rapidfuzz

If you want to find strings that are similar to another string above a certain threshold, use RapidFuzz. RapidFuzz is a Python library that allows you to quickly match strings.

from rapidfuzz import fuzz

Calculates the normalized Indel distance between 2 strings

fuzz.ratio("Let's meet at 10 am tomorrow", "Let's meet at 10 am tommorrow")
98.24561403508771
fuzz.ratio("here you go", "you go here")
54.54545454545454

Sort the words in the strings and calculates the fuzz.ratio between them

fuzz.token_sort_ratio("here you go", "you go here")
100.0

Link to RapidFuzz.

6.6.15. Checklist: Create Data to Test Your NLP Model

!pip install checklist torch

It can be time-consuming to create data to test edge cases of your NLP model. If you want to quickly create data to test your NLP models, use Checklist.

In the code below, I use Checklist’s Editor to create multiple examples of negation in one line of code.

import checklist
from checklist.editor import Editor

editor = Editor()
editor.template("{mask} is not {a:pos} option.", pos=["good", "cool"], nsamples=5).data
['that is not a good option.',
 'War is not a cool option.',
 'Windows is not a good option.',
 'Facebook is not a cool option.',
 'Sleep is not a cool option.']
editor.template("{mask} is not {a:neg} option.", neg=["bad", "awful"], nsamples=5).data
['There is not a bad option.',
 'Closure is not an awful option.',
 'TPP is not a bad option.',
 'Security is not an awful option.',
 'Change is not an awful option.']

Link to Checklist.

6.6.16. Top2Vec: Quick Topic Modeling in Python

!pip install top2vec

If you want to quickly detect topics present in your text and generate jointly embedded topic, document, and word vectors, use Top2Vec.

In the code below, I use Top2Vec to quickly find topics and create a wordcloud of words in the first 3 topics.

from top2vec import Top2Vec
from sklearn.datasets import fetch_openml
news = fetch_openml("Fake-News")
text = news.data["text"].to_list()
model = Top2Vec(documents=text, speed="learn", workers=8)
2022-05-25 08:35:13,293 - top2vec - INFO - Pre-processing documents for training
2022-05-25 08:35:22,285 - top2vec - INFO - Creating joint document/word embedding
2022-05-25 08:53:03,023 - top2vec - INFO - Creating lower dimension embedding of documents
2022-05-25 08:53:23,522 - top2vec - INFO - Finding dense areas of documents
2022-05-25 08:53:23,656 - top2vec - INFO - Finding topics
model.get_num_topics()
82
topic_words, word_scores, topic_nums = model.get_topics(3)

Returns:

  • topic_words: For each topic the top 50 words are returned, in order of semantic similarity to topic.

  • word_scores: For each topic the cosine similarity scores of the top 50 words to the topic are returned.

  • topic_nums: The unique index of every topic will be returned.

for topic in topic_nums:
    model.generate_topic_wordcloud(topic)
../_images/natural_language_processing_137_0.png ../_images/natural_language_processing_137_1.png ../_images/natural_language_processing_137_2.png

Link to Top2Vec.