Natural Language Processing: L02 words
Words and sentences are the basic units of text. In this lecture we discuss basics of operations on words and sentences such as tokenization, text normalization, tf-idf, cosine similarity measures, vector space models and word representation
Published on: Mar 3, 2016
Transcripts - Natural Language Processing: L02 words
Natural Language Processing
Unit 1 – Words
Anantharaman Narayana Iyer
narayana dot anantharaman at gmail dot com
7th and 8th Aug 2015
Tools pertaining to this lecture
Words and Sentences
• Text documents are constituted by words and sentences
• The words and their composition provide us the meaning
• Each class of documents (for example: technology, sports, political, films
etc) have some common words and also some distinguishing words. For
example, the term LBW is more likely to occur in a document pertaining to
cricket. Likewise terms like: comedy, hero, climax etc occur often in film
• By analysing the occurrence and frequency of words we can predict which
document the words came from.
• Notion of: term frequency (tf), inverse document frequency (idf) and tf-idf
Some basic concepts to get started
• We will now introduce a few concepts most of which will be covered
in detail in later classes
• Word and Sentence segmentation
• Pre-processing the text and Normalization: stop words removal, stemming,
• Term Frequency (tf)
• Inverse document Frequency (idf)
• Bag of words (BOW)
• Vector Space models
• Cosine Similarity
• Language Models
Vocabulary – ref: Dan Jurafsky and Christopher Manning
N = number of tokens
V = vocabulary = set of types
|V| is the size of the vocabulary
Tokens = N Types = |V|
2.4 million 20 thousand
Shakespeare 884,000 31 thousand
Google N-grams 1 trillion 13 million
Church and Gale (1990): |V| > O(N½)
Issues in tokenization - Dan Jurafsky and Christopher Manning
• Finland’s capital Finland Finlands Finland’s ?
• what’re, I’m, isn’t What are, I am, is not
• Hewlett-Packard Hewlett Packard ?
• state-of-the-art state of the art ?
• Lowercase lower-case lowercase lower case ?
• San Francisco one token or two?
• m.p.h., PhD. ??
A motivating example: Pre-processing and Text
Normalization challenges from tweet data
• Consider the following tweets:
• GOD's been very kind as HE didn't create FRIENDS with price tags.If He did, I
really couldn't afford U #HappyFriendshipDay @KimdeLeon_
• Why are wishing friendship day today? Aren't we sppsd to be friends for all
the 365 days?#HappyFriendshipDay
• eVrYthin ChnGZ & nthin StaYs D SMe bt aS v grOw uP 1 thnG vl reMain i ws
wT u b4& vL b tiLl D end #HappyFriendshipDay pic.twitter.com/Ym3ZAnFiFn
• How do we tokenize, normalize the above text?
• Often the answer to the above is application dependent
• How do we identify word boundaries and tokenize the words?
• Don’t, would’ve, b4 (if we have a rule/logic where we consider valid English words to consist only of alphabets
[a-zA-Z], then the word boundary would break at the letter 4)
• Normalizing lowercase, uppercase
• Some news headlines may be capitalized: “PM Modi Announces US $1 Billion Concessional Line of Credit to Nepal”
• Some tweets may be in all capital letters
• Normalization to same word expressed with differing syntax
• This is needed for applications like information retrieval so that the query string can be more appropriately matched
with the content being searched - E.g, U.K and UK may be resolved to one term, similarly Mr and Mr., Ph.D, PhD etc
• How do we handle words that are distorted versions of a valid English word?
• eVrYthin, ChnGZ, D, etc – if we have a vocabulary of a given corpus that has the terms everything, changes, the
would we count the abbreviated terms same as the valid ones or would we consider them different?
• “I am toooooooo happpppppppppy” – If we correct these in to: “I am too happy” we may lose the implied and
hidden emotion of the strength of happiness expressed in the phrase.
• How do we handle new words not part of the language?
• Tweeple, selfie, bestie etc – consider a spell checker application that we would like to build. Would we
correct these? Are these candidate correction words of some one’s misspells, eg, Selfy instead of selfie?
• Sentence segmentation is concerned about splitting the given input text in to
• Characters like !, ?, . may indicate sentence boundaries
• However, there could be ambiguity, for instance:
• The quarterly results of Yahoo! showed a promising trend
• Microsoft announced updates to .NET framework
• Indian rupee appreciated by 0.5% against USD during today’s trade
• Who moved my cheese? is a great read.
• The ambiguity may also arise due to spelling mistakes
• Possible ways to resolve:
• Traditional rule based regular expressions
• Decision trees encoding some rules
Challenges with social data
Some key challenges are:
• How do we tokenize words like: ireally, l0ve, #unever etc.
• How to handle tagging problems like: POS tagging, NE tagging and so on
• More broadly: How to address the traditional NLP core tasks and applications?
Tokenization - Tools
• Term frequency denotes the count of occurrences of a given term in the given document:
• Term frequency of a term t in a document d: tf(t, d)
• Term Frequency signifies the importance of the given term t in a given document d
• Suppose we have N documents and suppose we are counting the occurrence of a term t across all the
N documents. The Inverse Document Frequency (idf) is the log of ratio of number of documents to
those that contain the term t.
• IDF of a term t across N documents: idf(t, D) = log(N / D) where D = set of all documents containing the term t
• IDF indicates whether the given term t is a common term that may be found in a large number of documents in the corpus
of N documents or it is a rare term found only in a small subset of documents in the corpus
• Thus, IDF is a measure of how much information the given term provides. A common term provides very less information
while terms that are unique to a given document help distinguish the documents.
• Tfidf is the product of tf and idf
• tfidf(t, d, D) = tf(t, d) * idf(t, D)
• Tfidf assigns a weight to term t in a document d, where d belongs to a corpus of N documents
• Terms with relatively high tfidf help discriminate one document from the other
• Given a query term and document d, we can compute a score:
Anantharaman Narayana Iyer
Narayana dot Anantharaman at gmail dot com
13 Aug 2015
Algorithm: Computing document ranks with respect to a query
- Query String
- Corpus of N documents
- Rank list : [(di, si), (dj, sj), …] # di is the top ranked document, dj has rank 2 etc
1. Word tokenize the query to a set of words: q1, q2…, qk
2. Set ranks to an empty list # ranks = [ ] : output will be produced here
3. Compute for each document di in the document corpus D:
1. Initialize score of di = 0
2. Compute for each query term t in q:
1. n = Tf(t, di) # compute term frequency
2. m = idf(t, D) # compute IDF for t across documents: NOTE: If needed add 1 smoothing
3. tfidf = n * m
4. Add tfidf to score of di
3. Add the tuple (index of di, score for di) to ranks # (di, si)
4. Sort the ranks list
5. Return ranks
Rank ordering the 2 documents
• Consider the two documents shown in the next slide
• Let the query string be: q = “Python Web Server”
• The above query is made up of 3 words: (Python, Web, Server)
• We are required to rank the documents in terms of match for this
• Compute_ranks(q, documents) : returns the rank of the documents
Illustration of tfidf computation
Document 1: From Python official web site
Document 2: From Wikipedia
Tfidf computation and rank ordering
Query term Tf(t, d1) Tf(t, d2) Tfidf(t, D) Tfidf(t, D)
Python 4 0 4 * log 2 = 1.2 0
Web 9 4 0 0
Server 2 3 0 0
- As this is just an illustration, the tf values above are just counts and not normalized. Since the document sizes
can vary, we may need to normalize in some applications
- This illustration uses log with base 10
- We have not used smoothing approach. If a query term is not found in a document, it is excluded from
computing tfidf score.
- Tfidf = 0 for the terms web and server because both documents contain these terms and hence idf = log 1,
which is zero. Hence tf * idf = 0
As per our algorithm, score(q, d1) = 1.2 and score(q, d2) = 0
Hence the output is: ranks = [(d1, 1.2), (d2, 0)]
Text Normalization – Wiki Definition
• The goal of text normalization is to transform the text in to a form that makes it suitable for
• The kind of processing that we might do on the normalized text depends on the goals of the
• Hence it is imperative that there may not be one single best way to normalize the text.
• Natural language text generated in social media adds to the challenges of NLP. Social media
texts may contain:
• Misspelt words
• Peter Norvig reports an accuracy of 98.9% for his spell correction program that computes edit distance of 2 on a
standard text corpus (http://norvig.com/spell-correct.html). But some of the tweets in social media may have
words that can not be corrected with an edit distance of 2. eg, “vL” corresponds to “we will” in some tweets!
• New words and terminologies
• URLs, emoticons, hashtags, @names etc as part of the text
• Named entities that contain non alphabet symbols as a part of the name: eg, Yahoo!
• Sentence and word boundaries not well marked
• For example, tokenizing words for “vL b” yields 2 word tokens while “we will be” yields 3 tokens
Text tokenization and normalization: Motivating example
Consider 2 problems:
• Suppose we need to pick the tweets from the list below that have distinct information (as against having same
information expressed in different forms), which ones we would pick?
• Suppose we need to draw a tag cloud of terms from the tweets, which words are to be shown?
• Key questions:
• What is the contribution of twitter ids, hash tags, URLs etc?
• Which words contribute most? Least?
• What kind of text processing may bring best results?
Sample Tweets (time stamp: 8 Aug 2014, 5:20 pm IST):
1. RT @IndianCricNews: Pujara was dismissed for a duck for the first time in his Test career. 1st duck in FC cricket since November 2008
2. 4th Test Match' England Vs India' Eng 160-5 over 48 Root 9* M Ali 10* Lead by 8 runs' *GEO PAKISTAN*
3. RT @CricketAus: If you're playing at home, #EngvInd day two is starting. Can India fight back? Follow LIVE http://t.co/4FFMDOcJBV
4. Virat Kohli going through worst patch of his career - Zee News http://t.co/MHUll6zi6a
5. India vs England Live Score: 4th Test, Day 2 - IBNLive - IBNLiveIndia vs England Live Score: 4th Test, Day 2IBNLiv... http://t.co/ZIUTD5Y94j
6. England v India: Fourth Test, Old Trafford, day twoLive - BBC Sport http://t.co/OHOEiRj9QX
7. RT @cricbuzz: .@ECB_cricket move into the lead now! Follow here: http://t.co/GdlOLvNgmb #EngvInd
• Text normalization is required for most NLP applications
• Consider the problem of generating the index of a corpus of text, where the index
contains the word and the locations in the text where the word is found.
• This involves:
• Splitting the text in to sentences
• Word tokenization that yields (word, location of the word)
• Word types and tokens
• Similar to the token types and token values that we encounter during the lexical
analysis of programming languages
Lemmatisation and Stemming (ref: Wikipedia definition)
• Lemmatisation is the process of grouping
together the different inflected forms of a
word so they can be analysed as a single item
• Lemmatization is sensitive to POS: E.g: Meeting
(noun), meeting (verb)
• Stemming: In linguistic
morphology and information
retrieval, stemming is the process for
reducing inflected (or sometimes derived)
words to their stem, base or root form—
generally a written word form.
Note: We will revisit this topic and discuss
Porter’s algorithm for stemming in a later class
>>> print(lemmatizer.lemmatize('running', pos['noun']))
>>> print(lemmatizer.lemmatize('running', pos['verb']))
>>> print(lemmatizer.lemmatize('holding', pos['verb']))
Stemming and Lemmatization Differences
• Both lemmatization and stemming attempt to bring a canonical form for a set of related
• Lemmatization takes the part of speech in to consideration. For example, the term
‘meeting’ may either be returned as ‘meeting’ or as ‘meet’ depending on the part of
• Lemmatization often uses a tagged vocabulary (such as Wordnet) and can perform more
sophisticated normalization. E.g. transforming mice to mouse or foci to focus.
• Stemming implementations, such as the Porter’s stemmer, use heuristics that truncates
or transforms the end letters of the words with the goal of producing a normalized form.
Since this is algorithm based, there is no requirement of a vocabulary.
• Some stemming implementations may combine a vocabulary along with the algorithm.
Such an approach for example convert ‘cars’ to ‘automobile’ or even ‘Honda City’,
‘Mercedes Benz’ to a common word ‘automobile’
• A stem produced by typical stemmers may not be a word that is part of a language
vocabulary but lemmatizer transform the given word forms to a valid lemma.
Vector Space Representation of Sentences
Bag of words and Vector Space Model
• Bag of words is a representation of a document as an
unordered set of words with the respective frequency
• This approach of modelling a document by the list of
words it contains, without regard to the order, is
adequate for several applications (such as those that
use Naïve Bayes Classifier or query matching) but is a
severe limitation for certain other applications.
• Let us model the above 2 statements in the example as
a bag of words
Vocabulary size: |V| = |V1 U V2|
We have 29 unique words – hence 29 dimensional vector space
a = (1,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0)
b = (2,1,2,1,1,0,0,0,0,0,0,1,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1)
Word Count 1 Count 2
Broad 1 2
to 1 1
Rahane 2 2
no 1 1
run 1 1
touch 1 0
Short 1 0
Outside 1 0
Off 1 0
And 1 0
fiddles 1 0
At 1 1
It 1 0
Stuck 1 0
In 1 0
The 1 0
crease 1 0
Word Count 1 Count 2
snorting 0 1
bouncer 0 1
from 0 1
flying 0 1
through 0 1
throat 0 1
height 0 1
limbos 0 1
for 0 1
his 0 1
life 0 1
• We can map each document in a corpus
to a n-dimensional vector, where n is the
size of the vocabulary.
• Here, we represent each unique word as a
dimension and the magnitude along this
dimension is the count of that word in the
• Given such vectors a, b, …, we can
compute the vector dot product and
cosine of the angle between them.
• The angle is a measure of alignment
between 2 vectors and hence similarity.
• An example of its use in information
retrieval is to: Vectorize both the query
string and the documents and find
similarity(q, di) for all i from 1 to n.
Cosine Similarity – handwritten notes
If the 2 vectors are orthogonal, the dot product
is 0 and hence they are dissimilar.
If the 2 vectors are fully aligned (theta = 0),
cosine is 1 and hence the documents are similar
(not necessarily identical)