This presentation discusses probability theory basics, Naive Bayes Classifier with some practical examples. This also introduces graph models for representing joint probability distributions.

Published on: **Mar 3, 2016**

Published in:
Software

- 1. Mathematical Background For Natural Language Processing Anantharaman Narayana Iyer Narayana dot Anantharaman at gmail dot com 14 Aug 2015
- 2. Why Probability for NLP? • Consider the following example scenarios: • You are travelling in an autorikshaw on a busy road in Bangalore and are a on a call with your friend. • We are watching an Hollywood English film. Often times we may not understand exactly every word that is spoken either due to the accent of the speaker or the word is a slang that not everyone outside the context can relate to. • We are reading tweets that are cryptic with several misspelled words, emoticons, hashtags and so on. • Commonality in all the above cases is the presence of noise along with the signal • The noise or ambiguities result in uncertainty of interpretation • In order for an NLP application to process such text, we need an appropriate mathematical machinery. • Probability theory is our tool to handle such cases.
- 3. What is a Random Variable? • A is a Boolean valued RV if A denotes an event and there is some degree of uncertainty to whether A occurs. • Example: It will rain in Manchester during the 4th Cricket test match between India and England • Probability of A is the fraction of possible worlds in which A is true • The area of blue rectangle = 1 • Random Variable is not a variable in the traditional sense. It is rather a function mapping. Worlds where A is true Worlds where A is false
- 4. Axioms of Probability The following axioms always hold good: • 0 <= P(A) <= 1 • P(True) = 1 • P(False) = 0 • P(A or B) = P(A) + P(B) – P(A and B) Note: We can diagrammatically represent the above and verify these
- 5. Multivalued Discrete Random Variables Examples of multivalued RVs • Number of hashtags in a Tweet • Number of URLs in a tweet • Number of tweets that mention the term iPhone in a given day • Number of tweets sent by Times Now channel per day • The values a word position can take in a natural language text sentence of a given vocabulary
- 6. Random Variable Example • We can treat the number of hashtags contained in a given tweet (h) as a Random Variable because: • A tweet has non-negative number of hashtags: 0 to n, where n is the maximum number of hashtags bounded by the size of the tweet. For example, if a tweet has 140 characters and if the minimum string length required to specify a hashtag is 3 (one # character, one character for the tag, one delimiter), then n = floor(140 / 3) = 46 • As we observe, h is a discrete RV taking values: 0 <= h <= 46 • Thus the probability of finding a hashtag in a tweet satisfies: 0 < P(hashtag in a tweet) < 1 P(h = 0 V h = 1 V h = 2….V h = 46) = 1 – where h is the number of hashtags in the tweet P(h = hi, h = hj) = 0 when i and j are not equal – hi and hj are 2 different values of hashtag • If we are given a set of tweets, e.g., 500, we can compute the average number of hashtags per tweet. This is real valued and hence a continuous variable. • We can find the probability distribution for the continuous variable also (see later slides)
- 7. Joint Distribution of Discrete Random Variables • Consider 2 RVs X and Y, where X and Y can take discrete values. The joint distribution is given by: P(X = x, Y = y) The above satisfies: 1. P(X, Y) >= 0 2. Σ Σ P(X = xi, Y = yj) = 1 where the summation is done for all i and all j This concept can be generalized in to n variables HASHTAG RT URL SNAME 0 0 1 0 3 0 1 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 1 0 0 1 0 1 0 1 0 0 0 0 0 1 0 3 0 0 0 0 1 0 0
- 8. Frequency Distributions for our Tweet Corpora Some Observations • Frequency distribution shows distinguishable patterns across the different corpora. For e.g, in the case of Union budget, the number of tweets that have hashtags one or more is roughly 50% of total tweets while that for Railway budget is roughly 25% • The expectation value of hashtags/Retweets/URLs/Screen names per tweet can be treated as a vector of real valued Random Variables and used for classification. 0 50 100 150 200 250 300 350 400 0 1 2 3 4 5 6 7 8 9 10 Frequency Distribution For Dataset 10 (Tweets on General Elections) htfreq rtfreq urlfreq snfreq 0 50 100 150 200 250 300 350 400 450 0 1 2 3 4 5 6 7 8 9 10 Frequency Distribution for Dataset 1 (Tweets on Railway Budget) htfreq rtfreq urlfreq snfreq 0 100 200 300 400 500 0 1 2 3 4 5 6 7 8 9 10 Frequency Distribution for Dataset 19 (Tweets on Union Budget) htfreq rtfreq urlfreq snfreq
- 9. Random Variables: Illustration • Let us illustrate the definition of Random Variables and derive the sum, product rules of conditional probability with an example below: • Suppose we follow tweets from 2 media channels, say, Times News (x1) and Business News (x2). x1 is a general news channel and x2 is focused more on business news. • Suppose we measured the tweets generated by either of these 2 channels over a 1 month duration and observed the following: • Total tweets generated (union of tweets from both channels) = 1200 • Break up of the tweets received as given in the table below: 600 40 200 360 x1 x2 y1 y2
- 10. Random Variables • We can model the example we stated using 2 random variables: • Source of tweets (X), Type of tweet (Y) • X and Y are 2 RVs that can take values: • X = x where x ԑ {x1, x2}, Y = y where y ԑ {y1, y2} • Also, from the given data, we can compute prior probabilities and likelihood: P(X = x1) = (600 + 200)/1200 = 2/3 = 0.67 P(X = x2) = 1 – 2/3 = 1/3 = 0.33 P(Y = y1 | X = x1) = 600 / (600 + 200) = 3/4 = 0.75 P(Y = y2 | X = x1) = 0.25 P(Y = y1 | X = x2) = 0.1 P(Y = y2 | X = x2) = 0.9 600 40 200 360 x1 x2 y1 y2
- 11. Conditional Probability • Conditional probability is the probability of one event, given the other event has occurred. • Example: • Assume that we know the probability of finding a hashtag in a tweet. Suppose we have a tweet corpus C on a particular domain, where there is a increased probability of finding a hashtag. In this example, we have some notion (or prior) idea about the probability of finding a hashtag in a tag. When given an additional fact that the corpus from where the tweet was drawn was C, we now can revise our probability estimate on hashtag, which is: P(hashtag|C). This is called posterior probability
- 12. Sum Rule In our example: P(X = x1) = P(X = x1, Y = y1) + P(X = x1, Y = y2) We can calculate the other probabilities: P(x2), P(y1), P(y2) Note: P(X = x1) + P(X = x2) = 1 600 40 200 360 x1 x2 y1 y2
- 13. Product Rule and Generalization From product rule, we have: P(X, Y) = P(Y|X) P(X) We can generalize this in to: P(An, ..,A1)=P(An|An-1..A1)P(An-1,..,A1) Suppose e.g. n = 4 P(A4, A3, A2, A1) = P(A4|A3, A2, A1) P(A3|A2, A1) P(A2|A1) P(A1) 600 40 200 360 x1 x2 y1 y2
- 14. Bayes Theorem From product rule, we have: P(X, Y) = P(Y|X) P(X) We know: P(X, Y) = P(Y, X), hence: P(Y|X) P(X) = P(X|Y) P(Y) From the above, we derive: P(Y|X) = P(X|Y) P(Y) / P(X) The above is the Bayes Theorem
- 15. Independence • Independent Variables: Knowing Y does not alter our belief on X From product rule, we know: P(X, Y) = P(X|Y) P(Y) If X and Y are independent random variables: P(X|Y) = P(X) Hence: P(X, Y) = P(X) P(Y) We write: X Y to denote X, Y are independent • Conditional Independence • Informally, suppose X, Y are not independent taken together alone, but are independent on observing another variable Z. This is denoted by: X Y | Z • Definition: Let X, Y, Z be discrete random variables. X is conditionally independent of Y given Z if the probability distribution governing X is independent of the value of Y given a value of Z. P(X|Y, Z) = P(X|Z) Also: P(X, Y | Z) = P(X|Y, Z) P(Y|Z) = P(X|Z) P(Y|Z)
- 16. Graph Models: Bayesian Networks • Graph models also called Bayesian networks, belief networks and probabilistic networks • Nodes and edges between the nodes • Each node corresponds to a random variable X and the value of the node is the probability of X • If there is a direct edge between two vertices X to Y, it means there is a influence of X on Y • This influence is specified by the conditional probability P(Y|X) • This is a DAG • Nodes and edges define the structure of the network and the conditional probabilities are the parameters given the structure
- 17. Examples • Preparation for the exam R, and the marks obtained in the exam M • Marketing budget B and the advertisements A • Nationality of Team N and chance of qualifying for quarter final of world cup, Q • In all cases, the Probability distribution P respects the graph G
- 18. Representing the joint distributions • Consider P(A, B, C) = P(A) P(B|A) P(C|A, B). This can be represented as a graph (fig a) • Key Concept: Factorization – each vertex has a factor • Exercise: Draw the graph for P(X1, X2, X3, X4, X5) • The joint probability distribution with conditional probability assumptions respects the associated graph. (fig b) • The graph describing the distribution is useful in 2 ways: • Visualization of conditional dependencies • Inferencing • Determining Conditional Independence of a distribution is vital for tractable inference A B C A B C Fig (a) Fig (b)
- 19. Continuous Random Variables f (x) 0,x f (x) 1 P(0 X 1) f (x)dx 0 1 • A random variable X is said to be continuous if it takes continuous values – for example, a real valued number representing some physical quantity that can vary at random. • Suppose there is a sentiment analyser that outputs a sentiment polarity value as a real number in the range (-1, 1). The output of this system is an example of a continuous RV • While the polarity as mentioned in the above example is a scalar variable, we also encounter vectors of continuous random variables. For example, some sentic computing systems model emotions as a multi dimensional vector of real number. • Likewise a vector whose elements are the average values of hashtag, URL, Screen Names, Retweets per tweet, averaged over a corpus constitutes a vector of continuous Random Variables
- 20. Notes on Probability density P(x)
- 21. Canonical Cases • Head to tail • Tail to tail • Head to head A B C A B C A B C
- 22. Joint Distribution of Continuous Random Variables HASHTAG RT SCREEN NAME URL CORPUS 0.462 0.65 0.534 0.336 Elections 0.434 0.482 0.41 0.548 Elections 0.452 0.454 0.398 0.58 Elections 0.232 0.254 0.146 0.284 Railway 0.248 0.296 0.218 0.32 Railway 0.64 0.312 0.18 0.158 Union 0.38 0.572 0.45 0.34 Elections 0.646 0.296 0.164 0.178 Union 0.314 0.262 0.162 0.338 Railway 0.672 0.318 0.176 0.188 Union 0.32 0.31 0.196 0.314 Railway 0.542 0.586 0.422 0.334 Elections 0.638 0.31 0.172 0.2 Union 0.41 0.636 0.37 0.348 Elections 0.382 0.58 0.418 0.344 Elections 0.416 0.492 0.414 0.58 Elections 0.546 0.63 0.434 0.318 Elections 0.512 0.676 0.47 0.368 Elections 0.672 0.314 0.166 0.158 Union 0.422 0.678 0.4 0.37 Elections 0.694 0.298 0.112 0.182 Union 0.44 0.46 0.442 0.582 Elections 0.502 0.668 0.462 0.388 Elections 0.418 0.49 0.388 0.562 Elections 0.478 0.62 0.41 0.308 Elections 0.5 0.624 0.516 0.334 Elections
- 23. Expectation Value • For discrete variables: • Expectation value: E[X] = Σ f(x) p(x) • If a random sample is picked from the distribution, the expectation is simply the average value of f(x) • Thus, the values of hashtag, screen name etc shown in the previous slide correspond to their respective expectation values • For continuous variables: • E[X] = Integral of f(x) p(x) dx over the range of values X can take
- 24. Common Distributions • Normal X N(μ, σ2) E.g. the height of the entire population f (x) 1 2 exp (x )2 22
- 25. Naïve Bayes Classifier • Derive on the board.
- 26. Naïve Bayes Classifier: Worked out example • Classify the tweets in to one of the three corpora: (Railway, Union, Elections), given a dataset. • Split the dataset in to 75% training and 25% test • Estimate the parameters using the training data • Evaluate the model with the test data • Report the classifier accuracy