To download slides:
http://www.intelligentmining.com/category/knowledge-base/ These are my notes for a presentation I did internally at IM. It covers both the multinomial and multi-variate Bernoulli event models in Naive Bayes text classification.

Published on: **Mar 3, 2016**

Published in:
Technology Business

- 1. NAÏVE BAYESIANTEXT CLASSIFIEREVENT MODELSNOTES AND HOW TO USE THEMAlex LinSenior ArchitectIntelligent Miningalin@intelligentmining.com
- 2. Outline Quick refresher for Naïve Bayesian Text Classifier Event Models Multinomial Event Model Multi-variate Bernoulli Event Model Performance characteristics Why are they important? 2
- 3. Naïve Bayes Text Classifier Supervised Learning Labeled data set for training to classify the unlabeled data Easy to implement and highly scalable It is often the first thing to try Successful cases: Email spam filtering, News article categorization, Product classification, Social content categorization 3
- 4. N.B. Text Classifier - Trainingd1= w1w9w213w748w985...wn|y=1d2= w12w19w417w578...wn|y=0d3= w53w62w23w78w105...wn|y=1 Train y=0 y=1d4= w46w58w293w348w965...wn|y=1d5= w8w9w33w678w967...wn|y=0d6= w71w79w529w779...wn|y=1……dn= w49w127w23w93...wn|y=0 P(w1|y=1) P(w1|y=0) P(w3|y=1) P(w2|y=0) P(wn|y=1) P(w3|y=0) w1 w2 w3 … wn 4
- 5. N.B. Text Classifier - Classifying d1= w1w3|y=? Classify w1 w2 w3 … wn To classify a document, pick the class with “maximum posterior probability” Argmax P(Y = c | w1,w 3 ) c = Argmax P(w1,w 3 |Y = c)P(Y = c) c P(w1|y=0) = Argmax P(Y = c)∏ P(w n |Y = c) P(w3|y=1) c n P(w3|y=0) P(w1|y=1) P(Y = 0)P(w1 |Y = 0)P(w 3 |Y = 0) P(Y = 1)P(w1 |Y = 1)P(w 3 |Y = 1) 5 y=0 y=1€€
- 6. Bayesian Theorem Likelihood Class Prior P(X |Y )P(Y ) P(Y | X) = P(X) Posterior Evidence€ Generative learning algorithms model “Likelihood” P(X|Y) and “Class Prior” P(Y) then make prediction based on Bayes Theorem. 6
- 7. Naïve Bayes Text Classifier P(X |Y )P(Y ) P(Y | X) = P(X) Y ∈ {0,1} d = w1,w 6 ,...,w n Apply Bayes’ Theorem: € P(w1,w 6 ,...,w n |Y = 0)P(Y = 0) €P(Y = 0 | w1,w 6 ,...,w n ) = P(w1,w 6 ,...,w n ) P(w1,w 6 ,...,w n |Y = 1)P(Y = 1) P(Y = 1 | w1,w 6 ,...,w n ) = P(w1,w 6 ,...,w n )€ €€ 7 €
- 8. Naïve Bayes Text Classifier P(X |Y )P(Y ) P(Y | X) = P(X) Y ∈ {0,1} d = w1,w 6 ,...,w n To find most likely class label for d: € Argmax P(Y = c | w1,w 6 ,...,w n )€ € c P(w1,w 6 ,...,w n |Y = c)P(Y = c) = Argmax c P(w1,w 6 ,...,w n ) € = Argmax P(w1,w 6 ,...,w n |Y = c)P(Y = c) c How do we estimate likelihood? € 8 €
- 9. Estimate Likelihood P(X|Y) How to estimate Likelihood P(w1,w 6 ,...,w n |Y = c) ? Naïve Bayes’ assumption - assume that the words (wn) are conditionally independent given y. P(w1,w 6 ,...,w n |Y € c) = = P(w1 |Y = c) * P(w 6 |Y = c) * ...* P(w n |Y = c) = ∏ P(w i |Y = c) i∈n Different event models€ = ∑ log(P(w i |Y = c)) i∈n 9€
- 10. Differences of Two Models ∏ P(w i |Y = c) Multinomial Event model i∈n tf wi ,c Term Freq. of wi in Class c P(w i |Y = c) = c € Sum of all Term Freq. in Class c Multi-variate Bernoulli Event model€ df wi ,c Doc Freq. of wi in Class c P(w i |Y = c) = # of Docs in Class c Nc when wi not exists in d 10 P(w i |Y = c) = 1− P(w i |Y = c)€
- 11. Comparison of Two Models tf wi ,cMultinomial: P(w i |Y = c) = c Label w1 w2 w3 w4 … wn Y=0 tfw1,0 tfw2,0 tfw2,0 tfw2,0 … tfwn,0 |c| € Y=1 tfw1,1 tfw2,1 tfw2,1 tfw2,1 … tfwn,1 |c| df wi ,cMulti-variate Bernoulli: P(w i |Y = c) = Nc Label w1 w2 w3 w4 … wn Y=0 dfw1,0 dfw2,0 dfw3,0 dfw4,0 … dfwn,0 N0 € Y=0 !dfw1,0 !dfw2,0 !dfw3,0 !dfw4,0 … !dfwn,0 N0 Y=1 dfw1,1 dfw2,1 dfw3,1 dfw4,1 … dfwn,1 N1 11 Y=1 !dfw1,1 !dfw2,1 !dfw3,1 !dfw4,1 … !dfwn,1 N1
- 12. Comparison of Two Models tf wi ,c ∏ P(w |Y = c)Multinomial: P(w i |Y = c) = c i∈n id = {w1,w 2 ,...,w n } n: # of words in d w n ∈ {1,2,3,..., V } c €Argmax ∑ log(P(w i |Y = c)) + log( ) € c i∈n D € df wi ,cMulti-variate Bernoulli: P(w i |Y = c) = Ncd = {w1,w 2 ,...,w n } n: # of words in vocabulary |V| w n ∈ {0,1} € cArgmax ∑ log(P(w i |Y = c) (1− P(w i |Y = c)) wi 1−w i ) + log( ) c i∈n D 12 €
- 13. Multinomial For each word in docw1 w2 w3 w4 .. wn w1 w2 w3 w4 .. w5 Y=0 Y=1 A% B% 13
- 14. Multi-variate Bernoulli For each word in doc Y=1 A% W Y=0 1-A% When W does exists in doc Y=1 B% W’ Y=0 1-B% When W does not exists in doc 14
- 15. Performance Characteristics Multinomial would perform better Multi-variate Bernoulli in most text classification tasks, especially when vocabulary size |V| >= 1K Multinomial perform better when handling data sets that have large variance in document length Multivariate Bernoulli could have the advantage of dense data set Non-text features could be added as additional Bernoulli variables. However it should not be added to the vocabulary in Multinomial model 15
- 16. Why are they interesting? Certaindata points are more suitable for one event model than the other. Example: Web page text + “Domain” + “Author” Social content text + “Who” Product name / desc + “Brand” We can create a Naïve Bayes Classifier that combines event models Most importantly, try both on your data set 16
- 17. Thank you Any question or comment? 17
- 18. Appendix Laplace Smoothing Generative vs Discriminative Learning Multinomial Event Model Multi-variate Bernoulli Event Model Notation 18
- 19. Laplace Smoothing Multinomial: tf wi ,c + 1 P(w i |Y = c) = c+V Multi-variate Bernoulli:€ df wi ,c + 1 P(w i |Y = c) = Nc + 2 19€
- 20. Generative vs. DiscriminativeDiscriminative Learning Algorithm:Try to learn P(Y | X) directly or try to map input Xto labels Y directlyGenerative Learning Algorithm:Try to model P(X |Y) and € P(Y ) first, then useBayes theorem to find out P(X |Y )P(Y ) P(Y | X) = € P(X) € 20
- 21. Multinomial Event Model tf wi ,c P(w i |Y = c) = c d = {w1,w 2 ,...,w n } n: # of words in d w n ∈ {1,2,3,..., V } = Argmax P(w1,w 6 ,...,w n |Y = c)P(Y = c) c€ € c tf wi ,c = Argmax ∑ log( ) + log( ) c i∈n c D 21
- 22. Multi-variate Bernoulli Event Model df wi ,c P(w i |Y = c) = Nc d = {w1,w 2 ,...,w n } n: # of words in vocabulary |V| w n ∈ {0,1} = Argmax P(w1,w 6 ,...,w n |Y = c)P(Y = c) c€ df wi ,c df € wi ,c 1−wi c = Argmax ∑ log(( ) )(1− ( wi ) ) + log( ) c i∈n Nc Nc D 22
- 23. Notation d: a document w: a word in a document X: observed data attributes Y: Class Label |V|: number of terms in vocabulary |D|: number of docs in training set |c|: number of docs in class c tf: Term frequency df: document frequency !df: 23
- 24. References McCallum, A., Nigam, K.: A comparison of event models for Naive Bayes text classiﬁcation. In: Learning for Text Categorization: Papers from the AAAI Workshop, AAAI Press (1998) 41–48 Technical Report WS-98-05 Metsis, V., Androutsopoulos, I., Paliouras, G.: Spam Filtering with Naive Bayes – Which Naive Bayes (2006) Third Conference on Email and Anti-Spam (CEAS) Schneider, K: On Word Frequency Information and Negative Evidence in Naive Bayes Text Classification (2004) España for Natural Language Processing, EsTAL 24