Vectorisation des noms et des descriptions d'articles (textes en anglais)

Notebook basé sur:
Getting Started with Text Vectorization
How to Vectorize Text in DataFrames for NLP Tasks — 3 Simple Techniques
Multimodal deep learning to predict movie genres

0. Cleaning data

We use the Texthero library. To apply the default text cleaning script run hero.clean(pandas.Series). It runs the following seven functions by default when using clean():

Word lemmatization using NLTK. The goal of lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. For example, ‘Finds’ can be replaced with just ‘Find’.

Texthero documentation

1. Binary Term Frequency

Binary Term Frequency captures presence (1) or absence (0) of term in document. Under TfidfVectorizer, we set binary parameter equal to true so that it can show just presence (1) or absence (0) and norm parameter equal to false.
TfidfVectorizer documentation

2. Bag of Words (BoW) Term Frequency

Bag of Words (BoW) Term Frequency captures frequency of term in document. Under TfidfVectorizer, we set binary parameter equal to false so that it can show the actual frequency of the term and norm parameter equal to none.

3. (L1) Normalized Term Frequency

(L1) Normalized Term Frequency captures normalized BoW term frequency in document. Under TfidfVectorizer, we set binary parameter equal to false so that it can show the actual frequency of the term and norm parameter equal to l1.

4.(L2) Normalized TF-IDF

(L2) Normalized TFIDF (Term Frequency–Inverse Document Frequency) captures normalized TFIDF in document. The below is the formula for how to compute the TFIDF.
Under TfidfVectorizer, we set binary parameter equal to false so that it can show the actual frequency of the term and norm parameter equal to l2.

5. Word2Vec

Word2Vec provides embedded representation of words. Word2Vec starts with one representation of all words in the corpus and train a NN (with 1 hidden layer) on a very large corpus of data. Python’s spacy package provides pre-trained models we can use to see how w2v works.

6. Choix de la vectorisation

7. Encodage des labels (y_train)