Vectorizers
One of the most important attributes of a topic model you will have to choose is the vectorizer. A vectorizer is responsible for extracting term-features from text. It determines for which terms word-importance scores will be calculated.
By default, Turftopic uses sklearn's CountVectorizer,
which naively counts word/n-gram occurrences in text. This usually works quite well, but your use case might require you to use a different or more sophisticated approach.
This is why we provide a vectorizers
module, where a wide range of useful options is available to you.
How is this different from preprocessing?
You might think that preprocessing the documents might result in the same effect as some of these vectorizers, but this is not entirely the case. When you remove stop words or lemmatize texts in preprocessing, you remove a lot of valuable information that your topic model then can't use. By defining a custom vectorizer you limit the vocabulary of your model, thereby only learning word importance scores for certain words, but you keep your documents fully intact.
Phrase Vectorizers
You might want to get phrases in your topic descriptions instead of individual words. This could prove a very reasonable choice as it's often not words in themselves but phrases made up by them that describe a topic most accurately. Turftopic supports multiple ways of using phrases as fundamental terms.
N-gram Features with CountVectorizer
CountVectorizer
supports n-gram extraction right out of the box.
Just define a custom vectorizer with an n_gram_range
.
Tip
While this option is naive, and will likely yield the lowest quality results, it is also incredibly fast in comparison to other phrase vectorization techniques. It might, however be slower, if the topic model encodes its vocabulary when fitting.
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(ngram_range=(2,3), stop_words="english")
model = KeyNMF(10, vectorizer=vectorizer)
model.fit(corpus)
model.print_topics()
Topic ID | Highest Ranking |
---|---|
0 | bronx away sank, blew bronx away, blew bronx, bronx away, sank manhattan, stay blew bronx, manhattan sea, away sank manhattan, said queens stay, queens stay |
1 | faq alt atheism, alt atheism archive, atheism overview alt, alt atheism resources, atheism faq frequently, archive atheism overview, alt atheism faq, overview alt atheism, titles alt atheism, readers alt atheism |
2 | theism factor fanatism, theism leads fanatism, fanatism caused theism, theism correlated fanaticism, fanatism point theism, fanatism deletion theism, fanatics tend theism, fanaticism said fanatism, correlated fanaticism belief, strongly correlated fanaticism |
3 | alt atheism, atheism archive, alt atheism archive, archive atheism, atheism atheism, atheism faq, archive atheism introduction, atheism archive introduction, atheism introduction alt, atheism introduction |
... |
Noun phrases with NounPhraseCountVectorizer
Turftopic can also use noun phrases by utilizing the SpaCy package. For Noun phrase vectorization to work, you will have to install SpaCy.
pip install turftopic[spacy]
You will also need to install a relevant SpaCy pipeline for the language you intend to use.
The default pipeline is English, and you should install it before attempting to use NounPhraseCountVectorizer
.
You can find a model that fits your needs here.
python -m spacy download en_core_web_sm
Using SpaCy pipelines will substantially slow down model fitting, but the results might be more correct and higher quality than with naive n-gram extraction.
from turftopic import KeyNMF
from turftopic.vectorizers.spacy import NounPhraseCountVectorizer
model = KeyNMF(
n_components=10,
vectorizer=NounPhraseCountVectorizer("en_core_web_sm"),
)
model.fit(corpus)
model.print_topics()
Topic ID | Highest Ranking |
---|---|
0 | atheists, atheism, atheist, belief, beliefs, theists, faith, gods, christians, abortion |
1 | alt atheism, usenet alt atheism resources, usenet alt atheism introduction, alt atheism faq, newsgroup alt atheism, atheism faq resource txt, alt atheism groups, atheism, atheism faq intro txt, atheist resources |
2 | religion, christianity, faith, beliefs, religions, christian, belief, science, cult, justification |
3 | fanaticism, theism, fanatism, all fanatism, theists, strong theism, strong atheism, fanatics, precisely some theists, all theism |
4 | religion foundation darwin fish bumper stickers, darwin fish, atheism, 3d plastic fish, fish symbol, atheist books, atheist organizations, negative atheism, positive atheism, atheism index |
... |
Keyphrases with KeyphraseVectorizers
You can extract candidate keyphrases from text using KeyphraseVectorizers.
KeyphraseVectorizers uses POS tag patterns to identify phrases instead of word dependency graphs, like NounPhraseCountVectorizer
.
KeyphraseVectorizers can potentially be faster as the dependency parser component is not needed in the SpaCy pipeline.
This vectorizer is not part of the Turftopic package, but can be easily used with it out of the box.
pip install keyphrase-vectorizers
from keyphrase_vectorizers import KeyphraseCountVectorizer
vectorizer = KeyphraseCountVectorizer()
model = KeyNMF(10, vectorizer=vectorizer).fit(corpus)
Lemmatizing and Stemming Vectorizers
Since the same word can appear in multiple forms in a piece of text, one can sometimes obtain higher quality results by stemming or lemmatizing words in a text before processing them.
Warning
You should NEVER lemmatize or stem texts before passing them to a topic model in Turftopic, but rather, use a vectorizer that limits the model's vocabulary to the terms you are interested in.
Extracting lemmata with LemmaCountVectorizer
Similarly to NounPhraseCountVectorizer
, LemmaCountVectorizer
relies on a SpaCy pipeline for extracting lemmas from a piece of text.
This means you will have to install SpaCy and a SpaCy pipeline to be able to use it.
pip install turftopic[spacy]
python -m spacy download en_core_web_sm
from turftopic import KeyNMF
from turftopic.vectorizers.spacy import LemmaCountVectorizer
model = KeyNMF(10, vectorizer=LemmaCountVectorizer("en_core_web_sm"))
model.fit(corpus)
model.print_topics()
Topic ID | Highest Ranking |
---|---|
0 | atheist, theist, belief, christians, agnostic, christian, mythology, asimov, abortion, read |
1 | morality, moral, immoral, objective, society, animal, natural, societal, murder, morally |
2 | religion, religious, christianity, belief, christian, faith, cult, church, secular, christians |
3 | atheism, belief, agnosticism, religious, faq, lack, existence, theism, atheistic, allah |
4 | islam, muslim, islamic, rushdie, khomeini, bank, imam, bcci, law, secular |
... |
Stemming words with StemmingCountVectorizer
You might find that lemmatization isn't aggressive enough for your purposes and still many forms of the same word penetrate topic descriptions. In that case you should try stemming! Stemming is available in Turftopic via the Snowball Stemmer, so it has to be installed before using stemming vectorization.
Should I choose stemming or lemmatization?
In almost all cases you should prefer lemmatizaion over stemming, as it provides higher quality and more correct results. You should only use a stemmer if
- You need something fast (lemmatization is slower due to a more involved pipeline)
- You know what you want and it is definitely stemming.
pip install turftopic[snowball]
Then you can initialize a topic model with this vectorizer:
from turftopic import KeyNMF
from turftopic.vectorizers.snowball import StemmingCountVectorizer
model = KeyNMF(10, vectorizer=StemmingCountVectorizer(language="english"))
model.fit(corpus)
model.print_topics()
Topic ID | Highest Ranking |
---|---|
0 | atheism, belief, alt, theism, agnostic, stalin, lack, sceptic, exist, faith |
1 | religion, belief, religi, cult, faith, theism, secular, theist, scientist, dogma |
2 | bronx, manhattan, sank, queen, sea, away, said, com, bob, blew |
3 | moral, human, instinct, murder, kill, law, behaviour, action, behavior, ethic |
4 | atheist, theist, belief, asimov, philosoph, mytholog, strong, faq, agnostic, weak |
Non-English Vectorization
You may find that, especially with non-Indo-European languages, CountVectorizer
does not perform that well.
In these cases we recommend that you use a vectorizer with its own language-specific tokenization rules and stop-word list:
Vectorizing Any Language with TokenCountVectorizer
The SpaCy package includes language-specific tokenization and stop-word rules for just about any language. We provide a vectorizer that you can use with the language of your choice.
pip install turftopic[spacy]
Note
Note that you do not have to install any SpaCy pipelines for this to work.
No pipelines or models will be loaded with TokenCountVectorizer
only a language-specific tokenizer.
from turftopic import KeyNMF
from turftopic.vectorizers.spacy import TokenCountVectorizer
# CountVectorizer for Arabic
vectorizer = TokenCountVectorizer("ar", min_df=10)
model = KeyNMF(
n_components=10,
vectorizer=vectorizer,
encoder="Omartificial-Intelligence-Space/Arabic-MiniLM-L12-v2-all-nli-triplet"
)
model.fit(corpus)
Extracting Chinese Tokens with ChineseCountVectorizer
The Chinese language does not separate tokens by whitespace, unlike most Indo-European languages. You thus need to use special tokenization rules for Chinese. Turftopic provides tools for Chinese tokenization via the Jieba package.
Note
We recommend that you use Jieba over SpaCy for topic modeling with Chinese.
You will need to install the package in order to be able to use our Chinese vectorizer.
pip install turftopic[jieba]
You can then use the ChineseCountVectorizer
object, which comes preloaded with the jieba tokenizer along with a Chinese stop word list.
from turftopic import KeyNMF
from turftopic.vectorizers.chinese import ChineseCountVectorizer
vectorizer = ChineseCountVectorizer(min_df=10, stop_words="chinese")
model = KeyNMF(10, vectorizer=vectorizer, encoder="BAAI/bge-small-zh-v1.5")
model.fit(corpus)
model.print_topics()
Topic ID | Highest Ranking |
---|---|
0 | 消息, 时间, 科技, 媒体报道, 美国, 据, 国外, 讯, 宣布, 称 |
1 | 体育讯, 新浪, 球员, 球队, 赛季, 火箭, nba, 已经, 主场, 时间 |
2 | 记者, 本报讯, 昨日, 获悉, 新华网, 基金, 通讯员, 采访, 男子, 昨天 |
3 | 股, 下跌, 上涨, 震荡, 板块, 大盘, 股指, 涨幅, 沪, 反弹 |
... |
API Reference
turftopic.vectorizers.spacy.NounPhraseCountVectorizer
Bases: CountVectorizer
Extracts Noun phrases from text using SpaCy.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
nlp
|
Union[Language, str]
|
A Spacy pipeline or its name. |
'en_core_web_sm'
|
Source code in turftopic/vectorizers/spacy.py
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 |
|
turftopic.vectorizers.spacy.LemmaCountVectorizer
Bases: CountVectorizer
Extracts lemmata from text using SpaCy.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
nlp
|
Union[Language, str]
|
A Spacy pipeline or its name. |
'en_core_web_sm'
|
Source code in turftopic/vectorizers/spacy.py
89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 |
|
turftopic.vectorizers.spacy.TokenCountVectorizer
Bases: CountVectorizer
Tokenizes text with SpaCy using its language-specific tokenization rules and stop-word lists
Parameters:
Name | Type | Description | Default |
---|---|---|---|
language_code
|
str
|
Language code for the language you intend to use. |
'en'
|
remove_stop_words
|
bool
|
Indicates whether stop words should be removed. |
True
|
remove_nonalpha
|
bool
|
Indicates whether only tokens containing alphabetical characters should be kept. |
True
|
Source code in turftopic/vectorizers/spacy.py
158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 |
|
turftopic.vectorizers.snowball.StemmingCountVectorizer
Bases: CountVectorizer
Extractes stemmed words from documents using Snowball.
Source code in turftopic/vectorizers/snowball.py
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 |
|
turftopic.vectorizers.chinese.ChineseCountVectorizer
Bases: CountVectorizer
Chinese count vectorizer. Does word segmentation with Jieba, and includes a chinese stop words list. You have to specify stop_words="chinese" for this to kick into effect.
Source code in turftopic/vectorizers/chinese.py
1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 |
|