Skip to content

Vectorizers

One of the most important attributes of a topic model you will have to choose is the vectorizer. A vectorizer is responsible for extracting term-features from text. It determines for which terms word-importance scores will be calculated.

By default, Turftopic uses sklearn's CountVectorizer, which naively counts word/n-gram occurrences in text. This usually works quite well, but your use case might require you to use a different or more sophisticated approach. This is why we provide a vectorizers module, where a wide range of useful options is available to you.

How is this different from preprocessing?

You might think that preprocessing the documents might result in the same effect as some of these vectorizers, but this is not entirely the case. When you remove stop words or lemmatize texts in preprocessing, you remove a lot of valuable information that your topic model then can't use. By defining a custom vectorizer you limit the vocabulary of your model, thereby only learning word importance scores for certain words, but you keep your documents fully intact.

Phrase Vectorizers

You might want to get phrases in your topic descriptions instead of individual words. This could prove a very reasonable choice as it's often not words in themselves but phrases made up by them that describe a topic most accurately. Turftopic supports multiple ways of using phrases as fundamental terms.

N-gram Features with CountVectorizer

CountVectorizer supports n-gram extraction right out of the box. Just define a custom vectorizer with an n_gram_range.

Tip

While this option is naive, and will likely yield the lowest quality results, it is also incredibly fast in comparison to other phrase vectorization techniques. It might, however be slower, if the topic model encodes its vocabulary when fitting.

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(ngram_range=(2,3), stop_words="english")

model = KeyNMF(10, vectorizer=vectorizer)
model.fit(corpus)
model.print_topics()
Topic ID Highest Ranking
0 bronx away sank, blew bronx away, blew bronx, bronx away, sank manhattan, stay blew bronx, manhattan sea, away sank manhattan, said queens stay, queens stay
1 faq alt atheism, alt atheism archive, atheism overview alt, alt atheism resources, atheism faq frequently, archive atheism overview, alt atheism faq, overview alt atheism, titles alt atheism, readers alt atheism
2 theism factor fanatism, theism leads fanatism, fanatism caused theism, theism correlated fanaticism, fanatism point theism, fanatism deletion theism, fanatics tend theism, fanaticism said fanatism, correlated fanaticism belief, strongly correlated fanaticism
3 alt atheism, atheism archive, alt atheism archive, archive atheism, atheism atheism, atheism faq, archive atheism introduction, atheism archive introduction, atheism introduction alt, atheism introduction
...

Noun phrases with NounPhraseCountVectorizer

Turftopic can also use noun phrases by utilizing the SpaCy package. For Noun phrase vectorization to work, you will have to install SpaCy.

pip install turftopic[spacy]

You will also need to install a relevant SpaCy pipeline for the language you intend to use. The default pipeline is English, and you should install it before attempting to use NounPhraseCountVectorizer.

You can find a model that fits your needs here.

python -m spacy download en_core_web_sm

Using SpaCy pipelines will substantially slow down model fitting, but the results might be more correct and higher quality than with naive n-gram extraction.

from turftopic import KeyNMF
from turftopic.vectorizers.spacy import NounPhraseCountVectorizer

model = KeyNMF(
    n_components=10,
    vectorizer=NounPhraseCountVectorizer("en_core_web_sm"),
)
model.fit(corpus)
model.print_topics()

Topic ID Highest Ranking
0 atheists, atheism, atheist, belief, beliefs, theists, faith, gods, christians, abortion
1 alt atheism, usenet alt atheism resources, usenet alt atheism introduction, alt atheism faq, newsgroup alt atheism, atheism faq resource txt, alt atheism groups, atheism, atheism faq intro txt, atheist resources
2 religion, christianity, faith, beliefs, religions, christian, belief, science, cult, justification
3 fanaticism, theism, fanatism, all fanatism, theists, strong theism, strong atheism, fanatics, precisely some theists, all theism
4 religion foundation darwin fish bumper stickers, darwin fish, atheism, 3d plastic fish, fish symbol, atheist books, atheist organizations, negative atheism, positive atheism, atheism index
...

Keyphrases with KeyphraseVectorizers

You can extract candidate keyphrases from text using KeyphraseVectorizers. KeyphraseVectorizers uses POS tag patterns to identify phrases instead of word dependency graphs, like NounPhraseCountVectorizer. KeyphraseVectorizers can potentially be faster as the dependency parser component is not needed in the SpaCy pipeline. This vectorizer is not part of the Turftopic package, but can be easily used with it out of the box.

pip install keyphrase-vectorizers
from keyphrase_vectorizers import KeyphraseCountVectorizer

vectorizer = KeyphraseCountVectorizer()
model = KeyNMF(10, vectorizer=vectorizer).fit(corpus)

Lemmatizing and Stemming Vectorizers

Since the same word can appear in multiple forms in a piece of text, one can sometimes obtain higher quality results by stemming or lemmatizing words in a text before processing them.

Warning

You should NEVER lemmatize or stem texts before passing them to a topic model in Turftopic, but rather, use a vectorizer that limits the model's vocabulary to the terms you are interested in.

Extracting lemmata with LemmaCountVectorizer

Similarly to NounPhraseCountVectorizer, LemmaCountVectorizer relies on a SpaCy pipeline for extracting lemmas from a piece of text. This means you will have to install SpaCy and a SpaCy pipeline to be able to use it.

pip install turftopic[spacy]
python -m spacy download en_core_web_sm
from turftopic import KeyNMF
from turftopic.vectorizers.spacy import LemmaCountVectorizer

model = KeyNMF(10, vectorizer=LemmaCountVectorizer("en_core_web_sm"))
model.fit(corpus)
model.print_topics()
Topic ID Highest Ranking
0 atheist, theist, belief, christians, agnostic, christian, mythology, asimov, abortion, read
1 morality, moral, immoral, objective, society, animal, natural, societal, murder, morally
2 religion, religious, christianity, belief, christian, faith, cult, church, secular, christians
3 atheism, belief, agnosticism, religious, faq, lack, existence, theism, atheistic, allah
4 islam, muslim, islamic, rushdie, khomeini, bank, imam, bcci, law, secular
...

Stemming words with StemmingCountVectorizer

You might find that lemmatization isn't aggressive enough for your purposes and still many forms of the same word penetrate topic descriptions. In that case you should try stemming! Stemming is available in Turftopic via the Snowball Stemmer, so it has to be installed before using stemming vectorization.

Should I choose stemming or lemmatization?

In almost all cases you should prefer lemmatizaion over stemming, as it provides higher quality and more correct results. You should only use a stemmer if

  1. You need something fast (lemmatization is slower due to a more involved pipeline)
  2. You know what you want and it is definitely stemming.
pip install turftopic[snowball]

Then you can initialize a topic model with this vectorizer:

from turftopic import KeyNMF
from turftopic.vectorizers.snowball import StemmingCountVectorizer

model = KeyNMF(10, vectorizer=StemmingCountVectorizer(language="english"))
model.fit(corpus)
model.print_topics()
Topic ID Highest Ranking
0 atheism, belief, alt, theism, agnostic, stalin, lack, sceptic, exist, faith
1 religion, belief, religi, cult, faith, theism, secular, theist, scientist, dogma
2 bronx, manhattan, sank, queen, sea, away, said, com, bob, blew
3 moral, human, instinct, murder, kill, law, behaviour, action, behavior, ethic
4 atheist, theist, belief, asimov, philosoph, mytholog, strong, faq, agnostic, weak

Non-English Vectorization

You may find that, especially with non-Indo-European languages, CountVectorizer does not perform that well. In these cases we recommend that you use a vectorizer with its own language-specific tokenization rules and stop-word list:

Vectorizing Any Language with TokenCountVectorizer

The SpaCy package includes language-specific tokenization and stop-word rules for just about any language. We provide a vectorizer that you can use with the language of your choice.

pip install turftopic[spacy]

Note

Note that you do not have to install any SpaCy pipelines for this to work. No pipelines or models will be loaded with TokenCountVectorizer only a language-specific tokenizer.

from turftopic import KeyNMF
from turftopic.vectorizers.spacy import TokenCountVectorizer

# CountVectorizer for Arabic
vectorizer = TokenCountVectorizer("ar", min_df=10)

model = KeyNMF(
    n_components=10,
    vectorizer=vectorizer,
    encoder="Omartificial-Intelligence-Space/Arabic-MiniLM-L12-v2-all-nli-triplet"
)
model.fit(corpus)

Extracting Chinese Tokens with ChineseCountVectorizer

The Chinese language does not separate tokens by whitespace, unlike most Indo-European languages. You thus need to use special tokenization rules for Chinese. Turftopic provides tools for Chinese tokenization via the Jieba package.

Note

We recommend that you use Jieba over SpaCy for topic modeling with Chinese.

You will need to install the package in order to be able to use our Chinese vectorizer.

pip install turftopic[jieba]

You can then use the ChineseCountVectorizer object, which comes preloaded with the jieba tokenizer along with a Chinese stop word list.

from turftopic import KeyNMF
from turftopic.vectorizers.chinese import ChineseCountVectorizer

vectorizer = ChineseCountVectorizer(min_df=10, stop_words="chinese")

model = KeyNMF(10, vectorizer=vectorizer, encoder="BAAI/bge-small-zh-v1.5")
model.fit(corpus)

model.print_topics()
Topic ID Highest Ranking
0 消息, 时间, 科技, 媒体报道, 美国, 据, 国外, 讯, 宣布, 称
1 体育讯, 新浪, 球员, 球队, 赛季, 火箭, nba, 已经, 主场, 时间
2 记者, 本报讯, 昨日, 获悉, 新华网, 基金, 通讯员, 采访, 男子, 昨天
3 股, 下跌, 上涨, 震荡, 板块, 大盘, 股指, 涨幅, 沪, 反弹
...

API Reference

turftopic.vectorizers.spacy.NounPhraseCountVectorizer

Bases: CountVectorizer

Extracts Noun phrases from text using SpaCy.

Parameters:

Name Type Description Default
nlp Union[Language, str]

A Spacy pipeline or its name.

'en_core_web_sm'
Source code in turftopic/vectorizers/spacy.py
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
class NounPhraseCountVectorizer(CountVectorizer):
    """Extracts Noun phrases from text using SpaCy.

    Parameters
    ----------
    nlp: spacy.Language or str, default "en_core_web_sm"
        A Spacy pipeline or its name.
    """

    def __init__(
        self,
        nlp: Union[Language, str] = "en_core_web_sm",
        *,
        input="content",
        encoding="utf-8",
        decode_error="strict",
        strip_accents=None,
        lowercase=True,
        preprocessor=None,
        tokenizer=None,
        stop_words=None,
        token_pattern=r"(?u)\b\w\w+\b",
        ngram_range=(1, 1),
        analyzer="word",
        max_df=1.0,
        min_df=1,
        max_features=None,
        vocabulary=None,
        binary=False,
        dtype=np.int64,
    ):
        self.nlp = nlp
        if isinstance(nlp, str):
            self._nlp = spacy.load(nlp)
        else:
            self._nlp = nlp
        super().__init__(
            input=input,
            encoding=encoding,
            decode_error=decode_error,
            strip_accents=strip_accents,
            lowercase=lowercase,
            preprocessor=preprocessor,
            tokenizer=tokenizer,
            stop_words=stop_words,
            token_pattern=token_pattern,
            ngram_range=ngram_range,
            analyzer=analyzer,
            max_df=max_df,
            min_df=min_df,
            max_features=max_features,
            vocabulary=vocabulary,
            binary=binary,
            dtype=dtype,
        )

    def nounphrase_tokenize(self, text: str) -> list[str]:
        doc = self._nlp(text)
        tokens = []
        for chunk in doc.noun_chunks:
            if chunk[0].is_stop:
                chunk = chunk[1:]
            phrase = chunk.text
            phrase = re.sub(r"[^\w\s]", " ", phrase)
            phrase = " ".join(phrase.split()).strip()
            if phrase:
                tokens.append(phrase)
        return tokens

    def build_tokenizer(self):
        return self.nounphrase_tokenize

turftopic.vectorizers.spacy.LemmaCountVectorizer

Bases: CountVectorizer

Extracts lemmata from text using SpaCy.

Parameters:

Name Type Description Default
nlp Union[Language, str]

A Spacy pipeline or its name.

'en_core_web_sm'
Source code in turftopic/vectorizers/spacy.py
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
class LemmaCountVectorizer(CountVectorizer):
    """Extracts lemmata from text using SpaCy.

    Parameters
    ----------
    nlp: spacy.Language or str, default "en_core_web_sm"
        A Spacy pipeline or its name.
    """

    def __init__(
        self,
        nlp: Union[Language, str] = "en_core_web_sm",
        *,
        input="content",
        encoding="utf-8",
        decode_error="strict",
        strip_accents=None,
        lowercase=True,
        preprocessor=None,
        tokenizer=None,
        stop_words=None,
        token_pattern=r"(?u)\b\w\w+\b",
        ngram_range=(1, 1),
        analyzer="word",
        max_df=1.0,
        min_df=1,
        max_features=None,
        vocabulary=None,
        binary=False,
        dtype=np.int64,
    ):
        self.nlp = nlp
        if isinstance(nlp, str):
            self._nlp = spacy.load(nlp)
        else:
            self._nlp = nlp
        super().__init__(
            input=input,
            encoding=encoding,
            decode_error=decode_error,
            strip_accents=strip_accents,
            lowercase=lowercase,
            preprocessor=preprocessor,
            tokenizer=tokenizer,
            stop_words=stop_words,
            token_pattern=token_pattern,
            ngram_range=ngram_range,
            analyzer=analyzer,
            max_df=max_df,
            min_df=min_df,
            max_features=max_features,
            vocabulary=vocabulary,
            binary=binary,
            dtype=dtype,
        )

    def lemma_tokenize(self, text: str) -> list[str]:
        doc = self._nlp(text)
        tokens = []
        for token in doc:
            if token.is_stop or not token.is_alpha:
                continue
            tokens.append(token.lemma_.strip())
        return tokens

    def build_tokenizer(self):
        return self.lemma_tokenize

turftopic.vectorizers.spacy.TokenCountVectorizer

Bases: CountVectorizer

Tokenizes text with SpaCy using its language-specific tokenization rules and stop-word lists

Parameters:

Name Type Description Default
language_code str

Language code for the language you intend to use.

'en'
remove_stop_words bool

Indicates whether stop words should be removed.

True
remove_nonalpha bool

Indicates whether only tokens containing alphabetical characters should be kept.

True
Source code in turftopic/vectorizers/spacy.py
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
class TokenCountVectorizer(CountVectorizer):
    """Tokenizes text with SpaCy using its language-specific tokenization rules and stop-word lists

    Parameters
    ----------
    language_code: str, default "en"
        Language code for the language you intend to use.
    remove_stop_words: bool, default True
        Indicates whether stop words should be removed.
    remove_nonalpha: bool, default True
        Indicates whether only tokens containing alphabetical characters should be kept.
    """

    def __init__(
        self,
        language_code: str = "en",
        remove_stop_words: bool = True,
        remove_nonalpha: bool = True,
        *,
        input="content",
        encoding="utf-8",
        decode_error="strict",
        strip_accents=None,
        lowercase=True,
        preprocessor=None,
        tokenizer=None,
        stop_words=None,
        token_pattern=r"(?u)\b\w\w+\b",
        ngram_range=(1, 1),
        analyzer="word",
        max_df=1.0,
        min_df=1,
        max_features=None,
        vocabulary=None,
        binary=False,
        dtype=np.int64,
    ):
        self.language_code = language_code
        self.remove_stop_words = remove_stop_words
        self.remove_nonalpha = remove_nonalpha
        super().__init__(
            input=input,
            encoding=encoding,
            decode_error=decode_error,
            strip_accents=strip_accents,
            lowercase=lowercase,
            preprocessor=preprocessor,
            tokenizer=tokenizer,
            stop_words=stop_words,
            token_pattern=token_pattern,
            ngram_range=ngram_range,
            analyzer=analyzer,
            max_df=max_df,
            min_df=min_df,
            max_features=max_features,
            vocabulary=vocabulary,
            binary=binary,
            dtype=dtype,
        )

    def build_tokenizer(self):
        nlp = spacy.blank(self.language_code)

        def tokenize(text: str) -> list[str]:
            doc = nlp(text)
            result = []
            for tok in doc:
                if self.remove_stop_words and tok.is_stop:
                    continue
                if self.remove_nonalpha and not tok.is_alpha:
                    continue
                result.append(tok.orth_)
            return result

        return tokenize

turftopic.vectorizers.snowball.StemmingCountVectorizer

Bases: CountVectorizer

Extractes stemmed words from documents using Snowball.

Source code in turftopic/vectorizers/snowball.py
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
class StemmingCountVectorizer(CountVectorizer):
    """Extractes stemmed words from documents using Snowball."""

    def __init__(
        self,
        language="english",
        *,
        input="content",
        encoding="utf-8",
        decode_error="strict",
        strip_accents=None,
        lowercase=True,
        preprocessor=None,
        tokenizer=None,
        stop_words=None,
        token_pattern=r"(?u)\b\w\w+\b",
        ngram_range=(1, 1),
        analyzer="word",
        max_df=1.0,
        min_df=1,
        max_features=None,
        vocabulary=None,
        binary=False,
        dtype=np.int64,
    ):
        self.language = language
        self._stemmer = snowballstemmer.stemmer(self.language)
        super().__init__(
            input=input,
            encoding=encoding,
            decode_error=decode_error,
            strip_accents=strip_accents,
            lowercase=lowercase,
            preprocessor=preprocessor,
            tokenizer=tokenizer,
            stop_words=stop_words,
            token_pattern=token_pattern,
            ngram_range=ngram_range,
            analyzer=analyzer,
            max_df=max_df,
            min_df=min_df,
            max_features=max_features,
            vocabulary=vocabulary,
            binary=binary,
            dtype=dtype,
        )

    def build_tokenizer(self):
        super_tokenizer = super().build_tokenizer()

        def tokenizer(text):
            return self._stemmer.stemWords(super_tokenizer(text))

        return tokenizer

turftopic.vectorizers.chinese.ChineseCountVectorizer

Bases: CountVectorizer

Chinese count vectorizer. Does word segmentation with Jieba, and includes a chinese stop words list. You have to specify stop_words="chinese" for this to kick into effect.

Source code in turftopic/vectorizers/chinese.py
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
class ChineseCountVectorizer(CountVectorizer):
    """Chinese count vectorizer. Does word segmentation with Jieba, and includes a chinese stop words list.
    You have to specify stop_words="chinese" for this to kick into effect.
    """

    def __init__(
        self,
        *,
        input="content",
        encoding="utf-8",
        decode_error="strict",
        strip_accents=None,
        lowercase=True,
        preprocessor=None,
        tokenizer=tokenize_zh,
        stop_words=None,
        token_pattern=r"(?u)\b\w\w+\b",
        ngram_range=(1, 1),
        analyzer="word",
        max_df=1.0,
        min_df=1,
        max_features=None,
        vocabulary=None,
        binary=False,
        dtype=np.int64,
    ):
        if stop_words == "chinese":
            stop_words = chinese_stop_words
        super().__init__(
            input=input,
            encoding=encoding,
            decode_error=decode_error,
            strip_accents=strip_accents,
            lowercase=lowercase,
            preprocessor=preprocessor,
            tokenizer=tokenizer,
            stop_words=stop_words,
            token_pattern=token_pattern,
            ngram_range=ngram_range,
            analyzer=analyzer,
            max_df=max_df,
            min_df=min_df,
            max_features=max_features,
            vocabulary=vocabulary,
            binary=binary,
            dtype=dtype,
        )