Skip to content

Dynamic Topic Modeling

If you want to examine the evolution of topics over time, you will need a dynamic topic model.

Note that regular static models can also be used to study the evolution of topics and information dynamics, but they can't capture changes in the topics themselves.

Models

In Turftopic you can currently use three different topic models for modeling topics over time:

  1. ClusteringTopicModel, where an overall model is fitted on the whole corpus, and then term importances are estimated over time slices.
  2. GMM, similarly to clustering models, term importances are reestimated per time slice
  3. KeyNMF, an overall decomposition is done, then using coordinate descent, topic-term-matrices are recalculated based on document-topic importances in the given time slice.
  4. SemanticSignalSeparation, a global model is fitted and then local models are inferred using linear regression from embeddings and document-topic signals in a given time-slice.

Usage

Dynamic topic models in Turftopic have a unified interface. To fit a dynamic topic model you will need a corpus, that has been annotated with timestamps. The timestamps need to be Python datetime objects, but pandas Timestamp object are also supported.

Models that have dynamic modeling capabilities (KeyNMF, GMM, SemanticSignalSeparation and ClusteringTopicModel) have a fit_transform_dynamic() method, that fits the model on the corpus over time.

from datetime import datetime

from turftopic import KeyNMF

corpus: list[str] = []
timestamps: list[datetime] = []

model = KeyNMF(5, top_n=5, random_state=42)
document_topic_matrix = model.fit_transform_dynamic(
    corpus, timestamps=timestamps, bins=10
)

You can use the print_topics_over_time() method for producing a table of the topics over the generated time slices.

This example uses CNN news data.

model.print_topics_over_time()

Time Slice 0_olympics_tokyo_athletes_beijing 1_covid_vaccine_pandemic_coronavirus 2_olympic_athletes_ioc_athlete 3_djokovic_novak_tennis_federer 4_ronaldo_cristiano_messi_manchester
2012 12 06 - 2013 11 10 genocide, yugoslavia, karadzic, facts, cnn cnn, russia, chechnya, prince, merkel france, cnn, francois, hollande, bike tennis, tournament, wimbledon, grass, courts beckham, soccer, retired, david, learn
2013 11 10 - 2014 10 14 keith, stones, richards, musician, author georgia, russia, conflict, 2008, cnn civil, rights, hear, why, should cnn, kidneys, traffickers, organ, nepal ronaldo, cristiano, goalscorer, soccer, player
2014 10 14 - 2015 09 18 ethiopia, brew, coffee, birthplace, anderson climate, sutter, countries, snapchat, injustice women, guatemala, murder, country, worst cnn, climate, oklahoma, women, topics sweden, parental, dads, advantage, leave
2015 09 18 - 2016 08 22 snow, ice, winter, storm, pets climate, crisis, drought, outbreaks, syrian women, vulnerabilities, frontlines, countries, marcelas cnn, warming, climate, sutter, theresa sutter, band, paris, fans, crowd
2016 08 22 - 2017 07 26 derby, epsom, sporting, race, spectacle overdoses, heroin, deaths, macron, emmanuel fear, died, indigenous, people, arthur siblings, amnesia, palombo, racial, mh370 bobbi, measles, raped, camp, rape
2017 07 26 - 2018 06 30 her, percussionist, drums, she, deported novichok, hurricane, hospital, deaths, breathing women, day, celebrate, taliban, international abuse, harassment, cnn, women, pilgrimage maradona, argentina, history, jadon, rape
2018 06 30 - 2019 06 03 athletes, teammates, celtics, white, racism pope, archbishop, francis, vigano, resignation racism, athletes, teammates, celtics, white golf, iceland, volcanoes, atlantic, ocean rape, sudanese, racist, women, soldiers
2019 06 03 - 2020 05 07 esports, climate, ice, racers, culver esports, coronavirus, pandemic, football, teams racers, women, compete, zone, bery serena, stadium, sasha, final, naomi kobe, bryant, greatest, basketball, influence
2020 05 07 - 2021 04 10 olympics, beijing, xinjiang, ioc, boycott covid, vaccine, coronavirus, pandemic, vaccination olympic, japan, medalist, canceled, tokyo djokovic, novak, tennis, federer, masterclass ronaldo, cristiano, messi, juventus, barcelona
2021 04 10 - 2022 03 16 olympics, tokyo, athletes, beijing, medal covid, pandemic, vaccine, vaccinated, coronavirus olympic, athletes, ioc, medal, athlete djokovic, novak, tennis, wimbledon, federer ronaldo, cristiano, messi, manchester, scored

You can also display the topics over time on an interactive HTML figure. The most important words for topics get revealed by hovering over them.

You will need to install Plotly for this to work.

pip install plotly
model.plot_topics_over_time()
Topics over time in a Dynamic KeyNMF model.

API reference

All dynamic topic models have a temporal_components_ attribute, which contains the topic-term matrices for each time slice, along with a temporal_importance_ attribute, which contains the importance of each topic in each time slice.

turftopic.dynamic.DynamicTopicModel

Bases: ABC

Source code in turftopic/dynamic.py
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
class DynamicTopicModel(ABC):
    @staticmethod
    def bin_timestamps(
        timestamps: list[datetime], bins: Union[int, list[datetime]] = 10
    ) -> tuple[np.ndarray, list[datetime]]:
        """Bins timestamps based on given bins.

        Parameters
        ----------
        timestamps: list[datetime]
            List of timestamps for documents.
        bins: int or list[datetime], default 10
            Time bins to use.
            If the bins are an int (N), N equally sized bins are used.
            Otherwise they should be bin edges, including the last and first edge.
            Bins are inclusive at the lower end and exclusive at the upper (lower <= timestamp < upper).

        Returns
        -------
        time_labels: ndarray of int
            Labels for time slice in each document.
        bin_edges: list[datetime]
            List of edges for time bins.
        """
        return bin_timestamps(timestamps, bins)

    @abstractmethod
    def fit_transform_dynamic(
        self,
        raw_documents,
        timestamps: list[datetime],
        embeddings: Optional[np.ndarray] = None,
        bins: Union[int, list[datetime]] = 10,
    ) -> np.ndarray:
        """Fits a dynamic topic model on the corpus and returns document-topic-importances.

        Parameters
        ----------
        raw_documents
            Documents to fit the model on.
        timestamps: list[datetime]
            Timestamp for each document in `datetime` format.
        embeddings: np.ndarray, default None
            Document embeddings produced by an embedding model.
        bins: int or list[datetime], default 10
            Specifies how to bin timestamps in to time slices.
            When an `int`, the corpus will be divided into N equal time slices.
            When a list, it describes the edges of each time slice including the starting
            and final edges of the slices.

        Returns
        -------
        ndarray of shape (n_documents, n_topics)
            Document-topic importance matrix.
        """
        pass

    def fit_dynamic(
        self,
        raw_documents,
        timestamps: list[datetime],
        embeddings: Optional[np.ndarray] = None,
        bins: Union[int, list[datetime]] = 10,
    ):
        """Fits a dynamic topic model on the corpus and returns document-topic-importances.

        Parameters
        ----------
        raw_documents
            Documents to fit the model on.
        timestamps: list[datetime]
            Timestamp for each document in `datetime` format.
        embeddings: np.ndarray, default None
            Document embeddings produced by an embedding model.
        bins: int or list[datetime], default 10
            Specifies how to bin timestamps in to time slices.
            When an `int`, the corpus will be divided into N equal time slices.
            When a list, it describes the edges of each time slice including the starting
            and final edges of the slices.

            Note: The final edge is not included. You might want to add one day to
            the last bin edge if it equals the last timestamp.
        """
        self.fit_transform_dynamic(raw_documents, timestamps, embeddings, bins)
        return self

    def get_time_slices(self) -> list[tuple[datetime, datetime]]:
        """Returns starting and ending datetime of
        each timeslice in the model."""
        bins = self.time_bin_edges
        res = []
        for i_bin, slice_end in enumerate(bins[1:]):
            res.append((bins[i_bin], slice_end))
        return res

    def get_topics_over_time(
        self, top_k: int = 10
    ) -> list[list[tuple[Any, list[tuple[str, float]]]]]:
        """Returns high-level topic representations in form of the top K words
        in each topic.

        Parameters
        ----------
        top_k: int, default 10
            Number of top words to return for each topic.

        Returns
        -------
        list[list[tuple]]
            List of topics over each time slice in the dynamic model.
            Each time slice is a list of topics.
            Each topic is a tuple of topic ID and the top k words.
            Top k words are a list of (word, word_importance) pairs.
        """
        n_topics = self.temporal_components_.shape[1]
        try:
            classes = self.classes_
        except AttributeError:
            classes = list(range(n_topics))
        res = []
        for components in self.temporal_components_:
            highest = np.argpartition(-components, top_k)[:, :top_k]
            vocab = self.get_vocab()
            top = []
            score = []
            for component, high in zip(components, highest):
                importance = component[high]
                high = high[np.argsort(-importance)]
                score.append(component[high])
                top.append(vocab[high])
            topics = []
            for topic, words, scores in zip(classes, top, score):
                topic_data = (topic, list(zip(words, scores)))
                topics.append(topic_data)
            res.append(topics)
        return res

    def _topics_over_time(
        self,
        top_k: int = 5,
        show_scores: bool = False,
        date_format: str = "%Y %m %d",
    ) -> list[list[str]]:
        temporal_components = self.temporal_components_
        slices = self.get_time_slices()
        slice_names = []
        for start_dt, end_dt in slices:
            start_str = start_dt.strftime(date_format)
            end_str = end_dt.strftime(date_format)
            slice_names.append(f"{start_str} - {end_str}")
        n_topics = self.temporal_components_.shape[1]
        try:
            topic_names = self.topic_names
        except AttributeError:
            topic_names = [f"Topic {i}" for i in range(n_topics)]
        columns = []
        rows = []
        columns.append("Time Slice")
        for topic in topic_names:
            columns.append(topic)
        for slice_name, components in zip(slice_names, temporal_components):
            fields = []
            fields.append(slice_name)
            highest = np.argpartition(-components, top_k)[:, :top_k]
            vocab = self.get_vocab()
            for component, high in zip(components, highest):
                if np.all(component == 0) or np.all(np.isnan(component)):
                    fields.append("Topic not present.")
                    continue
                importance = component[high]
                high = high[np.argsort(-importance)]
                high = high[importance != 0]
                scores = component[high]
                words = vocab[high]
                if show_scores:
                    concat_words = ", ".join(
                        [
                            f"{word}({importance:.2f})"
                            for word, importance in zip(words, scores)
                        ]
                    )
                else:
                    concat_words = ", ".join([word for word in words])
                fields.append(concat_words)
            rows.append(fields)
        return [columns, *rows]

    def print_topics_over_time(
        self,
        top_k: int = 5,
        show_scores: bool = False,
        date_format: str = "%Y %m %d",
    ):
        """Pretty prints topics in the model in a table.

        Parameters
        ----------
        top_k: int, default 10
            Number of top words to return for each topic.
        show_scores: bool, default False
            Indicates whether to show importance scores for each word.
        """
        columns, *rows = self._topics_over_time(
            top_k, show_scores, date_format
        )
        table = Table(show_lines=True)
        for column in columns:
            table.add_column(column)
        for row in rows:
            table.add_row(*row)
        console = Console()
        console.print(table)

    def export_topics_over_time(
        self,
        top_k: int = 5,
        show_scores: bool = False,
        date_format: str = "%Y %m %d",
        format="csv",
    ) -> str:
        """Pretty prints topics in the model in a table.

        Parameters
        ----------
        top_k: int, default 10
            Number of top words to return for each topic.
        show_scores: bool, default False
            Indicates whether to show importance scores for each word.
        format: 'csv', 'latex' or 'markdown'
            Specifies which format should be used.
            'csv', 'latex' and 'markdown' are supported.
        """
        table = self._topics_over_time(top_k, show_scores, date_format)
        return export_table(table, format=format)

    def plot_topics_over_time(
        self,
        top_k: int = 6,
        color_discrete_sequence: Optional[Iterable[str]] = None,
        color_discrete_map: Optional[dict[str, str]] = None,
    ):
        """Displays topics over time in the fitted dynamic model on a dynamic HTML figure.

        > You will need to `pip install plotly` to use this method.

        Parameters
        ----------
        top_k: int, default 6
            Number of top words per topic to display on the figure.
        color_discrete_sequence: Iterable[str], default None
            Color palette to use in the plot.
            Example:

            ```python
            import plotly.express as px
            model.plot_topics_over_time(color_discrete_sequence=px.colors.qualitative.Light24)
            ```

        color_discrete_map: dict[str, str], default None
            Topic names mapped to the colors that should
            be associated with them.

        Returns
        -------
        go.Figure
            Plotly graph objects Figure, that can be displayed or exported as
            HTML or static image.
        """
        try:
            import plotly.express as px
            import plotly.graph_objects as go
        except (ImportError, ModuleNotFoundError) as e:
            raise ModuleNotFoundError(
                "Please install plotly if you intend to use plots in Turftopic."
            ) from e
        if color_discrete_sequence is not None:
            topic_colors = itertools.cycle(color_discrete_sequence)
        elif color_discrete_map is not None:
            topic_colors = [
                color_discrete_map[topic_name]
                for topic_name in self.topic_names
            ]
        else:
            topic_colors = px.colors.qualitative.Dark24
        fig = go.Figure()
        vocab = self.get_vocab()
        n_topics = self.temporal_components_.shape[1]
        try:
            topic_names = self.topic_names
        except AttributeError:
            topic_names = [f"Topic {i}" for i in range(n_topics)]
        for trace_color, (i_topic, topic_imp_t) in zip(
            topic_colors, enumerate(self.temporal_importance_.T)
        ):
            component_over_time = self.temporal_components_[:, i_topic, :]
            name_over_time = []
            for component in component_over_time:
                high = np.argpartition(-component, top_k)[:top_k]
                values = component[high]
                if np.all(values == 0) or np.all(np.isnan(values)):
                    name_over_time.append("<not present>")
                    continue
                high = high[np.argsort(-values)]
                name_over_time.append(", ".join(vocab[high]))
            times = self.time_bin_edges[:-1]
            fig.add_trace(
                go.Scatter(
                    x=times,
                    y=topic_imp_t,
                    mode="markers+lines",
                    text=name_over_time,
                    name=topic_names[i_topic],
                    hovertemplate="<b>%{text}</b>",
                    marker=dict(
                        line=dict(width=2, color="black"),
                        size=14,
                        color=trace_color,
                    ),
                    line=dict(width=3),
                )
            )
        fig.update_layout(
            template="plotly_white",
            hoverlabel=dict(font_size=16, bgcolor="white"),
            hovermode="x",
        )
        fig.update_xaxes(title="Time Slice Start")
        fig.update_yaxes(title="Topic Importance")
        return fig

bin_timestamps(timestamps, bins=10) staticmethod

Bins timestamps based on given bins.

Parameters:

Name Type Description Default
timestamps list[datetime]

List of timestamps for documents.

required
bins Union[int, list[datetime]]

Time bins to use. If the bins are an int (N), N equally sized bins are used. Otherwise they should be bin edges, including the last and first edge. Bins are inclusive at the lower end and exclusive at the upper (lower <= timestamp < upper).

10

Returns:

Name Type Description
time_labels ndarray of int

Labels for time slice in each document.

bin_edges list[datetime]

List of edges for time bins.

Source code in turftopic/dynamic.py
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
@staticmethod
def bin_timestamps(
    timestamps: list[datetime], bins: Union[int, list[datetime]] = 10
) -> tuple[np.ndarray, list[datetime]]:
    """Bins timestamps based on given bins.

    Parameters
    ----------
    timestamps: list[datetime]
        List of timestamps for documents.
    bins: int or list[datetime], default 10
        Time bins to use.
        If the bins are an int (N), N equally sized bins are used.
        Otherwise they should be bin edges, including the last and first edge.
        Bins are inclusive at the lower end and exclusive at the upper (lower <= timestamp < upper).

    Returns
    -------
    time_labels: ndarray of int
        Labels for time slice in each document.
    bin_edges: list[datetime]
        List of edges for time bins.
    """
    return bin_timestamps(timestamps, bins)

export_topics_over_time(top_k=5, show_scores=False, date_format='%Y %m %d', format='csv')

Pretty prints topics in the model in a table.

Parameters:

Name Type Description Default
top_k int

Number of top words to return for each topic.

5
show_scores bool

Indicates whether to show importance scores for each word.

False
format

Specifies which format should be used. 'csv', 'latex' and 'markdown' are supported.

'csv'
Source code in turftopic/dynamic.py
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
def export_topics_over_time(
    self,
    top_k: int = 5,
    show_scores: bool = False,
    date_format: str = "%Y %m %d",
    format="csv",
) -> str:
    """Pretty prints topics in the model in a table.

    Parameters
    ----------
    top_k: int, default 10
        Number of top words to return for each topic.
    show_scores: bool, default False
        Indicates whether to show importance scores for each word.
    format: 'csv', 'latex' or 'markdown'
        Specifies which format should be used.
        'csv', 'latex' and 'markdown' are supported.
    """
    table = self._topics_over_time(top_k, show_scores, date_format)
    return export_table(table, format=format)

fit_dynamic(raw_documents, timestamps, embeddings=None, bins=10)

Fits a dynamic topic model on the corpus and returns document-topic-importances.

Parameters:

Name Type Description Default
raw_documents

Documents to fit the model on.

required
timestamps list[datetime]

Timestamp for each document in datetime format.

required
embeddings Optional[ndarray]

Document embeddings produced by an embedding model.

None
bins Union[int, list[datetime]]

Specifies how to bin timestamps in to time slices. When an int, the corpus will be divided into N equal time slices. When a list, it describes the edges of each time slice including the starting and final edges of the slices.

Note: The final edge is not included. You might want to add one day to the last bin edge if it equals the last timestamp.

10
Source code in turftopic/dynamic.py
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
def fit_dynamic(
    self,
    raw_documents,
    timestamps: list[datetime],
    embeddings: Optional[np.ndarray] = None,
    bins: Union[int, list[datetime]] = 10,
):
    """Fits a dynamic topic model on the corpus and returns document-topic-importances.

    Parameters
    ----------
    raw_documents
        Documents to fit the model on.
    timestamps: list[datetime]
        Timestamp for each document in `datetime` format.
    embeddings: np.ndarray, default None
        Document embeddings produced by an embedding model.
    bins: int or list[datetime], default 10
        Specifies how to bin timestamps in to time slices.
        When an `int`, the corpus will be divided into N equal time slices.
        When a list, it describes the edges of each time slice including the starting
        and final edges of the slices.

        Note: The final edge is not included. You might want to add one day to
        the last bin edge if it equals the last timestamp.
    """
    self.fit_transform_dynamic(raw_documents, timestamps, embeddings, bins)
    return self

fit_transform_dynamic(raw_documents, timestamps, embeddings=None, bins=10) abstractmethod

Fits a dynamic topic model on the corpus and returns document-topic-importances.

Parameters:

Name Type Description Default
raw_documents

Documents to fit the model on.

required
timestamps list[datetime]

Timestamp for each document in datetime format.

required
embeddings Optional[ndarray]

Document embeddings produced by an embedding model.

None
bins Union[int, list[datetime]]

Specifies how to bin timestamps in to time slices. When an int, the corpus will be divided into N equal time slices. When a list, it describes the edges of each time slice including the starting and final edges of the slices.

10

Returns:

Type Description
ndarray of shape (n_documents, n_topics)

Document-topic importance matrix.

Source code in turftopic/dynamic.py
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
@abstractmethod
def fit_transform_dynamic(
    self,
    raw_documents,
    timestamps: list[datetime],
    embeddings: Optional[np.ndarray] = None,
    bins: Union[int, list[datetime]] = 10,
) -> np.ndarray:
    """Fits a dynamic topic model on the corpus and returns document-topic-importances.

    Parameters
    ----------
    raw_documents
        Documents to fit the model on.
    timestamps: list[datetime]
        Timestamp for each document in `datetime` format.
    embeddings: np.ndarray, default None
        Document embeddings produced by an embedding model.
    bins: int or list[datetime], default 10
        Specifies how to bin timestamps in to time slices.
        When an `int`, the corpus will be divided into N equal time slices.
        When a list, it describes the edges of each time slice including the starting
        and final edges of the slices.

    Returns
    -------
    ndarray of shape (n_documents, n_topics)
        Document-topic importance matrix.
    """
    pass

get_time_slices()

Returns starting and ending datetime of each timeslice in the model.

Source code in turftopic/dynamic.py
128
129
130
131
132
133
134
135
def get_time_slices(self) -> list[tuple[datetime, datetime]]:
    """Returns starting and ending datetime of
    each timeslice in the model."""
    bins = self.time_bin_edges
    res = []
    for i_bin, slice_end in enumerate(bins[1:]):
        res.append((bins[i_bin], slice_end))
    return res

get_topics_over_time(top_k=10)

Returns high-level topic representations in form of the top K words in each topic.

Parameters:

Name Type Description Default
top_k int

Number of top words to return for each topic.

10

Returns:

Type Description
list[list[tuple]]

List of topics over each time slice in the dynamic model. Each time slice is a list of topics. Each topic is a tuple of topic ID and the top k words. Top k words are a list of (word, word_importance) pairs.

Source code in turftopic/dynamic.py
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
def get_topics_over_time(
    self, top_k: int = 10
) -> list[list[tuple[Any, list[tuple[str, float]]]]]:
    """Returns high-level topic representations in form of the top K words
    in each topic.

    Parameters
    ----------
    top_k: int, default 10
        Number of top words to return for each topic.

    Returns
    -------
    list[list[tuple]]
        List of topics over each time slice in the dynamic model.
        Each time slice is a list of topics.
        Each topic is a tuple of topic ID and the top k words.
        Top k words are a list of (word, word_importance) pairs.
    """
    n_topics = self.temporal_components_.shape[1]
    try:
        classes = self.classes_
    except AttributeError:
        classes = list(range(n_topics))
    res = []
    for components in self.temporal_components_:
        highest = np.argpartition(-components, top_k)[:, :top_k]
        vocab = self.get_vocab()
        top = []
        score = []
        for component, high in zip(components, highest):
            importance = component[high]
            high = high[np.argsort(-importance)]
            score.append(component[high])
            top.append(vocab[high])
        topics = []
        for topic, words, scores in zip(classes, top, score):
            topic_data = (topic, list(zip(words, scores)))
            topics.append(topic_data)
        res.append(topics)
    return res

plot_topics_over_time(top_k=6, color_discrete_sequence=None, color_discrete_map=None)

Displays topics over time in the fitted dynamic model on a dynamic HTML figure.

You will need to pip install plotly to use this method.

Parameters:

Name Type Description Default
top_k int

Number of top words per topic to display on the figure.

6
color_discrete_sequence Optional[Iterable[str]]

Color palette to use in the plot. Example:

import plotly.express as px
model.plot_topics_over_time(color_discrete_sequence=px.colors.qualitative.Light24)
None
color_discrete_map Optional[dict[str, str]]

Topic names mapped to the colors that should be associated with them.

None

Returns:

Type Description
Figure

Plotly graph objects Figure, that can be displayed or exported as HTML or static image.

Source code in turftopic/dynamic.py
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
def plot_topics_over_time(
    self,
    top_k: int = 6,
    color_discrete_sequence: Optional[Iterable[str]] = None,
    color_discrete_map: Optional[dict[str, str]] = None,
):
    """Displays topics over time in the fitted dynamic model on a dynamic HTML figure.

    > You will need to `pip install plotly` to use this method.

    Parameters
    ----------
    top_k: int, default 6
        Number of top words per topic to display on the figure.
    color_discrete_sequence: Iterable[str], default None
        Color palette to use in the plot.
        Example:

        ```python
        import plotly.express as px
        model.plot_topics_over_time(color_discrete_sequence=px.colors.qualitative.Light24)
        ```

    color_discrete_map: dict[str, str], default None
        Topic names mapped to the colors that should
        be associated with them.

    Returns
    -------
    go.Figure
        Plotly graph objects Figure, that can be displayed or exported as
        HTML or static image.
    """
    try:
        import plotly.express as px
        import plotly.graph_objects as go
    except (ImportError, ModuleNotFoundError) as e:
        raise ModuleNotFoundError(
            "Please install plotly if you intend to use plots in Turftopic."
        ) from e
    if color_discrete_sequence is not None:
        topic_colors = itertools.cycle(color_discrete_sequence)
    elif color_discrete_map is not None:
        topic_colors = [
            color_discrete_map[topic_name]
            for topic_name in self.topic_names
        ]
    else:
        topic_colors = px.colors.qualitative.Dark24
    fig = go.Figure()
    vocab = self.get_vocab()
    n_topics = self.temporal_components_.shape[1]
    try:
        topic_names = self.topic_names
    except AttributeError:
        topic_names = [f"Topic {i}" for i in range(n_topics)]
    for trace_color, (i_topic, topic_imp_t) in zip(
        topic_colors, enumerate(self.temporal_importance_.T)
    ):
        component_over_time = self.temporal_components_[:, i_topic, :]
        name_over_time = []
        for component in component_over_time:
            high = np.argpartition(-component, top_k)[:top_k]
            values = component[high]
            if np.all(values == 0) or np.all(np.isnan(values)):
                name_over_time.append("<not present>")
                continue
            high = high[np.argsort(-values)]
            name_over_time.append(", ".join(vocab[high]))
        times = self.time_bin_edges[:-1]
        fig.add_trace(
            go.Scatter(
                x=times,
                y=topic_imp_t,
                mode="markers+lines",
                text=name_over_time,
                name=topic_names[i_topic],
                hovertemplate="<b>%{text}</b>",
                marker=dict(
                    line=dict(width=2, color="black"),
                    size=14,
                    color=trace_color,
                ),
                line=dict(width=3),
            )
        )
    fig.update_layout(
        template="plotly_white",
        hoverlabel=dict(font_size=16, bgcolor="white"),
        hovermode="x",
    )
    fig.update_xaxes(title="Time Slice Start")
    fig.update_yaxes(title="Topic Importance")
    return fig

print_topics_over_time(top_k=5, show_scores=False, date_format='%Y %m %d')

Pretty prints topics in the model in a table.

Parameters:

Name Type Description Default
top_k int

Number of top words to return for each topic.

5
show_scores bool

Indicates whether to show importance scores for each word.

False
Source code in turftopic/dynamic.py
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
def print_topics_over_time(
    self,
    top_k: int = 5,
    show_scores: bool = False,
    date_format: str = "%Y %m %d",
):
    """Pretty prints topics in the model in a table.

    Parameters
    ----------
    top_k: int, default 10
        Number of top words to return for each topic.
    show_scores: bool, default False
        Indicates whether to show importance scores for each word.
    """
    columns, *rows = self._topics_over_time(
        top_k, show_scores, date_format
    )
    table = Table(show_lines=True)
    for column in columns:
        table.add_column(column)
    for row in rows:
        table.add_row(*row)
    console = Console()
    console.print(table)