Skip to content

Dynamic Topic Modeling

If you want to examine the evolution of topics over time, you will need a dynamic topic model.

You will need to install Plotly for plotting to work.

pip install plotly

You can currently use three different topic models for modeling topics over time:

  1. ClusteringTopicModel, where an overall model is fitted on the whole corpus, and then term importances are estimated over time slices.
  2. GMM, similarly to clustering models, term importances are reestimated per time slice
  3. KeyNMF, an overall decomposition is done, then using coordinate descent, topic-term-matrices are recalculated based on document-topic importances in the given time slice.
  4. SemanticSignalSeparation, a global model is fitted and then local models are inferred using linear regression from embeddings and document-topic signals in a given time-slice.

Usage

Dynamic topic models in Turftopic have a unified interface. To fit a dynamic topic model you will need a corpus, that has been annotated with timestamps. The timestamps need to be Python datetime objects, but pandas Timestamp object are also supported.

Models that have dynamic modeling capabilities (KeyNMF, GMM, SemanticSignalSeparation and ClusteringTopicModel) have a fit_transform_dynamic() method, that fits the model on the corpus over time.

from datetime import datetime

from turftopic import KeyNMF

corpus: list[str] = []
timestamps: list[datetime] = []

model = KeyNMF(5, top_n=5, random_state=42)
document_topic_matrix = model.fit_transform_dynamic(
    corpus, timestamps=timestamps, bins=10
)
# or alternatively:
topic_data = model.prepare_dynamic_topic_data(corpus, timestamps=timestamps, bins=10)

Interpret Topics over Time

model.plot_topics_over_time()
# or
topic_data.plot_topics_over_time()

Topics over time in a Dynamic KeyNMF model.

model.print_topics_over_time()
# or
topic_data.print_topics_over_time()

Time Slice 0_olympics_tokyo_athletes_beijing 1_covid_vaccine_pandemic_coronavirus 2_olympic_athletes_ioc_athlete 3_djokovic_novak_tennis_federer 4_ronaldo_cristiano_messi_manchester
2012 12 06 - 2013 11 10 genocide, yugoslavia, karadzic, facts, cnn cnn, russia, chechnya, prince, merkel france, cnn, francois, hollande, bike tennis, tournament, wimbledon, grass, courts beckham, soccer, retired, david, learn
2013 11 10 - 2014 10 14 keith, stones, richards, musician, author georgia, russia, conflict, 2008, cnn civil, rights, hear, why, should cnn, kidneys, traffickers, organ, nepal ronaldo, cristiano, goalscorer, soccer, player
2014 10 14 - 2015 09 18 ethiopia, brew, coffee, birthplace, anderson climate, sutter, countries, snapchat, injustice women, guatemala, murder, country, worst cnn, climate, oklahoma, women, topics sweden, parental, dads, advantage, leave
2015 09 18 - 2016 08 22 snow, ice, winter, storm, pets climate, crisis, drought, outbreaks, syrian women, vulnerabilities, frontlines, countries, marcelas cnn, warming, climate, sutter, theresa sutter, band, paris, fans, crowd
2016 08 22 - 2017 07 26 derby, epsom, sporting, race, spectacle overdoses, heroin, deaths, macron, emmanuel fear, died, indigenous, people, arthur siblings, amnesia, palombo, racial, mh370 bobbi, measles, raped, camp, rape
2017 07 26 - 2018 06 30 her, percussionist, drums, she, deported novichok, hurricane, hospital, deaths, breathing women, day, celebrate, taliban, international abuse, harassment, cnn, women, pilgrimage maradona, argentina, history, jadon, rape
2018 06 30 - 2019 06 03 athletes, teammates, celtics, white, racism pope, archbishop, francis, vigano, resignation racism, athletes, teammates, celtics, white golf, iceland, volcanoes, atlantic, ocean rape, sudanese, racist, women, soldiers
2019 06 03 - 2020 05 07 esports, climate, ice, racers, culver esports, coronavirus, pandemic, football, teams racers, women, compete, zone, bery serena, stadium, sasha, final, naomi kobe, bryant, greatest, basketball, influence
2020 05 07 - 2021 04 10 olympics, beijing, xinjiang, ioc, boycott covid, vaccine, coronavirus, pandemic, vaccination olympic, japan, medalist, canceled, tokyo djokovic, novak, tennis, federer, masterclass ronaldo, cristiano, messi, juventus, barcelona
2021 04 10 - 2022 03 16 olympics, tokyo, athletes, beijing, medal covid, pandemic, vaccine, vaccinated, coronavirus olympic, athletes, ioc, medal, athlete djokovic, novak, tennis, wimbledon, federer ronaldo, cristiano, messi, manchester, scored

API reference

All dynamic topic models have a temporal_components_ attribute, which contains the topic-term matrices for each time slice, along with a temporal_importance_ attribute, which contains the importance of each topic in each time slice.

turftopic.dynamic.DynamicTopicModel

Bases: ABC

Source code in turftopic/dynamic.py
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
class DynamicTopicModel(ABC):
    @staticmethod
    def bin_timestamps(
        timestamps: list[datetime], bins: Union[int, list[datetime]] = 10
    ) -> tuple[np.ndarray, list[datetime]]:
        """Bins timestamps based on given bins.

        Parameters
        ----------
        timestamps: list[datetime]
            List of timestamps for documents.
        bins: int or list[datetime], default 10
            Time bins to use.
            If the bins are an int (N), N equally sized bins are used.
            Otherwise they should be bin edges, including the last and first edge.
            Bins are inclusive at the lower end and exclusive at the upper (lower <= timestamp < upper).

        Returns
        -------
        time_labels: ndarray of int
            Labels for time slice in each document.
        bin_edges: list[datetime]
            List of edges for time bins.
        """
        return bin_timestamps(timestamps, bins)

    @abstractmethod
    def fit_transform_dynamic(
        self,
        raw_documents,
        timestamps: list[datetime],
        embeddings: Optional[np.ndarray] = None,
        bins: Union[int, list[datetime]] = 10,
    ) -> np.ndarray:
        """Fits a dynamic topic model on the corpus and returns document-topic-importances.

        Parameters
        ----------
        raw_documents
            Documents to fit the model on.
        timestamps: list[datetime]
            Timestamp for each document in `datetime` format.
        embeddings: np.ndarray, default None
            Document embeddings produced by an embedding model.
        bins: int or list[datetime], default 10
            Specifies how to bin timestamps in to time slices.
            When an `int`, the corpus will be divided into N equal time slices.
            When a list, it describes the edges of each time slice including the starting
            and final edges of the slices.

        Returns
        -------
        ndarray of shape (n_documents, n_topics)
            Document-topic importance matrix.
        """
        pass

    def fit_dynamic(
        self,
        raw_documents,
        timestamps: list[datetime],
        embeddings: Optional[np.ndarray] = None,
        bins: Union[int, list[datetime]] = 10,
    ):
        """Fits a dynamic topic model on the corpus and returns document-topic-importances.

        Parameters
        ----------
        raw_documents
            Documents to fit the model on.
        timestamps: list[datetime]
            Timestamp for each document in `datetime` format.
        embeddings: np.ndarray, default None
            Document embeddings produced by an embedding model.
        bins: int or list[datetime], default 10
            Specifies how to bin timestamps in to time slices.
            When an `int`, the corpus will be divided into N equal time slices.
            When a list, it describes the edges of each time slice including the starting
            and final edges of the slices.

            Note: The final edge is not included. You might want to add one day to
            the last bin edge if it equals the last timestamp.
        """
        self.fit_transform_dynamic(raw_documents, timestamps, embeddings, bins)
        return self

    def prepare_dynamic_topic_data(
        self,
        corpus: list[str],
        timestamps: list[datetime],
        embeddings: Optional[np.ndarray] = None,
        bins: Union[int, list[datetime]] = 10,
    ):
        """Produces topic inference data for a given corpus, that can be then used and reused.
        Exists to allow visualizations out of the box with topicwizard.

        Parameters
        ----------
        corpus: list of str
            Documents to infer topical content for.
        timestamps: list[datetime]
            Timestamp for each document in `datetime` format.
        embeddings: ndarray of shape (n_documents, n_dimensions)
            Embeddings of documents.
        bins: int or list[datetime], default 10
            Specifies how to bin timestamps in to time slices.
            When an `int`, the corpus will be divided into N equal time slices.
            When a list, it describes the edges of each time slice including the starting
            and final edges of the slices.

            Note: The final edge is not included. You might want to add one day to
            the last bin edge if it equals the last timestamp.

        Returns
        -------
        TopicData
            Information about topical inference in a dictionary.
        """
        if embeddings is None:
            embeddings = self.encode_documents(corpus)
        if getattr(self, "temporal_components_", None) is not None:
            try:
                document_topic_matrix = self.transform(
                    corpus, embeddings=embeddings
                )
            except (AttributeError, NotFittedError):
                document_topic_matrix = self.fit_transform_dynamic(
                    corpus,
                    timestamps=timestamps,
                    embeddings=embeddings,
                    bins=bins,
                )
        else:
            document_topic_matrix = self.fit_transform_dynamic(
                corpus, timestamps=timestamps, embeddings=embeddings, bins=bins
            )
        dtm = self.vectorizer.transform(corpus)  # type: ignore
        try:
            classes = self.classes_
        except AttributeError:
            classes = list(range(self.components_.shape[0]))
        res = TopicData(
            corpus=corpus,
            document_term_matrix=dtm,
            vocab=self.get_vocab(),
            document_topic_matrix=document_topic_matrix,
            document_representation=embeddings,
            topic_term_matrix=self.components_,  # type: ignore
            transform=getattr(self, "transform", None),
            topic_names=self.topic_names,
            classes=classes,
            temporal_components=self.temporal_components_,
            temporal_importance=self.temporal_importance_,
            time_bin_edges=self.time_bin_edges,
        )
        return res

bin_timestamps(timestamps, bins=10) staticmethod

Bins timestamps based on given bins.

Parameters:

Name Type Description Default
timestamps list[datetime]

List of timestamps for documents.

required
bins Union[int, list[datetime]]

Time bins to use. If the bins are an int (N), N equally sized bins are used. Otherwise they should be bin edges, including the last and first edge. Bins are inclusive at the lower end and exclusive at the upper (lower <= timestamp < upper).

10

Returns:

Name Type Description
time_labels ndarray of int

Labels for time slice in each document.

bin_edges list[datetime]

List of edges for time bins.

Source code in turftopic/dynamic.py
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
@staticmethod
def bin_timestamps(
    timestamps: list[datetime], bins: Union[int, list[datetime]] = 10
) -> tuple[np.ndarray, list[datetime]]:
    """Bins timestamps based on given bins.

    Parameters
    ----------
    timestamps: list[datetime]
        List of timestamps for documents.
    bins: int or list[datetime], default 10
        Time bins to use.
        If the bins are an int (N), N equally sized bins are used.
        Otherwise they should be bin edges, including the last and first edge.
        Bins are inclusive at the lower end and exclusive at the upper (lower <= timestamp < upper).

    Returns
    -------
    time_labels: ndarray of int
        Labels for time slice in each document.
    bin_edges: list[datetime]
        List of edges for time bins.
    """
    return bin_timestamps(timestamps, bins)

fit_dynamic(raw_documents, timestamps, embeddings=None, bins=10)

Fits a dynamic topic model on the corpus and returns document-topic-importances.

Parameters:

Name Type Description Default
raw_documents

Documents to fit the model on.

required
timestamps list[datetime]

Timestamp for each document in datetime format.

required
embeddings Optional[ndarray]

Document embeddings produced by an embedding model.

None
bins Union[int, list[datetime]]

Specifies how to bin timestamps in to time slices. When an int, the corpus will be divided into N equal time slices. When a list, it describes the edges of each time slice including the starting and final edges of the slices.

Note: The final edge is not included. You might want to add one day to the last bin edge if it equals the last timestamp.

10
Source code in turftopic/dynamic.py
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
def fit_dynamic(
    self,
    raw_documents,
    timestamps: list[datetime],
    embeddings: Optional[np.ndarray] = None,
    bins: Union[int, list[datetime]] = 10,
):
    """Fits a dynamic topic model on the corpus and returns document-topic-importances.

    Parameters
    ----------
    raw_documents
        Documents to fit the model on.
    timestamps: list[datetime]
        Timestamp for each document in `datetime` format.
    embeddings: np.ndarray, default None
        Document embeddings produced by an embedding model.
    bins: int or list[datetime], default 10
        Specifies how to bin timestamps in to time slices.
        When an `int`, the corpus will be divided into N equal time slices.
        When a list, it describes the edges of each time slice including the starting
        and final edges of the slices.

        Note: The final edge is not included. You might want to add one day to
        the last bin edge if it equals the last timestamp.
    """
    self.fit_transform_dynamic(raw_documents, timestamps, embeddings, bins)
    return self

fit_transform_dynamic(raw_documents, timestamps, embeddings=None, bins=10) abstractmethod

Fits a dynamic topic model on the corpus and returns document-topic-importances.

Parameters:

Name Type Description Default
raw_documents

Documents to fit the model on.

required
timestamps list[datetime]

Timestamp for each document in datetime format.

required
embeddings Optional[ndarray]

Document embeddings produced by an embedding model.

None
bins Union[int, list[datetime]]

Specifies how to bin timestamps in to time slices. When an int, the corpus will be divided into N equal time slices. When a list, it describes the edges of each time slice including the starting and final edges of the slices.

10

Returns:

Type Description
ndarray of shape (n_documents, n_topics)

Document-topic importance matrix.

Source code in turftopic/dynamic.py
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
@abstractmethod
def fit_transform_dynamic(
    self,
    raw_documents,
    timestamps: list[datetime],
    embeddings: Optional[np.ndarray] = None,
    bins: Union[int, list[datetime]] = 10,
) -> np.ndarray:
    """Fits a dynamic topic model on the corpus and returns document-topic-importances.

    Parameters
    ----------
    raw_documents
        Documents to fit the model on.
    timestamps: list[datetime]
        Timestamp for each document in `datetime` format.
    embeddings: np.ndarray, default None
        Document embeddings produced by an embedding model.
    bins: int or list[datetime], default 10
        Specifies how to bin timestamps in to time slices.
        When an `int`, the corpus will be divided into N equal time slices.
        When a list, it describes the edges of each time slice including the starting
        and final edges of the slices.

    Returns
    -------
    ndarray of shape (n_documents, n_topics)
        Document-topic importance matrix.
    """
    pass

prepare_dynamic_topic_data(corpus, timestamps, embeddings=None, bins=10)

Produces topic inference data for a given corpus, that can be then used and reused. Exists to allow visualizations out of the box with topicwizard.

Parameters:

Name Type Description Default
corpus list[str]

Documents to infer topical content for.

required
timestamps list[datetime]

Timestamp for each document in datetime format.

required
embeddings Optional[ndarray]

Embeddings of documents.

None
bins Union[int, list[datetime]]

Specifies how to bin timestamps in to time slices. When an int, the corpus will be divided into N equal time slices. When a list, it describes the edges of each time slice including the starting and final edges of the slices.

Note: The final edge is not included. You might want to add one day to the last bin edge if it equals the last timestamp.

10

Returns:

Type Description
TopicData

Information about topical inference in a dictionary.

Source code in turftopic/dynamic.py
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
def prepare_dynamic_topic_data(
    self,
    corpus: list[str],
    timestamps: list[datetime],
    embeddings: Optional[np.ndarray] = None,
    bins: Union[int, list[datetime]] = 10,
):
    """Produces topic inference data for a given corpus, that can be then used and reused.
    Exists to allow visualizations out of the box with topicwizard.

    Parameters
    ----------
    corpus: list of str
        Documents to infer topical content for.
    timestamps: list[datetime]
        Timestamp for each document in `datetime` format.
    embeddings: ndarray of shape (n_documents, n_dimensions)
        Embeddings of documents.
    bins: int or list[datetime], default 10
        Specifies how to bin timestamps in to time slices.
        When an `int`, the corpus will be divided into N equal time slices.
        When a list, it describes the edges of each time slice including the starting
        and final edges of the slices.

        Note: The final edge is not included. You might want to add one day to
        the last bin edge if it equals the last timestamp.

    Returns
    -------
    TopicData
        Information about topical inference in a dictionary.
    """
    if embeddings is None:
        embeddings = self.encode_documents(corpus)
    if getattr(self, "temporal_components_", None) is not None:
        try:
            document_topic_matrix = self.transform(
                corpus, embeddings=embeddings
            )
        except (AttributeError, NotFittedError):
            document_topic_matrix = self.fit_transform_dynamic(
                corpus,
                timestamps=timestamps,
                embeddings=embeddings,
                bins=bins,
            )
    else:
        document_topic_matrix = self.fit_transform_dynamic(
            corpus, timestamps=timestamps, embeddings=embeddings, bins=bins
        )
    dtm = self.vectorizer.transform(corpus)  # type: ignore
    try:
        classes = self.classes_
    except AttributeError:
        classes = list(range(self.components_.shape[0]))
    res = TopicData(
        corpus=corpus,
        document_term_matrix=dtm,
        vocab=self.get_vocab(),
        document_topic_matrix=document_topic_matrix,
        document_representation=embeddings,
        topic_term_matrix=self.components_,  # type: ignore
        transform=getattr(self, "transform", None),
        topic_names=self.topic_names,
        classes=classes,
        temporal_components=self.temporal_components_,
        temporal_importance=self.temporal_importance_,
        time_bin_edges=self.time_bin_edges,
    )
    return res