Dynamic Topic Modeling

If you want to examine the evolution of topics over time, you will need a dynamic topic model.

You will need to install Plotly for plotting to work.

pip install plotly

You can currently use three different topic models for modeling topics over time:

ClusteringTopicModel, where an overall model is fitted on the whole corpus, and then term importances are estimated over time slices.
GMM, similarly to clustering models, term importances are reestimated per time slice
KeyNMF, an overall decomposition is done, then using coordinate descent, topic-term-matrices are recalculated based on document-topic importances in the given time slice.
SemanticSignalSeparation, a global model is fitted and then local models are inferred using linear regression from embeddings and document-topic signals in a given time-slice.

Usage

Dynamic topic models in Turftopic have a unified interface. To fit a dynamic topic model you will need a corpus, that has been annotated with timestamps. The timestamps need to be Python datetime objects, but pandas Timestamp object are also supported.

Models that have dynamic modeling capabilities (KeyNMF, GMM, SemanticSignalSeparation and ClusteringTopicModel) have a fit_transform_dynamic() method, that fits the model on the corpus over time.

from datetime import datetime

from turftopic import KeyNMF

corpus: list[str] = []
timestamps: list[datetime] = []

model = KeyNMF(5, top_n=5, random_state=42)
document_topic_matrix = model.fit_transform_dynamic(
    corpus, timestamps=timestamps, bins=10
)
# or alternatively:
topic_data = model.prepare_dynamic_topic_data(corpus, timestamps=timestamps, bins=10)

Interpret Topics over Time

Interactive PlotOver-time Table

model.plot_topics_over_time()
# or
topic_data.plot_topics_over_time()

Topics over time in a Dynamic KeyNMF model.

model.print_topics_over_time()
# or
topic_data.print_topics_over_time()

Time Slice	0_olympics_tokyo_athletes_beijing	1_covid_vaccine_pandemic_coronavirus	2_olympic_athletes_ioc_athlete	3_djokovic_novak_tennis_federer	4_ronaldo_cristiano_messi_manchester
2012 12 06 - 2013 11 10	genocide, yugoslavia, karadzic, facts, cnn	cnn, russia, chechnya, prince, merkel	france, cnn, francois, hollande, bike	tennis, tournament, wimbledon, grass, courts	beckham, soccer, retired, david, learn
2013 11 10 - 2014 10 14	keith, stones, richards, musician, author	georgia, russia, conflict, 2008, cnn	civil, rights, hear, why, should	cnn, kidneys, traffickers, organ, nepal	ronaldo, cristiano, goalscorer, soccer, player
2014 10 14 - 2015 09 18	ethiopia, brew, coffee, birthplace, anderson	climate, sutter, countries, snapchat, injustice	women, guatemala, murder, country, worst	cnn, climate, oklahoma, women, topics	sweden, parental, dads, advantage, leave
2015 09 18 - 2016 08 22	snow, ice, winter, storm, pets	climate, crisis, drought, outbreaks, syrian	women, vulnerabilities, frontlines, countries, marcelas	cnn, warming, climate, sutter, theresa	sutter, band, paris, fans, crowd
2016 08 22 - 2017 07 26	derby, epsom, sporting, race, spectacle	overdoses, heroin, deaths, macron, emmanuel	fear, died, indigenous, people, arthur	siblings, amnesia, palombo, racial, mh370	bobbi, measles, raped, camp, rape
2017 07 26 - 2018 06 30	her, percussionist, drums, she, deported	novichok, hurricane, hospital, deaths, breathing	women, day, celebrate, taliban, international	abuse, harassment, cnn, women, pilgrimage	maradona, argentina, history, jadon, rape
2018 06 30 - 2019 06 03	athletes, teammates, celtics, white, racism	pope, archbishop, francis, vigano, resignation	racism, athletes, teammates, celtics, white	golf, iceland, volcanoes, atlantic, ocean	rape, sudanese, racist, women, soldiers
2019 06 03 - 2020 05 07	esports, climate, ice, racers, culver	esports, coronavirus, pandemic, football, teams	racers, women, compete, zone, bery	serena, stadium, sasha, final, naomi	kobe, bryant, greatest, basketball, influence
2020 05 07 - 2021 04 10	olympics, beijing, xinjiang, ioc, boycott	covid, vaccine, coronavirus, pandemic, vaccination	olympic, japan, medalist, canceled, tokyo	djokovic, novak, tennis, federer, masterclass	ronaldo, cristiano, messi, juventus, barcelona
2021 04 10 - 2022 03 16	olympics, tokyo, athletes, beijing, medal	covid, pandemic, vaccine, vaccinated, coronavirus	olympic, athletes, ioc, medal, athlete	djokovic, novak, tennis, wimbledon, federer	ronaldo, cristiano, messi, manchester, scored

API reference

All dynamic topic models have a temporal_components_ attribute, which contains the topic-term matrices for each time slice, along with a temporal_importance_ attribute, which contains the importance of each topic in each time slice.

`turftopic.dynamic.DynamicTopicModel`

Bases: ABC

Source code in turftopic/dynamic.py

class DynamicTopicModel(ABC):
    @staticmethod
    def bin_timestamps(
        timestamps: list[datetime], bins: Union[int, list[datetime]] = 10
    ) -> tuple[np.ndarray, list[datetime]]:
        """Bins timestamps based on given bins.

        Parameters
        ----------
        timestamps: list[datetime]
            List of timestamps for documents.
        bins: int or list[datetime], default 10
            Time bins to use.
            If the bins are an int (N), N equally sized bins are used.
            Otherwise they should be bin edges, including the last and first edge.
            Bins are inclusive at the lower end and exclusive at the upper (lower <= timestamp < upper).

        Returns
        -------
        time_labels: ndarray of int
            Labels for time slice in each document.
        bin_edges: list[datetime]
            List of edges for time bins.
        """
        return bin_timestamps(timestamps, bins)

    @abstractmethod
    def fit_transform_dynamic(
        self,
        raw_documents,
        timestamps: list[datetime],
        embeddings: Optional[np.ndarray] = None,
        bins: Union[int, list[datetime]] = 10,
    ) -> np.ndarray:
        """Fits a dynamic topic model on the corpus and returns document-topic-importances.

        Parameters
        ----------
        raw_documents
            Documents to fit the model on.
        timestamps: list[datetime]
            Timestamp for each document in `datetime` format.
        embeddings: np.ndarray, default None
            Document embeddings produced by an embedding model.
        bins: int or list[datetime], default 10
            Specifies how to bin timestamps in to time slices.
            When an `int`, the corpus will be divided into N equal time slices.
            When a list, it describes the edges of each time slice including the starting
            and final edges of the slices.

        Returns
        -------
        ndarray of shape (n_documents, n_topics)
            Document-topic importance matrix.
        """
        pass

    def fit_dynamic(
        self,
        raw_documents,
        timestamps: list[datetime],
        embeddings: Optional[np.ndarray] = None,
        bins: Union[int, list[datetime]] = 10,
    ):
        """Fits a dynamic topic model on the corpus and returns document-topic-importances.

        Parameters
        ----------
        raw_documents
            Documents to fit the model on.
        timestamps: list[datetime]
            Timestamp for each document in `datetime` format.
        embeddings: np.ndarray, default None
            Document embeddings produced by an embedding model.
        bins: int or list[datetime], default 10
            Specifies how to bin timestamps in to time slices.
            When an `int`, the corpus will be divided into N equal time slices.
            When a list, it describes the edges of each time slice including the starting
            and final edges of the slices.

            Note: The final edge is not included. You might want to add one day to
            the last bin edge if it equals the last timestamp.
        """
        self.fit_transform_dynamic(raw_documents, timestamps, embeddings, bins)
        return self

    def prepare_dynamic_topic_data(
        self,
        corpus: list[str],
        timestamps: list[datetime],
        embeddings: Optional[np.ndarray] = None,
        bins: Union[int, list[datetime]] = 10,
    ):
        """Produces topic inference data for a given corpus, that can be then used and reused.
        Exists to allow visualizations out of the box with topicwizard.

        Parameters
        ----------
        corpus: list of str
            Documents to infer topical content for.
        timestamps: list[datetime]
            Timestamp for each document in `datetime` format.
        embeddings: ndarray of shape (n_documents, n_dimensions)
            Embeddings of documents.
        bins: int or list[datetime], default 10
            Specifies how to bin timestamps in to time slices.
            When an `int`, the corpus will be divided into N equal time slices.
            When a list, it describes the edges of each time slice including the starting
            and final edges of the slices.

            Note: The final edge is not included. You might want to add one day to
            the last bin edge if it equals the last timestamp.

        Returns
        -------
        TopicData
            Information about topical inference in a dictionary.
        """
        if embeddings is None:
            embeddings = self.encode_documents(corpus)
        if getattr(self, "temporal_components_", None) is not None:
            try:
                document_topic_matrix = self.transform(
                    corpus, embeddings=embeddings
                )
            except (AttributeError, NotFittedError):
                document_topic_matrix = self.fit_transform_dynamic(
                    corpus,
                    timestamps=timestamps,
                    embeddings=embeddings,
                    bins=bins,
                )
        else:
            document_topic_matrix = self.fit_transform_dynamic(
                corpus, timestamps=timestamps, embeddings=embeddings, bins=bins
            )
        dtm = self.vectorizer.transform(corpus)  # type: ignore
        try:
            classes = self.classes_
        except AttributeError:
            classes = list(range(self.components_.shape[0]))
        res = TopicData(
            corpus=corpus,
            document_term_matrix=dtm,
            vocab=self.get_vocab(),
            document_topic_matrix=document_topic_matrix,
            document_representation=embeddings,
            topic_term_matrix=self.components_,  # type: ignore
            transform=getattr(self, "transform", None),
            topic_names=self.topic_names,
            classes=classes,
            temporal_components=self.temporal_components_,
            temporal_importance=self.temporal_importance_,
            time_bin_edges=self.time_bin_edges,
        )
        return res

`bin_timestamps(timestamps, bins=10)` `staticmethod`

Bins timestamps based on given bins.

Parameters:

Name	Type	Description	Default
`timestamps`	`list[datetime]`	List of timestamps for documents.	required
`bins`	`Union[int, list[datetime]]`	Time bins to use. If the bins are an int (N), N equally sized bins are used. Otherwise they should be bin edges, including the last and first edge. Bins are inclusive at the lower end and exclusive at the upper (lower <= timestamp < upper).	`10`

Returns:

Name	Type	Description
`time_labels`	`ndarray of int`	Labels for time slice in each document.
`bin_edges`	`list[datetime]`	List of edges for time bins.

Source code in turftopic/dynamic.py

@staticmethod
def bin_timestamps(
    timestamps: list[datetime], bins: Union[int, list[datetime]] = 10
) -> tuple[np.ndarray, list[datetime]]:
    """Bins timestamps based on given bins.

    Parameters
    ----------
    timestamps: list[datetime]
        List of timestamps for documents.
    bins: int or list[datetime], default 10
        Time bins to use.
        If the bins are an int (N), N equally sized bins are used.
        Otherwise they should be bin edges, including the last and first edge.
        Bins are inclusive at the lower end and exclusive at the upper (lower <= timestamp < upper).

    Returns
    -------
    time_labels: ndarray of int
        Labels for time slice in each document.
    bin_edges: list[datetime]
        List of edges for time bins.
    """
    return bin_timestamps(timestamps, bins)

`fit_dynamic(raw_documents, timestamps, embeddings=None, bins=10)`

Fits a dynamic topic model on the corpus and returns document-topic-importances.

Parameters:

Name	Type	Description	Default
`raw_documents`		Documents to fit the model on.	required
`timestamps`	`list[datetime]`	Timestamp for each document in `datetime` format.	required
`embeddings`	`Optional[ndarray]`	Document embeddings produced by an embedding model.	`None`
`bins`	`Union[int, list[datetime]]`	Specifies how to bin timestamps in to time slices. When an `int`, the corpus will be divided into N equal time slices. When a list, it describes the edges of each time slice including the starting and final edges of the slices. Note: The final edge is not included. You might want to add one day to the last bin edge if it equals the last timestamp.	`10`

Source code in turftopic/dynamic.py

def fit_dynamic(
    self,
    raw_documents,
    timestamps: list[datetime],
    embeddings: Optional[np.ndarray] = None,
    bins: Union[int, list[datetime]] = 10,
):
    """Fits a dynamic topic model on the corpus and returns document-topic-importances.

    Parameters
    ----------
    raw_documents
        Documents to fit the model on.
    timestamps: list[datetime]
        Timestamp for each document in `datetime` format.
    embeddings: np.ndarray, default None
        Document embeddings produced by an embedding model.
    bins: int or list[datetime], default 10
        Specifies how to bin timestamps in to time slices.
        When an `int`, the corpus will be divided into N equal time slices.
        When a list, it describes the edges of each time slice including the starting
        and final edges of the slices.

        Note: The final edge is not included. You might want to add one day to
        the last bin edge if it equals the last timestamp.
    """
    self.fit_transform_dynamic(raw_documents, timestamps, embeddings, bins)
    return self

`fit_transform_dynamic(raw_documents, timestamps, embeddings=None, bins=10)` `abstractmethod`

Fits a dynamic topic model on the corpus and returns document-topic-importances.

Parameters:

Name	Type	Description	Default
`raw_documents`		Documents to fit the model on.	required
`timestamps`	`list[datetime]`	Timestamp for each document in `datetime` format.	required
`embeddings`	`Optional[ndarray]`	Document embeddings produced by an embedding model.	`None`
`bins`	`Union[int, list[datetime]]`	Specifies how to bin timestamps in to time slices. When an `int`, the corpus will be divided into N equal time slices. When a list, it describes the edges of each time slice including the starting and final edges of the slices.	`10`

Returns:

Type	Description
`ndarray of shape (n_documents, n_topics)`	Document-topic importance matrix.

Source code in turftopic/dynamic.py

@abstractmethod
def fit_transform_dynamic(
    self,
    raw_documents,
    timestamps: list[datetime],
    embeddings: Optional[np.ndarray] = None,
    bins: Union[int, list[datetime]] = 10,
) -> np.ndarray:
    """Fits a dynamic topic model on the corpus and returns document-topic-importances.

    Parameters
    ----------
    raw_documents
        Documents to fit the model on.
    timestamps: list[datetime]
        Timestamp for each document in `datetime` format.
    embeddings: np.ndarray, default None
        Document embeddings produced by an embedding model.
    bins: int or list[datetime], default 10
        Specifies how to bin timestamps in to time slices.
        When an `int`, the corpus will be divided into N equal time slices.
        When a list, it describes the edges of each time slice including the starting
        and final edges of the slices.

    Returns
    -------
    ndarray of shape (n_documents, n_topics)
        Document-topic importance matrix.
    """
    pass

`prepare_dynamic_topic_data(corpus, timestamps, embeddings=None, bins=10)`

Produces topic inference data for a given corpus, that can be then used and reused. Exists to allow visualizations out of the box with topicwizard.

Parameters:

Name	Type	Description	Default
`corpus`	`list[str]`	Documents to infer topical content for.	required
`timestamps`	`list[datetime]`	Timestamp for each document in `datetime` format.	required
`embeddings`	`Optional[ndarray]`	Embeddings of documents.	`None`
`bins`	`Union[int, list[datetime]]`	Specifies how to bin timestamps in to time slices. When an `int`, the corpus will be divided into N equal time slices. When a list, it describes the edges of each time slice including the starting and final edges of the slices. Note: The final edge is not included. You might want to add one day to the last bin edge if it equals the last timestamp.	`10`

Returns:

Type	Description
`TopicData`	Information about topical inference in a dictionary.

Source code in turftopic/dynamic.py

def prepare_dynamic_topic_data(
    self,
    corpus: list[str],
    timestamps: list[datetime],
    embeddings: Optional[np.ndarray] = None,
    bins: Union[int, list[datetime]] = 10,
):
    """Produces topic inference data for a given corpus, that can be then used and reused.
    Exists to allow visualizations out of the box with topicwizard.

    Parameters
    ----------
    corpus: list of str
        Documents to infer topical content for.
    timestamps: list[datetime]
        Timestamp for each document in `datetime` format.
    embeddings: ndarray of shape (n_documents, n_dimensions)
        Embeddings of documents.
    bins: int or list[datetime], default 10
        Specifies how to bin timestamps in to time slices.
        When an `int`, the corpus will be divided into N equal time slices.
        When a list, it describes the edges of each time slice including the starting
        and final edges of the slices.

        Note: The final edge is not included. You might want to add one day to
        the last bin edge if it equals the last timestamp.

    Returns
    -------
    TopicData
        Information about topical inference in a dictionary.
    """
    if embeddings is None:
        embeddings = self.encode_documents(corpus)
    if getattr(self, "temporal_components_", None) is not None:
        try:
            document_topic_matrix = self.transform(
                corpus, embeddings=embeddings
            )
        except (AttributeError, NotFittedError):
            document_topic_matrix = self.fit_transform_dynamic(
                corpus,
                timestamps=timestamps,
                embeddings=embeddings,
                bins=bins,
            )
    else:
        document_topic_matrix = self.fit_transform_dynamic(
            corpus, timestamps=timestamps, embeddings=embeddings, bins=bins
        )
    dtm = self.vectorizer.transform(corpus)  # type: ignore
    try:
        classes = self.classes_
    except AttributeError:
        classes = list(range(self.components_.shape[0]))
    res = TopicData(
        corpus=corpus,
        document_term_matrix=dtm,
        vocab=self.get_vocab(),
        document_topic_matrix=document_topic_matrix,
        document_representation=embeddings,
        topic_term_matrix=self.components_,  # type: ignore
        transform=getattr(self, "transform", None),
        topic_names=self.topic_names,
        classes=classes,
        temporal_components=self.temporal_components_,
        temporal_importance=self.temporal_importance_,
        time_bin_edges=self.time_bin_edges,
    )
    return res

Dynamic Topic Modeling

Usage

API reference

turftopic.dynamic.DynamicTopicModel

bin_timestamps(timestamps, bins=10) staticmethod

fit_dynamic(raw_documents, timestamps, embeddings=None, bins=10)

fit_transform_dynamic(raw_documents, timestamps, embeddings=None, bins=10) abstractmethod

prepare_dynamic_topic_data(corpus, timestamps, embeddings=None, bins=10)

`turftopic.dynamic.DynamicTopicModel`

`bin_timestamps(timestamps, bins=10)` `staticmethod`

`fit_dynamic(raw_documents, timestamps, embeddings=None, bins=10)`

`fit_transform_dynamic(raw_documents, timestamps, embeddings=None, bins=10)` `abstractmethod`

`prepare_dynamic_topic_data(corpus, timestamps, embeddings=None, bins=10)`