Module 2: Clustering

ClusteredNetworkTable class

class ClusteredNetworkTable(NetworkTable):
'''
A table showing the network representation of census data and the cluster assignments in the network.
Each feature present in the data (including the cluster assignment for each year) is a column,
and each possible path through the network is a row.
'''

Instance Variables:

  • table (pandas.DataFrame): The table, presented as a pandas DataFrame.

  • years (List[str]): The census years present in the table.

  • id (str): The unique geographical id used to distinguish geographical areas in the table.

  • num_clusters (int): The number of clusters that data can be assigned to. Determined by the user.

  • tsc (Union[OptTSCluster, GreedyTSCluster]): The tscluster clustering object used to fit the data.

  • arr (np.ndarray[np.float64]): The array of data used in clustering.

  • label_dict (dict[str, Any]): The labels of census years, network paths, and variables that correspond to each dimension of arr.

Methods:

  • modify_label_dict: Takes a custom label dictionary as an argument and sets label_dict to the new dictionary.

Module 2 Functions

clustering_prep

Converts a piccard network table into a 3d numpy array of all possible paths and their corresponding features. This will be used for clustering with tscluster. The user can (optionally) input a list of columns that they want to be considered in the clustering algorithm, and the function will check that these columns are valid.

Parameters:

  • network_table (NetworkTable):

    The NetworkTable containing the data to be clustered.

  • cols (list[str] | None):

    A list of the names of network table columns that should be considered in the clustering algorithm. If none, every numerical feature will be considered. Leaving it none is not recommended as many numerical features, such as network level, have little bearing on the data.

Returns:

  • (tuple[np.ndarray[np.float64], dict[str, Any]], NetworkTable):

    A tuple of a 3d numpy array, a corresponding dictionary of labels showing the shape of the array, and the network table modified so it doesn’t include any of the NaN rows.

cluster

Runs one of tscluster’s clustering algorithms (default is fully dynamic clustering or 'z1c1') and adds the resulting cluster assignments to the network table and nodes as an additional feature. Information about the different clustering algorithms is available here: https://tscluster.readthedocs.io/en/latest/introduction.html We recommend either Sequential Label Analysis ('z1c0') or the default 'z1c1'.

Users can choose to only input the network table, in which case clustering_prep will be run for them with the default columns, or they can choose to run clustering_prep on their own and then have the option to apply one or both of the normalization methods available in tscluster.preprocessing.utils.

Parameters:

  • network_table (NetworkTable):

    The NetworkTable containing the data to be clustered.

  • G (nx.Graph):

    The result of pc.create_network().

  • num_clusters (int):

    The number of clusters that the algorithm will find.

  • algo (str | None):

    The algorithm that tscluster will use, either 'greedy' (default) or 'opt'. 'greedy' runs GreedyTSCluster, which is a faster and easier, but less accurate, method than OptTSCluster. Since it doesn’t require a special academic licence, we recommend 'greedy' for any non-academic users. 'opt' runs OptTSCluster, which is guaranteed to find the optimal clustering but requires a Gurobi academic licence to run the clustering algorithm. More information about obtaining an academic licence can be found here: https://www.gurobi.com/academia/academic-program-and-licenses/

  • scheme (str | None):

    the clustering scheme. See the first paragraph for more information. Default is 'z1c1'.

  • arr (np.ndarray[np.float64] | None):

    the array of data to be clustered. If none, arr and label_dict will be generated by running clustering_prep with the default columns. See the clustering_prep documentation for why we DO NOT recommend leaving this blank.

  • label_dict (dict[str, Any] | None):

    the label dictionary corresponding to the data array. See arr.

Returns:

  • ClusteredNetworkTable:

    The ClusteredNetworkTable object. Note that cluster also adds the resulting cluster assignments to the network table and nodes as an additional feature.