Module 2: Clustering

ClusteredNetworkTable class

class ClusteredNetworkTable(NetworkTable):
'''
A table showing the network representation of census data and the cluster assignments in the network.
Each feature present in the data (including the cluster assignment for each year) is a column,
and each possible path through the network is a row.
'''

Instance Variables:

table (pandas.DataFrame): The table, presented as a pandas DataFrame.
years (List[str]): The census years present in the table.
id (str): The unique geographical id used to distinguish geographical areas in the table; used externally when interacting with GeoJSON files.
id_col (str): id.lower(); used internally to find column names in the network.
weighted (bool): Whether weights have been applied to the network so that data points that show up
multiple times in the same column (due to that data point appearing in multiple temporal paths) do not exert undue influence on clustering and other data analysis.
num_clusters (int): The number of clusters that data can be assigned to. Determined by the user.
tsc (Union[OptTSCluster, GreedyTSCluster]): The tscluster clustering object used to fit the data.
arr (np.ndarray[np.float64]): The array of data used in clustering.
label_dict (dict[str, Any]): The labels of census years, network paths, and variables that correspond to each dimension of arr.

Methods:

modify_label_dict: Takes a custom label dictionary as an argument and sets label_dict to the new dictionary.

Module 2 Functions

`clustering_prep`

Converts a piccard network table into a 3d numpy array of all possible paths and their corresponding features. This will be used for clustering with tscluster. The user can (optionally) input a list of columns that they want to be considered in the clustering algorithm, and the function will check that these columns are valid.

Note that weights are applied to the numpy array regardless of whether weighted is true in the network table. The network table is not modified to include weights.

Parameters:

network_table (NetworkTable):
The NetworkTable containing the data to be clustered.
cols (list[str] | None):
A list of the names of variables (network table columns minus years) that should be considered in the clustering algorithm. If none, every numerical feature will be considered. Leaving it none is not recommended as many numerical features, such as network level, have little bearing on the data.

Returns:

(tuple[np.ndarray[np.float64], dict[str, Any]], NetworkTable):
A tuple of a 3d numpy array, a corresponding dictionary of labels showing the shape of the array, and the network table modified so it doesn’t include any of the NaN rows.

`cluster`

Runs one of tscluster’s clustering algorithms (default is fully dynamic clustering or 'z1c1') and adds the resulting cluster assignments to the network table and nodes as an additional feature. Information about the different clustering algorithms is available here: https://tscluster.readthedocs.io/en/latest/introduction.html We recommend either Sequential Label Analysis ('z1c0') or the default 'z1c1'.

Users can choose to only input the network table, in which case clustering_prep will be run for them with the default columns, or they can choose to run clustering_prep on their own and then have the option to apply one or both of the normalization methods available in tscluster.preprocessing.utils.