piccard

A graph-based approach to census data analysis. Developed by Abdulmohseen AlAli, Fabian Corbin, and Maliha Lodi.

Sections

Introduction

Introduction

Overview

Urban researchers rely on census data to identify and analyze demographic trends over time. Understanding these trends is essential for planning new infrastructure, supporting immigrant communities, and providing local services, among several other goals. However, due to the same demographic trends that census data is meant to illuminate, population changes can lead to new census boundaries being drawn. This makes the seemingly simple task of analyzing changes in a region difficult because a region defined by a specific boundary in any given year may not exist in other years.

piccard combines a novel solution to this problem with data clustering algorithms to streamline the data analysis process. piccard’s solution significantly improves on the traditional one, geographical harmonization, which involves defining a common set of regions across all years and fitting data to these regions. This method always introduces some amount of error, and harmonization methods are not readily available for some types of data, which makes it difficult to analyze and visualize that data.

piccard represents census regions as nodes in a graph network. Each census region in a specific year represents a node, and two nodes are connected if they represent consecutive census years and share at least a specific percentage of geographical overlap. When identifying trends in a specific region over time, every path through the graph containing that region is analyzed.

piccard integrates network creation, data clustering, and visualization into one tool, an approach that makes useful analysis possible for data that cannot easily be harmonized. Also, piccard is able to efficiently create networks by utilizing parallel processing for large datasets and incorporating flexibility over different coordinate systems.

For more information about the theory behind piccard, see this research paper.

Modules

The functionality of piccard is broken into four modules.

The first module, network creation, focuses on efficiently processing census data and representing it as a graph network.

The second module, clustering, uses tscluster’s flexible time-series clustering algorithms to cluster census regions in piccard networks.

The third module, visualization and analysis, provides highly customizable and accessible visualizations of piccard networks and clustering results, and also offers probabilistic analysis supported by ppandas.

Finally, the fourth module, VariableLinker, is a tool for understanding the links between census variables over time. Like census regions, census variables change considerably over time, and VariableLinker allows users to match variables based on their semantic meaning over time.

Tests

To run the tests for piccard, clone the repository and run the following commands in the root directory:

` pip install -e . pytest --import-mode=importlib tests/ `

Do not attempt to run the tests without installing piccard as a package via pip, as the tests rely on relative imports that will not work otherwise.

Example usage

For a real-life example using the first three modules of piccard, see this Colab notebook.

For a real-life example of the fourth module, see this Colab notebook.

Licence

This software is distributed under a CC0-1.0 Licence.

GitHub repository

To report a bug, contribute a fix, or look at the code behind piccard, see this GitHub repository.

Installation

Installation

Requirements

  • pandas>=1.3.0

  • numpy>=1.20.0

  • geopandas>=0.10.0

  • shapely>=1.8.0

  • pyproj>=3.0.0

  • networkx>=2.6.0

  • matplotlib>=3.5.0

  • plotly>=5.0.0

  • nltk>=3.6.0

  • sentence-transformers>=2.0.0

  • scikit-learn>=1.0.0

  • graphviz>=0.20.0

  • swifter>=1.3.0

  • typing-extensions>=4.0.0

  • hatchling>=1.0.0

In addition, when using the second and third modules, you will need to install tscluster and ppandas respectively. Keep reading this section for instructions on installing those packages.

Installing piccard

To install the current released version:

pip install piccard==1.1.2

To install the pre-release version via git:

pip install git+https://github.com/fcorbin567/piccard2.git

Then import:

from piccard import piccard as pc

Installing tscluster

tscluster requires the following:

  • Python 3.8+

  • numpy>=1.26

  • scipy>=1.10

  • gurobipy>=11.0

  • tslearn>=0.6.3

  • h5py>=3.10

  • pandas>=2.2

  • matplotlib>=3.8

Note that you will need a Gurobi licence when using OptTSCluster with large model size. See here for more about Gurobi licences.

To install the current released version:

pip install tscluster

To install the pre-release version via git:

pip install git+https://github.com/tscluster-project/tscluster.git

Installing ppandas

ppandas requires the following:

  • pgmpy==0.1.9

  • networkx==2.4

  • matplotlib

  • python-interval

  • geopandas

  • geovoronoi

To install via git:

pip install git+https://github.com/fcorbin567/ppandas.git

Module 1: Network Creation

Module 1: Network Creation

NetworkTable class

class NetworkTable():
    '''
    A table showing the network representation of census data.
    Each feature present in the data is a column, and each possible path through the network is a row.
    '''

Instance Variables:

  • table (pandas.DataFrame): The table, presented as a pandas DataFrame.

  • years (List[str]): The census years present in the table.

  • id (str): The unique geographical id used to distinguish geographical areas in the table.

Methods:

  • modify_table: Takes a new pandas DataFrame as an argument and sets table to the new DataFrame.

Module 1 Functions

preprocessing

Not necessary for network table creation, but you may optionally run this function yourself, for example if you want details of the dataframe cleaning but not the network creation, or if you want to try out different CRSs. Returns a cleaned geopandas df of the input data. Uses parallel processing for very large (>100,000 rows) datasets. Also adds a column for each year with calculated areas of each census tract in that year. Note: Input data is assumed to have been passed through gpd.read_file() beforehand.

Parameters:

  • data (GeoDataFrame):

    The census data to be analyzed with piccard.

  • year (str):

    The year that the census data was collected.

  • id (str):

    The name of the unique identifier that will be used to distinguish geographical areas.

  • crs (CRS | None):

    A pythonic Coordinate Reference System manager that will be used to compute areas. Default is EPSG:3347, a consistent, equal-area CRS based on square metres. Can be many formats; see https://pyproj4.github.io/pyproj/stable/api/crs/crs.html for more information.

  • verbose (bool | None):

    Whether to issue print statements about the progress of network creation. Default is true.

Returns:

  • gpd.GeoDataFrame: the cleaned data

create_network

Creates a networkx network representation of the temporal connections present in census_dfs over years when each yearly geographic area has at most threshold percentage of overlap with its corresponding area(s) in the next year. Represents geographical areas as nodes, and temporal connections as edges.

Parameters:

  • census_dfs (List[gpd.GeoDataFrame]):

    A list of GeoDataFrames containing the census data to be turned into a network.

  • years (List[str]):

    A list of years present in census_dfs over which the network representation will be created. Data from years not present in years will be ignored.

  • id (str):

    The name of the unique identifier that will be used to distinguish geographical areas.

  • crs (CRS | None):

    A pythonic Coordinate Reference System manager that will be used to compute areas. Default is EPSG:3347, a consistent, equal-area CRS based on square metres. Can be many formats; see https://pyproj4.github.io/pyproj/stable/api/crs/crs.html for more information.

  • threshold (float | None):

    The percentage of overlap (divided by 100) that geographic areas must meet or exceed in order to have a connection. Default is 0.05, or 5 percent.

  • verbose (bool | None):

    Whether to issue print statements about the progress of network creation. Default is true.

Returns:

  • nx.Graph: The networkx graph containing the nodes (geographical areas) and edges (geographical overlap)

    created in the new network representation.

create_network_table

Creates a NetworkTable showing the network representation of the census data in census_dfs. Each feature present in the data is a column, and each possible path through the network is a row.

Parameters:

  • census_dfs (List[gpd.GeoDataFrame]):

    A list of GeoDataFrames containing the census data to be turned into a network.

  • years (List[str]):

    A list of years present in census_dfs over which the network representation will be created. Data from years not present in years will be ignored.

  • id (str):

    The name of the unique identifier that will be used to distinguish geographical areas.

  • crs (CRS | None):

    A pythonic Coordinate Reference System manager that will be used to compute areas. Default is EPSG:3347, a consistent, equal-area CRS based on square metres. Can be many formats; see https://pyproj4.github.io/pyproj/stable/api/crs/crs.html for more information.

  • threshold (float | None):

    The percentage of overlap (divided by 100) that geographic areas must meet or exceed in order to have a connection. Default is 0.05, or 5 percent.

  • verbose (bool | None):

    Whether to issue print statements about the progress of network creation. Default is true.

Returns:

  • NetworkTable: the table.

Module 2: Clustering

Module 2: Clustering

ClusteredNetworkTable class

class ClusteredNetworkTable(NetworkTable):
'''
A table showing the network representation of census data and the cluster assignments in the network.
Each feature present in the data (including the cluster assignment for each year) is a column,
and each possible path through the network is a row.
'''

Instance Variables:

  • table (pandas.DataFrame): The table, presented as a pandas DataFrame.

  • years (List[str]): The census years present in the table.

  • id (str): The unique geographical id used to distinguish geographical areas in the table.

  • num_clusters (int): The number of clusters that data can be assigned to. Determined by the user.

  • tsc (Union[OptTSCluster, GreedyTSCluster]): The tscluster clustering object used to fit the data.

  • arr (np.ndarray[np.float64]): The array of data used in clustering.

  • label_dict (dict[str, Any]): The labels of census years, network paths, and variables that correspond to each dimension of arr.

Methods:

  • modify_label_dict: Takes a custom label dictionary as an argument and sets label_dict to the new dictionary.

Module 2 Functions

clustering_prep

Converts a piccard network table into a 3d numpy array of all possible paths and their corresponding features. This will be used for clustering with tscluster. The user can (optionally) input a list of columns that they want to be considered in the clustering algorithm, and the function will check that these columns are valid.

Parameters:

  • network_table (NetworkTable):

    The NetworkTable containing the data to be clustered.

  • cols (list[str] | None):

    A list of the names of network table columns that should be considered in the clustering algorithm. If none, every numerical feature will be considered. Leaving it none is not recommended as many numerical features, such as network level, have little bearing on the data.

Returns:

  • (tuple[np.ndarray[np.float64], dict[str, Any]], NetworkTable):

    A tuple of a 3d numpy array, a corresponding dictionary of labels showing the shape of the array, and the network table modified so it doesn’t include any of the NaN rows.

cluster

Runs one of tscluster’s clustering algorithms (default is fully dynamic clustering or 'z1c1') and adds the resulting cluster assignments to the network table and nodes as an additional feature. Information about the different clustering algorithms is available here: https://tscluster.readthedocs.io/en/latest/introduction.html We recommend either Sequential Label Analysis ('z1c0') or the default 'z1c1'.

Users can choose to only input the network table, in which case clustering_prep will be run for them with the default columns, or they can choose to run clustering_prep on their own and then have the option to apply one or both of the normalization methods available in tscluster.preprocessing.utils.

Parameters:

  • network_table (NetworkTable):

    The NetworkTable containing the data to be clustered.

  • G (nx.Graph):

    The result of pc.create_network().

  • num_clusters (int):

    The number of clusters that the algorithm will find.

  • algo (str | None):

    The algorithm that tscluster will use, either 'greedy' (default) or 'opt'. 'greedy' runs GreedyTSCluster, which is a faster and easier, but less accurate, method than OptTSCluster. Since it doesn’t require a special academic licence, we recommend 'greedy' for any non-academic users. 'opt' runs OptTSCluster, which is guaranteed to find the optimal clustering but requires a Gurobi academic licence to run the clustering algorithm. More information about obtaining an academic licence can be found here: https://www.gurobi.com/academia/academic-program-and-licenses/

  • scheme (str | None):

    the clustering scheme. See the first paragraph for more information. Default is 'z1c1'.

  • arr (np.ndarray[np.float64] | None):

    the array of data to be clustered. If none, arr and label_dict will be generated by running clustering_prep with the default columns. See the clustering_prep documentation for why we DO NOT recommend leaving this blank.

  • label_dict (dict[str, Any] | None):

    the label dictionary corresponding to the data array. See arr.

Returns:

  • ClusteredNetworkTable:

    The ClusteredNetworkTable object. Note that cluster also adds the resulting cluster assignments to the network table and nodes as an additional feature.

Module 3: Visualizations and Analysis

Module 3: Visualization and Analysis

Module 3 Functions

Network Visualizations

plot_subnetwork

Draws a subgraph of the network representation. If neither a specific list of ids to show nor a specific list of paths to show are given, picks num_to_sample random nodes from the first census year in the data and plots a subnetwork of their paths.

Hovering over each node shows the paths the node is part of.

Parameters:

  • network_table (NetworkTable):

    The NetworkTable containing the data.

  • G (nx.Graph):

    The network containing the data.

  • years (List[str] | None):

    A list of years to show in the subnetwork. Default is all census years present in the data.

  • paths_to_show (List[int] | None):

    A list of paths (numbered according to their position in network_table) whose points will be plotted in the subnetwork.

  • ids_to_show (List[str] | None):

    A list of ids (use the same type of id you used when creating the graph and network table) that will be plotted in the subnetwork. If both paths_to_show and ids_to_show are given, the function will only consider ids_to_show.

  • num_to_sample (int | None):

    The number of random nodes to plot the paths of in the subnetwork. Default is 4. Note: A large num_to_sample value may result in an unorganized and hard-to-read visualization.

Returns:

  • plotly.graph_objects.Figure:

    The interactive subnetwork plot.

plot_num_areas

Plots the number of geographical areas across a subset of census years in the data.

Parameters:

  • network_table (NetworkTable):

    The NetworkTable containing the data.

  • years (List[str] | None):

    A list of years to show in the plot. Default is all census years present in the data.

Returns:

  • plotly.graph_objects.Figure:

    The plot of the number of geographical areas.

Clustering Visualizations

plot_clusters_scatter

Creates a plotly scatterplot for each variable used in clustering with each timestep on the x axis and values on the y axis. The colours of data points correspond to their assigned cluster, and there is a legend showing which colour goes with which cluster. (Cluster numbers start at 0.)

Since cluster assignment often changes along the same path (or within the same area) over the years, plotting all the data points in one cluster often involves considering other clusters as well. Therefore, when you select a cluster to plot, you will see every path that contains a point in that cluster, and some of these paths will also contain paths in different clusters.

Add any clusters you don’t want to see (e.g. a cluster composed of NaN values) to exclude_clusters. This will exclude all paths containing these clusters, even paths that also have paths specified in the clusters list. In addition, you can curate the specific paths you want to see with paths_to_show; just make sure the paths are numbered according to their position in network_table.

Parameters:

  • network_table (ClusteredNetworkTable):

    The ClusteredNetworkTable containing the data.

  • label_dict (dict[str, Any] | None):

    A custom label dictionary.

  • years (List[str] | None):

    A list of years to show in the subnetwork. Default is all census years present in the data.

  • cluster_colours (dict[int, str] | None):

    A dict mapping cluster numbers to their corresponding colours. If None, plotly’s default colour map will be used. If a cluster number is not part of the dict, plotly’s default colour map will be used for that cluster.

  • dynamic_paths_only (bool | None):

    A boolean indicating whether to only plot dynamic entities (entities whose cluster assignment has changed over time). Default is true.

  • paths_to_show (List[int] | None):

    A list of paths (numbered according to their position in network_table) whose points will be plotted. Default is every path.

  • ids_to_show (List[str] | None):

    A list of ids (use the same type of id you used when creating the graph and network table) that will be plotted. Default is every id. If both paths_to_show and ids_to_show are given, the function will only consider ids_to_show.

  • clusters_to_show (List[int] | None):

    A list of the clusters whose points will be displayed on the map. Default is every cluster.

  • clusters_to_exclude (List[int] | None):

    A list of the clusters whose points will NOT be displayed on the map. Default is an empty list.

  • figsize (Tuple[float, float] | None):

    A tuple indicating the width and height of each figure that will be shown. Default is (700, 500).

  • cluster_labels (List[str] | None):

    A custom list of cluster names. Default is Cluster 0, …, Cluster n.

Returns:

  • List[plotly.graph_objects.Figure]:

    a list of plotly.graph_objects.Figure (you cannot show the whole list; rather, iterate through the list and show each figure)

plot_clusters_parallelcats

Creates an interactive parallel categories (parallel sets) plot to visualize how cluster assignments evolve over time.

Each column in the plot corresponds to a time point (e.g., a census year), and each path across the columns represents a “temporal path” of a tract or unit as it transitions across categories.

Parameters:

  • network_table (ClusteredNetworkTable):

    The ClusteredNetworkTable containing the data.

  • years (List[str] | None):

    A list of years to show. Default is all census years present in the data.

  • cluster_colours (dict[int, str] | None):

    A dict mapping cluster numbers to their corresponding colours. If None, plotly’s default colour map will be used. If a cluster number is not part of the dict, plotly’s default colour map will be used for that cluster.

  • colour_index_year (str | None):

    The year that will be used to determine the colours of the parallel plot. For example, if you chose 2011 as the colour index year, every cluster in the 2011 dimension would have a colour assigned to it, and then the paths into and out of these clusters would be shown in those colours. Default is the first year in the network table, and if an invalid input is given, the default will be used.

  • cluster_labels (List[str] | None):

    A custom list of cluster names. Default is Cluster 0, …, Cluster n.

  • figsize (Tuple[float, float] | None):

    A tuple indicating the width and height of each figure that will be shown. Default is (700, 500).

Returns:

  • plotly.graph_objects.Figure:

    The interactive parallel categories plot.

plot_clusters_area

Creates an interactive area chart to visualize how cluster assignments evolve over time.

Each column in the plot corresponds to a time point (e.g., a census year), and each path across the columns represents a “temporal path” of a tract or unit as it transitions across categories.

Parameters:

  • network_table (ClusteredNetworkTable):

    The ClusteredNetworkTable containing the data.

  • years (List[str] | None):

    A list of years to show. Default is all census years present in the data.

  • cluster_colours (dict[int, str] | None):

    A dict mapping cluster numbers to their corresponding colours. If None, plotly’s default colour map will be used. If a cluster number is not part of the dict, plotly’s default colour map will be used for that cluster.

  • cluster_labels (List[str] | None):

    A custom list of cluster names. Default is Cluster 0, …, Cluster n.

  • figsize (Tuple[float, float] | None):

    A tuple indicating the width and height of each figure that will be shown. Default is (700, 500).

  • stacked (bool | None):

    Whether to show the area plot as a stacked plot, with all the areas on top of each other. If False, shows the area plot as a regular line graph. Default is True.

Returns:

  • plotly.graph_objects.Figure:

    The interactive area plot.

plot_clusters_map

Plots cluster assignments in their associated geographical regions for a specific year using a GeoDataFrame.

Parameters:

  • geofile_path (str):

    Path to geographical data file

  • network_table (ClusteredNetworkTable):

    Network table to be merged with GeoJSON

  • year (str):

    Year to visualize (used in column name)

  • cluster_colours (dict[int, str] | None):

    A dict mapping cluster numbers to their corresponding colours. If None, plotly’s default colour map will be used. If a cluster number is not part of the dict, plotly’s default colour map will be used for that cluster.

  • label_dict (dict[str, Any] | None):

    The label dictionary from pc.clustering_prep() that you used in pc.cluster() or a custom label dictionary. Used to determine what data will be shown when you hover over each geographical region. If None, only the index (path number) will be shown.

  • cluster_labels (List[str] | None):

    A custom list of cluster names. Default is Cluster 0, …, Cluster n.

  • figsize (Tuple[float, float] | None):

    A tuple indicating the width and height of each figure that will be shown. Default is (700, 500).

Returns:

  • plotly.express.choropleth:

    The interactive choropleth map

plot_line_means

Creates an interactive line chart with one subplot per feature, showing how cluster mean values evolve over the selected years.

For each year in years, plots the mean value of each feature in selected_features for every cluster.

Parameters:

  • network_table (ClusteredNetworkTable):

    The ClusteredNetworkTable containing the data.

  • years (List[str] | None):

    A list of years to show. Default is all census years present in the data.

  • selected_features (List[str]):

    Which features (column names present in clustering) to plot

  • varnames (List[str] | None):

    The custom variable names to plot

  • cluster_colours (dict[int, str] | None):

    A dict mapping cluster numbers to their corresponding colours. If None, plotly’s default colour map will be used. If a cluster number is not part of the dict, plotly’s default colour map will be used for that cluster.

  • cluster_labels (List[str] | None):

    A custom list of cluster names. Default is Cluster 0, …, Cluster n.

  • figsize (Tuple[float, float] | None):

    A tuple indicating the width and height of each figure that will be shown. Default is (700, 500).

Returns:

  • plotly.graph_objects.Figure:

    The line chart with subplots.

plot_bar_means

Create grouped bar-chart subplots of cluster means for each year.

For each year in years, plots the mean value of each feature in selected_features for every cluster. Subplots are arranged in a grid with two columns.

Parameters:

  • network_table (ClusteredNetworkTable):

    The ClusteredNetworkTable containing the data.

  • years (List[str] | None):

    A list of years to show. Default is all census years present in the data.

  • selected_features (List[str]):

    Which features (column names present in clustering) to plot

  • varnames (List[str] | None):

    The custom variable names to plot

  • cluster_colours (dict[int, str] | None):

    A dict mapping cluster numbers to their corresponding colours. If None, plotly’s default colour map will be used. If a cluster number is not part of the dict, plotly’s default colour map will be used for that cluster.

  • cluster_labels (List[str] | None):

    A custom list of cluster names. Default is Cluster 0, …, Cluster n.

  • figsize (Tuple[float, float] | None):

    A tuple indicating the width and height of each figure that will be shown. Default is (700, 500).

Returns:

  • plotly.graph_objects.Figure:

    The bar chart with subplots.

radar_chart_multiple_years

Creates a radar (polar) chart of selected variables for a given cluster across years.

Parameters:

  • network_table (ClusteredNetworkTable):

    The ClusteredNetworkTable containing the data.

  • years (List[str] | None):

    A list of years to show. Default is all census years present in the data.

  • selected_cluster (int):

    Which cluster to plot

  • selected_features (List[str]):

    Which features (column names present in clustering) to plot

  • varnames (List[str] | None):

    The custom variable names to plot

  • year_colours (dict[int, str] | None):

    A dict mapping indices of years to their corresponding colours. For example, if your data goes from 2006 to 2021, 2006 corresponds to index 0, 2011 to 1, etc. If None, plotly’s default colour map will be used. If a year is not part of the dict, plotly’s default colour map will be used for that year.

  • cluster_label (str | None):

    The custom label of the cluster to show. Default is Cluster n.

  • figsize (Tuple[float, float] | None):

    A tuple indicating the width and height of each figure that will be shown. Default is (700, 500).

Returns:

  • plotly.graph_objects.Figure:

    The radar chart.

radar_chart_multiple_clusters

Creates a radar (polar) chart of selected variables for a given year across clusters.

Parameters:

  • network_table (ClusteredNetworkTable):

    The ClusteredNetworkTable containing the data.

  • clusters (List[int] | None):

    A list of clusters to show. Default is all clusters present in the data.

  • selected_year (str):

    Which year to plot

  • selected_features (List[str]):

    Which features (column names present in clustering) to plot

  • varnames (List[str] | None):

    The custom variable names to plot

  • cluster_colours (dict[int, str] | None):

    A dict mapping cluster numbers to their corresponding colours. If None, plotly’s default colour map will be used. If a cluster number is not part of the dict, plotly’s default colour map will be used for that cluster.

  • cluster_labels (List[str] | None):

    A custom list of cluster names. Default is Cluster 0, …, Cluster n.

  • figsize (Tuple[float, float] | None):

    A tuple indicating the width and height of each figure that will be shown. Default is (700, 500).

Returns:

  • plotly.graph_objects.Figure:

    The radar chart.

Probabilistic Analysis

prob_reasoning_networks

Allows probabilistic reasoning over network representations of heterogenous/unlinked datasets using the ppandas package. For more information about ppandas, visit: https://github.com/D3Mlab/ppandas/tree/master

Takes in two network tables and lists of independent and dependent variables for each, performs and visualizes a join, and returns the resulting PDataFrame (which can be used to obtain information about conditional probabilities). This function is recommended if you have datasets from different sources or datasets that designate geographical regions using different units.

The second list of independent variables must be a subset of the first, so make sure the column names are the same before passing them into this function. However, mismatches in independent variable column data allowed by ppandas are okay.

Parameters:

  • network_table_1 (NetworkTable | pd.DataFrame | gpd.GeoDataFrame):

    The reference network table. Typically the network table associated with the data assumed to be more unbiased and reliable.

  • network_table_2 (NetworkTable | pd.DataFrame | gpd.GeoDataFrame):

    The second network table whose independent and dependent variables will be joined into a probabilistic model of network_table_1.

  • independent_vars_1 (List[str]):

    A list of independent variables associated with network_table_1. Must be columns in network_table_1.

  • independent_vars_2 (List[str]):

    A list of independent variables associated with network_table_2. Must be columns in network_table_2 and every column in independent_vars_2 must also appear in independent_vars_1.

  • dependent_vars_1 (List[str]):

    A list of dependent variables associated with network_table_1. Must be columns in network_table_1.

  • dependent_vars_2 (List[str]):

    A list of dependent variables associated with network_table_2. Must be columns in network_table_2. Unlike with independent variables, not every column in dependent_vars_2 also has to appear in dependent_vars_1.

  • mismatches (dict[str, str] | None):

    A dictionary of the mismatches PDataFrame.pjoin will handle. Must be in format {<independent variable name>: <’categorical’ | ‘numerical’ | ‘spatial’> }. See the link above for more information.

Returns:

  • PDataFrame:

    The result of joining the two probabilistic models of network tables.

prob_reasoning_years

Allows probabilistic reasoning over network representations of heterogenous/unlinked datasets using the ppandas package. For more information about ppandas, visit: https://github.com/D3Mlab/ppandas/tree/master

Takes in two years from the same network table and lists of independent and dependent variables for each, performs and visualizes a join, and returns the resulting PDataFrame (which can be used to obtain information about conditional probabilities).

The second list of independent variables must be a subset of the first, so make sure the column names are the same before passing them into this function. Mismatches in independent variable column data allowed by ppandas are okay.

Parameters:

  • network_table (NetworkTable):

    The network table containing the data.

  • year_1 (str):

    The first year examined.

  • year_2 (str):

    The second year examined.

  • independent_vars_1 (List[str]):

    A list of independent variables associated with year_1. Must be columns in network_table and end in year_1.

  • independent_vars_2 (List[str]):

    A list of independent variables associated with year_2. Must be columns in network_table and end in year_2. The columns (minus year 2) must be a subset of independent_vars_1 (minus year 1).

  • dependent_vars_1 (List[str]):

    A list of dependent variables associated with year_1. Must be columns in network_table and end in year_1.

  • dependent_vars_2 (List[str]):

    A list of dependent variables associated with year_1. Must be columns in network_table and end in year_1. Unlike with independent variables, not every column in dependent_vars_2 also has to appear in dependent_vars_1.

  • mismatches (dict[str, str] | None):

    A dictionary of the mismatches PDataFrame.pjoin will handle. Must be in format <independent variable name>: <’categorical’ | ‘numerical’ | ‘spatial’> }. See the link above for more information.

Returns:

  • PDataFrame:

    The result of joining the two probabilistic models.

Module 4: VariableLinker

Module 4: VariableLinker

Overview

VariableLinker is a Python framework designed for visualizing the links between census variables across multiple years. It provides multiple approaches for matching census variables between different years and creates hierarchical tree visualizations that show how these variables are connected.

Key Features:

  • Multiple Matching Algorithms: Jaccard similarity and sentence transformers

  • Hierarchical Visualization: Creates tree structures showing the parent-child relationships in census data

  • Colour-coded Results: Visual indicators for data consistency across years

Use Cases:

  • Census data harmonization across multiple years

  • Tracking changes in census variables over time

  • Visualizing data consistency and evolution

Installation and Setup

Prerequisites

pip install -r requirements.txt

Importing VariableLinker

import sys
import os


# Add the src/piccard directory to Python path

current_dir = os.getcwd()
src_path = os.path.join(current_dir, '..', 'src', 'piccard')
sys.path.append(src_path)

from variable_linker import VariableLinker

Core Concepts

1. Census Metadata Structure

VariableLinker works with census metadata JSON files that contain:

  • Vector identifiers: Unique codes for census variables

  • Descriptions: Human-readable descriptions of census variables

  • Types: Categories like “Total”, “Male”, “Female”

  • Details: Additional contextual information

2. Matching Process

The framework performs two-pass matching:

  • Exact Match: Find identical descriptions across years

  • Similarity Match: Use similarity algorithms for inexact matches

3. Tree Visualization

  • Nodes: Represent census variables

  • Edges: Show parent-child relationships

  • Colours: Indicate consistency across years

    • Grey: Source year only

    • Salmon: Matches in 1 other year

    • Yellow: Matches in 2 other years

    • Light green: Matches in 3+ other years

VariableLinker Class Reference

Class Overview

class VariableLinker:
    """
    A class for processing census metadata and creating tree visualizations.

    This class provides functionality for:
    - Preprocessing census metadata from JSON files
    - Computing similarity between census descriptions using various methods
    - Matching descriptions across different census years
    - Building hierarchical tree visualizations with colour-coding
    """

Static Methods

preprocess_census_metadata(path, type_filter="Total")

Preprocesses census metadata from JSON files.

Parameters:

  • path (str): Path to the JSON file containing census metadata

  • type_filter (str): Type of records to filter for (default: “Total”)

Returns:

  • pd.DataFrame: Preprocessed DataFrame with columns [‘vector’, ‘type’, ‘description’, …]

Example:

data_2021 = VariableLinker.preprocess_census_metadata("census_ca21_full_metadata.json")

jaccard_similarity(sentence1, sentence2)

Computes Jaccard similarity between two census descriptions.

Parameters:

  • sentence1 (str): First census description

  • sentence2 (str): Second census description

Returns:

  • float: Jaccard similarity score between 0.0 and 1.0

process_discription_text(text)

Processes and tokenizes census text for similarity comparison.

Parameters:

  • text (str): Raw census description text

Returns:

  • set: Set of processed tokens (words and numbers, excluding stopwords)

normalize_ranges(text)

Normalizes numeric ranges in text for consistent processing.

Parameters:

  • text (str): Text containing potential numeric ranges

Returns:

  • str: Text with normalized numeric ranges

parse_tree_to_dict(filepath)

Parses a Graphviz tree file into a dictionary structure.

Parameters:

  • filepath (str): Path to the Graphviz tree file

Returns:

  • Dict: Dictionary mapping node IDs to their information including descriptions, year mappings, and colours

Example:

tree_dict = VariableLinker.parse_tree_to_dict("my_tree")

extract_parent_child_relationships(filepath)

Extracts parent-child relationships from tree file edges.

Parameters:

  • filepath (str): Path to the tree file (Graphviz format)

Returns:

  • Dict[str, List[str]]: Dictionary mapping parent nodes to their children

Example:

relationships = VariableLinker.extract_parent_child_relationships("my_tree")

predict_parent_nodes(tree_dict, parent_child_relationships, target_years)

Predicts parent nodes in other years using the additive property.

Parameters:

  • tree_dict (Dict): Parsed tree dictionary with node info and year mappings

  • parent_child_relationships (Dict[str, List[str]]): Parent to children mapping

  • target_years (List[str]): Years to predict parents for (default: [‘2016’, ‘2011’, ‘2006’])

Returns:

  • Dict[str, List[str]]: Dictionary mapping parent nodes to years in which they can be predicted

Example:

predictions = VariableLinker.predict_parent_nodes(tree_dict, relationships, ['2016', '2011'])

Matching Approaches

1. Jaccard Similarity Matching

Method: match_descriptions_jaccard()

Uses token-based similarity to match descriptions across years.

Advantages:

  • Good for exact and near-exact matches

  • Language-agnostic

Disadvantages:

  • May miss semantic similarities

  • Sensitive to phrasing

Usage:

jaccard_mapping = VariableLinker.match_descriptions_jaccard(
    source_df=data_2021,
    compare_df=data_2016,
    similarity_threshold=0.9
)

2. Sentence Transformer Matching

Method: match_descriptions_transformer()

Uses pre-trained sentence transformers for semantic similarity matching.

Advantages:

  • Captures semantic meaning

  • Better for paraphrased descriptions

  • Robust to word variations

  • Faster than Jaccard since it uses vectorization

Disadvantages:

  • Limited ability to process numeric values and ranges in text descriptions

Usage:

transformer_mapping = VariableLinker.match_descriptions_transformer(
    source_df=data_2021,
    compare_df=data_2016,
    similarity_threshold=0.9,
    model_name='all-mpnet-base-v2'
)

3. Advanced Sentence Transformer Matching

Method: match_descriptions_details_sentence_transformer()

Enhanced version of sentence transformer that uses details for breaking ties when multiple exact matches are found.

Advantages:

  • Attempts better disambiguation using details field

  • More sophisticated exact matching strategy

Disadvantages

  • Performance evaluation indicates higher error rate than basic transformer

  • Higher computational complexity without performance benefit

Usage:

advanced_mapping = VariableLinker.match_descriptions_details_sentence_transformer(
    source_df=data_2021,
    compare_df=data_2016,
    similarity_threshold=0.9
)

4. Multithreaded Matching

Method: match_descriptions_multithreaded() (from multithreaded_mapping.py)

Jaccard similarity approach with multithreaded execution for enhanced performance on large datasets.

Advantages:

  • Parallel processing for similarity matching phase

  • Configurable number of worker threads (default: 4)

  • Thread-safe operations for similarity matching

Usage:

from multithreaded_mapping import match_descriptions_multithreaded

multithreaded_mapping = match_descriptions_multithreaded(
    source_df=data_2021,
    compare_df=data_2016,
    similarity_threshold=0.9,
    max_workers=8
)

Workflow Examples

Basic Workflow

# 1. Load and preprocess data
data_2021 = VariableLinker.preprocess_census_metadata("census_ca21_full_metadata.json")
data_2016 = VariableLinker.preprocess_census_metadata("census_ca16_full_metadata.json")


# 2. Perform matching
mapping_21_16 = VariableLinker.match_descriptions_jaccard(
    source_df=data_2021,
    compare_df=data_2016,
    similarity_threshold=0.9
)


# 3. Merge mappings
merged_df = VariableLinker.merge_mappings(data_2021, mapping_21_16)


# 4. Build visualization
tree = VariableLinker.build_tree(data_2021, merged_df, "my_tree", "output_path")

Multi-Year Workflow

# Load data for multiple years
data_2006 = VariableLinker.preprocess_census_metadata("census_ca06_full_metadata.json")
data_2011 = VariableLinker.preprocess_census_metadata("census_ca11_full_metadata.json")
data_2016 = VariableLinker.preprocess_census_metadata("census_ca16_full_metadata.json")
data_2021 = VariableLinker.preprocess_census_metadata("census_ca21_full_metadata.json")


# Match against 2021 (latest year)
mapping_21_16 = VariableLinker.match_descriptions_jaccard(data_2021, data_2016, 0.9)
mapping_21_11 = VariableLinker.match_descriptions_jaccard(data_2021, data_2011, 0.9)
mapping_21_06 = VariableLinker.match_descriptions_jaccard(data_2021, data_2006, 0.9)


# Merge all mappings
merged_df = VariableLinker.merge_mappings(data_2021, mapping_21_16, mapping_21_11, mapping_21_06)


# Build comprehensive tree
tree = VariableLinker.build_tree(data_2021, merged_df, "multi_year_tree", "trees/")

Comparison of Approaches

# Jaccard approach
jaccard_mapping = VariableLinker.match_descriptions_jaccard(data_2021, data_2016, 0.9)
jaccard_merged = VariableLinker.merge_mappings(data_2021, jaccard_mapping)
jaccard_tree = VariableLinker.build_tree(data_2021, jaccard_merged, "jaccard_tree", "trees/")


# Transformer approach
transformer_mapping = VariableLinker.match_descriptions_transformer(data_2021, data_2016, 0.9)
transformer_merged = VariableLinker.merge_mappings(data_2021, transformer_mapping)
transformer_tree = VariableLinker.build_tree(data_2021, transformer_merged, "transformer_tree", "trees/")


# Multithreaded approach
from multithreaded_mapping import match_descriptions_multithreaded
multithreaded_mapping = match_descriptions_multithreaded(data_2021, data_2016, 0.9, 8)
multithreaded_merged = VariableLinker.merge_mappings(data_2021, multithreaded_mapping)
multithreaded_tree = VariableLinker.build_tree(data_2021, multithreaded_merged, "multithreaded_tree", "trees/")

Advanced Features

Custom Similarity Thresholds

Different thresholds can be used for different types of data:

# Strict matching for critical variables
critical_mapping = VariableLinker.match_descriptions_jaccard(data_2021, data_2016, 0.95)


# Relaxed matching for exploratory analysis
exploratory_mapping = VariableLinker.match_descriptions_jaccard(data_2021, data_2016, 0.7)

Model Selection for Transformers

# Use different transformer models
mapping_mini = VariableLinker.match_descriptions_transformer(
    data_2021, data_2016, 0.9, 'all-MiniLM-L6-v2'
)
mapping_mpnet = VariableLinker.match_descriptions_transformer(
    data_2021, data_2016, 0.9, 'all-mpnet-base-v2'
)

Tree Analysis and Prediction

Overview

VariableLinker provides advanced functionality for analyzing existing tree structures and predicting missing parent nodes based on the additive property of census data.

Key Concepts

  • Additive Property

    In census data, parent variables often represent the sum of their child variables:

    Parent_Value = Sum(Child_Values)

    This property allows us to predict parent nodes in years where they don’t exist or did not get matched, as long as all their children are available in those years.

  • Tree Parsing

    The framework can parse existing Graphviz tree files to extract:

    • Node descriptions and metadata

    • Year-specific vector mappings

    • Parent-child relationships

    • Colour-coding information

Workflow for Tree Analysis

# 1. Parse existing tree file
tree_dict = VariableLinker.parse_tree_to_dict("existing_tree.gv")


# 2. Extract parent-child relationships
relationships = VariableLinker.extract_parent_child_relationships("existing_tree.gv")


# 3. Predict missing parent nodes
predictions = VariableLinker.predict_parent_nodes(
    tree_dict=tree_dict,
    parent_child_relationships=relationships,
    target_years=['2016', '2011', '2006']
)


# 4. Analyze predictions
for parent_node, predictable_years in predictions.items():
    print(f"Parent '{parent_node}' can be predicted in years: {predictable_years}")

Complete Analysis Workflow

# Load and process census data
data_2021 = VariableLinker.preprocess_census_metadata("census_ca21_full_metadata.json")
data_2016 = VariableLinker.preprocess_census_metadata("census_ca16_full_metadata.json")


# Create initial tree
mapping_21_16 = VariableLinker.match_descriptions_jaccard(data_2021, data_2016, 0.9)
merged_df = VariableLinker.merge_mappings(data_2021, mapping_21_16)
tree = VariableLinker.build_tree(data_2021, merged_df, "analysis_tree", "trees/")


# Analyze the created tree
tree_dict = VariableLinker.parse_tree_to_dict("trees/analysis_tree")
relationships = VariableLinker.extract_parent_child_relationships("trees/analysis_tree")


# Predict missing parents
predictions = VariableLinker.predict_parent_nodes(tree_dict, relationships)


# Generate report
print("=== Tree Analysis Report ===")
print(f"Total nodes in tree: {len(tree_dict)}")
print(f"Parent-child relationships: {len(relationships)}")
print(f"Predictable parent nodes: {len(predictions)}")

for parent, years in predictions.items():
    parent_desc = tree_dict[parent]['description']
    print(f"\nParent: {parent_desc}")
    print(f"  Node ID: {parent}")
    print(f"  Predictable in years: {years}")

Prediction Algorithm Details

The prediction algorithm works as follows:

  • Year Analysis: Identifies the years in which the parent currently exists

  • Child Verification: For each target year, checks if ALL children exist

  • Prediction: If all children exist in a target year, the parent can be predicted

Example Scenario:

Parent: “Total Population” Children: [“Male Population”, “Female Population”]

If “Male Population” and “Female Population” both exist in 2016, but “Total Population” doesn’t exist in 2016, then “Total Population” can be predicted for 2016.

Use Cases for Tree Analysis

  • Data Completeness Assessment: Identify missing parent nodes across years

  • Prediction Validation: Verify which parent nodes can be reliably predicted

Performance Considerations

  • Memory Usage

    • Large datasets may require significant RAM

    • Consider processing in chunks for very large datasets

    • Use multithreaded approach for better memory management

Data Structures

Input DataFrame Format

{
    'vector': 'v_CA21_1234',
    'type': 'Total',
    'description': 'Population aged 25-34 years',
    'details': 'Detailed description...'
}

Output Mapping Format

{
    'description': 'Population aged 25-34 years',
    'vector_base': 'v_CA21_1234',
    'vector_cmp': 'v_CA16_1234'
}

Merged Mapping Format

{
    'description': 'Population aged 25-34 years',
    'vector_base': 'v_CA21_1234',
    'vector_cmp_list': ['v_CA16_1234', 'v_CA11_1234', 'v_CA06_1234']
}

Troubleshooting

Import Errors

# Solution: Add correct path
import sys
sys.path.append('../src/piccard')
from variable_linker import VariableLinker

File Not Found Errors

# Check file paths
import os
print("Current directory:", os.getcwd())
print("Files available:", os.listdir('.'))

Memory Issues

  • Reduce batch size for large datasets

  • Use multithreaded approach

  • Process data in chunks

Poor Matching Results

  • Adjust similarity threshold

  • Try different matching approaches

  • Check data quality and consistency

Configuration Options

Similarity Thresholds

  • Strict: 0.95+ for critical variables

  • Standard: 0.9 for most use cases

  • Relaxed: 0.7-0.8 for exploratory analysis

Transformer Models