piccard
A graph-based approach to census data analysis. Developed by Abdulmohseen AlAli, Fabian Corbin, and Maliha Lodi.
Table of Contents
Sections
Introduction
Introduction
Overview
Urban researchers rely on census data to identify and analyze demographic trends over time. Understanding these trends is essential for planning new infrastructure, supporting immigrant communities, and providing local services, among several other goals. However, due to the same demographic trends that census data is meant to illuminate, population changes can lead to new census boundaries being drawn. This makes the seemingly simple task of analyzing changes in a region difficult because a region defined by a specific boundary in any given year may not exist in other years.
piccard combines a novel solution to this problem with data clustering algorithms to streamline the data analysis process. piccard’s solution significantly improves on the traditional one, geographical harmonization, which involves defining a common set of regions across all years and fitting data to these regions. This method always introduces some amount of error, and harmonization methods are not readily available for some types of data, which makes it difficult to analyze and visualize that data.
piccard represents census regions as nodes in a graph network. Each census region in a specific year represents a node, and two nodes are connected if they represent consecutive census years and share at least a specific percentage of geographical overlap. When identifying trends in a specific region over time, every path through the graph containing that region is analyzed.
piccard integrates network creation, data clustering, and visualization into one tool, an approach that makes useful analysis possible for data that cannot easily be harmonized. Also, piccard is able to efficiently create networks by utilizing parallel processing for large datasets and incorporating flexibility over different coordinate systems.
For more information about the theory behind piccard, see this research paper.
Modules
The functionality of piccard is broken into four modules.
The first module, network creation, focuses on efficiently processing census data and representing it as a graph network.
The second module, clustering, uses tscluster’s flexible time-series clustering algorithms to cluster census regions in piccard networks.
The third module, visualization and analysis, provides highly customizable and accessible visualizations of piccard networks and clustering results, and also offers probabilistic analysis supported by ppandas.
Finally, the fourth module, VariableLinker, is a tool for understanding the links between census variables over time. Like census regions, census variables change considerably over time, and VariableLinker allows users to match variables based on their semantic meaning over time.
Tests
To run the tests for piccard, clone the repository and run the following commands in the root directory:
`
pip install -e .
pytest --import-mode=importlib tests/
`
Do not attempt to run the tests without installing piccard as a package via pip, as the tests rely on relative imports that will not work otherwise.
Example usage
For a real-life example using the first three modules of piccard, see this Colab notebook.
For a real-life example of the fourth module, see this Colab notebook.
Licence
This software is distributed under a CC0-1.0 Licence.
GitHub repository
To report a bug, contribute a fix, or look at the code behind piccard, see this GitHub repository.
Installation
Installation
Requirements
pandas>=1.3.0
numpy>=1.20.0
geopandas>=0.10.0
shapely>=1.8.0
pyproj>=3.0.0
networkx>=2.6.0
matplotlib>=3.5.0
plotly>=5.0.0
nltk>=3.6.0
sentence-transformers>=2.0.0
scikit-learn>=1.0.0
graphviz>=0.20.0
swifter>=1.3.0
typing-extensions>=4.0.0
hatchling>=1.0.0
In addition, when using the second and third modules, you will need to install tscluster and ppandas respectively.
Keep reading this section for instructions on installing those packages.
Installing piccard
To install the current released version:
pip install piccard==1.1.2
To install the pre-release version via git:
pip install git+https://github.com/fcorbin567/piccard2.git
Then import:
from piccard import piccard as pc
Installing tscluster
tscluster requires the following:
Python 3.8+
numpy>=1.26
scipy>=1.10
gurobipy>=11.0
tslearn>=0.6.3
h5py>=3.10
pandas>=2.2
matplotlib>=3.8
Note that you will need a Gurobi licence when using OptTSCluster with large model size. See here for more about Gurobi licences.
To install the current released version:
pip install tscluster
To install the pre-release version via git:
pip install git+https://github.com/tscluster-project/tscluster.git
Installing ppandas
ppandas requires the following:
pgmpy==0.1.9
networkx==2.4
matplotlib
python-interval
geopandas
geovoronoi
To install via git:
pip install git+https://github.com/fcorbin567/ppandas.git
Module 1: Network Creation
Module 1: Network Creation
NetworkTable class
class NetworkTable():
'''
A table showing the network representation of census data.
Each feature present in the data is a column, and each possible path through the network is a row.
'''
Instance Variables:
table(pandas.DataFrame): The table, presented as apandasDataFrame.years(List[str]): The census years present in the table.id(str): The unique geographical id used to distinguish geographical areas in the table.
Methods:
modify_table: Takes a newpandasDataFrame as an argument and setstableto the new DataFrame.
Module 1 Functions
preprocessing
Not necessary for network table creation, but you may optionally run this function yourself, for example
if you want details of the dataframe cleaning but not the network creation, or if you want to try out
different CRSs.
Returns a cleaned geopandas df of the input data. Uses parallel processing for very large (>100,000 rows) datasets.
Also adds a column for each year with calculated areas of each census tract in that year.
Note: Input data is assumed to have been passed through gpd.read_file() beforehand.
Parameters:
data(GeoDataFrame):The census data to be analyzed with piccard.
year(str):The year that the census data was collected.
id(str):The name of the unique identifier that will be used to distinguish geographical areas.
crs(CRS | None):A pythonic Coordinate Reference System manager that will be used to compute areas. Default is EPSG:3347, a consistent, equal-area CRS based on square metres. Can be many formats; see https://pyproj4.github.io/pyproj/stable/api/crs/crs.html for more information.
verbose(bool | None):Whether to issue print statements about the progress of network creation. Default is true.
Returns:
gpd.GeoDataFrame: the cleaned data
create_network
Creates a networkx network representation of the temporal connections present in census_dfs over years
when each yearly geographic area has at most threshold percentage of overlap with its
corresponding area(s) in the next year. Represents geographical areas as nodes, and temporal connections
as edges.
Parameters:
census_dfs(List[gpd.GeoDataFrame]):A list of GeoDataFrames containing the census data to be turned into a network.
years(List[str]):A list of years present in
census_dfsover which the network representation will be created. Data from years not present in years will be ignored.
id(str):The name of the unique identifier that will be used to distinguish geographical areas.
crs(CRS | None):A pythonic Coordinate Reference System manager that will be used to compute areas. Default is EPSG:3347, a consistent, equal-area CRS based on square metres. Can be many formats; see https://pyproj4.github.io/pyproj/stable/api/crs/crs.html for more information.
threshold(float | None):The percentage of overlap (divided by 100) that geographic areas must meet or exceed in order to have a connection. Default is 0.05, or 5 percent.
verbose(bool | None):Whether to issue print statements about the progress of network creation. Default is true.
Returns:
nx.Graph: Thenetworkxgraph containing the nodes (geographical areas) and edges (geographical overlap)created in the new network representation.
create_network_table
Creates a NetworkTable showing the network representation of the census data in census_dfs.
Each feature present in the data is a column, and each possible path through the network is a row.
Parameters:
census_dfs(List[gpd.GeoDataFrame]):A list of GeoDataFrames containing the census data to be turned into a network.
years(List[str]):A list of years present in
census_dfsover which the network representation will be created. Data from years not present in years will be ignored.
id(str):The name of the unique identifier that will be used to distinguish geographical areas.
crs(CRS | None):A pythonic Coordinate Reference System manager that will be used to compute areas. Default is EPSG:3347, a consistent, equal-area CRS based on square metres. Can be many formats; see https://pyproj4.github.io/pyproj/stable/api/crs/crs.html for more information.
threshold(float | None):The percentage of overlap (divided by 100) that geographic areas must meet or exceed in order to have a connection. Default is 0.05, or 5 percent.
verbose(bool | None):Whether to issue print statements about the progress of network creation. Default is true.
Returns:
NetworkTable: the table.
Module 2: Clustering
Module 2: Clustering
ClusteredNetworkTable class
class ClusteredNetworkTable(NetworkTable):
'''
A table showing the network representation of census data and the cluster assignments in the network.
Each feature present in the data (including the cluster assignment for each year) is a column,
and each possible path through the network is a row.
'''
Instance Variables:
table(pandas.DataFrame): The table, presented as apandasDataFrame.years(List[str]): The census years present in the table.id(str): The unique geographical id used to distinguish geographical areas in the table.num_clusters(int): The number of clusters that data can be assigned to. Determined by the user.tsc(Union[OptTSCluster, GreedyTSCluster]): Thetsclusterclustering object used to fit the data.arr(np.ndarray[np.float64]): The array of data used in clustering.label_dict(dict[str, Any]): The labels of census years, network paths, and variables that correspond to each dimension ofarr.
Methods:
modify_label_dict: Takes a custom label dictionary as an argument and setslabel_dictto the new dictionary.
Module 2 Functions
clustering_prep
Converts a piccard network table into a 3d numpy array of all possible paths and their corresponding features. This will be used for clustering with tscluster. The user can (optionally) input a list of columns that they want to be considered in the clustering algorithm, and the function will check that these columns are valid.
Parameters:
network_table(NetworkTable):The NetworkTable containing the data to be clustered.
cols(list[str] | None):A list of the names of network table columns that should be considered in the clustering algorithm. If none, every numerical feature will be considered. Leaving it none is not recommended as many numerical features, such as network level, have little bearing on the data.
Returns:
(tuple[np.ndarray[np.float64], dict[str, Any]], NetworkTable):A tuple of a 3d numpy array, a corresponding dictionary of labels showing the shape of the array, and the network table modified so it doesn’t include any of the NaN rows.
cluster
Runs one of tscluster’s clustering algorithms (default is fully dynamic clustering or 'z1c1')
and adds the resulting cluster assignments to the network table and nodes as an additional feature.
Information about the different clustering algorithms is available here: https://tscluster.readthedocs.io/en/latest/introduction.html
We recommend either Sequential Label Analysis ('z1c0') or the default 'z1c1'.
Users can choose to only input the network table, in which case clustering_prep will be run for them with the default columns,
or they can choose to run clustering_prep on their own and then have the option to apply one or both of the
normalization methods available in tscluster.preprocessing.utils.
Parameters:
network_table(NetworkTable):The NetworkTable containing the data to be clustered.
G(nx.Graph):The result of pc.create_network().
num_clusters(int):The number of clusters that the algorithm will find.
algo(str | None):The algorithm that tscluster will use, either
'greedy'(default) or'opt'.'greedy'runs GreedyTSCluster, which is a faster and easier, but less accurate, method than OptTSCluster. Since it doesn’t require a special academic licence, we recommend'greedy'for any non-academic users.'opt'runs OptTSCluster, which is guaranteed to find the optimal clustering but requires a Gurobi academic licence to run the clustering algorithm. More information about obtaining an academic licence can be found here: https://www.gurobi.com/academia/academic-program-and-licenses/
scheme(str | None):the clustering scheme. See the first paragraph for more information. Default is
'z1c1'.
arr(np.ndarray[np.float64] | None):the array of data to be clustered. If none,
arrandlabel_dictwill be generated by runningclustering_prepwith the default columns. See theclustering_prepdocumentation for why we DO NOT recommend leaving this blank.
label_dict(dict[str, Any] | None):the label dictionary corresponding to the data array. See
arr.
Returns:
ClusteredNetworkTable:The ClusteredNetworkTable object. Note that
clusteralso adds the resulting cluster assignments to the network table and nodes as an additional feature.
Module 3: Visualizations and Analysis
Module 3: Visualization and Analysis
Module 3 Functions
Network Visualizations
plot_subnetwork
Draws a subgraph of the network representation. If neither a specific list of ids to show nor a specific list of paths to show are given, picks num_to_sample random nodes from the first census year in the data and plots a subnetwork of their paths.
Hovering over each node shows the paths the node is part of.
Parameters:
network_table(NetworkTable):The NetworkTable containing the data.
G(nx.Graph):The network containing the data.
years(List[str] | None):A list of years to show in the subnetwork. Default is all census years present in the data.
paths_to_show(List[int] | None):A list of paths (numbered according to their position in network_table) whose points will be plotted in the subnetwork.
ids_to_show(List[str] | None):A list of ids (use the same type of id you used when creating the graph and network table) that will be plotted in the subnetwork. If both
paths_to_showandids_to_showare given, the function will only considerids_to_show.
num_to_sample(int | None):The number of random nodes to plot the paths of in the subnetwork. Default is 4. Note: A large
num_to_samplevalue may result in an unorganized and hard-to-read visualization.
Returns:
plotly.graph_objects.Figure:The interactive subnetwork plot.
plot_num_areas
Plots the number of geographical areas across a subset of census years in the data.
Parameters:
network_table(NetworkTable):The NetworkTable containing the data.
years(List[str] | None):A list of years to show in the plot. Default is all census years present in the data.
Returns:
plotly.graph_objects.Figure:The plot of the number of geographical areas.
Clustering Visualizations
plot_clusters_scatter
Creates a plotly scatterplot for each variable used in clustering with each timestep on the x axis and values on the y axis. The colours of data points correspond to their assigned cluster, and there is a legend showing which colour goes with which cluster. (Cluster numbers start at 0.)
Since cluster assignment often changes along the same path (or within the same area) over the years, plotting all the data points in one cluster often involves considering other clusters as well. Therefore, when you select a cluster to plot, you will see every path that contains a point in that cluster, and some of these paths will also contain paths in different clusters.
Add any clusters you don’t want to see (e.g. a cluster composed of NaN values) to exclude_clusters. This will exclude all paths containing these clusters, even paths that also have paths specified in the clusters list. In addition, you can curate the specific paths you want to see with paths_to_show; just make sure the paths are numbered according to their position in network_table.
Parameters:
network_table(ClusteredNetworkTable):The ClusteredNetworkTable containing the data.
label_dict(dict[str, Any] | None):A custom label dictionary.
years(List[str] | None):A list of years to show in the subnetwork. Default is all census years present in the data.
cluster_colours(dict[int, str] | None):A dict mapping cluster numbers to their corresponding colours. If None, plotly’s default colour map will be used. If a cluster number is not part of the dict, plotly’s default colour map will be used for that cluster.
dynamic_paths_only(bool | None):A boolean indicating whether to only plot dynamic entities (entities whose cluster assignment has changed over time). Default is true.
paths_to_show(List[int] | None):A list of paths (numbered according to their position in
network_table) whose points will be plotted. Default is every path.
ids_to_show(List[str] | None):A list of ids (use the same type of id you used when creating the graph and network table) that will be plotted. Default is every id. If both
paths_to_showandids_to_showare given, the function will only considerids_to_show.
clusters_to_show(List[int] | None):A list of the clusters whose points will be displayed on the map. Default is every cluster.
clusters_to_exclude(List[int] | None):A list of the clusters whose points will NOT be displayed on the map. Default is an empty list.
figsize(Tuple[float, float] | None):A tuple indicating the width and height of each figure that will be shown. Default is (700, 500).
cluster_labels(List[str] | None):A custom list of cluster names. Default is Cluster 0, …, Cluster n.
Returns:
List[plotly.graph_objects.Figure]:a list of plotly.graph_objects.Figure (you cannot show the whole list; rather, iterate through the list and show each figure)
plot_clusters_parallelcats
Creates an interactive parallel categories (parallel sets) plot to visualize how cluster assignments evolve over time.
Each column in the plot corresponds to a time point (e.g., a census year), and each path across the columns represents a “temporal path” of a tract or unit as it transitions across categories.
Parameters:
network_table(ClusteredNetworkTable):The ClusteredNetworkTable containing the data.
years(List[str] | None):A list of years to show. Default is all census years present in the data.
cluster_colours(dict[int, str] | None):A dict mapping cluster numbers to their corresponding colours. If None, plotly’s default colour map will be used. If a cluster number is not part of the dict, plotly’s default colour map will be used for that cluster.
colour_index_year(str | None):The year that will be used to determine the colours of the parallel plot. For example, if you chose 2011 as the colour index year, every cluster in the 2011 dimension would have a colour assigned to it, and then the paths into and out of these clusters would be shown in those colours. Default is the first year in the network table, and if an invalid input is given, the default will be used.
cluster_labels(List[str] | None):A custom list of cluster names. Default is Cluster 0, …, Cluster n.
figsize(Tuple[float, float] | None):A tuple indicating the width and height of each figure that will be shown. Default is (700, 500).
Returns:
plotly.graph_objects.Figure:The interactive parallel categories plot.
plot_clusters_area
Creates an interactive area chart to visualize how cluster assignments evolve over time.
Each column in the plot corresponds to a time point (e.g., a census year), and each path across the columns represents a “temporal path” of a tract or unit as it transitions across categories.
Parameters:
network_table(ClusteredNetworkTable):The ClusteredNetworkTable containing the data.
years(List[str] | None):A list of years to show. Default is all census years present in the data.
cluster_colours(dict[int, str] | None):A dict mapping cluster numbers to their corresponding colours. If None, plotly’s default colour map will be used. If a cluster number is not part of the dict, plotly’s default colour map will be used for that cluster.
cluster_labels(List[str] | None):A custom list of cluster names. Default is Cluster 0, …, Cluster n.
figsize(Tuple[float, float] | None):A tuple indicating the width and height of each figure that will be shown. Default is (700, 500).
stacked(bool | None):Whether to show the area plot as a stacked plot, with all the areas on top of each other. If False, shows the area plot as a regular line graph. Default is True.
Returns:
plotly.graph_objects.Figure:The interactive area plot.
plot_clusters_map
Plots cluster assignments in their associated geographical regions for a specific year using a GeoDataFrame.
Parameters:
geofile_path(str):Path to geographical data file
network_table(ClusteredNetworkTable):Network table to be merged with GeoJSON
year(str):Year to visualize (used in column name)
cluster_colours(dict[int, str] | None):A dict mapping cluster numbers to their corresponding colours. If None, plotly’s default colour map will be used. If a cluster number is not part of the dict, plotly’s default colour map will be used for that cluster.
label_dict(dict[str, Any] | None):The label dictionary from pc.clustering_prep() that you used in pc.cluster() or a custom label dictionary. Used to determine what data will be shown when you hover over each geographical region. If None, only the index (path number) will be shown.
cluster_labels(List[str] | None):A custom list of cluster names. Default is Cluster 0, …, Cluster n.
figsize(Tuple[float, float] | None):A tuple indicating the width and height of each figure that will be shown. Default is (700, 500).
Returns:
plotly.express.choropleth:The interactive choropleth map
plot_line_means
Creates an interactive line chart with one subplot per feature, showing how cluster mean values evolve over the selected years.
For each year in years, plots the mean value of each feature in
selected_features for every cluster.
Parameters:
network_table(ClusteredNetworkTable):The ClusteredNetworkTable containing the data.
years(List[str] | None):A list of years to show. Default is all census years present in the data.
selected_features(List[str]):Which features (column names present in clustering) to plot
varnames(List[str] | None):The custom variable names to plot
cluster_colours(dict[int, str] | None):A dict mapping cluster numbers to their corresponding colours. If None, plotly’s default colour map will be used. If a cluster number is not part of the dict, plotly’s default colour map will be used for that cluster.
cluster_labels(List[str] | None):A custom list of cluster names. Default is Cluster 0, …, Cluster n.
figsize(Tuple[float, float] | None):A tuple indicating the width and height of each figure that will be shown. Default is (700, 500).
Returns:
plotly.graph_objects.Figure:The line chart with subplots.
plot_bar_means
Create grouped bar-chart subplots of cluster means for each year.
For each year in years, plots the mean value of each feature in
selected_features for every cluster. Subplots are arranged in a
grid with two columns.
Parameters:
network_table(ClusteredNetworkTable):The ClusteredNetworkTable containing the data.
years(List[str] | None):A list of years to show. Default is all census years present in the data.
selected_features(List[str]):Which features (column names present in clustering) to plot
varnames(List[str] | None):The custom variable names to plot
cluster_colours(dict[int, str] | None):A dict mapping cluster numbers to their corresponding colours. If None, plotly’s default colour map will be used. If a cluster number is not part of the dict, plotly’s default colour map will be used for that cluster.
cluster_labels(List[str] | None):A custom list of cluster names. Default is Cluster 0, …, Cluster n.
figsize(Tuple[float, float] | None):A tuple indicating the width and height of each figure that will be shown. Default is (700, 500).
Returns:
plotly.graph_objects.Figure:The bar chart with subplots.
radar_chart_multiple_years
Creates a radar (polar) chart of selected variables for a given cluster across years.
Parameters:
network_table(ClusteredNetworkTable):The ClusteredNetworkTable containing the data.
years(List[str] | None):A list of years to show. Default is all census years present in the data.
selected_cluster(int):Which cluster to plot
selected_features(List[str]):Which features (column names present in clustering) to plot
varnames(List[str] | None):The custom variable names to plot
year_colours(dict[int, str] | None):A dict mapping indices of years to their corresponding colours. For example, if your data goes from 2006 to 2021, 2006 corresponds to index 0, 2011 to 1, etc. If None, plotly’s default colour map will be used. If a year is not part of the dict, plotly’s default colour map will be used for that year.
cluster_label(str | None):The custom label of the cluster to show. Default is Cluster n.
figsize(Tuple[float, float] | None):A tuple indicating the width and height of each figure that will be shown. Default is (700, 500).
Returns:
plotly.graph_objects.Figure:The radar chart.
radar_chart_multiple_clusters
Creates a radar (polar) chart of selected variables for a given year across clusters.
Parameters:
network_table(ClusteredNetworkTable):The ClusteredNetworkTable containing the data.
clusters(List[int] | None):A list of clusters to show. Default is all clusters present in the data.
selected_year(str):Which year to plot
selected_features(List[str]):Which features (column names present in clustering) to plot
varnames(List[str] | None):The custom variable names to plot
cluster_colours(dict[int, str] | None):A dict mapping cluster numbers to their corresponding colours. If None, plotly’s default colour map will be used. If a cluster number is not part of the dict, plotly’s default colour map will be used for that cluster.
cluster_labels(List[str] | None):A custom list of cluster names. Default is Cluster 0, …, Cluster n.
figsize(Tuple[float, float] | None):A tuple indicating the width and height of each figure that will be shown. Default is (700, 500).
Returns:
plotly.graph_objects.Figure:The radar chart.
Probabilistic Analysis
prob_reasoning_networks
Allows probabilistic reasoning over network representations of heterogenous/unlinked datasets using the ppandas package.
For more information about ppandas, visit: https://github.com/D3Mlab/ppandas/tree/master
Takes in two network tables and lists of independent and dependent variables for each, performs and visualizes a join, and returns the resulting PDataFrame (which can be used to obtain information about conditional probabilities). This function is recommended if you have datasets from different sources or datasets that designate geographical regions using different units.
The second list of independent variables must be a subset of the first, so make sure the column names are the same
before passing them into this function. However, mismatches in independent variable column data allowed by ppandas
are okay.
Parameters:
network_table_1(NetworkTable | pd.DataFrame | gpd.GeoDataFrame):The reference network table. Typically the network table associated with the data assumed to be more unbiased and reliable.
network_table_2(NetworkTable | pd.DataFrame | gpd.GeoDataFrame):The second network table whose independent and dependent variables will be joined into a probabilistic model of network_table_1.
independent_vars_1(List[str]):A list of independent variables associated with network_table_1. Must be columns in network_table_1.
independent_vars_2(List[str]):A list of independent variables associated with network_table_2. Must be columns in network_table_2 and every column in independent_vars_2 must also appear in independent_vars_1.
dependent_vars_1(List[str]):A list of dependent variables associated with network_table_1. Must be columns in network_table_1.
dependent_vars_2(List[str]):A list of dependent variables associated with network_table_2. Must be columns in network_table_2. Unlike with independent variables, not every column in dependent_vars_2 also has to appear in dependent_vars_1.
mismatches(dict[str, str] | None):A dictionary of the mismatches PDataFrame.pjoin will handle. Must be in format {<independent variable name>: <’categorical’ | ‘numerical’ | ‘spatial’> }. See the link above for more information.
Returns:
PDataFrame:The result of joining the two probabilistic models of network tables.
prob_reasoning_years
Allows probabilistic reasoning over network representations of heterogenous/unlinked datasets using the ppandas package.
For more information about ppandas, visit: https://github.com/D3Mlab/ppandas/tree/master
Takes in two years from the same network table and lists of independent and dependent variables for each, performs and visualizes a join, and returns the resulting PDataFrame (which can be used to obtain information about conditional probabilities).
The second list of independent variables must be a subset of the first, so make sure the column names are the same
before passing them into this function. Mismatches in independent variable column data allowed by ppandas
are okay.
Parameters:
network_table(NetworkTable):The network table containing the data.
year_1(str):The first year examined.
year_2(str):The second year examined.
independent_vars_1(List[str]):A list of independent variables associated with year_1. Must be columns in network_table and end in year_1.
independent_vars_2(List[str]):A list of independent variables associated with year_2. Must be columns in network_table and end in year_2. The columns (minus year 2) must be a subset of independent_vars_1 (minus year 1).
dependent_vars_1(List[str]):A list of dependent variables associated with year_1. Must be columns in network_table and end in year_1.
dependent_vars_2(List[str]):A list of dependent variables associated with year_1. Must be columns in network_table and end in year_1. Unlike with independent variables, not every column in dependent_vars_2 also has to appear in dependent_vars_1.
mismatches(dict[str, str] | None):A dictionary of the mismatches PDataFrame.pjoin will handle. Must be in format <independent variable name>: <’categorical’ | ‘numerical’ | ‘spatial’> }. See the link above for more information.
Returns:
PDataFrame:The result of joining the two probabilistic models.
Module 4: VariableLinker
Module 4: VariableLinker
Overview
VariableLinker is a Python framework designed for visualizing the links between census variables across multiple years. It provides multiple approaches for matching census variables between different years and creates hierarchical tree visualizations that show how these variables are connected.
Key Features:
Multiple Matching Algorithms: Jaccard similarity and sentence transformers
Hierarchical Visualization: Creates tree structures showing the parent-child relationships in census data
Colour-coded Results: Visual indicators for data consistency across years
Use Cases:
Census data harmonization across multiple years
Tracking changes in census variables over time
Visualizing data consistency and evolution
Installation and Setup
Prerequisites
pip install -r requirements.txt
Importing VariableLinker
import sys
import os
# Add the src/piccard directory to Python path
current_dir = os.getcwd()
src_path = os.path.join(current_dir, '..', 'src', 'piccard')
sys.path.append(src_path)
from variable_linker import VariableLinker
Core Concepts
1. Census Metadata Structure
VariableLinker works with census metadata JSON files that contain:
Vector identifiers: Unique codes for census variables
Descriptions: Human-readable descriptions of census variables
Types: Categories like “Total”, “Male”, “Female”
Details: Additional contextual information
2. Matching Process
The framework performs two-pass matching:
Exact Match: Find identical descriptions across years
Similarity Match: Use similarity algorithms for inexact matches
3. Tree Visualization
Nodes: Represent census variables
Edges: Show parent-child relationships
Colours: Indicate consistency across years
Grey: Source year only
Salmon: Matches in 1 other year
Yellow: Matches in 2 other years
Light green: Matches in 3+ other years
VariableLinker Class Reference
Class Overview
class VariableLinker:
"""
A class for processing census metadata and creating tree visualizations.
This class provides functionality for:
- Preprocessing census metadata from JSON files
- Computing similarity between census descriptions using various methods
- Matching descriptions across different census years
- Building hierarchical tree visualizations with colour-coding
"""
Static Methods
preprocess_census_metadata(path, type_filter="Total")
Preprocesses census metadata from JSON files.
Parameters:
path(str): Path to the JSON file containing census metadatatype_filter(str): Type of records to filter for (default: “Total”)
Returns:
pd.DataFrame: Preprocessed DataFrame with columns [‘vector’, ‘type’, ‘description’, …]
Example:
data_2021 = VariableLinker.preprocess_census_metadata("census_ca21_full_metadata.json")
jaccard_similarity(sentence1, sentence2)
Computes Jaccard similarity between two census descriptions.
Parameters:
sentence1(str): First census descriptionsentence2(str): Second census description
Returns:
float: Jaccard similarity score between 0.0 and 1.0
process_discription_text(text)
Processes and tokenizes census text for similarity comparison.
Parameters:
text(str): Raw census description text
Returns:
set: Set of processed tokens (words and numbers, excluding stopwords)
normalize_ranges(text)
Normalizes numeric ranges in text for consistent processing.
Parameters:
text(str): Text containing potential numeric ranges
Returns:
str: Text with normalized numeric ranges
parse_tree_to_dict(filepath)
Parses a Graphviz tree file into a dictionary structure.
Parameters:
filepath(str): Path to the Graphviz tree file
Returns:
Dict: Dictionary mapping node IDs to their information including descriptions, year mappings, and colours
Example:
tree_dict = VariableLinker.parse_tree_to_dict("my_tree")
extract_parent_child_relationships(filepath)
Extracts parent-child relationships from tree file edges.
Parameters:
filepath(str): Path to the tree file (Graphviz format)
Returns:
Dict[str, List[str]]: Dictionary mapping parent nodes to their children
Example:
relationships = VariableLinker.extract_parent_child_relationships("my_tree")
predict_parent_nodes(tree_dict, parent_child_relationships, target_years)
Predicts parent nodes in other years using the additive property.
Parameters:
tree_dict(Dict): Parsed tree dictionary with node info and year mappingsparent_child_relationships(Dict[str, List[str]]): Parent to children mappingtarget_years(List[str]): Years to predict parents for (default: [‘2016’, ‘2011’, ‘2006’])
Returns:
Dict[str, List[str]]: Dictionary mapping parent nodes to years in which they can be predicted
Example:
predictions = VariableLinker.predict_parent_nodes(tree_dict, relationships, ['2016', '2011'])
Matching Approaches
1. Jaccard Similarity Matching
Method: match_descriptions_jaccard()
Uses token-based similarity to match descriptions across years.
Advantages:
Good for exact and near-exact matches
Language-agnostic
Disadvantages:
May miss semantic similarities
Sensitive to phrasing
Usage:
jaccard_mapping = VariableLinker.match_descriptions_jaccard(
source_df=data_2021,
compare_df=data_2016,
similarity_threshold=0.9
)
2. Sentence Transformer Matching
Method: match_descriptions_transformer()
Uses pre-trained sentence transformers for semantic similarity matching.
Advantages:
Captures semantic meaning
Better for paraphrased descriptions
Robust to word variations
Faster than Jaccard since it uses vectorization
Disadvantages:
Limited ability to process numeric values and ranges in text descriptions
Usage:
transformer_mapping = VariableLinker.match_descriptions_transformer(
source_df=data_2021,
compare_df=data_2016,
similarity_threshold=0.9,
model_name='all-mpnet-base-v2'
)
3. Advanced Sentence Transformer Matching
Method: match_descriptions_details_sentence_transformer()
Enhanced version of sentence transformer that uses details for breaking ties when multiple exact matches are found.
Advantages:
Attempts better disambiguation using details field
More sophisticated exact matching strategy
Disadvantages
Performance evaluation indicates higher error rate than basic transformer
Higher computational complexity without performance benefit
Usage:
advanced_mapping = VariableLinker.match_descriptions_details_sentence_transformer(
source_df=data_2021,
compare_df=data_2016,
similarity_threshold=0.9
)
4. Multithreaded Matching
Method: match_descriptions_multithreaded() (from multithreaded_mapping.py)
Jaccard similarity approach with multithreaded execution for enhanced performance on large datasets.
Advantages:
Parallel processing for similarity matching phase
Configurable number of worker threads (default: 4)
Thread-safe operations for similarity matching
Usage:
from multithreaded_mapping import match_descriptions_multithreaded
multithreaded_mapping = match_descriptions_multithreaded(
source_df=data_2021,
compare_df=data_2016,
similarity_threshold=0.9,
max_workers=8
)
Workflow Examples
Basic Workflow
# 1. Load and preprocess data
data_2021 = VariableLinker.preprocess_census_metadata("census_ca21_full_metadata.json")
data_2016 = VariableLinker.preprocess_census_metadata("census_ca16_full_metadata.json")
# 2. Perform matching
mapping_21_16 = VariableLinker.match_descriptions_jaccard(
source_df=data_2021,
compare_df=data_2016,
similarity_threshold=0.9
)
# 3. Merge mappings
merged_df = VariableLinker.merge_mappings(data_2021, mapping_21_16)
# 4. Build visualization
tree = VariableLinker.build_tree(data_2021, merged_df, "my_tree", "output_path")
Multi-Year Workflow
# Load data for multiple years
data_2006 = VariableLinker.preprocess_census_metadata("census_ca06_full_metadata.json")
data_2011 = VariableLinker.preprocess_census_metadata("census_ca11_full_metadata.json")
data_2016 = VariableLinker.preprocess_census_metadata("census_ca16_full_metadata.json")
data_2021 = VariableLinker.preprocess_census_metadata("census_ca21_full_metadata.json")
# Match against 2021 (latest year)
mapping_21_16 = VariableLinker.match_descriptions_jaccard(data_2021, data_2016, 0.9)
mapping_21_11 = VariableLinker.match_descriptions_jaccard(data_2021, data_2011, 0.9)
mapping_21_06 = VariableLinker.match_descriptions_jaccard(data_2021, data_2006, 0.9)
# Merge all mappings
merged_df = VariableLinker.merge_mappings(data_2021, mapping_21_16, mapping_21_11, mapping_21_06)
# Build comprehensive tree
tree = VariableLinker.build_tree(data_2021, merged_df, "multi_year_tree", "trees/")
Comparison of Approaches
# Jaccard approach
jaccard_mapping = VariableLinker.match_descriptions_jaccard(data_2021, data_2016, 0.9)
jaccard_merged = VariableLinker.merge_mappings(data_2021, jaccard_mapping)
jaccard_tree = VariableLinker.build_tree(data_2021, jaccard_merged, "jaccard_tree", "trees/")
# Transformer approach
transformer_mapping = VariableLinker.match_descriptions_transformer(data_2021, data_2016, 0.9)
transformer_merged = VariableLinker.merge_mappings(data_2021, transformer_mapping)
transformer_tree = VariableLinker.build_tree(data_2021, transformer_merged, "transformer_tree", "trees/")
# Multithreaded approach
from multithreaded_mapping import match_descriptions_multithreaded
multithreaded_mapping = match_descriptions_multithreaded(data_2021, data_2016, 0.9, 8)
multithreaded_merged = VariableLinker.merge_mappings(data_2021, multithreaded_mapping)
multithreaded_tree = VariableLinker.build_tree(data_2021, multithreaded_merged, "multithreaded_tree", "trees/")
Advanced Features
Custom Similarity Thresholds
Different thresholds can be used for different types of data:
# Strict matching for critical variables
critical_mapping = VariableLinker.match_descriptions_jaccard(data_2021, data_2016, 0.95)
# Relaxed matching for exploratory analysis
exploratory_mapping = VariableLinker.match_descriptions_jaccard(data_2021, data_2016, 0.7)
Model Selection for Transformers
# Use different transformer models
mapping_mini = VariableLinker.match_descriptions_transformer(
data_2021, data_2016, 0.9, 'all-MiniLM-L6-v2'
)
mapping_mpnet = VariableLinker.match_descriptions_transformer(
data_2021, data_2016, 0.9, 'all-mpnet-base-v2'
)
Tree Analysis and Prediction
Overview
VariableLinker provides advanced functionality for analyzing existing tree structures and predicting missing parent nodes based on the additive property of census data.
Key Concepts
Additive Property
In census data, parent variables often represent the sum of their child variables:
Parent_Value = Sum(Child_Values)
This property allows us to predict parent nodes in years where they don’t exist or did not get matched, as long as all their children are available in those years.
Tree Parsing
The framework can parse existing Graphviz tree files to extract:
Node descriptions and metadata
Year-specific vector mappings
Parent-child relationships
Colour-coding information
Workflow for Tree Analysis
# 1. Parse existing tree file
tree_dict = VariableLinker.parse_tree_to_dict("existing_tree.gv")
# 2. Extract parent-child relationships
relationships = VariableLinker.extract_parent_child_relationships("existing_tree.gv")
# 3. Predict missing parent nodes
predictions = VariableLinker.predict_parent_nodes(
tree_dict=tree_dict,
parent_child_relationships=relationships,
target_years=['2016', '2011', '2006']
)
# 4. Analyze predictions
for parent_node, predictable_years in predictions.items():
print(f"Parent '{parent_node}' can be predicted in years: {predictable_years}")
Complete Analysis Workflow
# Load and process census data
data_2021 = VariableLinker.preprocess_census_metadata("census_ca21_full_metadata.json")
data_2016 = VariableLinker.preprocess_census_metadata("census_ca16_full_metadata.json")
# Create initial tree
mapping_21_16 = VariableLinker.match_descriptions_jaccard(data_2021, data_2016, 0.9)
merged_df = VariableLinker.merge_mappings(data_2021, mapping_21_16)
tree = VariableLinker.build_tree(data_2021, merged_df, "analysis_tree", "trees/")
# Analyze the created tree
tree_dict = VariableLinker.parse_tree_to_dict("trees/analysis_tree")
relationships = VariableLinker.extract_parent_child_relationships("trees/analysis_tree")
# Predict missing parents
predictions = VariableLinker.predict_parent_nodes(tree_dict, relationships)
# Generate report
print("=== Tree Analysis Report ===")
print(f"Total nodes in tree: {len(tree_dict)}")
print(f"Parent-child relationships: {len(relationships)}")
print(f"Predictable parent nodes: {len(predictions)}")
for parent, years in predictions.items():
parent_desc = tree_dict[parent]['description']
print(f"\nParent: {parent_desc}")
print(f" Node ID: {parent}")
print(f" Predictable in years: {years}")
Prediction Algorithm Details
The prediction algorithm works as follows:
Year Analysis: Identifies the years in which the parent currently exists
Child Verification: For each target year, checks if ALL children exist
Prediction: If all children exist in a target year, the parent can be predicted
Example Scenario:
Parent: “Total Population” Children: [“Male Population”, “Female Population”]
If “Male Population” and “Female Population” both exist in 2016, but “Total Population” doesn’t exist in 2016, then “Total Population” can be predicted for 2016.
Use Cases for Tree Analysis
Data Completeness Assessment: Identify missing parent nodes across years
Prediction Validation: Verify which parent nodes can be reliably predicted
Performance Considerations
Memory Usage
Large datasets may require significant RAM
Consider processing in chunks for very large datasets
Use multithreaded approach for better memory management
Data Structures
Input DataFrame Format
{
'vector': 'v_CA21_1234',
'type': 'Total',
'description': 'Population aged 25-34 years',
'details': 'Detailed description...'
}
Output Mapping Format
{
'description': 'Population aged 25-34 years',
'vector_base': 'v_CA21_1234',
'vector_cmp': 'v_CA16_1234'
}
Merged Mapping Format
{
'description': 'Population aged 25-34 years',
'vector_base': 'v_CA21_1234',
'vector_cmp_list': ['v_CA16_1234', 'v_CA11_1234', 'v_CA06_1234']
}
Troubleshooting
Import Errors
# Solution: Add correct path
import sys
sys.path.append('../src/piccard')
from variable_linker import VariableLinker
File Not Found Errors
# Check file paths
import os
print("Current directory:", os.getcwd())
print("Files available:", os.listdir('.'))
Memory Issues
Reduce batch size for large datasets
Use multithreaded approach
Process data in chunks
Poor Matching Results
Adjust similarity threshold
Try different matching approaches
Check data quality and consistency
Configuration Options
Similarity Thresholds
Strict: 0.95+ for critical variables
Standard: 0.9 for most use cases
Relaxed: 0.7-0.8 for exploratory analysis
Transformer Models
'all-MiniLM-L6-v2': Fast, good accuracy'all-mpnet-base-v2': Best accuracy, slowerOther Transformer Models can be found at [SBERT Pretrained Models](https://sbert.net/docs/sentence_transformer/pretrained_models.html)