Module 4: VariableLinker
Overview
VariableLinker is a Python framework designed for visualizing the links between census variables across multiple years. It provides multiple approaches for matching census variables between different years and creates hierarchical tree visualizations that show how these variables are connected.
Key Features:
Multiple Matching Algorithms: Jaccard similarity and sentence transformers
Hierarchical Visualization: Creates tree structures showing the parent-child relationships in census data
Colour-coded Results: Visual indicators for data consistency across years
Use Cases:
Census data harmonization across multiple years
Tracking changes in census variables over time
Visualizing data consistency and evolution
Installation and Setup
Prerequisites
pip install -r requirements.txt
Importing VariableLinker
import sys
import os
# Add the src/piccard directory to Python path
current_dir = os.getcwd()
src_path = os.path.join(current_dir, '..', 'src', 'piccard')
sys.path.append(src_path)
from variable_linker import VariableLinker
Core Concepts
1. Census Metadata Structure
VariableLinker works with census metadata JSON files that contain:
Vector identifiers: Unique codes for census variables
Descriptions: Human-readable descriptions of census variables
Types: Categories like “Total”, “Male”, “Female”
Details: Additional contextual information
2. Matching Process
The framework performs two-pass matching:
Exact Match: Find identical descriptions across years
Similarity Match: Use similarity algorithms for inexact matches
3. Tree Visualization
Nodes: Represent census variables
Edges: Show parent-child relationships
Colours: Indicate consistency across years
Grey: Source year only
Salmon: Matches in 1 other year
Yellow: Matches in 2 other years
Light green: Matches in 3+ other years
VariableLinker Class Reference
Class Overview
class VariableLinker:
"""
A class for processing census metadata and creating tree visualizations.
This class provides functionality for:
- Preprocessing census metadata from JSON files
- Computing similarity between census descriptions using various methods
- Matching descriptions across different census years
- Building hierarchical tree visualizations with colour-coding
"""
Static Methods
preprocess_census_metadata(path, type_filter="Total")
Preprocesses census metadata from JSON files.
Parameters:
path(str): Path to the JSON file containing census metadatatype_filter(str): Type of records to filter for (default: “Total”)
Returns:
pd.DataFrame: Preprocessed DataFrame with columns [‘vector’, ‘type’, ‘description’, …]
Example:
data_2021 = VariableLinker.preprocess_census_metadata("census_ca21_full_metadata.json")
jaccard_similarity(sentence1, sentence2)
Computes Jaccard similarity between two census descriptions.
Parameters:
sentence1(str): First census descriptionsentence2(str): Second census description
Returns:
float: Jaccard similarity score between 0.0 and 1.0
process_discription_text(text)
Processes and tokenizes census text for similarity comparison.
Parameters:
text(str): Raw census description text
Returns:
set: Set of processed tokens (words and numbers, excluding stopwords)
normalize_ranges(text)
Normalizes numeric ranges in text for consistent processing.
Parameters:
text(str): Text containing potential numeric ranges
Returns:
str: Text with normalized numeric ranges
parse_tree_to_dict(filepath)
Parses a Graphviz tree file into a dictionary structure.
Parameters:
filepath(str): Path to the Graphviz tree file
Returns:
Dict: Dictionary mapping node IDs to their information including descriptions, year mappings, and colours
Example:
tree_dict = VariableLinker.parse_tree_to_dict("my_tree")
extract_parent_child_relationships(filepath)
Extracts parent-child relationships from tree file edges.
Parameters:
filepath(str): Path to the tree file (Graphviz format)
Returns:
Dict[str, List[str]]: Dictionary mapping parent nodes to their children
Example:
relationships = VariableLinker.extract_parent_child_relationships("my_tree")
predict_parent_nodes(tree_dict, parent_child_relationships, target_years)
Predicts parent nodes in other years using the additive property.
Parameters:
tree_dict(Dict): Parsed tree dictionary with node info and year mappingsparent_child_relationships(Dict[str, List[str]]): Parent to children mappingtarget_years(List[str]): Years to predict parents for (default: [‘2016’, ‘2011’, ‘2006’])
Returns:
Dict[str, List[str]]: Dictionary mapping parent nodes to years in which they can be predicted
Example:
predictions = VariableLinker.predict_parent_nodes(tree_dict, relationships, ['2016', '2011'])
Matching Approaches
1. Jaccard Similarity Matching
Method: match_descriptions_jaccard()
Uses token-based similarity to match descriptions across years.
Advantages:
Good for exact and near-exact matches
Language-agnostic
Disadvantages:
May miss semantic similarities
Sensitive to phrasing
Usage:
jaccard_mapping = VariableLinker.match_descriptions_jaccard(
source_df=data_2021,
compare_df=data_2016,
similarity_threshold=0.9
)
2. Sentence Transformer Matching
Method: match_descriptions_transformer()
Uses pre-trained sentence transformers for semantic similarity matching.
Advantages:
Captures semantic meaning
Better for paraphrased descriptions
Robust to word variations
Faster than Jaccard since it uses vectorization
Disadvantages:
Limited ability to process numeric values and ranges in text descriptions
Usage:
transformer_mapping = VariableLinker.match_descriptions_transformer(
source_df=data_2021,
compare_df=data_2016,
similarity_threshold=0.9,
model_name='all-mpnet-base-v2'
)
3. Advanced Sentence Transformer Matching
Method: match_descriptions_details_sentence_transformer()
Enhanced version of sentence transformer that uses details for breaking ties when multiple exact matches are found.
Advantages:
Attempts better disambiguation using details field
More sophisticated exact matching strategy
Disadvantages
Performance evaluation indicates higher error rate than basic transformer
Higher computational complexity without performance benefit
Usage:
advanced_mapping = VariableLinker.match_descriptions_details_sentence_transformer(
source_df=data_2021,
compare_df=data_2016,
similarity_threshold=0.9
)
4. Multithreaded Matching
Method: match_descriptions_multithreaded() (from multithreaded_mapping.py)
Jaccard similarity approach with multithreaded execution for enhanced performance on large datasets.
Advantages:
Parallel processing for similarity matching phase
Configurable number of worker threads (default: 4)
Thread-safe operations for similarity matching
Usage:
from multithreaded_mapping import match_descriptions_multithreaded
multithreaded_mapping = match_descriptions_multithreaded(
source_df=data_2021,
compare_df=data_2016,
similarity_threshold=0.9,
max_workers=8
)
Workflow Examples
Basic Workflow
# 1. Load and preprocess data
data_2021 = VariableLinker.preprocess_census_metadata("census_ca21_full_metadata.json")
data_2016 = VariableLinker.preprocess_census_metadata("census_ca16_full_metadata.json")
# 2. Perform matching
mapping_21_16 = VariableLinker.match_descriptions_jaccard(
source_df=data_2021,
compare_df=data_2016,
similarity_threshold=0.9
)
# 3. Merge mappings
merged_df = VariableLinker.merge_mappings(data_2021, mapping_21_16)
# 4. Build visualization
tree = VariableLinker.build_tree(data_2021, merged_df, "my_tree", "output_path")
Multi-Year Workflow
# Load data for multiple years
data_2006 = VariableLinker.preprocess_census_metadata("census_ca06_full_metadata.json")
data_2011 = VariableLinker.preprocess_census_metadata("census_ca11_full_metadata.json")
data_2016 = VariableLinker.preprocess_census_metadata("census_ca16_full_metadata.json")
data_2021 = VariableLinker.preprocess_census_metadata("census_ca21_full_metadata.json")
# Match against 2021 (latest year)
mapping_21_16 = VariableLinker.match_descriptions_jaccard(data_2021, data_2016, 0.9)
mapping_21_11 = VariableLinker.match_descriptions_jaccard(data_2021, data_2011, 0.9)
mapping_21_06 = VariableLinker.match_descriptions_jaccard(data_2021, data_2006, 0.9)
# Merge all mappings
merged_df = VariableLinker.merge_mappings(data_2021, mapping_21_16, mapping_21_11, mapping_21_06)
# Build comprehensive tree
tree = VariableLinker.build_tree(data_2021, merged_df, "multi_year_tree", "trees/")
Comparison of Approaches
# Jaccard approach
jaccard_mapping = VariableLinker.match_descriptions_jaccard(data_2021, data_2016, 0.9)
jaccard_merged = VariableLinker.merge_mappings(data_2021, jaccard_mapping)
jaccard_tree = VariableLinker.build_tree(data_2021, jaccard_merged, "jaccard_tree", "trees/")
# Transformer approach
transformer_mapping = VariableLinker.match_descriptions_transformer(data_2021, data_2016, 0.9)
transformer_merged = VariableLinker.merge_mappings(data_2021, transformer_mapping)
transformer_tree = VariableLinker.build_tree(data_2021, transformer_merged, "transformer_tree", "trees/")
# Multithreaded approach
from multithreaded_mapping import match_descriptions_multithreaded
multithreaded_mapping = match_descriptions_multithreaded(data_2021, data_2016, 0.9, 8)
multithreaded_merged = VariableLinker.merge_mappings(data_2021, multithreaded_mapping)
multithreaded_tree = VariableLinker.build_tree(data_2021, multithreaded_merged, "multithreaded_tree", "trees/")
Advanced Features
Custom Similarity Thresholds
Different thresholds can be used for different types of data:
# Strict matching for critical variables
critical_mapping = VariableLinker.match_descriptions_jaccard(data_2021, data_2016, 0.95)
# Relaxed matching for exploratory analysis
exploratory_mapping = VariableLinker.match_descriptions_jaccard(data_2021, data_2016, 0.7)
Model Selection for Transformers
# Use different transformer models
mapping_mini = VariableLinker.match_descriptions_transformer(
data_2021, data_2016, 0.9, 'all-MiniLM-L6-v2'
)
mapping_mpnet = VariableLinker.match_descriptions_transformer(
data_2021, data_2016, 0.9, 'all-mpnet-base-v2'
)
Tree Analysis and Prediction
Overview
VariableLinker provides advanced functionality for analyzing existing tree structures and predicting missing parent nodes based on the additive property of census data.
Key Concepts
Additive Property
In census data, parent variables often represent the sum of their child variables:
Parent_Value = Sum(Child_Values)
This property allows us to predict parent nodes in years where they don’t exist or did not get matched, as long as all their children are available in those years.
Tree Parsing
The framework can parse existing Graphviz tree files to extract:
Node descriptions and metadata
Year-specific vector mappings
Parent-child relationships
Colour-coding information
Workflow for Tree Analysis
# 1. Parse existing tree file
tree_dict = VariableLinker.parse_tree_to_dict("existing_tree.gv")
# 2. Extract parent-child relationships
relationships = VariableLinker.extract_parent_child_relationships("existing_tree.gv")
# 3. Predict missing parent nodes
predictions = VariableLinker.predict_parent_nodes(
tree_dict=tree_dict,
parent_child_relationships=relationships,
target_years=['2016', '2011', '2006']
)
# 4. Analyze predictions
for parent_node, predictable_years in predictions.items():
print(f"Parent '{parent_node}' can be predicted in years: {predictable_years}")
Complete Analysis Workflow
# Load and process census data
data_2021 = VariableLinker.preprocess_census_metadata("census_ca21_full_metadata.json")
data_2016 = VariableLinker.preprocess_census_metadata("census_ca16_full_metadata.json")
# Create initial tree
mapping_21_16 = VariableLinker.match_descriptions_jaccard(data_2021, data_2016, 0.9)
merged_df = VariableLinker.merge_mappings(data_2021, mapping_21_16)
tree = VariableLinker.build_tree(data_2021, merged_df, "analysis_tree", "trees/")
# Analyze the created tree
tree_dict = VariableLinker.parse_tree_to_dict("trees/analysis_tree")
relationships = VariableLinker.extract_parent_child_relationships("trees/analysis_tree")
# Predict missing parents
predictions = VariableLinker.predict_parent_nodes(tree_dict, relationships)
# Generate report
print("=== Tree Analysis Report ===")
print(f"Total nodes in tree: {len(tree_dict)}")
print(f"Parent-child relationships: {len(relationships)}")
print(f"Predictable parent nodes: {len(predictions)}")
for parent, years in predictions.items():
parent_desc = tree_dict[parent]['description']
print(f"\nParent: {parent_desc}")
print(f" Node ID: {parent}")
print(f" Predictable in years: {years}")
Prediction Algorithm Details
The prediction algorithm works as follows:
Year Analysis: Identifies the years in which the parent currently exists
Child Verification: For each target year, checks if ALL children exist
Prediction: If all children exist in a target year, the parent can be predicted
Example Scenario:
Parent: “Total Population” Children: [“Male Population”, “Female Population”]
If “Male Population” and “Female Population” both exist in 2016, but “Total Population” doesn’t exist in 2016, then “Total Population” can be predicted for 2016.
Use Cases for Tree Analysis
Data Completeness Assessment: Identify missing parent nodes across years
Prediction Validation: Verify which parent nodes can be reliably predicted
Performance Considerations
Memory Usage
Large datasets may require significant RAM
Consider processing in chunks for very large datasets
Use multithreaded approach for better memory management
Data Structures
Input DataFrame Format
{
'vector': 'v_CA21_1234',
'type': 'Total',
'description': 'Population aged 25-34 years',
'details': 'Detailed description...'
}
Output Mapping Format
{
'description': 'Population aged 25-34 years',
'vector_base': 'v_CA21_1234',
'vector_cmp': 'v_CA16_1234'
}
Merged Mapping Format
{
'description': 'Population aged 25-34 years',
'vector_base': 'v_CA21_1234',
'vector_cmp_list': ['v_CA16_1234', 'v_CA11_1234', 'v_CA06_1234']
}
Troubleshooting
Import Errors
# Solution: Add correct path
import sys
sys.path.append('../src/piccard')
from variable_linker import VariableLinker
File Not Found Errors
# Check file paths
import os
print("Current directory:", os.getcwd())
print("Files available:", os.listdir('.'))
Memory Issues
Reduce batch size for large datasets
Use multithreaded approach
Process data in chunks
Poor Matching Results
Adjust similarity threshold
Try different matching approaches
Check data quality and consistency
Configuration Options
Similarity Thresholds
Strict: 0.95+ for critical variables
Standard: 0.9 for most use cases
Relaxed: 0.7-0.8 for exploratory analysis
Transformer Models
'all-MiniLM-L6-v2': Fast, good accuracy'all-mpnet-base-v2': Best accuracy, slowerOther Transformer Models can be found at [SBERT Pretrained Models](https://sbert.net/docs/sentence_transformer/pretrained_models.html)