Module 4: VariableLinker

Overview

VariableLinker is a Python framework designed for visualizing the links between census variables across multiple years. It provides multiple approaches for matching census variables between different years and creates hierarchical tree visualizations that show how these variables are connected.

Key Features:

Multiple Matching Algorithms: Jaccard similarity and sentence transformers
Hierarchical Visualization: Creates tree structures showing the parent-child relationships in census data
Colour-coded Results: Visual indicators for data consistency across years

Use Cases:

Census data harmonization across multiple years
Tracking changes in census variables over time
Visualizing data consistency and evolution

Installation and Setup

Prerequisites

pip install -r requirements.txt

Importing VariableLinker

import sys
import os


# Add the src/piccard directory to Python path

current_dir = os.getcwd()
src_path = os.path.join(current_dir, '..', 'src', 'piccard')
sys.path.append(src_path)

from variable_linker import VariableLinker

Core Concepts

1. Census Metadata Structure

VariableLinker works with census metadata JSON files that contain:

Vector identifiers: Unique codes for census variables
Descriptions: Human-readable descriptions of census variables
Types: Categories like “Total”, “Male”, “Female”
Details: Additional contextual information

2. Matching Process

The framework performs two-pass matching:

Exact Match: Find identical descriptions across years
Similarity Match: Use similarity algorithms for inexact matches

3. Tree Visualization

Nodes: Represent census variables
Edges: Show parent-child relationships
Colours: Indicate consistency across years
- Grey: Source year only
- Salmon: Matches in 1 other year
- Yellow: Matches in 2 other years
- Light green: Matches in 3+ other years

VariableLinker Class Reference

Class Overview

class VariableLinker:
    """
    A class for processing census metadata and creating tree visualizations.

    This class provides functionality for:
    - Preprocessing census metadata from JSON files
    - Computing similarity between census descriptions using various methods
    - Matching descriptions across different census years
    - Building hierarchical tree visualizations with colour-coding
    """

Static Methods

`preprocess_census_metadata(path, type_filter="Total")`

Preprocesses census metadata from JSON files.

Parameters:

path (str): Path to the JSON file containing census metadata
type_filter (str): Type of records to filter for (default: “Total”)

Returns:

pd.DataFrame: Preprocessed DataFrame with columns [‘vector’, ‘type’, ‘description’, …]

Example:

data_2021 = VariableLinker.preprocess_census_metadata("census_ca21_full_metadata.json")

`jaccard_similarity(sentence1, sentence2)`

Computes Jaccard similarity between two census descriptions.

Parameters:

sentence1 (str): First census description
sentence2 (str): Second census description

Returns:

float: Jaccard similarity score between 0.0 and 1.0

`process_discription_text(text)`

Processes and tokenizes census text for similarity comparison.

Parameters:

text (str): Raw census description text

Returns:

set: Set of processed tokens (words and numbers, excluding stopwords)

`normalize_ranges(text)`

Normalizes numeric ranges in text for consistent processing.

Parameters:

text (str): Text containing potential numeric ranges

Returns:

str: Text with normalized numeric ranges

`parse_tree_to_dict(filepath)`

Parses a Graphviz tree file into a dictionary structure.

Parameters:

filepath (str): Path to the Graphviz tree file

Returns:

Dict: Dictionary mapping node IDs to their information including descriptions, year mappings, and colours

Example:

tree_dict = VariableLinker.parse_tree_to_dict("my_tree")

`extract_parent_child_relationships(filepath)`

Extracts parent-child relationships from tree file edges.

Parameters:

filepath (str): Path to the tree file (Graphviz format)

Returns:

Dict[str, List[str]]: Dictionary mapping parent nodes to their children

Example:

relationships = VariableLinker.extract_parent_child_relationships("my_tree")

`predict_parent_nodes(tree_dict, parent_child_relationships, target_years)`

Predicts parent nodes in other years using the additive property.

Parameters:

tree_dict (Dict): Parsed tree dictionary with node info and year mappings
parent_child_relationships (Dict[str, List[str]]): Parent to children mapping
target_years (List[str]): Years to predict parents for (default: [‘2016’, ‘2011’, ‘2006’])

Returns:

Dict[str, List[str]]: Dictionary mapping parent nodes to years in which they can be predicted

Example:

predictions = VariableLinker.predict_parent_nodes(tree_dict, relationships, ['2016', '2011'])

Matching Approaches

1. Jaccard Similarity Matching

Method: match_descriptions_jaccard()

Uses token-based similarity to match descriptions across years.

Advantages:

Good for exact and near-exact matches
Language-agnostic

Disadvantages:

May miss semantic similarities
Sensitive to phrasing

Usage:

jaccard_mapping = VariableLinker.match_descriptions_jaccard(
    source_df=data_2021,
    compare_df=data_2016,
    similarity_threshold=0.9
)

2. Sentence Transformer Matching

Method: match_descriptions_transformer()

Uses pre-trained sentence transformers for semantic similarity matching.

Advantages:

Captures semantic meaning
Better for paraphrased descriptions
Robust to word variations
Faster than Jaccard since it uses vectorization

Disadvantages:

Limited ability to process numeric values and ranges in text descriptions

Usage:

transformer_mapping = VariableLinker.match_descriptions_transformer(
    source_df=data_2021,
    compare_df=data_2016,
    similarity_threshold=0.9,
    model_name='all-mpnet-base-v2'
)

3. Advanced Sentence Transformer Matching

Method: match_descriptions_details_sentence_transformer()

Enhanced version of sentence transformer that uses details for breaking ties when multiple exact matches are found.

Advantages:

Attempts better disambiguation using details field
More sophisticated exact matching strategy

Disadvantages

Performance evaluation indicates higher error rate than basic transformer
Higher computational complexity without performance benefit

Usage:

advanced_mapping = VariableLinker.match_descriptions_details_sentence_transformer(
    source_df=data_2021,
    compare_df=data_2016,
    similarity_threshold=0.9
)

4. Multithreaded Matching

Method: match_descriptions_multithreaded() (from multithreaded_mapping.py)

Jaccard similarity approach with multithreaded execution for enhanced performance on large datasets.

Advantages:

Parallel processing for similarity matching phase
Configurable number of worker threads (default: 4)
Thread-safe operations for similarity matching

Usage:

from multithreaded_mapping import match_descriptions_multithreaded

multithreaded_mapping = match_descriptions_multithreaded(
    source_df=data_2021,
    compare_df=data_2016,
    similarity_threshold=0.9,
    max_workers=8
)

Workflow Examples

Basic Workflow

# 1. Load and preprocess data
data_2021 = VariableLinker.preprocess_census_metadata("census_ca21_full_metadata.json")
data_2016 = VariableLinker.preprocess_census_metadata("census_ca16_full_metadata.json")


# 2. Perform matching
mapping_21_16 = VariableLinker.match_descriptions_jaccard(
    source_df=data_2021,
    compare_df=data_2016,
    similarity_threshold=0.9
)


# 3. Merge mappings
merged_df = VariableLinker.merge_mappings(data_2021, mapping_21_16)


# 4. Build visualization
tree = VariableLinker.build_tree(data_2021, merged_df, "my_tree", "output_path")

Multi-Year Workflow

# Load data for multiple years
data_2006 = VariableLinker.preprocess_census_metadata("census_ca06_full_metadata.json")
data_2011 = VariableLinker.preprocess_census_metadata("census_ca11_full_metadata.json")
data_2016 = VariableLinker.preprocess_census_metadata("census_ca16_full_metadata.json")
data_2021 = VariableLinker.preprocess_census_metadata("census_ca21_full_metadata.json")


# Match against 2021 (latest year)
mapping_21_16 = VariableLinker.match_descriptions_jaccard(data_2021, data_2016, 0.9)
mapping_21_11 = VariableLinker.match_descriptions_jaccard(data_2021, data_2011, 0.9)
mapping_21_06 = VariableLinker.match_descriptions_jaccard(data_2021, data_2006, 0.9)


# Merge all mappings
merged_df = VariableLinker.merge_mappings(data_2021, mapping_21_16, mapping_21_11, mapping_21_06)


# Build comprehensive tree
tree = VariableLinker.build_tree(data_2021, merged_df, "multi_year_tree", "trees/")

Comparison of Approaches

# Jaccard approach
jaccard_mapping = VariableLinker.match_descriptions_jaccard(data_2021, data_2016, 0.9)
jaccard_merged = VariableLinker.merge_mappings(data_2021, jaccard_mapping)
jaccard_tree = VariableLinker.build_tree(data_2021, jaccard_merged, "jaccard_tree", "trees/")


# Transformer approach
transformer_mapping = VariableLinker.match_descriptions_transformer(data_2021, data_2016, 0.9)
transformer_merged = VariableLinker.merge_mappings(data_2021, transformer_mapping)
transformer_tree = VariableLinker.build_tree(data_2021, transformer_merged, "transformer_tree", "trees/")


# Multithreaded approach
from multithreaded_mapping import match_descriptions_multithreaded
multithreaded_mapping = match_descriptions_multithreaded(data_2021, data_2016, 0.9, 8)
multithreaded_merged = VariableLinker.merge_mappings(data_2021, multithreaded_mapping)
multithreaded_tree = VariableLinker.build_tree(data_2021, multithreaded_merged, "multithreaded_tree", "trees/")

Advanced Features

Custom Similarity Thresholds

Different thresholds can be used for different types of data:

# Strict matching for critical variables
critical_mapping = VariableLinker.match_descriptions_jaccard(data_2021, data_2016, 0.95)


# Relaxed matching for exploratory analysis
exploratory_mapping = VariableLinker.match_descriptions_jaccard(data_2021, data_2016, 0.7)

Model Selection for Transformers

# Use different transformer models
mapping_mini = VariableLinker.match_descriptions_transformer(
    data_2021, data_2016, 0.9, 'all-MiniLM-L6-v2'
)
mapping_mpnet = VariableLinker.match_descriptions_transformer(
    data_2021, data_2016, 0.9, 'all-mpnet-base-v2'
)

Tree Analysis and Prediction

Overview

VariableLinker provides advanced functionality for analyzing existing tree structures and predicting missing parent nodes based on the additive property of census data.

Key Concepts

Additive Property

In census data, parent variables often represent the sum of their child variables:

Parent_Value = Sum(Child_Values)

This property allows us to predict parent nodes in years where they don’t exist or did not get matched, as long as all their children are available in those years.
Tree Parsing

The framework can parse existing Graphviz tree files to extract:
- Node descriptions and metadata
- Year-specific vector mappings
- Parent-child relationships
- Colour-coding information

Workflow for Tree Analysis

# 1. Parse existing tree file
tree_dict = VariableLinker.parse_tree_to_dict("existing_tree.gv")


# 2. Extract parent-child relationships
relationships = VariableLinker.extract_parent_child_relationships("existing_tree.gv")


# 3. Predict missing parent nodes
predictions = VariableLinker.predict_parent_nodes(
    tree_dict=tree_dict,
    parent_child_relationships=relationships,
    target_years=['2016', '2011', '2006']
)


# 4. Analyze predictions
for parent_node, predictable_years in predictions.items():
    print(f"Parent '{parent_node}' can be predicted in years: {predictable_years}")

Complete Analysis Workflow

# Load and process census data
data_2021 = VariableLinker.preprocess_census_metadata("census_ca21_full_metadata.json")
data_2016 = VariableLinker.preprocess_census_metadata("census_ca16_full_metadata.json")


# Create initial tree
mapping_21_16 = VariableLinker.match_descriptions_jaccard(data_2021, data_2016, 0.9)
merged_df = VariableLinker.merge_mappings(data_2021, mapping_21_16)
tree = VariableLinker.build_tree(data_2021, merged_df, "analysis_tree", "trees/")


# Analyze the created tree
tree_dict = VariableLinker.parse_tree_to_dict("trees/analysis_tree")
relationships = VariableLinker.extract_parent_child_relationships("trees/analysis_tree")


# Predict missing parents
predictions = VariableLinker.predict_parent_nodes(tree_dict, relationships)


# Generate report
print("=== Tree Analysis Report ===")
print(f"Total nodes in tree: {len(tree_dict)}")
print(f"Parent-child relationships: {len(relationships)}")
print(f"Predictable parent nodes: {len(predictions)}")

for parent, years in predictions.items():
    parent_desc = tree_dict[parent]['description']
    print(f"\nParent: {parent_desc}")
    print(f"  Node ID: {parent}")
    print(f"  Predictable in years: {years}")

Prediction Algorithm Details

The prediction algorithm works as follows:

Year Analysis: Identifies the years in which the parent currently exists
Child Verification: For each target year, checks if ALL children exist
Prediction: If all children exist in a target year, the parent can be predicted

Example Scenario:

Parent: “Total Population” Children: [“Male Population”, “Female Population”]

If “Male Population” and “Female Population” both exist in 2016, but “Total Population” doesn’t exist in 2016, then “Total Population” can be predicted for 2016.

Use Cases for Tree Analysis

Data Completeness Assessment: Identify missing parent nodes across years
Prediction Validation: Verify which parent nodes can be reliably predicted

Performance Considerations

Memory Usage
- Large datasets may require significant RAM
- Consider processing in chunks for very large datasets
- Use multithreaded approach for better memory management

Data Structures

Input DataFrame Format

{
    'vector': 'v_CA21_1234',
    'type': 'Total',
    'description': 'Population aged 25-34 years',
    'details': 'Detailed description...'
}

Output Mapping Format

{
    'description': 'Population aged 25-34 years',
    'vector_base': 'v_CA21_1234',
    'vector_cmp': 'v_CA16_1234'
}

Merged Mapping Format

{
    'description': 'Population aged 25-34 years',
    'vector_base': 'v_CA21_1234',
    'vector_cmp_list': ['v_CA16_1234', 'v_CA11_1234', 'v_CA06_1234']
}

Troubleshooting

Import Errors

# Solution: Add correct path
import sys
sys.path.append('../src/piccard')
from variable_linker import VariableLinker

File Not Found Errors

# Check file paths
import os
print("Current directory:", os.getcwd())
print("Files available:", os.listdir('.'))

Memory Issues

Reduce batch size for large datasets
Use multithreaded approach
Process data in chunks

Poor Matching Results

Adjust similarity threshold
Try different matching approaches
Check data quality and consistency

Configuration Options

Similarity Thresholds

Strict: 0.95+ for critical variables
Standard: 0.9 for most use cases
Relaxed: 0.7-0.8 for exploratory analysis

Transformer Models

'all-MiniLM-L6-v2': Fast, good accuracy
'all-mpnet-base-v2': Best accuracy, slower
Other Transformer Models can be found at [SBERT Pretrained Models](https://sbert.net/docs/sentence_transformer/pretrained_models.html)

Module 4: VariableLinker

Overview

Installation and Setup

Core Concepts

1. Census Metadata Structure

2. Matching Process

3. Tree Visualization

VariableLinker Class Reference

Class Overview

Static Methods

preprocess_census_metadata(path, type_filter="Total")

jaccard_similarity(sentence1, sentence2)

process_discription_text(text)

normalize_ranges(text)

parse_tree_to_dict(filepath)

extract_parent_child_relationships(filepath)

predict_parent_nodes(tree_dict, parent_child_relationships, target_years)

Matching Approaches

1. Jaccard Similarity Matching

2. Sentence Transformer Matching

3. Advanced Sentence Transformer Matching

4. Multithreaded Matching

Workflow Examples

Basic Workflow

Multi-Year Workflow

Comparison of Approaches

Advanced Features

Custom Similarity Thresholds

Model Selection for Transformers

Tree Analysis and Prediction

Overview

Key Concepts

Workflow for Tree Analysis

Complete Analysis Workflow

Prediction Algorithm Details

Example Scenario:

Use Cases for Tree Analysis

Performance Considerations

Data Structures

Input DataFrame Format

Output Mapping Format

Merged Mapping Format

Troubleshooting

Import Errors

File Not Found Errors

Memory Issues

Poor Matching Results

Configuration Options

`preprocess_census_metadata(path, type_filter="Total")`

`jaccard_similarity(sentence1, sentence2)`

`process_discription_text(text)`

`normalize_ranges(text)`

`parse_tree_to_dict(filepath)`

`extract_parent_child_relationships(filepath)`

`predict_parent_nodes(tree_dict, parent_child_relationships, target_years)`