Module 4: VariableLinker ================== Overview -------- VariableLinker is a Python framework designed for visualizing the links between census variables across multiple years. It provides multiple approaches for matching census variables between different years and creates hierarchical tree visualizations that show how these variables are connected. Key Features: * **Multiple Matching Algorithms**: Jaccard similarity and sentence transformers * **Hierarchical Visualization**: Creates tree structures showing the parent-child relationships in census data * **Colour-coded Results**: Visual indicators for data consistency across years Use Cases: * Census data harmonization across multiple years * Tracking changes in census variables over time * Visualizing data consistency and evolution Installation and Setup ---------------------- Prerequisites .. code-block:: bash pip install -r requirements.txt Importing VariableLinker .. code-block:: python import sys import os # Add the src/piccard directory to Python path current_dir = os.getcwd() src_path = os.path.join(current_dir, '..', 'src', 'piccard') sys.path.append(src_path) from variable_linker import VariableLinker Core Concepts ------------- 1. Census Metadata Structure ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ VariableLinker works with census metadata JSON files that contain: * **Vector identifiers**: Unique codes for census variables * **Descriptions**: Human-readable descriptions of census variables * **Types**: Categories like "Total", "Male", "Female" * **Details**: Additional contextual information 2. Matching Process ~~~~~~~~~~~~~~~~~~~~ The framework performs two-pass matching: * **Exact Match**: Find identical descriptions across years * **Similarity Match**: Use similarity algorithms for inexact matches 3. Tree Visualization ~~~~~~~~~~~~~~~~~~~~~ * **Nodes**: Represent census variables * **Edges**: Show parent-child relationships * **Colours**: Indicate consistency across years * Grey: Source year only * Salmon: Matches in 1 other year * Yellow: Matches in 2 other years * Light green: Matches in 3+ other years VariableLinker Class Reference ------------------------------ Class Overview ~~~~~~~~~~~~~~ .. code-block:: python class VariableLinker: """ A class for processing census metadata and creating tree visualizations. This class provides functionality for: - Preprocessing census metadata from JSON files - Computing similarity between census descriptions using various methods - Matching descriptions across different census years - Building hierarchical tree visualizations with colour-coding """ Static Methods ~~~~~~~~~~~~~~~ .. list-table:: VariableLinker Static Methods :header-rows: 1 :widths: 20 20 20 20 * - Method - Parameters - Returns - Description * - ``preprocess_census_metadata`` - ``path, type_filter`` - ``pd.DataFrame`` - Preprocess census metadata * - ``jaccard_similarity`` - ``sentence1, sentence2`` - ``float`` - Compute Jaccard similarity * - ``process_discription_text`` - ``text`` - ``set`` - Process and tokenize text * - ``normalize_ranges`` - ``text`` - ``str`` - Normalize numeric ranges * - ``match_descriptions_jaccard`` - ``source_df, compare_df, threshold`` - ``pd.DataFrame`` - Jaccard-based matching * - ``match_descriptions_transformer`` - ``source_df, compare_df, threshold, model`` - ``pd.DataFrame`` - Transformer-based matching * - ``match_descriptions_details_sentence_transformer`` - ``source_df, compare_df, threshold, model`` - ``pd.DataFrame`` - Advanced transformer matching * - ``merge_mappings`` - ``map_descriptions, *mappings_dfs`` - ``pd.DataFrame`` - Merge multiple mappings * - ``build_tree`` - ``source_data, merged_df, tree_name, path`` - ``Digraph`` - Build tree visualization * - ``parse_tree_to_dict`` - ``filepath`` - ``Dict`` - Parse tree file to dictionary * - ``extract_parent_child_relationships`` - ``filepath`` - ``Dict[str, List[str]]`` - Extract parent-child relationships * - ``predict_parent_nodes`` - ``tree_dict, parent_child_relationships, target_years`` - ``Dict[str, List[str]]`` - Predict missing parent nodes ``preprocess_census_metadata(path, type_filter="Total")`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Preprocesses census metadata from JSON files. *Parameters:* - ``path`` (str): Path to the JSON file containing census metadata - ``type_filter`` (str): Type of records to filter for (default: "Total") *Returns:* - ``pd.DataFrame``: Preprocessed DataFrame with columns ['vector', 'type', 'description', ...] *Example:* .. code-block:: python data_2021 = VariableLinker.preprocess_census_metadata("census_ca21_full_metadata.json") ``jaccard_similarity(sentence1, sentence2)`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Computes Jaccard similarity between two census descriptions. *Parameters:* - ``sentence1`` (str): First census description - ``sentence2`` (str): Second census description *Returns:* - ``float``: Jaccard similarity score between 0.0 and 1.0 ``process_discription_text(text)`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Processes and tokenizes census text for similarity comparison. *Parameters:* - ``text`` (str): Raw census description text *Returns:* - ``set``: Set of processed tokens (words and numbers, excluding stopwords) ``normalize_ranges(text)`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~ Normalizes numeric ranges in text for consistent processing. *Parameters:* - ``text`` (str): Text containing potential numeric ranges *Returns:* - ``str``: Text with normalized numeric ranges ``parse_tree_to_dict(filepath)`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Parses a Graphviz tree file into a dictionary structure. *Parameters:* - ``filepath`` (str): Path to the Graphviz tree file *Returns:* - ``Dict``: Dictionary mapping node IDs to their information including descriptions, year mappings, and colours *Example:* .. code-block:: python tree_dict = VariableLinker.parse_tree_to_dict("my_tree") ``extract_parent_child_relationships(filepath)`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Extracts parent-child relationships from tree file edges. *Parameters:* - ``filepath`` (str): Path to the tree file (Graphviz format) *Returns:* - ``Dict[str, List[str]]``: Dictionary mapping parent nodes to their children *Example:* .. code-block:: python relationships = VariableLinker.extract_parent_child_relationships("my_tree") ``predict_parent_nodes(tree_dict, parent_child_relationships, target_years)`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Predicts parent nodes in other years using the additive property. *Parameters:* - ``tree_dict`` (Dict): Parsed tree dictionary with node info and year mappings - ``parent_child_relationships`` (Dict[str, List[str]]): Parent to children mapping - ``target_years`` (List[str]): Years to predict parents for (default: ['2016', '2011', '2006']) *Returns:* - ``Dict[str, List[str]]``: Dictionary mapping parent nodes to years in which they can be predicted *Example:* .. code-block:: python predictions = VariableLinker.predict_parent_nodes(tree_dict, relationships, ['2016', '2011']) Matching Approaches -------------------- 1. Jaccard Similarity Matching ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Method:** ``match_descriptions_jaccard()`` Uses token-based similarity to match descriptions across years. **Advantages:** * Good for exact and near-exact matches * Language-agnostic **Disadvantages:** * May miss semantic similarities * Sensitive to phrasing **Usage:** .. code-block:: python jaccard_mapping = VariableLinker.match_descriptions_jaccard( source_df=data_2021, compare_df=data_2016, similarity_threshold=0.9 ) 2. Sentence Transformer Matching ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Method:** ``match_descriptions_transformer()`` Uses pre-trained sentence transformers for semantic similarity matching. **Advantages:** * Captures semantic meaning * Better for paraphrased descriptions * Robust to word variations * Faster than Jaccard since it uses vectorization **Disadvantages:** * Limited ability to process numeric values and ranges in text descriptions **Usage:** .. code-block:: python transformer_mapping = VariableLinker.match_descriptions_transformer( source_df=data_2021, compare_df=data_2016, similarity_threshold=0.9, model_name='all-mpnet-base-v2' ) 3. Advanced Sentence Transformer Matching ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Method:** ``match_descriptions_details_sentence_transformer()`` Enhanced version of sentence transformer that uses details for breaking ties when multiple exact matches are found. **Advantages:** * Attempts better disambiguation using details field * More sophisticated exact matching strategy **Disadvantages** * Performance evaluation indicates higher error rate than basic transformer * Higher computational complexity without performance benefit **Usage:** .. code-block:: python advanced_mapping = VariableLinker.match_descriptions_details_sentence_transformer( source_df=data_2021, compare_df=data_2016, similarity_threshold=0.9 ) 4. Multithreaded Matching ~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Method:** ``match_descriptions_multithreaded()`` (from multithreaded_mapping.py) Jaccard similarity approach with multithreaded execution for enhanced performance on large datasets. **Advantages:** * Parallel processing for similarity matching phase * Configurable number of worker threads (default: 4) * Thread-safe operations for similarity matching **Usage:** .. code-block:: python from multithreaded_mapping import match_descriptions_multithreaded multithreaded_mapping = match_descriptions_multithreaded( source_df=data_2021, compare_df=data_2016, similarity_threshold=0.9, max_workers=8 ) Workflow Examples ------------------ Basic Workflow ~~~~~~~~~~~~~~~~ .. code-block:: python # 1. Load and preprocess data data_2021 = VariableLinker.preprocess_census_metadata("census_ca21_full_metadata.json") data_2016 = VariableLinker.preprocess_census_metadata("census_ca16_full_metadata.json") # 2. Perform matching mapping_21_16 = VariableLinker.match_descriptions_jaccard( source_df=data_2021, compare_df=data_2016, similarity_threshold=0.9 ) # 3. Merge mappings merged_df = VariableLinker.merge_mappings(data_2021, mapping_21_16) # 4. Build visualization tree = VariableLinker.build_tree(data_2021, merged_df, "my_tree", "output_path") Multi-Year Workflow ~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python # Load data for multiple years data_2006 = VariableLinker.preprocess_census_metadata("census_ca06_full_metadata.json") data_2011 = VariableLinker.preprocess_census_metadata("census_ca11_full_metadata.json") data_2016 = VariableLinker.preprocess_census_metadata("census_ca16_full_metadata.json") data_2021 = VariableLinker.preprocess_census_metadata("census_ca21_full_metadata.json") # Match against 2021 (latest year) mapping_21_16 = VariableLinker.match_descriptions_jaccard(data_2021, data_2016, 0.9) mapping_21_11 = VariableLinker.match_descriptions_jaccard(data_2021, data_2011, 0.9) mapping_21_06 = VariableLinker.match_descriptions_jaccard(data_2021, data_2006, 0.9) # Merge all mappings merged_df = VariableLinker.merge_mappings(data_2021, mapping_21_16, mapping_21_11, mapping_21_06) # Build comprehensive tree tree = VariableLinker.build_tree(data_2021, merged_df, "multi_year_tree", "trees/") Comparison of Approaches ~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python # Jaccard approach jaccard_mapping = VariableLinker.match_descriptions_jaccard(data_2021, data_2016, 0.9) jaccard_merged = VariableLinker.merge_mappings(data_2021, jaccard_mapping) jaccard_tree = VariableLinker.build_tree(data_2021, jaccard_merged, "jaccard_tree", "trees/") # Transformer approach transformer_mapping = VariableLinker.match_descriptions_transformer(data_2021, data_2016, 0.9) transformer_merged = VariableLinker.merge_mappings(data_2021, transformer_mapping) transformer_tree = VariableLinker.build_tree(data_2021, transformer_merged, "transformer_tree", "trees/") # Multithreaded approach from multithreaded_mapping import match_descriptions_multithreaded multithreaded_mapping = match_descriptions_multithreaded(data_2021, data_2016, 0.9, 8) multithreaded_merged = VariableLinker.merge_mappings(data_2021, multithreaded_mapping) multithreaded_tree = VariableLinker.build_tree(data_2021, multithreaded_merged, "multithreaded_tree", "trees/") Advanced Features ------------------ Custom Similarity Thresholds ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Different thresholds can be used for different types of data: .. code-block:: python # Strict matching for critical variables critical_mapping = VariableLinker.match_descriptions_jaccard(data_2021, data_2016, 0.95) # Relaxed matching for exploratory analysis exploratory_mapping = VariableLinker.match_descriptions_jaccard(data_2021, data_2016, 0.7) Model Selection for Transformers ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python # Use different transformer models mapping_mini = VariableLinker.match_descriptions_transformer( data_2021, data_2016, 0.9, 'all-MiniLM-L6-v2' ) mapping_mpnet = VariableLinker.match_descriptions_transformer( data_2021, data_2016, 0.9, 'all-mpnet-base-v2' ) Tree Analysis and Prediction ----------------------------- Overview ~~~~~~~~~~ VariableLinker provides advanced functionality for analyzing existing tree structures and predicting missing parent nodes based on the additive property of census data. Key Concepts ~~~~~~~~~~~~ * Additive Property In census data, parent variables often represent the sum of their child variables: Parent_Value = Sum(Child_Values) This property allows us to predict parent nodes in years where they don't exist or did not get matched, as long as all their children are available in those years. * Tree Parsing The framework can parse existing Graphviz tree files to extract: * Node descriptions and metadata * Year-specific vector mappings * Parent-child relationships * Colour-coding information Workflow for Tree Analysis ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python # 1. Parse existing tree file tree_dict = VariableLinker.parse_tree_to_dict("existing_tree.gv") # 2. Extract parent-child relationships relationships = VariableLinker.extract_parent_child_relationships("existing_tree.gv") # 3. Predict missing parent nodes predictions = VariableLinker.predict_parent_nodes( tree_dict=tree_dict, parent_child_relationships=relationships, target_years=['2016', '2011', '2006'] ) # 4. Analyze predictions for parent_node, predictable_years in predictions.items(): print(f"Parent '{parent_node}' can be predicted in years: {predictable_years}") Complete Analysis Workflow ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python # Load and process census data data_2021 = VariableLinker.preprocess_census_metadata("census_ca21_full_metadata.json") data_2016 = VariableLinker.preprocess_census_metadata("census_ca16_full_metadata.json") # Create initial tree mapping_21_16 = VariableLinker.match_descriptions_jaccard(data_2021, data_2016, 0.9) merged_df = VariableLinker.merge_mappings(data_2021, mapping_21_16) tree = VariableLinker.build_tree(data_2021, merged_df, "analysis_tree", "trees/") # Analyze the created tree tree_dict = VariableLinker.parse_tree_to_dict("trees/analysis_tree") relationships = VariableLinker.extract_parent_child_relationships("trees/analysis_tree") # Predict missing parents predictions = VariableLinker.predict_parent_nodes(tree_dict, relationships) # Generate report print("=== Tree Analysis Report ===") print(f"Total nodes in tree: {len(tree_dict)}") print(f"Parent-child relationships: {len(relationships)}") print(f"Predictable parent nodes: {len(predictions)}") for parent, years in predictions.items(): parent_desc = tree_dict[parent]['description'] print(f"\nParent: {parent_desc}") print(f" Node ID: {parent}") print(f" Predictable in years: {years}") Prediction Algorithm Details ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The prediction algorithm works as follows: * **Year Analysis**: Identifies the years in which the parent currently exists * **Child Verification**: For each target year, checks if ALL children exist * **Prediction**: If all children exist in a target year, the parent can be predicted Example Scenario: ~~~~~~~~~~~~~~~~~~ Parent: "Total Population" Children: ["Male Population", "Female Population"] If "Male Population" and "Female Population" both exist in 2016, but "Total Population" doesn't exist in 2016, then "Total Population" can be predicted for 2016. Use Cases for Tree Analysis ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ * **Data Completeness Assessment**: Identify missing parent nodes across years * **Prediction Validation**: Verify which parent nodes can be reliably predicted Performance Considerations ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ * Memory Usage * Large datasets may require significant RAM * Consider processing in chunks for very large datasets * Use multithreaded approach for better memory management Data Structures --------------- Input DataFrame Format ~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python { 'vector': 'v_CA21_1234', 'type': 'Total', 'description': 'Population aged 25-34 years', 'details': 'Detailed description...' } Output Mapping Format ~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python { 'description': 'Population aged 25-34 years', 'vector_base': 'v_CA21_1234', 'vector_cmp': 'v_CA16_1234' } Merged Mapping Format ~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python { 'description': 'Population aged 25-34 years', 'vector_base': 'v_CA21_1234', 'vector_cmp_list': ['v_CA16_1234', 'v_CA11_1234', 'v_CA06_1234'] } Troubleshooting ---------------- Import Errors ~~~~~~~~~~~~~~ .. code-block:: python # Solution: Add correct path import sys sys.path.append('../src/piccard') from variable_linker import VariableLinker File Not Found Errors ~~~~~~~~~~~~~~~~~~~~ .. code-block:: python # Check file paths import os print("Current directory:", os.getcwd()) print("Files available:", os.listdir('.')) Memory Issues ~~~~~~~~~~~~~~ * Reduce batch size for large datasets * Use multithreaded approach * Process data in chunks Poor Matching Results ~~~~~~~~~~~~~~~~~~~~~~ * Adjust similarity threshold * Try different matching approaches * Check data quality and consistency Configuration Options --------------------- Similarity Thresholds * **Strict**: 0.95+ for critical variables * **Standard**: 0.9 for most use cases * **Relaxed**: 0.7-0.8 for exploratory analysis Transformer Models * ``'all-MiniLM-L6-v2'``: Fast, good accuracy * ``'all-mpnet-base-v2'``: Best accuracy, slower * Other Transformer Models can be found at [SBERT Pretrained Models](https://sbert.net/docs/sentence_transformer/pretrained_models.html)