Module 2: Clustering
==========================

ClusteredNetworkTable class
------------------

.. code-block:: python 

    class ClusteredNetworkTable(NetworkTable):
    '''
    A table showing the network representation of census data and the cluster assignments in the network.
    Each feature present in the data (including the cluster assignment for each year) is a column, 
    and each possible path through the network is a row.
    '''
        
*Instance Variables:*
~~~~~~~~~~~~~~~~~~~~~~~

- ``table`` (``pandas.DataFrame``): The table, presented as a ``pandas`` DataFrame.
- ``years`` (List[str]): The census years present in the table.
- ``id`` (str): The unique geographical id used to distinguish geographical areas in the table.
- ``num_clusters`` (int): The number of clusters that data can be assigned to. Determined by the user.
- ``tsc`` (Union[OptTSCluster, GreedyTSCluster]): The ``tscluster`` clustering object used to fit the data.
- ``arr`` (np.ndarray[np.float64]): The array of data used in clustering.
- ``label_dict`` (dict[str, Any]): The labels of census years, network paths, and variables that correspond to each dimension of ``arr``.

*Methods:*
~~~~~~~~~~~

- ``modify_label_dict``: Takes a custom label dictionary as an argument and sets ``label_dict`` to the new dictionary.

Module 2 Functions
-------------------

``clustering_prep``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Converts a piccard network table into a 3d numpy array of all possible paths and their corresponding features. This will be used for clustering with tscluster.
The user can (optionally) input a list of columns that they want to be considered in the clustering algorithm, 
and the function will check that these columns are valid.

*Parameters:*

* ``network_table`` (NetworkTable): 
    The NetworkTable containing the data to be clustered.
 
* ``cols`` (list[str] | None): 
    A list of the names of network table columns that should be considered in
    the clustering algorithm. If none, every numerical feature will be considered. Leaving it none is
    not recommended as many numerical features, such as network level, have little bearing on the data.

*Returns:*

* ``(tuple[np.ndarray[np.float64], dict[str, Any]], NetworkTable)``:
    A tuple of a 3d numpy array, a corresponding dictionary of labels showing
    the shape of the array, and the network table modified so it doesn't include any of the NaN rows.


``cluster``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Runs one of ``tscluster``'s clustering algorithms (default is fully dynamic clustering or ``'z1c1'``)
and adds the resulting cluster assignments to the network table and nodes as an additional feature.
Information about the different clustering algorithms is available here: https://tscluster.readthedocs.io/en/latest/introduction.html
We recommend either Sequential Label Analysis (``'z1c0'``) or the default ``'z1c1'``.

Users can choose to only input the network table, in which case ``clustering_prep`` will be run for them with the default columns,
or they can choose to run ``clustering_prep`` on their own and then have the option to apply one or both of the
normalization methods available in ``tscluster.preprocessing.utils``.

*Parameters:*

* ``network_table`` (NetworkTable): 
    The NetworkTable containing the data to be clustered.
 
* ``G`` (nx.Graph): 
    The result of pc.create_network().

* ``num_clusters`` (int): 
    The number of clusters that the algorithm will find.

* ``algo`` (str | None): 
    The algorithm that tscluster will use, either ``'greedy'`` (default) or ``'opt'``.
    ``'greedy'`` runs GreedyTSCluster, which is a faster and easier, but less accurate, method than OptTSCluster. 
    Since it doesn't require a special academic licence, we recommend ``'greedy'`` for any non-academic users.
    ``'opt'`` runs OptTSCluster, which is guaranteed to find the optimal clustering but requires a Gurobi academic
    licence to run the clustering algorithm. More information about obtaining an academic licence can be found
    here: https://www.gurobi.com/academia/academic-program-and-licenses/
        
* ``scheme`` (str | None): 
    the clustering scheme. See the first paragraph for more information. Default is ``'z1c1'``.

* ``arr`` (np.ndarray[np.float64] | None): 
    the array of data to be clustered. If none, ``arr`` and ``label_dict`` will be generated by running
    ``clustering_prep`` with the default columns. See the ``clustering_prep`` documentation for why we DO NOT
    recommend leaving this blank.
        
* ``label_dict`` (dict[str, Any] | None): 
    the label dictionary corresponding to the data array. See ``arr``.

*Returns:*

* ``ClusteredNetworkTable``:
    The ClusteredNetworkTable object.
    Note that ``cluster`` also adds the resulting cluster assignments to the network table and nodes as an additional feature.