Cluster Analysis

Runs unsupervised clustering on paths based on computed behavioral metrics and displays a segment overview heatmap for the discovered clusters. Supports k-means and HDBSCAN, optional NMF decomposition, and silhouette-based grid search to find the optimal number of clusters.

Usage

es.cluster_analysis()

With options:

es.cluster_analysis(
    features=[
        {"metric": "event_count", "metric_args": {"event": "purchase"}},
        {"metric": "duration"},
        {"metric": "length"},
    ],
    method="kmeans",
    n_clusters="3-6",
    scaler="minmax",
)

Parameters

Parameter	Type	Default	Description
`features`	`list \| None`	auto	Metric configs used as clustering features. Defaults to event counts for all events. Can be configured interactively via Configure Features. See Path Metrics.
`method`	`str`	`"kmeans"`	Clustering algorithm. Options: `"kmeans"`, `"hdbscan"`.
`scaler`	`str`	`"minmax"`	Feature scaling. Options: `"minmax"`, `"std"`, or `""` for none.
`n_clusters`	`str \| int`	`"3-8"`	Number of clusters. Accepts a single int (`3`), a range string (`"3-8"`), or comma-separated values (`"3,5,7"`). Ranges and lists trigger silhouette grid search to find the best value.
`metrics_config`	`list \| None`	auto	Metrics shown in the overview heatmap after clustering. Can be configured interactively via Configure Metrics. See Path Metrics.
`path_id_col`	`str \| None`	`None`	Override the path ID column.
`height`	`int`	`520`	Widget height in pixels.
`sidebar_open`	`bool`	`True`	Whether the settings sidebar starts open.

NMF decomposition

When NMF Decomposition is enabled in the sidebar, hopscotch applies Non-negative Matrix Factorization to reduce the feature space before clustering. This can improve cluster quality when features are highly correlated. The NMF K parameter controls the number of components and supports the same range notation as n_clusters.

Headless mode

es.cluster_analysis_data() returns clustering results as adict without rendering a widget.

Key	Present when	Description
`overview_df`	always	DataFrame with metrics per cluster (metrics × clusters)
`silhouette`	`n_clusters` is a range or list	Dict with `params` and `silhouette` scores for each candidate
`nmf`	`nmf_k` is set	Dict with `H_matrix`, `features`, `W_cluster_means`

result = es.cluster_analysis_data(
    features=[{"metric": "length"}, {"metric": "duration"}],
    method="kmeans",
    n_clusters="3-6",
)
print(result["overview_df"])
print(result["silhouette"])