Cluster Analysis
Runs unsupervised clustering on paths based on computed behavioral metrics and displays a segment overview heatmap for the discovered clusters. Supports k-means and HDBSCAN, optional NMF decomposition, and silhouette-based grid search to find the optimal number of clusters.
Usage
es.cluster_analysis()With options:
es.cluster_analysis(
features=[
{"metric": "event_count", "metric_args": {"event": "purchase"}},
{"metric": "duration"},
{"metric": "length"},
],
method="kmeans",
n_clusters="3-6",
scaler="minmax",
)Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
features | list | None | auto | Metric configs used as clustering features. Defaults to event counts for all events. Can be configured interactively via Configure Features. See Path Metrics. |
method | str | "kmeans" | Clustering algorithm. Options: "kmeans", "hdbscan". |
scaler | str | "minmax" | Feature scaling. Options: "minmax", "std", or "" for none. |
n_clusters | str | int | "3-8" | Number of clusters. Accepts a single int (3), a range string ("3-8"), or comma-separated values ("3,5,7"). Ranges and lists trigger silhouette grid search to find the best value. |
metrics_config | list | None | auto | Metrics shown in the overview heatmap after clustering. Can be configured interactively via Configure Metrics. See Path Metrics. |
path_id_col | str | None | None | Override the path ID column. |
height | int | 520 | Widget height in pixels. |
sidebar_open | bool | True | Whether the settings sidebar starts open. |
NMF decomposition
When NMF Decomposition is enabled in the sidebar, hopscotch applies Non-negative Matrix Factorization to reduce the feature space before clustering. This can improve cluster quality when features are highly correlated. The NMF K parameter controls the number of components and supports the same range notation as n_clusters.
Headless mode
es.cluster_analysis_data() returns clustering results as adict without rendering a widget.
| Key | Present when | Description |
|---|---|---|
overview_df | always | DataFrame with metrics per cluster (metrics × clusters) |
silhouette | n_clusters is a range or list | Dict with params and silhouette scores for each candidate |
nmf | nmf_k is set | Dict with H_matrix, features, W_cluster_means |
result = es.cluster_analysis_data(
features=[{"metric": "length"}, {"metric": "duration"}],
method="kmeans",
n_clusters="3-6",
)
print(result["overview_df"])
print(result["silhouette"])