Cluster Analysis

Runs unsupervised clustering on paths based on computed behavioral metrics and displays a segment overview heatmap for the discovered clusters. Supports k-means and HDBSCAN, optional NMF decomposition, and silhouette-based grid search to find the optimal number of clusters.

Usage

es.cluster_analysis()

With options:

es.cluster_analysis(
    features=[
        {"metric": "event_count", "metric_args": {"event": "purchase"}},
        {"metric": "duration"},
        {"metric": "length"},
    ],
    method="kmeans",
    n_clusters="3-6",
    scaler="minmax",
)

Parameters

ParameterTypeDefaultDescription
featureslist | NoneautoMetric configs used as clustering features. Defaults to event counts for all events. Can be configured interactively via Configure Features. See Path Metrics.
methodstr"kmeans"Clustering algorithm. Options: "kmeans", "hdbscan".
scalerstr"minmax"Feature scaling. Options: "minmax", "std", or "" for none.
n_clustersstr | int"3-8"Number of clusters. Accepts a single int (3), a range string ("3-8"), or comma-separated values ("3,5,7"). Ranges and lists trigger silhouette grid search to find the best value.
metrics_configlist | NoneautoMetrics shown in the overview heatmap after clustering. Can be configured interactively via Configure Metrics. See Path Metrics.
path_id_colstr | NoneNoneOverride the path ID column.
heightint520Widget height in pixels.
sidebar_openboolTrueWhether the settings sidebar starts open.

NMF decomposition

When NMF Decomposition is enabled in the sidebar, hopscotch applies Non-negative Matrix Factorization to reduce the feature space before clustering. This can improve cluster quality when features are highly correlated. The NMF K parameter controls the number of components and supports the same range notation as n_clusters.

Headless mode

es.cluster_analysis_data() returns clustering results as adict without rendering a widget.

KeyPresent whenDescription
overview_dfalwaysDataFrame with metrics per cluster (metrics × clusters)
silhouetten_clusters is a range or listDict with params and silhouette scores for each candidate
nmfnmf_k is setDict with H_matrix, features, W_cluster_means
result = es.cluster_analysis_data(
    features=[{"metric": "length"}, {"metric": "duration"}],
    method="kmeans",
    n_clusters="3-6",
)
print(result["overview_df"])
print(result["silhouette"])