Unsupervised Projects (Clustering)
Use clustering when data is not labelled and the problem can be interpreted as grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar to each other than to those in other groups (clusters). It is a common task in data exploration when finding groups and similarities is needed.
Creating Unsupervised Projects
To create an unsupervised project, set unsupervised_mode
to True
when setting the target.
To specify clustering, set unsupervised_type
to CLUSTERING
. When setting the modeling mode
is required, clustering supports either``AUTOPILOT_MODE.COMPREHENSIVE`` for DataRobot-run Autopilot
or AUTOPILOT_MODE.MANUAL
for user control of which models/parameters to use.
Example:
from datarobot import Project
from datarobot.enums import UnsupervisedTypeEnum
from datarobot.enums import AUTOPILOT_MODE
project = Project.create("dataset.csv", project_name="unsupervised clustering")
project.analyze_and_model(
unsupervised_mode=True,
mode=AUTOPILOT_MODE.COMPREHENSIVE,
unsupervised_type=UnsupervisedTypeEnum.CLUSTERING,
)
You can optionally specify list of explicit cluster numbers. To do this, pass a list of integer
values to optional autopilot_cluster_list
parameter using the analyze_and_model()
method.
project.analyze_and_model(
unsupervised_mode=True,
mode=AUTOPILOT_MODE.COMPREHENSIVE,
unsupervised_type=UnsupervisedTypeEnum.CLUSTERING,
autopilot_cluster_list=[7, 9, 11, 15, 19],
)
You can also do both in one step using the Project.start()
method. This method by default will
use AUTOPILOT_MODE.COMPREHENSIVE
mode.
from datarobot import Project
from datarobot.enums import UnsupervisedTypeEnum
project = Project.start(
"dataset.csv",
unsupervised_mode=True,
project_name="unsupervised clustering project",
unsupervised_type=UnsupervisedTypeEnum.CLUSTERING,
)
Unsupervised Clustering Project Metric
Unsupervised clustering projects use the Silhouette Score
metric for model ranking (instead of
using it for model optimization). It measures the average similarity of objects within a cluster
and their distance to the other objects in the other clusters.
Retrieving information about Clusters
In a trained model, you can retrieve information about clusters in along with standard model information. To do this, when training completes, retrieve a model and view basic clustering information:
n_clusters
: number of clusters for model
is_n_clusters_dynamically_determined
: how clustering model picks number of clusters
Here is a code snippet to retrieve information about the number of clusters for model:
from datarobot import ClusteringModel
model = ClusteringModel.get(project_id, model_id)
print("{} clusters found".format(model.n_clusters))
You can retrieve more details about clusters and their data using cluster insights.
Working with Clusters Insights
You can compute insights to gain deep insights into clusters and their characteristics. This process will perform calculations and return detailed information about each feature and its importance, as well as a detailed per-cluster breakdown.
To compute and retrieve cluster insights, use the ClusteringModel
and its compute_insights
method. The method starts the cluster insights compute job, waits for its completion for the number
of seconds specified in the optional parameter max_wait
(default: 600), and returns results
when insights are ready.
If clusters are already computed, access them using the insights
property of the
ClusteringModel
method.
from datarobot import ClusteringModel
model = ClusteringModel.get(project_id, model_id)
insights = model.compute_insights()
This call, with the specified wait_time
, will run and wait for specified time:
from datarobot import ClusteringModel
model = ClusteringModel.get(project_id, model_id)
insights = model.compute_insights(max_wait=60)
If computation fails to finish before max_wait
expires, the method will raise
an AsyncTimeoutError
. You can retrieve cluster insights after jobs computation finishes.
To retrieve cluster insights already computed:
from datarobot import ClusteringModel
model = ClusteringModel.get(project_id, model_id)
for insight in model.insights:
print(insight)
Working with Clusters
By default, DataRobot names clusters “Cluster 1”, “Cluster 2”, … , “Cluster N” . You can retrieve these names and alter them according to preference. When retrieving clusters before computing insights, clusters will contain only names. After insight computation completes, each cluster will also hold information about the percentage of data that is represented by the Cluster.
For example:
from datarobot import ClusteringModel
model = ClusteringModel.get(project_id, model_id)
# helper function
def print_summary(name, percent):
if not percent:
percent = "?"
print("'{}' holds {} % of data".format(name, percent))
for cluster in model.clusters:
print_summary(cluster.name, cluster.percent)
model.compute_insights()
for cluster in model.clusters:
print_summary(cluster.name, cluster.percent)
For a model with three clusters, the code snippet will output:
'Cluster 1' holds ? % of data
'Cluster 2' holds ? % of data
'Cluster 3' holds ? % of data
-- Cluster insights computation finished --
'Cluster 1' holds 27.1704180064 % of data
'Cluster 2' holds 36.9131832797 % of data
'Cluster 3' holds 35.9163987138 % of data
Use the following methods of ClusteringModel
class to alter cluster names:
: - update_cluster_names
- changes multiple cluster names using mapping in dictionary
update_cluster_name
- changes one cluster name
After update, each method will return a list of clusters with changed names.
For example:
from datarobot import ClusteringModel
model = ClusteringModel.get(project_id, model_id)
# update multiple
cluster_name_mappings = [
("Cluster 1", "AAA"),
("Cluster 2", "BBB"),
("Cluster 3", "CCC")
]
clusters = model.update_cluster_names(cluster_name_mappings)
# update single
clusters = model.update_cluster_name("CCC", "DDD")
Clustering Classes Reference
ClusteringModel
- class datarobot.models.model.ClusteringModel(id=None, processes=None, featurelist_name=None, featurelist_id=None, project_id=None, sample_pct=None, model_type=None, model_category=None, is_frozen=None, is_n_clusters_dynamically_determined=None, blueprint_id=None, metrics=None, monotonic_increasing_featurelist_id=None, monotonic_decreasing_featurelist_id=None, n_clusters=None, has_empty_clusters=None, supports_monotonic_constraints=None, is_starred=None, prediction_threshold=None, prediction_threshold_read_only=None, model_number=None, parent_model_id=None, supports_composable_ml=None, training_row_count=None, training_duration=None, training_start_date=None, training_end_date=None, data_selection_method=None, time_window_sample_pct=None, sampling_method=None, model_family_full_name=None, is_trained_into_validation=None, is_trained_into_holdout=None)
ClusteringModel extends
Model
class. It provides provides properties and methods specific to clustering projects.- compute_insights(max_wait=600)
Compute and retrieve cluster insights for model. This method awaits completion of job computing cluster insights and returns results after it is finished. If computation takes longer than specified
max_wait
exception will be raised.- Parameters:
- project_id: str
Project to start creation in.
- model_id: str
Project’s model to start creation in.
- max_wait: int
Maximum number of seconds to wait before giving up
- Returns:
- List of ClusterInsight
- Raises:
- ClientError
Server rejected creation due to client error. Most likely cause is bad
project_id
ormodel_id
.- AsyncFailureError
If any of the responses from the server are unexpected
- AsyncProcessUnsuccessfulError
If the cluster insights computation has failed or was cancelled.
- AsyncTimeoutError
If the cluster insights computation did not resolve in time
- Return type:
List
[ClusterInsight
]
- property insights: List[ClusterInsight]
Return actual list of cluster insights if already computed.
- Returns:
- List of ClusterInsight
- update_cluster_names(cluster_name_mappings)
Change many cluster names at once based on list of name mappings.
- Parameters:
- cluster_name_mappings: List of tuples
Cluster names mapping consisting of current cluster name and old cluster name. Example:
cluster_name_mappings = [ ("current cluster name 1", "new cluster name 1"), ("current cluster name 2", "new cluster name 2")]
- Returns:
- List of Cluster
- Raises:
- datarobot.errors.ClientError
Server rejected update of cluster names. Possible reasons include: incorrect format of mapping, mapping introduces duplicates.
- Return type:
List
[Cluster
]
- update_cluster_name(current_name, new_name)
Change cluster name from current_name to new_name.
- Parameters:
- current_name: str
Current cluster name.
- new_name: str
New cluster name.
- Returns:
- List of Cluster
- Raises:
- datarobot.errors.ClientError
Server rejected update of cluster names.
- Return type:
List
[Cluster
]
Cluster
- class datarobot.models.model.Cluster(**kwargs)
Representation of a single cluster.
- Attributes:
- name: str
Current cluster name
- percent: float
Percent of data contained in the cluster. This value is reported after cluster insights are computed for the model.
- classmethod list(project_id, model_id)
Retrieve a list of clusters in the model.
- Parameters:
- project_id: str
ID of the project that the model is part of.
- model_id: str
ID of the model.
- Returns:
- List of clusters
- Return type:
List
[Cluster
]
- classmethod update_multiple_names(project_id, model_id, cluster_name_mappings)
Update many clusters at once based on list of name mappings.
- Parameters:
- project_id: str
ID of the project that the model is part of.
- model_id: str
ID of the model.
- cluster_name_mappings: List of tuples
Cluster name mappings, consisting of current and previous names for each cluster. Example:
cluster_name_mappings = [ ("current cluster name 1", "new cluster name 1"), ("current cluster name 2", "new cluster name 2")]
- Returns:
- List of clusters
- Raises:
- datarobot.errors.ClientError
Server rejected update of cluster names.
- ValueError
Invalid cluster name mapping provided.
- Return type:
List
[Cluster
]
- classmethod update_name(project_id, model_id, current_name, new_name)
Change cluster name from current_name to new_name
- Parameters:
- project_id: str
ID of the project that the model is part of.
- model_id: str
ID of the model.
- current_name: str
Current cluster name
- new_name: str
New cluster name
- Returns:
- List of Cluster
- Return type:
List
[Cluster
]
ClusterInsight
- class datarobot.models.model.ClusterInsight(**kwargs)
Holds data on all insights related to feature as well as breakdown per cluster.
- Parameters:
- feature_name: str
Name of a feature from the dataset.
- feature_type: str
Type of feature.
- insightsList of classes (ClusterInsight)
List provides information regarding the importance of a specific feature in relation to each cluster. Results help understand how the model is grouping data and what each cluster represents.
- feature_impact: float
Impact of a feature ranging from 0 to 1.
- classmethod compute(project_id, model_id, max_wait=600)
Starts creation of cluster insights for the model and if successful, returns computed ClusterInsights. This method allows calculation to continue for a specified time and if not complete, cancels the request.
- Parameters:
- project_id: str
ID of the project to begin creation of cluster insights for.
- model_id: str
ID of the project model to begin creation of cluster insights for.
- max_wait: int
Maximum number of seconds to wait canceling the request.
- Returns:
- List[ClusterInsight]
- Raises:
- ClientError
Server rejected creation due to client error. Most likely cause is bad
project_id
ormodel_id
.- AsyncFailureError
Indicates whether any of the responses from the server are unexpected.
- AsyncProcessUnsuccessfulError
Indicates whether the cluster insights computation failed or was cancelled.
- AsyncTimeoutError
Indicates whether the cluster insights computation did not resolve within the specified time limit (max_wait).
- Return type:
List
[ClusterInsight
]