summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
-rw-r--r--.gitignore1
-rw-r--r--python-clustergram.spec1185
-rw-r--r--sources1
3 files changed, 1187 insertions, 0 deletions
diff --git a/.gitignore b/.gitignore
index e69de29..6a47078 100644
--- a/.gitignore
+++ b/.gitignore
@@ -0,0 +1 @@
+/clustergram-0.7.0.tar.gz
diff --git a/python-clustergram.spec b/python-clustergram.spec
new file mode 100644
index 0000000..201a24a
--- /dev/null
+++ b/python-clustergram.spec
@@ -0,0 +1,1185 @@
+%global _empty_manifest_terminate_build 0
+Name: python-clustergram
+Version: 0.7.0
+Release: 1
+Summary: Clustergram - visualization and diagnostics for cluster analysis
+License: MIT
+URL: https://pypi.org/project/clustergram/
+Source0: https://mirrors.nju.edu.cn/pypi/web/packages/3b/03/2bf3032fd8ae1f0201579d8d020099e62c30e62519a8c5f7ae73a1166b8e/clustergram-0.7.0.tar.gz
+BuildArch: noarch
+
+Requires: python3-pandas
+Requires: python3-numpy
+Requires: python3-matplotlib
+
+%description
+# Clustergram
+
+![logo
+clustergram](https://raw.githubusercontent.com/martinfleis/clustergram/main/doc/_static/logo.svg)
+
+## Visualization and diagnostics for cluster analysis
+
+[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.4750483.svg)](https://doi.org/10.5281/zenodo.4750483)
+
+Clustergram is a diagram proposed by Matthias Schonlau in his paper *[The clustergram: A
+graph for visualizing hierarchical and nonhierarchical cluster
+analyses](https://journals.sagepub.com/doi/10.1177/1536867X0200200405)*:
+
+> In hierarchical cluster analysis, dendrograms are used to visualize how clusters are
+> formed. I propose an alternative graph called a “clustergram” to examine how cluster
+> members are assigned to clusters as the number of clusters increases. This graph is
+> useful in exploratory analysis for nonhierarchical clustering algorithms such as
+> k-means and for hierarchical cluster algorithms when the number of observations is
+> large enough to make dendrograms impractical.
+
+The clustergram was later implemented in R by [Tal
+Galili](https://www.r-statistics.com/2010/06/clustergram-visualization-and-diagnostics-for-cluster-analysis-r-code/),
+who also gives a thorough explanation of the concept.
+
+This is a Python implementation, originally based on Tal's script, written for
+`scikit-learn` and RAPIDS `cuML` implementations of K-Means, Mini Batch K-Means and
+Gaussian Mixture Model (scikit-learn only) clustering, plus hierarchical/agglomerative
+clustering using `SciPy`. Alternatively, you can create clustergram using `from_*`
+constructors based on alternative clustering algorithms.
+
+[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/martinfleis/clustergram/main?urlpath=tree/doc/notebooks/)
+
+## Getting started
+
+You can install clustergram from `conda` or `pip`:
+
+```shell
+conda install clustergram -c conda-forge
+```
+
+```shell
+pip install clustergram
+```
+
+In any case, you still need to install your selected backend (`scikit-learn` and `scipy`
+or `cuML`).
+
+The example of clustergram on Palmer penguins dataset:
+
+```python
+import seaborn
+df = seaborn.load_dataset('penguins')
+```
+
+First we have to select numerical data and scale them.
+
+```python
+from sklearn.preprocessing import scale
+data = scale(df.drop(columns=['species', 'island', 'sex']).dropna())
+```
+
+And then we can simply pass the data to `clustergram`.
+
+```python
+from clustergram import Clustergram
+
+cgram = Clustergram(range(1, 8))
+cgram.fit(data)
+cgram.plot()
+```
+
+![Default
+clustergram](https://raw.githubusercontent.com/martinfleis/clustergram/master/doc/_static/default.png)
+
+## Styling
+
+`Clustergram.plot()` returns matplotlib axis and can be fully customised as any other
+matplotlib plot.
+
+```python
+seaborn.set(style='whitegrid')
+
+cgram.plot(
+ ax=ax,
+ size=0.5,
+ linewidth=0.5,
+ cluster_style={"color": "lightblue", "edgecolor": "black"},
+ line_style={"color": "red", "linestyle": "-."},
+ figsize=(12, 8)
+)
+```
+
+![Colored
+clustergram](https://raw.githubusercontent.com/martinfleis/clustergram/master/doc/_static/colors.png)
+
+## Mean options
+
+On the `y` axis, a clustergram can use mean values as in the original paper by Matthias
+Schonlau or PCA weighted mean values as in the implementation by Tal Galili.
+
+```python
+cgram = Clustergram(range(1, 8))
+cgram.fit(data)
+cgram.plot(figsize=(12, 8), pca_weighted=True)
+```
+
+![Default
+clustergram](https://raw.githubusercontent.com/martinfleis/clustergram/master/doc/_static/pca_true.png)
+
+```python
+cgram = Clustergram(range(1, 8))
+cgram.fit(data)
+cgram.plot(figsize=(12, 8), pca_weighted=False)
+```
+
+![Default
+clustergram](https://raw.githubusercontent.com/martinfleis/clustergram/master/doc/_static/pca_false.png)
+
+## Scikit-learn, SciPy and RAPIDS cuML backends
+
+Clustergram offers three backends for the computation - `scikit-learn` and `scipy` which
+use CPU and RAPIDS.AI `cuML`, which uses GPU. Note that all are optional dependencies
+but you will need at least one of them to generate clustergram.
+
+Using `scikit-learn` (default):
+
+```python
+cgram = Clustergram(range(1, 8), backend='sklearn')
+cgram.fit(data)
+cgram.plot()
+```
+
+Using `cuML`:
+
+```python
+cgram = Clustergram(range(1, 8), backend='cuML')
+cgram.fit(data)
+cgram.plot()
+```
+
+`data` can be all data types supported by the selected backend (including
+`cudf.DataFrame` with `cuML` backend).
+
+## Supported methods
+
+Clustergram currently supports K-Means, Mini Batch K-Means, Gaussian Mixture Model and
+SciPy's hierarchical clustering methods. Note tha GMM and Mini Batch K-Means are
+supported only for `scikit-learn` backend and hierarchical methods are supported only
+for `scipy` backend.
+
+Using K-Means (default):
+
+```python
+cgram = Clustergram(range(1, 8), method='kmeans')
+cgram.fit(data)
+cgram.plot()
+```
+
+Using Mini Batch K-Means, which can provide significant speedup over K-Means:
+
+```python
+cgram = Clustergram(range(1, 8), method='minibatchkmeans', batch_size=100)
+cgram.fit(data)
+cgram.plot()
+```
+
+Using Gaussian Mixture Model:
+
+```python
+cgram = Clustergram(range(1, 8), method='gmm')
+cgram.fit(data)
+cgram.plot()
+```
+
+Using Ward's hierarchical clustering:
+
+```python
+cgram = Clustergram(range(1, 8), method='hierarchical', linkage='ward')
+cgram.fit(data)
+cgram.plot()
+```
+
+## Manual input
+
+Alternatively, you can create clustergram using `from_data` or `from_centers` methods
+based on alternative clustering algorithms.
+
+Using `Clustergram.from_data` which creates cluster centers as mean or median values:
+
+```python
+data = numpy.array([[-1, -1, 0, 10], [1, 1, 10, 2], [0, 0, 20, 4]])
+labels = pandas.DataFrame({1: [0, 0, 0], 2: [0, 0, 1], 3: [0, 2, 1]})
+
+cgram = Clustergram.from_data(data, labels)
+cgram.plot()
+```
+
+Using `Clustergram.from_centers` based on explicit cluster centers.:
+
+```python
+labels = pandas.DataFrame({1: [0, 0, 0], 2: [0, 0, 1], 3: [0, 2, 1]})
+centers = {
+ 1: np.array([[0, 0]]),
+ 2: np.array([[-1, -1], [1, 1]]),
+ 3: np.array([[-1, -1], [1, 1], [0, 0]]),
+ }
+cgram = Clustergram.from_centers(centers, labels)
+cgram.plot(pca_weighted=False)
+```
+
+To support PCA weighted plots you also need to pass data:
+
+```python
+cgram = Clustergram.from_centers(centers, labels, data=data)
+cgram.plot()
+```
+
+## Partial plot
+
+`Clustergram.plot()` can also plot only a part of the diagram, if you want to focus on a
+limited range of `k`.
+
+```python
+cgram = Clustergram(range(1, 20))
+cgram.fit(data)
+cgram.plot(figsize=(12, 8))
+```
+
+![Long
+clustergram](https://raw.githubusercontent.com/martinfleis/clustergram/master/doc/_static/20_clusters.png)
+
+```python
+cgram.plot(k_range=range(3, 10), figsize=(12, 8))
+```
+
+![Limited
+clustergram](https://raw.githubusercontent.com/martinfleis/clustergram/master/doc/_static/limited_plot.png)
+
+## Additional clustering performance evaluation
+
+Clustergam includes handy wrappers around a selection of clustering performance metrics
+offered by `scikit-learn`. Data which were originally computed on GPU are converted to
+numpy on the fly.
+
+### Silhouette score
+
+Compute the mean Silhouette Coefficient of all samples. See [`scikit-learn`
+documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html#sklearn.metrics.silhouette_score)
+for details.
+
+```python
+>>> cgram.silhouette_score()
+2 0.531540
+3 0.447219
+4 0.400154
+5 0.377720
+6 0.372128
+7 0.331575
+Name: silhouette_score, dtype: float64
+```
+
+Once computed, resulting Series is available as `cgram.silhouette`. Calling the original
+method will recompute the score.
+
+### Calinski and Harabasz score
+
+Compute the Calinski and Harabasz score, also known as the Variance Ratio Criterion. See
+[`scikit-learn`
+documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.calinski_harabasz_score.html#sklearn.metrics.calinski_harabasz_score)
+for details.
+
+```python
+>>> cgram.calinski_harabasz_score()
+2 482.191469
+3 441.677075
+4 400.392131
+5 411.175066
+6 382.731416
+7 352.447569
+Name: calinski_harabasz_score, dtype: float64
+```
+
+Once computed, resulting Series is available as `cgram.calinski_harabasz`. Calling the
+original method will recompute the score.
+
+### Davies-Bouldin score
+
+Compute the Davies-Bouldin score. See [`scikit-learn`
+documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.davies_bouldin_score.html#sklearn.metrics.davies_bouldin_score)
+for details.
+
+```python
+>>> cgram.davies_bouldin_score()
+2 0.714064
+3 0.943553
+4 0.943320
+5 0.973248
+6 0.950910
+7 1.074937
+Name: davies_bouldin_score, dtype: float64
+```
+
+Once computed, resulting Series is available as `cgram.davies_bouldin`. Calling the
+original method will recompute the score.
+
+## Acessing labels
+
+`Clustergram` stores resulting labels for each of the tested options, which can be
+accessed as:
+
+```python
+>>> cgram.labels
+ 1 2 3 4 5 6 7
+0 0 0 2 2 3 2 1
+1 0 0 2 2 3 2 1
+2 0 0 2 2 3 2 1
+3 0 0 2 2 3 2 1
+4 0 0 2 2 0 0 3
+.. .. .. .. .. .. .. ..
+337 0 1 1 3 2 5 0
+338 0 1 1 3 2 5 0
+339 0 1 1 1 1 1 4
+340 0 1 1 3 2 5 5
+341 0 1 1 1 1 1 5
+```
+
+## Saving clustergram
+
+You can save both plot and `clustergram.Clustergram` to a disk.
+
+### Saving plot
+
+`Clustergram.plot()` returns matplotlib axis object and as such can be saved as any
+other plot:
+
+```python
+import matplotlib.pyplot as plt
+
+cgram.plot()
+plt.savefig('clustergram.svg')
+```
+
+### Saving object
+
+If you want to save your computed `clustergram.Clustergram` object to a disk, you can
+use `pickle` library:
+
+```python
+import pickle
+
+with open('clustergram.pickle','wb') as f:
+ pickle.dump(cgram, f)
+```
+
+Then loading is equally simple:
+
+```python
+with open('clustergram.pickle','rb') as f:
+ loaded = pickle.load(f)
+```
+
+## References
+
+Schonlau M. The clustergram: a graph for visualizing hierarchical and non-hierarchical
+cluster analyses. The Stata Journal, 2002; 2 (4):391-402.
+
+Schonlau M. Visualizing Hierarchical and Non-Hierarchical Cluster Analyses with
+Clustergrams. Computational Statistics: 2004; 19(1):95-111.
+
+[https://www.r-statistics.com/2010/06/clustergram-visualization-and-diagnostics-for-cluster-analysis-r-code/](https://www.r-statistics.com/2010/06/clustergram-visualization-and-diagnostics-for-cluster-analysis-r-code/)
+
+
+%package -n python3-clustergram
+Summary: Clustergram - visualization and diagnostics for cluster analysis
+Provides: python-clustergram
+BuildRequires: python3-devel
+BuildRequires: python3-setuptools
+BuildRequires: python3-pip
+%description -n python3-clustergram
+# Clustergram
+
+![logo
+clustergram](https://raw.githubusercontent.com/martinfleis/clustergram/main/doc/_static/logo.svg)
+
+## Visualization and diagnostics for cluster analysis
+
+[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.4750483.svg)](https://doi.org/10.5281/zenodo.4750483)
+
+Clustergram is a diagram proposed by Matthias Schonlau in his paper *[The clustergram: A
+graph for visualizing hierarchical and nonhierarchical cluster
+analyses](https://journals.sagepub.com/doi/10.1177/1536867X0200200405)*:
+
+> In hierarchical cluster analysis, dendrograms are used to visualize how clusters are
+> formed. I propose an alternative graph called a “clustergram” to examine how cluster
+> members are assigned to clusters as the number of clusters increases. This graph is
+> useful in exploratory analysis for nonhierarchical clustering algorithms such as
+> k-means and for hierarchical cluster algorithms when the number of observations is
+> large enough to make dendrograms impractical.
+
+The clustergram was later implemented in R by [Tal
+Galili](https://www.r-statistics.com/2010/06/clustergram-visualization-and-diagnostics-for-cluster-analysis-r-code/),
+who also gives a thorough explanation of the concept.
+
+This is a Python implementation, originally based on Tal's script, written for
+`scikit-learn` and RAPIDS `cuML` implementations of K-Means, Mini Batch K-Means and
+Gaussian Mixture Model (scikit-learn only) clustering, plus hierarchical/agglomerative
+clustering using `SciPy`. Alternatively, you can create clustergram using `from_*`
+constructors based on alternative clustering algorithms.
+
+[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/martinfleis/clustergram/main?urlpath=tree/doc/notebooks/)
+
+## Getting started
+
+You can install clustergram from `conda` or `pip`:
+
+```shell
+conda install clustergram -c conda-forge
+```
+
+```shell
+pip install clustergram
+```
+
+In any case, you still need to install your selected backend (`scikit-learn` and `scipy`
+or `cuML`).
+
+The example of clustergram on Palmer penguins dataset:
+
+```python
+import seaborn
+df = seaborn.load_dataset('penguins')
+```
+
+First we have to select numerical data and scale them.
+
+```python
+from sklearn.preprocessing import scale
+data = scale(df.drop(columns=['species', 'island', 'sex']).dropna())
+```
+
+And then we can simply pass the data to `clustergram`.
+
+```python
+from clustergram import Clustergram
+
+cgram = Clustergram(range(1, 8))
+cgram.fit(data)
+cgram.plot()
+```
+
+![Default
+clustergram](https://raw.githubusercontent.com/martinfleis/clustergram/master/doc/_static/default.png)
+
+## Styling
+
+`Clustergram.plot()` returns matplotlib axis and can be fully customised as any other
+matplotlib plot.
+
+```python
+seaborn.set(style='whitegrid')
+
+cgram.plot(
+ ax=ax,
+ size=0.5,
+ linewidth=0.5,
+ cluster_style={"color": "lightblue", "edgecolor": "black"},
+ line_style={"color": "red", "linestyle": "-."},
+ figsize=(12, 8)
+)
+```
+
+![Colored
+clustergram](https://raw.githubusercontent.com/martinfleis/clustergram/master/doc/_static/colors.png)
+
+## Mean options
+
+On the `y` axis, a clustergram can use mean values as in the original paper by Matthias
+Schonlau or PCA weighted mean values as in the implementation by Tal Galili.
+
+```python
+cgram = Clustergram(range(1, 8))
+cgram.fit(data)
+cgram.plot(figsize=(12, 8), pca_weighted=True)
+```
+
+![Default
+clustergram](https://raw.githubusercontent.com/martinfleis/clustergram/master/doc/_static/pca_true.png)
+
+```python
+cgram = Clustergram(range(1, 8))
+cgram.fit(data)
+cgram.plot(figsize=(12, 8), pca_weighted=False)
+```
+
+![Default
+clustergram](https://raw.githubusercontent.com/martinfleis/clustergram/master/doc/_static/pca_false.png)
+
+## Scikit-learn, SciPy and RAPIDS cuML backends
+
+Clustergram offers three backends for the computation - `scikit-learn` and `scipy` which
+use CPU and RAPIDS.AI `cuML`, which uses GPU. Note that all are optional dependencies
+but you will need at least one of them to generate clustergram.
+
+Using `scikit-learn` (default):
+
+```python
+cgram = Clustergram(range(1, 8), backend='sklearn')
+cgram.fit(data)
+cgram.plot()
+```
+
+Using `cuML`:
+
+```python
+cgram = Clustergram(range(1, 8), backend='cuML')
+cgram.fit(data)
+cgram.plot()
+```
+
+`data` can be all data types supported by the selected backend (including
+`cudf.DataFrame` with `cuML` backend).
+
+## Supported methods
+
+Clustergram currently supports K-Means, Mini Batch K-Means, Gaussian Mixture Model and
+SciPy's hierarchical clustering methods. Note tha GMM and Mini Batch K-Means are
+supported only for `scikit-learn` backend and hierarchical methods are supported only
+for `scipy` backend.
+
+Using K-Means (default):
+
+```python
+cgram = Clustergram(range(1, 8), method='kmeans')
+cgram.fit(data)
+cgram.plot()
+```
+
+Using Mini Batch K-Means, which can provide significant speedup over K-Means:
+
+```python
+cgram = Clustergram(range(1, 8), method='minibatchkmeans', batch_size=100)
+cgram.fit(data)
+cgram.plot()
+```
+
+Using Gaussian Mixture Model:
+
+```python
+cgram = Clustergram(range(1, 8), method='gmm')
+cgram.fit(data)
+cgram.plot()
+```
+
+Using Ward's hierarchical clustering:
+
+```python
+cgram = Clustergram(range(1, 8), method='hierarchical', linkage='ward')
+cgram.fit(data)
+cgram.plot()
+```
+
+## Manual input
+
+Alternatively, you can create clustergram using `from_data` or `from_centers` methods
+based on alternative clustering algorithms.
+
+Using `Clustergram.from_data` which creates cluster centers as mean or median values:
+
+```python
+data = numpy.array([[-1, -1, 0, 10], [1, 1, 10, 2], [0, 0, 20, 4]])
+labels = pandas.DataFrame({1: [0, 0, 0], 2: [0, 0, 1], 3: [0, 2, 1]})
+
+cgram = Clustergram.from_data(data, labels)
+cgram.plot()
+```
+
+Using `Clustergram.from_centers` based on explicit cluster centers.:
+
+```python
+labels = pandas.DataFrame({1: [0, 0, 0], 2: [0, 0, 1], 3: [0, 2, 1]})
+centers = {
+ 1: np.array([[0, 0]]),
+ 2: np.array([[-1, -1], [1, 1]]),
+ 3: np.array([[-1, -1], [1, 1], [0, 0]]),
+ }
+cgram = Clustergram.from_centers(centers, labels)
+cgram.plot(pca_weighted=False)
+```
+
+To support PCA weighted plots you also need to pass data:
+
+```python
+cgram = Clustergram.from_centers(centers, labels, data=data)
+cgram.plot()
+```
+
+## Partial plot
+
+`Clustergram.plot()` can also plot only a part of the diagram, if you want to focus on a
+limited range of `k`.
+
+```python
+cgram = Clustergram(range(1, 20))
+cgram.fit(data)
+cgram.plot(figsize=(12, 8))
+```
+
+![Long
+clustergram](https://raw.githubusercontent.com/martinfleis/clustergram/master/doc/_static/20_clusters.png)
+
+```python
+cgram.plot(k_range=range(3, 10), figsize=(12, 8))
+```
+
+![Limited
+clustergram](https://raw.githubusercontent.com/martinfleis/clustergram/master/doc/_static/limited_plot.png)
+
+## Additional clustering performance evaluation
+
+Clustergam includes handy wrappers around a selection of clustering performance metrics
+offered by `scikit-learn`. Data which were originally computed on GPU are converted to
+numpy on the fly.
+
+### Silhouette score
+
+Compute the mean Silhouette Coefficient of all samples. See [`scikit-learn`
+documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html#sklearn.metrics.silhouette_score)
+for details.
+
+```python
+>>> cgram.silhouette_score()
+2 0.531540
+3 0.447219
+4 0.400154
+5 0.377720
+6 0.372128
+7 0.331575
+Name: silhouette_score, dtype: float64
+```
+
+Once computed, resulting Series is available as `cgram.silhouette`. Calling the original
+method will recompute the score.
+
+### Calinski and Harabasz score
+
+Compute the Calinski and Harabasz score, also known as the Variance Ratio Criterion. See
+[`scikit-learn`
+documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.calinski_harabasz_score.html#sklearn.metrics.calinski_harabasz_score)
+for details.
+
+```python
+>>> cgram.calinski_harabasz_score()
+2 482.191469
+3 441.677075
+4 400.392131
+5 411.175066
+6 382.731416
+7 352.447569
+Name: calinski_harabasz_score, dtype: float64
+```
+
+Once computed, resulting Series is available as `cgram.calinski_harabasz`. Calling the
+original method will recompute the score.
+
+### Davies-Bouldin score
+
+Compute the Davies-Bouldin score. See [`scikit-learn`
+documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.davies_bouldin_score.html#sklearn.metrics.davies_bouldin_score)
+for details.
+
+```python
+>>> cgram.davies_bouldin_score()
+2 0.714064
+3 0.943553
+4 0.943320
+5 0.973248
+6 0.950910
+7 1.074937
+Name: davies_bouldin_score, dtype: float64
+```
+
+Once computed, resulting Series is available as `cgram.davies_bouldin`. Calling the
+original method will recompute the score.
+
+## Acessing labels
+
+`Clustergram` stores resulting labels for each of the tested options, which can be
+accessed as:
+
+```python
+>>> cgram.labels
+ 1 2 3 4 5 6 7
+0 0 0 2 2 3 2 1
+1 0 0 2 2 3 2 1
+2 0 0 2 2 3 2 1
+3 0 0 2 2 3 2 1
+4 0 0 2 2 0 0 3
+.. .. .. .. .. .. .. ..
+337 0 1 1 3 2 5 0
+338 0 1 1 3 2 5 0
+339 0 1 1 1 1 1 4
+340 0 1 1 3 2 5 5
+341 0 1 1 1 1 1 5
+```
+
+## Saving clustergram
+
+You can save both plot and `clustergram.Clustergram` to a disk.
+
+### Saving plot
+
+`Clustergram.plot()` returns matplotlib axis object and as such can be saved as any
+other plot:
+
+```python
+import matplotlib.pyplot as plt
+
+cgram.plot()
+plt.savefig('clustergram.svg')
+```
+
+### Saving object
+
+If you want to save your computed `clustergram.Clustergram` object to a disk, you can
+use `pickle` library:
+
+```python
+import pickle
+
+with open('clustergram.pickle','wb') as f:
+ pickle.dump(cgram, f)
+```
+
+Then loading is equally simple:
+
+```python
+with open('clustergram.pickle','rb') as f:
+ loaded = pickle.load(f)
+```
+
+## References
+
+Schonlau M. The clustergram: a graph for visualizing hierarchical and non-hierarchical
+cluster analyses. The Stata Journal, 2002; 2 (4):391-402.
+
+Schonlau M. Visualizing Hierarchical and Non-Hierarchical Cluster Analyses with
+Clustergrams. Computational Statistics: 2004; 19(1):95-111.
+
+[https://www.r-statistics.com/2010/06/clustergram-visualization-and-diagnostics-for-cluster-analysis-r-code/](https://www.r-statistics.com/2010/06/clustergram-visualization-and-diagnostics-for-cluster-analysis-r-code/)
+
+
+%package help
+Summary: Development documents and examples for clustergram
+Provides: python3-clustergram-doc
+%description help
+# Clustergram
+
+![logo
+clustergram](https://raw.githubusercontent.com/martinfleis/clustergram/main/doc/_static/logo.svg)
+
+## Visualization and diagnostics for cluster analysis
+
+[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.4750483.svg)](https://doi.org/10.5281/zenodo.4750483)
+
+Clustergram is a diagram proposed by Matthias Schonlau in his paper *[The clustergram: A
+graph for visualizing hierarchical and nonhierarchical cluster
+analyses](https://journals.sagepub.com/doi/10.1177/1536867X0200200405)*:
+
+> In hierarchical cluster analysis, dendrograms are used to visualize how clusters are
+> formed. I propose an alternative graph called a “clustergram” to examine how cluster
+> members are assigned to clusters as the number of clusters increases. This graph is
+> useful in exploratory analysis for nonhierarchical clustering algorithms such as
+> k-means and for hierarchical cluster algorithms when the number of observations is
+> large enough to make dendrograms impractical.
+
+The clustergram was later implemented in R by [Tal
+Galili](https://www.r-statistics.com/2010/06/clustergram-visualization-and-diagnostics-for-cluster-analysis-r-code/),
+who also gives a thorough explanation of the concept.
+
+This is a Python implementation, originally based on Tal's script, written for
+`scikit-learn` and RAPIDS `cuML` implementations of K-Means, Mini Batch K-Means and
+Gaussian Mixture Model (scikit-learn only) clustering, plus hierarchical/agglomerative
+clustering using `SciPy`. Alternatively, you can create clustergram using `from_*`
+constructors based on alternative clustering algorithms.
+
+[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/martinfleis/clustergram/main?urlpath=tree/doc/notebooks/)
+
+## Getting started
+
+You can install clustergram from `conda` or `pip`:
+
+```shell
+conda install clustergram -c conda-forge
+```
+
+```shell
+pip install clustergram
+```
+
+In any case, you still need to install your selected backend (`scikit-learn` and `scipy`
+or `cuML`).
+
+The example of clustergram on Palmer penguins dataset:
+
+```python
+import seaborn
+df = seaborn.load_dataset('penguins')
+```
+
+First we have to select numerical data and scale them.
+
+```python
+from sklearn.preprocessing import scale
+data = scale(df.drop(columns=['species', 'island', 'sex']).dropna())
+```
+
+And then we can simply pass the data to `clustergram`.
+
+```python
+from clustergram import Clustergram
+
+cgram = Clustergram(range(1, 8))
+cgram.fit(data)
+cgram.plot()
+```
+
+![Default
+clustergram](https://raw.githubusercontent.com/martinfleis/clustergram/master/doc/_static/default.png)
+
+## Styling
+
+`Clustergram.plot()` returns matplotlib axis and can be fully customised as any other
+matplotlib plot.
+
+```python
+seaborn.set(style='whitegrid')
+
+cgram.plot(
+ ax=ax,
+ size=0.5,
+ linewidth=0.5,
+ cluster_style={"color": "lightblue", "edgecolor": "black"},
+ line_style={"color": "red", "linestyle": "-."},
+ figsize=(12, 8)
+)
+```
+
+![Colored
+clustergram](https://raw.githubusercontent.com/martinfleis/clustergram/master/doc/_static/colors.png)
+
+## Mean options
+
+On the `y` axis, a clustergram can use mean values as in the original paper by Matthias
+Schonlau or PCA weighted mean values as in the implementation by Tal Galili.
+
+```python
+cgram = Clustergram(range(1, 8))
+cgram.fit(data)
+cgram.plot(figsize=(12, 8), pca_weighted=True)
+```
+
+![Default
+clustergram](https://raw.githubusercontent.com/martinfleis/clustergram/master/doc/_static/pca_true.png)
+
+```python
+cgram = Clustergram(range(1, 8))
+cgram.fit(data)
+cgram.plot(figsize=(12, 8), pca_weighted=False)
+```
+
+![Default
+clustergram](https://raw.githubusercontent.com/martinfleis/clustergram/master/doc/_static/pca_false.png)
+
+## Scikit-learn, SciPy and RAPIDS cuML backends
+
+Clustergram offers three backends for the computation - `scikit-learn` and `scipy` which
+use CPU and RAPIDS.AI `cuML`, which uses GPU. Note that all are optional dependencies
+but you will need at least one of them to generate clustergram.
+
+Using `scikit-learn` (default):
+
+```python
+cgram = Clustergram(range(1, 8), backend='sklearn')
+cgram.fit(data)
+cgram.plot()
+```
+
+Using `cuML`:
+
+```python
+cgram = Clustergram(range(1, 8), backend='cuML')
+cgram.fit(data)
+cgram.plot()
+```
+
+`data` can be all data types supported by the selected backend (including
+`cudf.DataFrame` with `cuML` backend).
+
+## Supported methods
+
+Clustergram currently supports K-Means, Mini Batch K-Means, Gaussian Mixture Model and
+SciPy's hierarchical clustering methods. Note tha GMM and Mini Batch K-Means are
+supported only for `scikit-learn` backend and hierarchical methods are supported only
+for `scipy` backend.
+
+Using K-Means (default):
+
+```python
+cgram = Clustergram(range(1, 8), method='kmeans')
+cgram.fit(data)
+cgram.plot()
+```
+
+Using Mini Batch K-Means, which can provide significant speedup over K-Means:
+
+```python
+cgram = Clustergram(range(1, 8), method='minibatchkmeans', batch_size=100)
+cgram.fit(data)
+cgram.plot()
+```
+
+Using Gaussian Mixture Model:
+
+```python
+cgram = Clustergram(range(1, 8), method='gmm')
+cgram.fit(data)
+cgram.plot()
+```
+
+Using Ward's hierarchical clustering:
+
+```python
+cgram = Clustergram(range(1, 8), method='hierarchical', linkage='ward')
+cgram.fit(data)
+cgram.plot()
+```
+
+## Manual input
+
+Alternatively, you can create clustergram using `from_data` or `from_centers` methods
+based on alternative clustering algorithms.
+
+Using `Clustergram.from_data` which creates cluster centers as mean or median values:
+
+```python
+data = numpy.array([[-1, -1, 0, 10], [1, 1, 10, 2], [0, 0, 20, 4]])
+labels = pandas.DataFrame({1: [0, 0, 0], 2: [0, 0, 1], 3: [0, 2, 1]})
+
+cgram = Clustergram.from_data(data, labels)
+cgram.plot()
+```
+
+Using `Clustergram.from_centers` based on explicit cluster centers.:
+
+```python
+labels = pandas.DataFrame({1: [0, 0, 0], 2: [0, 0, 1], 3: [0, 2, 1]})
+centers = {
+ 1: np.array([[0, 0]]),
+ 2: np.array([[-1, -1], [1, 1]]),
+ 3: np.array([[-1, -1], [1, 1], [0, 0]]),
+ }
+cgram = Clustergram.from_centers(centers, labels)
+cgram.plot(pca_weighted=False)
+```
+
+To support PCA weighted plots you also need to pass data:
+
+```python
+cgram = Clustergram.from_centers(centers, labels, data=data)
+cgram.plot()
+```
+
+## Partial plot
+
+`Clustergram.plot()` can also plot only a part of the diagram, if you want to focus on a
+limited range of `k`.
+
+```python
+cgram = Clustergram(range(1, 20))
+cgram.fit(data)
+cgram.plot(figsize=(12, 8))
+```
+
+![Long
+clustergram](https://raw.githubusercontent.com/martinfleis/clustergram/master/doc/_static/20_clusters.png)
+
+```python
+cgram.plot(k_range=range(3, 10), figsize=(12, 8))
+```
+
+![Limited
+clustergram](https://raw.githubusercontent.com/martinfleis/clustergram/master/doc/_static/limited_plot.png)
+
+## Additional clustering performance evaluation
+
+Clustergam includes handy wrappers around a selection of clustering performance metrics
+offered by `scikit-learn`. Data which were originally computed on GPU are converted to
+numpy on the fly.
+
+### Silhouette score
+
+Compute the mean Silhouette Coefficient of all samples. See [`scikit-learn`
+documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html#sklearn.metrics.silhouette_score)
+for details.
+
+```python
+>>> cgram.silhouette_score()
+2 0.531540
+3 0.447219
+4 0.400154
+5 0.377720
+6 0.372128
+7 0.331575
+Name: silhouette_score, dtype: float64
+```
+
+Once computed, resulting Series is available as `cgram.silhouette`. Calling the original
+method will recompute the score.
+
+### Calinski and Harabasz score
+
+Compute the Calinski and Harabasz score, also known as the Variance Ratio Criterion. See
+[`scikit-learn`
+documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.calinski_harabasz_score.html#sklearn.metrics.calinski_harabasz_score)
+for details.
+
+```python
+>>> cgram.calinski_harabasz_score()
+2 482.191469
+3 441.677075
+4 400.392131
+5 411.175066
+6 382.731416
+7 352.447569
+Name: calinski_harabasz_score, dtype: float64
+```
+
+Once computed, resulting Series is available as `cgram.calinski_harabasz`. Calling the
+original method will recompute the score.
+
+### Davies-Bouldin score
+
+Compute the Davies-Bouldin score. See [`scikit-learn`
+documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.davies_bouldin_score.html#sklearn.metrics.davies_bouldin_score)
+for details.
+
+```python
+>>> cgram.davies_bouldin_score()
+2 0.714064
+3 0.943553
+4 0.943320
+5 0.973248
+6 0.950910
+7 1.074937
+Name: davies_bouldin_score, dtype: float64
+```
+
+Once computed, resulting Series is available as `cgram.davies_bouldin`. Calling the
+original method will recompute the score.
+
+## Acessing labels
+
+`Clustergram` stores resulting labels for each of the tested options, which can be
+accessed as:
+
+```python
+>>> cgram.labels
+ 1 2 3 4 5 6 7
+0 0 0 2 2 3 2 1
+1 0 0 2 2 3 2 1
+2 0 0 2 2 3 2 1
+3 0 0 2 2 3 2 1
+4 0 0 2 2 0 0 3
+.. .. .. .. .. .. .. ..
+337 0 1 1 3 2 5 0
+338 0 1 1 3 2 5 0
+339 0 1 1 1 1 1 4
+340 0 1 1 3 2 5 5
+341 0 1 1 1 1 1 5
+```
+
+## Saving clustergram
+
+You can save both plot and `clustergram.Clustergram` to a disk.
+
+### Saving plot
+
+`Clustergram.plot()` returns matplotlib axis object and as such can be saved as any
+other plot:
+
+```python
+import matplotlib.pyplot as plt
+
+cgram.plot()
+plt.savefig('clustergram.svg')
+```
+
+### Saving object
+
+If you want to save your computed `clustergram.Clustergram` object to a disk, you can
+use `pickle` library:
+
+```python
+import pickle
+
+with open('clustergram.pickle','wb') as f:
+ pickle.dump(cgram, f)
+```
+
+Then loading is equally simple:
+
+```python
+with open('clustergram.pickle','rb') as f:
+ loaded = pickle.load(f)
+```
+
+## References
+
+Schonlau M. The clustergram: a graph for visualizing hierarchical and non-hierarchical
+cluster analyses. The Stata Journal, 2002; 2 (4):391-402.
+
+Schonlau M. Visualizing Hierarchical and Non-Hierarchical Cluster Analyses with
+Clustergrams. Computational Statistics: 2004; 19(1):95-111.
+
+[https://www.r-statistics.com/2010/06/clustergram-visualization-and-diagnostics-for-cluster-analysis-r-code/](https://www.r-statistics.com/2010/06/clustergram-visualization-and-diagnostics-for-cluster-analysis-r-code/)
+
+
+%prep
+%autosetup -n clustergram-0.7.0
+
+%build
+%py3_build
+
+%install
+%py3_install
+install -d -m755 %{buildroot}/%{_pkgdocdir}
+if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi
+if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi
+if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi
+if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi
+pushd %{buildroot}
+if [ -d usr/lib ]; then
+ find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+if [ -d usr/lib64 ]; then
+ find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+if [ -d usr/bin ]; then
+ find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+if [ -d usr/sbin ]; then
+ find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+touch doclist.lst
+if [ -d usr/share/man ]; then
+ find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst
+fi
+popd
+mv %{buildroot}/filelist.lst .
+mv %{buildroot}/doclist.lst .
+
+%files -n python3-clustergram -f filelist.lst
+%dir %{python3_sitelib}/*
+
+%files help -f doclist.lst
+%{_docdir}/*
+
+%changelog
+* Wed May 31 2023 Python_Bot <Python_Bot@openeuler.org> - 0.7.0-1
+- Package Spec generated
diff --git a/sources b/sources
new file mode 100644
index 0000000..f8d79c1
--- /dev/null
+++ b/sources
@@ -0,0 +1 @@
+d612ba4563e6aeffbdac039290428b77 clustergram-0.7.0.tar.gz