diff options
Diffstat (limited to 'python-kpplus.spec')
| -rw-r--r-- | python-kpplus.spec | 289 |
1 files changed, 289 insertions, 0 deletions
diff --git a/python-kpplus.spec b/python-kpplus.spec new file mode 100644 index 0000000..571bcd2 --- /dev/null +++ b/python-kpplus.spec @@ -0,0 +1,289 @@ +%global _empty_manifest_terminate_build 0 +Name: python-kpplus +Version: 0.0.3 +Release: 1 +Summary: A JIT optimized K-Prototype algorithm +License: MIT License +URL: https://github.com/youbao88/KPrototypes_plus +Source0: https://mirrors.nju.edu.cn/pypi/web/packages/a2/5c/df60622dab8168d875947c28cee33c63e72f47c6559af6baccdabac5c97f/kpplus-0.0.3.tar.gz +BuildArch: noarch + +Requires: python3-pandas +Requires: python3-numpy +Requires: python3-numba +Requires: python3-joblib + +%description +# KPrototype plus (kpplus) +[](https://GitHub.com/Naereen/StrapDown.js/graphs/commit-activity) [](https://www.python.org/) [](https://pypi.org/project/kpplus/) + +## Description + +K-prototype is a clustering method invented to support both categorical and numerical variables[1] + +**KPrototype plus (kpplus)** is a Python 3 package that is designed to increase the performance of [nivoc's KPrototypes function](https://github.com/nicodv/kmodes) by using [Numba](http://numba.pydata.org/). + +This code is part of [Stockholms diabetespreventiva program](https://www.folkhalsoguiden.se/amnesomraden1/analys-och-kartlaggning/sdpp/). + +### Performance improvement +As an [example](example/example.ipynb), I used one of the [Heart Disease Data Sets](https://archive.ics.uci.edu/ml/datasets/Heart+Disease) from [UCI](https://archive.ics.uci.edu/ml/index.php) to test the performance. +This data set contains 4455 rows, 7 categorical variables, and 5 numerical variables. +We compare the performance between nicodv's kprototype function and k_prototype_plus. + +~~~~ +< nicodv's kprototype > +CPU times: user 2.14 s, sys: 18.2 ms, total: 2.16 s +Wall time: 1min 41s +~~~~ +~~~~ +< k_prototype_plus > +CPU times: user 298 ms, sys: 9.24 ms, total: 308 ms +Wall time: 13.4 s +~~~~ + +**Notice:** Only Cao initiation is supported as the initiation method[2]. + +## System requirement +[](https://www.python.org/) [](https://pandas.pydata.org/) [](https://numpy.org/) [](https://joblib.readthedocs.io/en/latest/) [](http://numba.pydata.org/) + +## Installiation + +``` +pip install kpplus +``` + +## Usage +```python +from kpplus import KPrototypes_plus +model = KPrototypes_plus(n_clusters = 3, n_init = 4, gamma = None, n_jobs = -1) #initialize the model +model.fit_predict(X=df, categorical = [0,1]) #fit the data and categorical into the mdoel + +model.labels_ #return the cluster_labels +model.cluster_centroids_ #return the cluster centroid points(prototypes) +model.n_iter_ #return the number of iterations +model.cost_ #return the costs +``` +**n_clusters:** the number of clusters + +**n_init:** the number of parallel oprations by using different initializations + +**gamma (optional):** A value that controls how algorithm favours categorical variables. (By default, it is the mean std of all numeric variables) + +**n_jobs (optional, default=-1):** The number of parallel processors. ('-1' means using all the processor) + +**X:** 2-D numpy array (dataset) + +**types:** A numpy array that indicates if the variable is categorical or numerical. + +For example: ```types = [1,1,0,0,0,0]``` means the first two variables are categorical and the last four variables are numerical. + +## Acknowledgement +I'm extremely grateful to [Dr. Diego Yacaman Mendez](https://staff.ki.se/people/dieyac?_ga=2.70810192.1199119869.1588953123-1873461028.1579027503) and [Dr. David Ebbevi](https://www.linkedin.com/in/debbevi/?originalSubdomain=se) for their support. They are two brilliant researchers who started this project with excellent knowledge of medical science, epidemiology, statistics and programming. + +## Reference +[1] Huang Z. Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values. Data Mining and Knowledge Discovery. 1998;2(3):283-304. +[2] Cao F, Liang J, Bai LJESwA. A new initialization method for categorical data clustering. 2009;36(7):10223-8. + + + + +%package -n python3-kpplus +Summary: A JIT optimized K-Prototype algorithm +Provides: python-kpplus +BuildRequires: python3-devel +BuildRequires: python3-setuptools +BuildRequires: python3-pip +%description -n python3-kpplus +# KPrototype plus (kpplus) +[](https://GitHub.com/Naereen/StrapDown.js/graphs/commit-activity) [](https://www.python.org/) [](https://pypi.org/project/kpplus/) + +## Description + +K-prototype is a clustering method invented to support both categorical and numerical variables[1] + +**KPrototype plus (kpplus)** is a Python 3 package that is designed to increase the performance of [nivoc's KPrototypes function](https://github.com/nicodv/kmodes) by using [Numba](http://numba.pydata.org/). + +This code is part of [Stockholms diabetespreventiva program](https://www.folkhalsoguiden.se/amnesomraden1/analys-och-kartlaggning/sdpp/). + +### Performance improvement +As an [example](example/example.ipynb), I used one of the [Heart Disease Data Sets](https://archive.ics.uci.edu/ml/datasets/Heart+Disease) from [UCI](https://archive.ics.uci.edu/ml/index.php) to test the performance. +This data set contains 4455 rows, 7 categorical variables, and 5 numerical variables. +We compare the performance between nicodv's kprototype function and k_prototype_plus. + +~~~~ +< nicodv's kprototype > +CPU times: user 2.14 s, sys: 18.2 ms, total: 2.16 s +Wall time: 1min 41s +~~~~ +~~~~ +< k_prototype_plus > +CPU times: user 298 ms, sys: 9.24 ms, total: 308 ms +Wall time: 13.4 s +~~~~ + +**Notice:** Only Cao initiation is supported as the initiation method[2]. + +## System requirement +[](https://www.python.org/) [](https://pandas.pydata.org/) [](https://numpy.org/) [](https://joblib.readthedocs.io/en/latest/) [](http://numba.pydata.org/) + +## Installiation + +``` +pip install kpplus +``` + +## Usage +```python +from kpplus import KPrototypes_plus +model = KPrototypes_plus(n_clusters = 3, n_init = 4, gamma = None, n_jobs = -1) #initialize the model +model.fit_predict(X=df, categorical = [0,1]) #fit the data and categorical into the mdoel + +model.labels_ #return the cluster_labels +model.cluster_centroids_ #return the cluster centroid points(prototypes) +model.n_iter_ #return the number of iterations +model.cost_ #return the costs +``` +**n_clusters:** the number of clusters + +**n_init:** the number of parallel oprations by using different initializations + +**gamma (optional):** A value that controls how algorithm favours categorical variables. (By default, it is the mean std of all numeric variables) + +**n_jobs (optional, default=-1):** The number of parallel processors. ('-1' means using all the processor) + +**X:** 2-D numpy array (dataset) + +**types:** A numpy array that indicates if the variable is categorical or numerical. + +For example: ```types = [1,1,0,0,0,0]``` means the first two variables are categorical and the last four variables are numerical. + +## Acknowledgement +I'm extremely grateful to [Dr. Diego Yacaman Mendez](https://staff.ki.se/people/dieyac?_ga=2.70810192.1199119869.1588953123-1873461028.1579027503) and [Dr. David Ebbevi](https://www.linkedin.com/in/debbevi/?originalSubdomain=se) for their support. They are two brilliant researchers who started this project with excellent knowledge of medical science, epidemiology, statistics and programming. + +## Reference +[1] Huang Z. Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values. Data Mining and Knowledge Discovery. 1998;2(3):283-304. +[2] Cao F, Liang J, Bai LJESwA. A new initialization method for categorical data clustering. 2009;36(7):10223-8. + + + + +%package help +Summary: Development documents and examples for kpplus +Provides: python3-kpplus-doc +%description help +# KPrototype plus (kpplus) +[](https://GitHub.com/Naereen/StrapDown.js/graphs/commit-activity) [](https://www.python.org/) [](https://pypi.org/project/kpplus/) + +## Description + +K-prototype is a clustering method invented to support both categorical and numerical variables[1] + +**KPrototype plus (kpplus)** is a Python 3 package that is designed to increase the performance of [nivoc's KPrototypes function](https://github.com/nicodv/kmodes) by using [Numba](http://numba.pydata.org/). + +This code is part of [Stockholms diabetespreventiva program](https://www.folkhalsoguiden.se/amnesomraden1/analys-och-kartlaggning/sdpp/). + +### Performance improvement +As an [example](example/example.ipynb), I used one of the [Heart Disease Data Sets](https://archive.ics.uci.edu/ml/datasets/Heart+Disease) from [UCI](https://archive.ics.uci.edu/ml/index.php) to test the performance. +This data set contains 4455 rows, 7 categorical variables, and 5 numerical variables. +We compare the performance between nicodv's kprototype function and k_prototype_plus. + +~~~~ +< nicodv's kprototype > +CPU times: user 2.14 s, sys: 18.2 ms, total: 2.16 s +Wall time: 1min 41s +~~~~ +~~~~ +< k_prototype_plus > +CPU times: user 298 ms, sys: 9.24 ms, total: 308 ms +Wall time: 13.4 s +~~~~ + +**Notice:** Only Cao initiation is supported as the initiation method[2]. + +## System requirement +[](https://www.python.org/) [](https://pandas.pydata.org/) [](https://numpy.org/) [](https://joblib.readthedocs.io/en/latest/) [](http://numba.pydata.org/) + +## Installiation + +``` +pip install kpplus +``` + +## Usage +```python +from kpplus import KPrototypes_plus +model = KPrototypes_plus(n_clusters = 3, n_init = 4, gamma = None, n_jobs = -1) #initialize the model +model.fit_predict(X=df, categorical = [0,1]) #fit the data and categorical into the mdoel + +model.labels_ #return the cluster_labels +model.cluster_centroids_ #return the cluster centroid points(prototypes) +model.n_iter_ #return the number of iterations +model.cost_ #return the costs +``` +**n_clusters:** the number of clusters + +**n_init:** the number of parallel oprations by using different initializations + +**gamma (optional):** A value that controls how algorithm favours categorical variables. (By default, it is the mean std of all numeric variables) + +**n_jobs (optional, default=-1):** The number of parallel processors. ('-1' means using all the processor) + +**X:** 2-D numpy array (dataset) + +**types:** A numpy array that indicates if the variable is categorical or numerical. + +For example: ```types = [1,1,0,0,0,0]``` means the first two variables are categorical and the last four variables are numerical. + +## Acknowledgement +I'm extremely grateful to [Dr. Diego Yacaman Mendez](https://staff.ki.se/people/dieyac?_ga=2.70810192.1199119869.1588953123-1873461028.1579027503) and [Dr. David Ebbevi](https://www.linkedin.com/in/debbevi/?originalSubdomain=se) for their support. They are two brilliant researchers who started this project with excellent knowledge of medical science, epidemiology, statistics and programming. + +## Reference +[1] Huang Z. Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values. Data Mining and Knowledge Discovery. 1998;2(3):283-304. +[2] Cao F, Liang J, Bai LJESwA. A new initialization method for categorical data clustering. 2009;36(7):10223-8. + + + + +%prep +%autosetup -n kpplus-0.0.3 + +%build +%py3_build + +%install +%py3_install +install -d -m755 %{buildroot}/%{_pkgdocdir} +if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi +if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi +if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi +if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi +pushd %{buildroot} +if [ -d usr/lib ]; then + find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/lib64 ]; then + find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/bin ]; then + find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/sbin ]; then + find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst +fi +touch doclist.lst +if [ -d usr/share/man ]; then + find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst +fi +popd +mv %{buildroot}/filelist.lst . +mv %{buildroot}/doclist.lst . + +%files -n python3-kpplus -f filelist.lst +%dir %{python3_sitelib}/* + +%files help -f doclist.lst +%{_docdir}/* + +%changelog +* Wed May 17 2023 Python_Bot <Python_Bot@openeuler.org> - 0.0.3-1 +- Package Spec generated |
