diff options
author | CoprDistGit <infra@openeuler.org> | 2023-05-29 11:16:32 +0000 |
---|---|---|
committer | CoprDistGit <infra@openeuler.org> | 2023-05-29 11:16:32 +0000 |
commit | e47696966f22b9acbb1bb84dcf34cc5adcf45625 (patch) | |
tree | a2203445f1761f7445194818b951d5e6c1712233 | |
parent | 03a97ed1b3520105859bfc843cba5a8d9facb0bf (diff) |
automatic import of python-pydatasci
-rw-r--r-- | .gitignore | 1 | ||||
-rw-r--r-- | python-pydatasci.spec | 190 | ||||
-rw-r--r-- | sources | 1 |
3 files changed, 192 insertions, 0 deletions
@@ -0,0 +1 @@ +/pydatasci-0.0.61.tar.gz diff --git a/python-pydatasci.spec b/python-pydatasci.spec new file mode 100644 index 0000000..f09ded7 --- /dev/null +++ b/python-pydatasci.spec @@ -0,0 +1,190 @@ +%global _empty_manifest_terminate_build 0 +Name: python-pydatasci +Version: 0.0.61 +Release: 1 +Summary: End-to-end machine learning on your desktop or server. +License: GNU Affero General Public License v3 +URL: https://github.com/pydatasci-repo +Source0: https://mirrors.nju.edu.cn/pypi/web/packages/e0/9c/f7e887ef64331d8fe7973e8bc00ed7db8cc8b8bbd56b47ddca152e332966/pydatasci-0.0.61.tar.gz +BuildArch: noarch + +Requires: python3-appdirs +Requires: python3-keras +Requires: python3-numpy +Requires: python3-pandas +Requires: python3-peewee +Requires: python3-plotly +Requires: python3-pyarrow +Requires: python3-scikit-learn +Requires: python3-tensorflow +Requires: python3-tqdm + +%description +*pre-alpha; in active development* +# Value Proposition +*PyDataSci* is an open source, automated machine learning (AutoML) tool for data scientists that reduces the amount of code needed to perform best practice machine learning by 95%; more science with less code. +It is a Python package that records experiments in a lightweight, file-based database that works on Mac/ Linux/ Windows without any configuration required by the user. By tracking the input (samples and settings) as well as output (models and metrics) of each experiment, it makes machine learning reproducible; less of a black box. +Users can either (a) queue many experiments on their desktop/ server, or (b) delegate them to run in the *PyDataSci* cloud if they outgrow their local resources. From there, model performance metrics can be visually compared in interactive charts. It is designed for use within Jupyter notebooks, but runs in any Python shell. +## TLDR +```python +$ pip install pydatasci +>>> import pydatasci as pds +>>> from pydatasci import aidb +``` + +<div align="center"><i>Examples of built-in charts. Seen above is the new "boomerang chart" for comparing performance across models.</i></div><br/> + +## Mission +* **Accelerating Research at Universities & Institutes Everywhere.**<br />We empower non-cloud users: the academic/ institute HPCers, the private clouders, the remote server SSH'ers, and everyday desktop hackers - with the same quality ML tooling as present in public clouds (e.g. AWS SageMaker). This toolset provides research teams a standardized method for ML-based evidence, rather than each researcher spending time cobbling together their own approach.<br /><br /> +* **Reproducible Experiments.**<br />No more black boxes. No more screenshotting loss-accuracy graphs and hyperparameter combinations. A record of every: dataset, feature, label, sample, split, fold, parameter, model, training job, and result - is automatically persisted in a lightweight, file-based database that is automatically configured when you import the package. Submit your *aidb* database file alongside your publications/ papers and model zoo entries as a proof.<br /><br /> +* **Queue Hypertuning Jobs.**<br />Design a batch of runs to test many hypotheses at once. Queue many hypertuning jobs locally, or delegate big jobs to the cloud to run in parallel by setting `cloud_queue = True`.<br /><br /> +* **Visually Compare Performance Metrics.**<br />Compare models using pre-defined plots for assessing performance, including: quantitative metrics (e.g. accuracy, loss, variance, precision, recall, etc.), training histories, and confusion/ contingency matrices.<br /><br /> +* **Code-Integrated & Agnostic.**<br />We don’t disrupt the natural workflow of data scientists by forcing them into the confines of a GUI app or specific IDE. Instead, we weave automated tracking into their existing scripts so that *PyDataSci* is compatible with any data science toolset.<br /><br /> + +## Functionality +*Initially focusing on tabular data before expanding to multi-file use cases.* +- [Done] Compress an immutable dataset (csv, tsv, parquet, pandas dataframe, numpy ndarray) to be analyzed. +- [Done] Split stratified samples by index while treating validation sets (3rd split) and cross-folds (k-fold) as first-level citizens. +- [Done] Generate hyperparameter combinations for model building, training, and evaluation. +- [Done] Preprocess samples to encode them for specific algorithms. +- [Done] Queue hypertuning jobs and batches based on hyperparameter combinations. +- [Done] Evaluate and save the performance metrics of each model. +- [Done] Visually compare model metrics to find the best one. +- [ToDo] Talk to users to find out if they want: time series and image data, pytorch support, or unsupervised learning more. +- [Future] Derive informative featuresets from that dataset using supervised and unsupervised methods. +- [Future] Behind the scenes, stream rows from your datasets w generators to keep a low memory footprint. +- [Future] Scale out to run cloud jobs in parallel by toggling `cloud_queue = True`. +## Community +*Much to automate there is. Simple it must be.* ML is a broad space with a lot of challenges to solve. Let us know if you want to get involved. We plan to host monthly dev jam sessions and data science lightning talks. +* **Data types to support:** tabular, time series, image, graph, audio, video, gaming. + +%package -n python3-pydatasci +Summary: End-to-end machine learning on your desktop or server. +Provides: python-pydatasci +BuildRequires: python3-devel +BuildRequires: python3-setuptools +BuildRequires: python3-pip +%description -n python3-pydatasci +*pre-alpha; in active development* +# Value Proposition +*PyDataSci* is an open source, automated machine learning (AutoML) tool for data scientists that reduces the amount of code needed to perform best practice machine learning by 95%; more science with less code. +It is a Python package that records experiments in a lightweight, file-based database that works on Mac/ Linux/ Windows without any configuration required by the user. By tracking the input (samples and settings) as well as output (models and metrics) of each experiment, it makes machine learning reproducible; less of a black box. +Users can either (a) queue many experiments on their desktop/ server, or (b) delegate them to run in the *PyDataSci* cloud if they outgrow their local resources. From there, model performance metrics can be visually compared in interactive charts. It is designed for use within Jupyter notebooks, but runs in any Python shell. +## TLDR +```python +$ pip install pydatasci +>>> import pydatasci as pds +>>> from pydatasci import aidb +``` + +<div align="center"><i>Examples of built-in charts. Seen above is the new "boomerang chart" for comparing performance across models.</i></div><br/> + +## Mission +* **Accelerating Research at Universities & Institutes Everywhere.**<br />We empower non-cloud users: the academic/ institute HPCers, the private clouders, the remote server SSH'ers, and everyday desktop hackers - with the same quality ML tooling as present in public clouds (e.g. AWS SageMaker). This toolset provides research teams a standardized method for ML-based evidence, rather than each researcher spending time cobbling together their own approach.<br /><br /> +* **Reproducible Experiments.**<br />No more black boxes. No more screenshotting loss-accuracy graphs and hyperparameter combinations. A record of every: dataset, feature, label, sample, split, fold, parameter, model, training job, and result - is automatically persisted in a lightweight, file-based database that is automatically configured when you import the package. Submit your *aidb* database file alongside your publications/ papers and model zoo entries as a proof.<br /><br /> +* **Queue Hypertuning Jobs.**<br />Design a batch of runs to test many hypotheses at once. Queue many hypertuning jobs locally, or delegate big jobs to the cloud to run in parallel by setting `cloud_queue = True`.<br /><br /> +* **Visually Compare Performance Metrics.**<br />Compare models using pre-defined plots for assessing performance, including: quantitative metrics (e.g. accuracy, loss, variance, precision, recall, etc.), training histories, and confusion/ contingency matrices.<br /><br /> +* **Code-Integrated & Agnostic.**<br />We don’t disrupt the natural workflow of data scientists by forcing them into the confines of a GUI app or specific IDE. Instead, we weave automated tracking into their existing scripts so that *PyDataSci* is compatible with any data science toolset.<br /><br /> + +## Functionality +*Initially focusing on tabular data before expanding to multi-file use cases.* +- [Done] Compress an immutable dataset (csv, tsv, parquet, pandas dataframe, numpy ndarray) to be analyzed. +- [Done] Split stratified samples by index while treating validation sets (3rd split) and cross-folds (k-fold) as first-level citizens. +- [Done] Generate hyperparameter combinations for model building, training, and evaluation. +- [Done] Preprocess samples to encode them for specific algorithms. +- [Done] Queue hypertuning jobs and batches based on hyperparameter combinations. +- [Done] Evaluate and save the performance metrics of each model. +- [Done] Visually compare model metrics to find the best one. +- [ToDo] Talk to users to find out if they want: time series and image data, pytorch support, or unsupervised learning more. +- [Future] Derive informative featuresets from that dataset using supervised and unsupervised methods. +- [Future] Behind the scenes, stream rows from your datasets w generators to keep a low memory footprint. +- [Future] Scale out to run cloud jobs in parallel by toggling `cloud_queue = True`. +## Community +*Much to automate there is. Simple it must be.* ML is a broad space with a lot of challenges to solve. Let us know if you want to get involved. We plan to host monthly dev jam sessions and data science lightning talks. +* **Data types to support:** tabular, time series, image, graph, audio, video, gaming. + +%package help +Summary: Development documents and examples for pydatasci +Provides: python3-pydatasci-doc +%description help +*pre-alpha; in active development* +# Value Proposition +*PyDataSci* is an open source, automated machine learning (AutoML) tool for data scientists that reduces the amount of code needed to perform best practice machine learning by 95%; more science with less code. +It is a Python package that records experiments in a lightweight, file-based database that works on Mac/ Linux/ Windows without any configuration required by the user. By tracking the input (samples and settings) as well as output (models and metrics) of each experiment, it makes machine learning reproducible; less of a black box. +Users can either (a) queue many experiments on their desktop/ server, or (b) delegate them to run in the *PyDataSci* cloud if they outgrow their local resources. From there, model performance metrics can be visually compared in interactive charts. It is designed for use within Jupyter notebooks, but runs in any Python shell. +## TLDR +```python +$ pip install pydatasci +>>> import pydatasci as pds +>>> from pydatasci import aidb +``` + +<div align="center"><i>Examples of built-in charts. Seen above is the new "boomerang chart" for comparing performance across models.</i></div><br/> + +## Mission +* **Accelerating Research at Universities & Institutes Everywhere.**<br />We empower non-cloud users: the academic/ institute HPCers, the private clouders, the remote server SSH'ers, and everyday desktop hackers - with the same quality ML tooling as present in public clouds (e.g. AWS SageMaker). This toolset provides research teams a standardized method for ML-based evidence, rather than each researcher spending time cobbling together their own approach.<br /><br /> +* **Reproducible Experiments.**<br />No more black boxes. No more screenshotting loss-accuracy graphs and hyperparameter combinations. A record of every: dataset, feature, label, sample, split, fold, parameter, model, training job, and result - is automatically persisted in a lightweight, file-based database that is automatically configured when you import the package. Submit your *aidb* database file alongside your publications/ papers and model zoo entries as a proof.<br /><br /> +* **Queue Hypertuning Jobs.**<br />Design a batch of runs to test many hypotheses at once. Queue many hypertuning jobs locally, or delegate big jobs to the cloud to run in parallel by setting `cloud_queue = True`.<br /><br /> +* **Visually Compare Performance Metrics.**<br />Compare models using pre-defined plots for assessing performance, including: quantitative metrics (e.g. accuracy, loss, variance, precision, recall, etc.), training histories, and confusion/ contingency matrices.<br /><br /> +* **Code-Integrated & Agnostic.**<br />We don’t disrupt the natural workflow of data scientists by forcing them into the confines of a GUI app or specific IDE. Instead, we weave automated tracking into their existing scripts so that *PyDataSci* is compatible with any data science toolset.<br /><br /> + +## Functionality +*Initially focusing on tabular data before expanding to multi-file use cases.* +- [Done] Compress an immutable dataset (csv, tsv, parquet, pandas dataframe, numpy ndarray) to be analyzed. +- [Done] Split stratified samples by index while treating validation sets (3rd split) and cross-folds (k-fold) as first-level citizens. +- [Done] Generate hyperparameter combinations for model building, training, and evaluation. +- [Done] Preprocess samples to encode them for specific algorithms. +- [Done] Queue hypertuning jobs and batches based on hyperparameter combinations. +- [Done] Evaluate and save the performance metrics of each model. +- [Done] Visually compare model metrics to find the best one. +- [ToDo] Talk to users to find out if they want: time series and image data, pytorch support, or unsupervised learning more. +- [Future] Derive informative featuresets from that dataset using supervised and unsupervised methods. +- [Future] Behind the scenes, stream rows from your datasets w generators to keep a low memory footprint. +- [Future] Scale out to run cloud jobs in parallel by toggling `cloud_queue = True`. +## Community +*Much to automate there is. Simple it must be.* ML is a broad space with a lot of challenges to solve. Let us know if you want to get involved. We plan to host monthly dev jam sessions and data science lightning talks. +* **Data types to support:** tabular, time series, image, graph, audio, video, gaming. + +%prep +%autosetup -n pydatasci-0.0.61 + +%build +%py3_build + +%install +%py3_install +install -d -m755 %{buildroot}/%{_pkgdocdir} +if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi +if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi +if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi +if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi +pushd %{buildroot} +if [ -d usr/lib ]; then + find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/lib64 ]; then + find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/bin ]; then + find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/sbin ]; then + find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst +fi +touch doclist.lst +if [ -d usr/share/man ]; then + find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst +fi +popd +mv %{buildroot}/filelist.lst . +mv %{buildroot}/doclist.lst . + +%files -n python3-pydatasci -f filelist.lst +%dir %{python3_sitelib}/* + +%files help -f doclist.lst +%{_docdir}/* + +%changelog +* Mon May 29 2023 Python_Bot <Python_Bot@openeuler.org> - 0.0.61-1 +- Package Spec generated @@ -0,0 +1 @@ +cef17b37b5f04a724c1b6e5009845189 pydatasci-0.0.61.tar.gz |