%global _empty_manifest_terminate_build 0
Name: python-pydatasci
Version: 0.0.61
Release: 1
Summary: End-to-end machine learning on your desktop or server.
License: GNU Affero General Public License v3
URL: https://github.com/pydatasci-repo
Source0: https://mirrors.nju.edu.cn/pypi/web/packages/e0/9c/f7e887ef64331d8fe7973e8bc00ed7db8cc8b8bbd56b47ddca152e332966/pydatasci-0.0.61.tar.gz
BuildArch: noarch
Requires: python3-appdirs
Requires: python3-keras
Requires: python3-numpy
Requires: python3-pandas
Requires: python3-peewee
Requires: python3-plotly
Requires: python3-pyarrow
Requires: python3-scikit-learn
Requires: python3-tensorflow
Requires: python3-tqdm
%description
*pre-alpha; in active development*
# Value Proposition
*PyDataSci* is an open source, automated machine learning (AutoML) tool for data scientists that reduces the amount of code needed to perform best practice machine learning by 95%; more science with less code.
It is a Python package that records experiments in a lightweight, file-based database that works on Mac/ Linux/ Windows without any configuration required by the user. By tracking the input (samples and settings) as well as output (models and metrics) of each experiment, it makes machine learning reproducible; less of a black box.
Users can either (a) queue many experiments on their desktop/ server, or (b) delegate them to run in the *PyDataSci* cloud if they outgrow their local resources. From there, model performance metrics can be visually compared in interactive charts. It is designed for use within Jupyter notebooks, but runs in any Python shell.
## TLDR
```python
$ pip install pydatasci
>>> import pydatasci as pds
>>> from pydatasci import aidb
```
![Model Metrics](/images/chart_boomerang.png)
Examples of built-in charts. Seen above is the new "boomerang chart" for comparing performance across models.
![Model Metrics](/images/chart_history.png)
## Mission
* **Accelerating Research at Universities & Institutes Everywhere.**
We empower non-cloud users: the academic/ institute HPCers, the private clouders, the remote server SSH'ers, and everyday desktop hackers - with the same quality ML tooling as present in public clouds (e.g. AWS SageMaker). This toolset provides research teams a standardized method for ML-based evidence, rather than each researcher spending time cobbling together their own approach.
* **Reproducible Experiments.**
No more black boxes. No more screenshotting loss-accuracy graphs and hyperparameter combinations. A record of every: dataset, feature, label, sample, split, fold, parameter, model, training job, and result - is automatically persisted in a lightweight, file-based database that is automatically configured when you import the package. Submit your *aidb* database file alongside your publications/ papers and model zoo entries as a proof.
* **Queue Hypertuning Jobs.**
Design a batch of runs to test many hypotheses at once. Queue many hypertuning jobs locally, or delegate big jobs to the cloud to run in parallel by setting `cloud_queue = True`.
* **Visually Compare Performance Metrics.**
Compare models using pre-defined plots for assessing performance, including: quantitative metrics (e.g. accuracy, loss, variance, precision, recall, etc.), training histories, and confusion/ contingency matrices.
* **Code-Integrated & Agnostic.**
We don’t disrupt the natural workflow of data scientists by forcing them into the confines of a GUI app or specific IDE. Instead, we weave automated tracking into their existing scripts so that *PyDataSci* is compatible with any data science toolset.
![Ecosystem Banner (wide)](/images/ecosystem_banner.png)
## Functionality
*Initially focusing on tabular data before expanding to multi-file use cases.*
- [Done] Compress an immutable dataset (csv, tsv, parquet, pandas dataframe, numpy ndarray) to be analyzed.
- [Done] Split stratified samples by index while treating validation sets (3rd split) and cross-folds (k-fold) as first-level citizens.
- [Done] Generate hyperparameter combinations for model building, training, and evaluation.
- [Done] Preprocess samples to encode them for specific algorithms.
- [Done] Queue hypertuning jobs and batches based on hyperparameter combinations.
- [Done] Evaluate and save the performance metrics of each model.
- [Done] Visually compare model metrics to find the best one.
- [ToDo] Talk to users to find out if they want: time series and image data, pytorch support, or unsupervised learning more.
- [Future] Derive informative featuresets from that dataset using supervised and unsupervised methods.
- [Future] Behind the scenes, stream rows from your datasets w generators to keep a low memory footprint.
- [Future] Scale out to run cloud jobs in parallel by toggling `cloud_queue = True`.
## Community
*Much to automate there is. Simple it must be.* ML is a broad space with a lot of challenges to solve. Let us know if you want to get involved. We plan to host monthly dev jam sessions and data science lightning talks.
* **Data types to support:** tabular, time series, image, graph, audio, video, gaming.
%package -n python3-pydatasci
Summary: End-to-end machine learning on your desktop or server.
Provides: python-pydatasci
BuildRequires: python3-devel
BuildRequires: python3-setuptools
BuildRequires: python3-pip
%description -n python3-pydatasci
*pre-alpha; in active development*
# Value Proposition
*PyDataSci* is an open source, automated machine learning (AutoML) tool for data scientists that reduces the amount of code needed to perform best practice machine learning by 95%; more science with less code.
It is a Python package that records experiments in a lightweight, file-based database that works on Mac/ Linux/ Windows without any configuration required by the user. By tracking the input (samples and settings) as well as output (models and metrics) of each experiment, it makes machine learning reproducible; less of a black box.
Users can either (a) queue many experiments on their desktop/ server, or (b) delegate them to run in the *PyDataSci* cloud if they outgrow their local resources. From there, model performance metrics can be visually compared in interactive charts. It is designed for use within Jupyter notebooks, but runs in any Python shell.
## TLDR
```python
$ pip install pydatasci
>>> import pydatasci as pds
>>> from pydatasci import aidb
```
![Model Metrics](/images/chart_boomerang.png)
Examples of built-in charts. Seen above is the new "boomerang chart" for comparing performance across models.
![Model Metrics](/images/chart_history.png)
## Mission
* **Accelerating Research at Universities & Institutes Everywhere.**
We empower non-cloud users: the academic/ institute HPCers, the private clouders, the remote server SSH'ers, and everyday desktop hackers - with the same quality ML tooling as present in public clouds (e.g. AWS SageMaker). This toolset provides research teams a standardized method for ML-based evidence, rather than each researcher spending time cobbling together their own approach.
* **Reproducible Experiments.**
No more black boxes. No more screenshotting loss-accuracy graphs and hyperparameter combinations. A record of every: dataset, feature, label, sample, split, fold, parameter, model, training job, and result - is automatically persisted in a lightweight, file-based database that is automatically configured when you import the package. Submit your *aidb* database file alongside your publications/ papers and model zoo entries as a proof.
* **Queue Hypertuning Jobs.**
Design a batch of runs to test many hypotheses at once. Queue many hypertuning jobs locally, or delegate big jobs to the cloud to run in parallel by setting `cloud_queue = True`.
* **Visually Compare Performance Metrics.**
Compare models using pre-defined plots for assessing performance, including: quantitative metrics (e.g. accuracy, loss, variance, precision, recall, etc.), training histories, and confusion/ contingency matrices.
* **Code-Integrated & Agnostic.**
We don’t disrupt the natural workflow of data scientists by forcing them into the confines of a GUI app or specific IDE. Instead, we weave automated tracking into their existing scripts so that *PyDataSci* is compatible with any data science toolset.
![Ecosystem Banner (wide)](/images/ecosystem_banner.png)
## Functionality
*Initially focusing on tabular data before expanding to multi-file use cases.*
- [Done] Compress an immutable dataset (csv, tsv, parquet, pandas dataframe, numpy ndarray) to be analyzed.
- [Done] Split stratified samples by index while treating validation sets (3rd split) and cross-folds (k-fold) as first-level citizens.
- [Done] Generate hyperparameter combinations for model building, training, and evaluation.
- [Done] Preprocess samples to encode them for specific algorithms.
- [Done] Queue hypertuning jobs and batches based on hyperparameter combinations.
- [Done] Evaluate and save the performance metrics of each model.
- [Done] Visually compare model metrics to find the best one.
- [ToDo] Talk to users to find out if they want: time series and image data, pytorch support, or unsupervised learning more.
- [Future] Derive informative featuresets from that dataset using supervised and unsupervised methods.
- [Future] Behind the scenes, stream rows from your datasets w generators to keep a low memory footprint.
- [Future] Scale out to run cloud jobs in parallel by toggling `cloud_queue = True`.
## Community
*Much to automate there is. Simple it must be.* ML is a broad space with a lot of challenges to solve. Let us know if you want to get involved. We plan to host monthly dev jam sessions and data science lightning talks.
* **Data types to support:** tabular, time series, image, graph, audio, video, gaming.
%package help
Summary: Development documents and examples for pydatasci
Provides: python3-pydatasci-doc
%description help
*pre-alpha; in active development*
# Value Proposition
*PyDataSci* is an open source, automated machine learning (AutoML) tool for data scientists that reduces the amount of code needed to perform best practice machine learning by 95%; more science with less code.
It is a Python package that records experiments in a lightweight, file-based database that works on Mac/ Linux/ Windows without any configuration required by the user. By tracking the input (samples and settings) as well as output (models and metrics) of each experiment, it makes machine learning reproducible; less of a black box.
Users can either (a) queue many experiments on their desktop/ server, or (b) delegate them to run in the *PyDataSci* cloud if they outgrow their local resources. From there, model performance metrics can be visually compared in interactive charts. It is designed for use within Jupyter notebooks, but runs in any Python shell.
## TLDR
```python
$ pip install pydatasci
>>> import pydatasci as pds
>>> from pydatasci import aidb
```
![Model Metrics](/images/chart_boomerang.png)
Examples of built-in charts. Seen above is the new "boomerang chart" for comparing performance across models.
![Model Metrics](/images/chart_history.png)
## Mission
* **Accelerating Research at Universities & Institutes Everywhere.**
We empower non-cloud users: the academic/ institute HPCers, the private clouders, the remote server SSH'ers, and everyday desktop hackers - with the same quality ML tooling as present in public clouds (e.g. AWS SageMaker). This toolset provides research teams a standardized method for ML-based evidence, rather than each researcher spending time cobbling together their own approach.
* **Reproducible Experiments.**
No more black boxes. No more screenshotting loss-accuracy graphs and hyperparameter combinations. A record of every: dataset, feature, label, sample, split, fold, parameter, model, training job, and result - is automatically persisted in a lightweight, file-based database that is automatically configured when you import the package. Submit your *aidb* database file alongside your publications/ papers and model zoo entries as a proof.
* **Queue Hypertuning Jobs.**
Design a batch of runs to test many hypotheses at once. Queue many hypertuning jobs locally, or delegate big jobs to the cloud to run in parallel by setting `cloud_queue = True`.
* **Visually Compare Performance Metrics.**
Compare models using pre-defined plots for assessing performance, including: quantitative metrics (e.g. accuracy, loss, variance, precision, recall, etc.), training histories, and confusion/ contingency matrices.
* **Code-Integrated & Agnostic.**
We don’t disrupt the natural workflow of data scientists by forcing them into the confines of a GUI app or specific IDE. Instead, we weave automated tracking into their existing scripts so that *PyDataSci* is compatible with any data science toolset.
![Ecosystem Banner (wide)](/images/ecosystem_banner.png)
## Functionality
*Initially focusing on tabular data before expanding to multi-file use cases.*
- [Done] Compress an immutable dataset (csv, tsv, parquet, pandas dataframe, numpy ndarray) to be analyzed.
- [Done] Split stratified samples by index while treating validation sets (3rd split) and cross-folds (k-fold) as first-level citizens.
- [Done] Generate hyperparameter combinations for model building, training, and evaluation.
- [Done] Preprocess samples to encode them for specific algorithms.
- [Done] Queue hypertuning jobs and batches based on hyperparameter combinations.
- [Done] Evaluate and save the performance metrics of each model.
- [Done] Visually compare model metrics to find the best one.
- [ToDo] Talk to users to find out if they want: time series and image data, pytorch support, or unsupervised learning more.
- [Future] Derive informative featuresets from that dataset using supervised and unsupervised methods.
- [Future] Behind the scenes, stream rows from your datasets w generators to keep a low memory footprint.
- [Future] Scale out to run cloud jobs in parallel by toggling `cloud_queue = True`.
## Community
*Much to automate there is. Simple it must be.* ML is a broad space with a lot of challenges to solve. Let us know if you want to get involved. We plan to host monthly dev jam sessions and data science lightning talks.
* **Data types to support:** tabular, time series, image, graph, audio, video, gaming.
%prep
%autosetup -n pydatasci-0.0.61
%build
%py3_build
%install
%py3_install
install -d -m755 %{buildroot}/%{_pkgdocdir}
if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi
if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi
if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi
if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi
pushd %{buildroot}
if [ -d usr/lib ]; then
find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst
fi
if [ -d usr/lib64 ]; then
find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst
fi
if [ -d usr/bin ]; then
find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst
fi
if [ -d usr/sbin ]; then
find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst
fi
touch doclist.lst
if [ -d usr/share/man ]; then
find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst
fi
popd
mv %{buildroot}/filelist.lst .
mv %{buildroot}/doclist.lst .
%files -n python3-pydatasci -f filelist.lst
%dir %{python3_sitelib}/*
%files help -f doclist.lst
%{_docdir}/*
%changelog
* Wed May 31 2023 Python_Bot - 0.0.61-1
- Package Spec generated