%global _empty_manifest_terminate_build 0
Name:		python-nyaggle
Version:	0.1.5
Release:	1
Summary:	Code for Kaggle and Offline Competitions.
License:	MIT
URL:		https://github.com/nyanp/nyaggle
Source0:	https://mirrors.aliyun.com/pypi/web/packages/78/b9/817787f062c68abe065f0b59cd1b55afe9409ed0be6ebc5c9a000e2051eb/nyaggle-0.1.5.tar.gz
BuildArch:	noarch

Requires:	python3-category-encoders
Requires:	python3-matplotlib
Requires:	python3-more-itertools
Requires:	python3-numpy
Requires:	python3-optuna
Requires:	python3-pandas
Requires:	python3-pyarrow
Requires:	python3-seaborn
Requires:	python3-sklearn
Requires:	python3-tqdm
Requires:	python3-transformers
Requires:	python3-catboost
Requires:	python3-lightgbm
Requires:	python3-xgboost
Requires:	python3-torch
Requires:	python3-mlflow

%description
# nyaggle

![GitHub Actions CI Status](https://github.com/nyanp/nyaggle/workflows/Python%20package/badge.svg)
![GitHub Actions CI Status](https://github.com/nyanp/nyaggle/workflows/weekly_test/badge.svg)
![Python Versions](https://img.shields.io/pypi/pyversions/nyaggle.svg?logo=python&logoColor=white)
![Documentation Status](https://readthedocs.org/projects/nyaggle/badge/?version=latest)

[**Documentation**](https://nyaggle.readthedocs.io/en/latest/index.html)
| [**Slide (Japanese)**](https://docs.google.com/presentation/d/1jv3J7DISw8phZT4z9rqjM-azdrQ4L4wWJN5P-gKL6fA/edit?usp=sharing)

**nyaggle** is a utility library for Kaggle and offline competitions, 
particularly focused on experiment tracking, feature engineering and validation.

- **nyaggle.ensemble** - Averaging & stacking
- **nyaggle.experiment** - Experiment tracking
- **nyaggle.feature_store** - Lightweight feature storage using feather-format
- **nyaggle.features** - sklearn-compatible features
- **nyaggle.hyper_parameters** - Collection of GBDT hyper-parameters used in past Kaggle competitions
- **nyaggle.validation** - Adversarial validation & sklearn-compatible CV splitters

## Installation

You can install nyaggle via pip:

```Shell
$pip install nyaggle
```

## Examples

### Experiment Tracking

`run_experiment()` is an high-level API for experiment with cross validation.
It outputs parameters, metrics, out of fold predictions, test predictions,
feature importance and submission.csv under the specified directory.

It can be combined with mlflow tracking.

```python
from sklearn.model_selection import train_test_split

from nyaggle.experiment import run_experiment
from nyaggle.testing import make_classification_df

X, y = make_classification_df()
X_train, X_test, y_train, y_test = train_test_split(X, y)

params = {
    'n_estimators': 1000,
    'max_depth': 8
}

result = run_experiment(params,
                        X_train,
                        y_train,
                        X_test)

# You can get outputs that needed in data science competitions with 1 API

print(result.test_prediction)  # Test prediction in numpy array
print(result.oof_prediction)   # Out-of-fold prediction in numpy array
print(result.models)           # Trained models for each fold
print(result.importance)       # Feature importance for each fold
print(result.metrics)          # Evalulation metrics for each fold
print(result.time)             # Elapsed time
print(result.submission_df)    # The output dataframe saved as submission.csv

# ...and all outputs have been saved under the logging directory (default: output/yyyymmdd_HHMMSS).


# You can use it with mlflow and track your experiments through mlflow-ui
result = run_experiment(params,
                        X_train,
                        y_train,
                        X_test,
                        with_mlflow=True)
```

nyaggle also has a low-level API which has similar interface to
[mlflow tracking](https://www.mlflow.org/docs/latest/tracking.html) and [wandb](https://www.wandb.com/).

```python
from nyaggle.experiment import Experiment

with Experiment(logging_directory='./output/') as exp:
    # log key-value pair as a parameter
    exp.log_param('lr', 0.01)
    exp.log_param('optimizer', 'adam')

    # log text
    exp.log('blah blah blah')

    # log metric
    exp.log_metric('CV', 0.85)

    # log numpy ndarray, pandas dafaframe and any artifacts
    exp.log_numpy('predicted', predicted)
    exp.log_dataframe('submission', sub, file_format='csv')
    exp.log_artifact('path-to-your-file')
```

### Feature Engineering

#### Target Encoding with K-Fold

```python
import pandas as pd
import numpy as np

from sklearn.model_selection import KFold
from nyaggle.feature.category_encoder import TargetEncoder


train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
all = pd.concat([train, test]).copy()

cat_cols = [c for c in train.columns if train[c].dtype == np.object]
target_col = 'y'

kf = KFold(5)

# Target encoding with K-fold
te = TargetEncoder(kf.split(train))

# use fit/fit_transform to train data, then apply transform to test data
train.loc[:, cat_cols] = te.fit_transform(train[cat_cols], train[target_col])
test.loc[:, cat_cols] = te.transform(test[cat_cols])

# ... or just call fit_transform to concatenated data
all.loc[:, cat_cols] = te.fit_transform(all[cat_cols], all[cat_cols])
```

#### Text Vectorization using BERT

You need to install pytorch to your virtual environment to use BertSentenceVectorizer. 
MaCab and mecab-python3 are also required if you use Japanese BERT model.

```python
import pandas as pd
from nyaggle.feature.nlp import BertSentenceVectorizer


train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
all = pd.concat([train, test]).copy()

text_cols = ['body']
target_col = 'y'
group_col = 'user_id'


# extract BERT-based sentence vector
bv = BertSentenceVectorizer(text_columns=text_cols)

text_vector = bv.fit_transform(train)


# BERT + SVD, with cuda
bv = BertSentenceVectorizer(text_columns=text_cols, use_cuda=True, n_components=40)

text_vector_svd = bv.fit_transform(train)

# Japanese BERT
bv = BertSentenceVectorizer(text_columns=text_cols, lang='jp')

japanese_text_vector = bv.fit_transform(train)
```


### Adversarial Validation

```python
import pandas as pd
from nyaggle.validation import adversarial_validate

train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

auc, importance = adversarial_validate(train, test, importance_type='gain')

```

### Validation Splitters

nyaggle provides a set of validation splitters that compatible with sklean interface.

```python
import pandas as pd
from sklearn.model_selection import cross_validate, KFold
from nyaggle.validation import TimeSeriesSplit, Take, Skip, Nth

train = pd.read_csv('train.csv', parse_dates='dt')

# time-series split
ts = TimeSeriesSplit(train['dt'])
ts.add_fold(train_interval=('2019-01-01', '2019-01-10'), test_interval=('2019-01-10', '2019-01-20'))
ts.add_fold(train_interval=('2019-01-06', '2019-01-15'), test_interval=('2019-01-15', '2019-01-25'))

cross_validate(..., cv=ts)

# take the first 3 folds out of 10
cross_validate(..., cv=Take(3, KFold(10)))

# skip the first 3 folds, and evaluate the remaining 7 folds
cross_validate(..., cv=Skip(3, KFold(10)))

# evaluate 1st fold
cross_validate(..., cv=Nth(1, ts))

```

### Other Awesome Repositories

Here is a list of awesome repositories that provide general utility functions for data science competitions.
Please let me know if you have another one :)

- [jeongyoonlee/Kaggler](https://github.com/jeongyoonlee/Kaggler)
- [mxbi/mlcrate](https://github.com/mxbi/mlcrate)
- [analokmaus/kuma_utils](https://github.com/analokmaus/kuma_utils)
- [Far0n/kaggletils](https://github.com/Far0n/kaggletils)
- [MLWave/Kaggle-Ensemble-Guide](https://github.com/MLWave/Kaggle-Ensemble-Guide)
- [rushter/heamy](https://github.com/rushter/heamy)


%package -n python3-nyaggle
Summary:	Code for Kaggle and Offline Competitions.
Provides:	python-nyaggle
BuildRequires:	python3-devel
BuildRequires:	python3-setuptools
BuildRequires:	python3-pip
%description -n python3-nyaggle
# nyaggle

![GitHub Actions CI Status](https://github.com/nyanp/nyaggle/workflows/Python%20package/badge.svg)
![GitHub Actions CI Status](https://github.com/nyanp/nyaggle/workflows/weekly_test/badge.svg)
![Python Versions](https://img.shields.io/pypi/pyversions/nyaggle.svg?logo=python&logoColor=white)
![Documentation Status](https://readthedocs.org/projects/nyaggle/badge/?version=latest)

[**Documentation**](https://nyaggle.readthedocs.io/en/latest/index.html)
| [**Slide (Japanese)**](https://docs.google.com/presentation/d/1jv3J7DISw8phZT4z9rqjM-azdrQ4L4wWJN5P-gKL6fA/edit?usp=sharing)

**nyaggle** is a utility library for Kaggle and offline competitions, 
particularly focused on experiment tracking, feature engineering and validation.

- **nyaggle.ensemble** - Averaging & stacking
- **nyaggle.experiment** - Experiment tracking
- **nyaggle.feature_store** - Lightweight feature storage using feather-format
- **nyaggle.features** - sklearn-compatible features
- **nyaggle.hyper_parameters** - Collection of GBDT hyper-parameters used in past Kaggle competitions
- **nyaggle.validation** - Adversarial validation & sklearn-compatible CV splitters

## Installation

You can install nyaggle via pip:

```Shell
$pip install nyaggle
```

## Examples

### Experiment Tracking

`run_experiment()` is an high-level API for experiment with cross validation.
It outputs parameters, metrics, out of fold predictions, test predictions,
feature importance and submission.csv under the specified directory.

It can be combined with mlflow tracking.

```python
from sklearn.model_selection import train_test_split

from nyaggle.experiment import run_experiment
from nyaggle.testing import make_classification_df

X, y = make_classification_df()
X_train, X_test, y_train, y_test = train_test_split(X, y)

params = {
    'n_estimators': 1000,
    'max_depth': 8
}

result = run_experiment(params,
                        X_train,
                        y_train,
                        X_test)

# You can get outputs that needed in data science competitions with 1 API

print(result.test_prediction)  # Test prediction in numpy array
print(result.oof_prediction)   # Out-of-fold prediction in numpy array
print(result.models)           # Trained models for each fold
print(result.importance)       # Feature importance for each fold
print(result.metrics)          # Evalulation metrics for each fold
print(result.time)             # Elapsed time
print(result.submission_df)    # The output dataframe saved as submission.csv

# ...and all outputs have been saved under the logging directory (default: output/yyyymmdd_HHMMSS).


# You can use it with mlflow and track your experiments through mlflow-ui
result = run_experiment(params,
                        X_train,
                        y_train,
                        X_test,
                        with_mlflow=True)
```

nyaggle also has a low-level API which has similar interface to
[mlflow tracking](https://www.mlflow.org/docs/latest/tracking.html) and [wandb](https://www.wandb.com/).

```python
from nyaggle.experiment import Experiment

with Experiment(logging_directory='./output/') as exp:
    # log key-value pair as a parameter
    exp.log_param('lr', 0.01)
    exp.log_param('optimizer', 'adam')

    # log text
    exp.log('blah blah blah')

    # log metric
    exp.log_metric('CV', 0.85)

    # log numpy ndarray, pandas dafaframe and any artifacts
    exp.log_numpy('predicted', predicted)
    exp.log_dataframe('submission', sub, file_format='csv')
    exp.log_artifact('path-to-your-file')
```

### Feature Engineering

#### Target Encoding with K-Fold

```python
import pandas as pd
import numpy as np

from sklearn.model_selection import KFold
from nyaggle.feature.category_encoder import TargetEncoder


train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
all = pd.concat([train, test]).copy()

cat_cols = [c for c in train.columns if train[c].dtype == np.object]
target_col = 'y'

kf = KFold(5)

# Target encoding with K-fold
te = TargetEncoder(kf.split(train))

# use fit/fit_transform to train data, then apply transform to test data
train.loc[:, cat_cols] = te.fit_transform(train[cat_cols], train[target_col])
test.loc[:, cat_cols] = te.transform(test[cat_cols])

# ... or just call fit_transform to concatenated data
all.loc[:, cat_cols] = te.fit_transform(all[cat_cols], all[cat_cols])
```

#### Text Vectorization using BERT

You need to install pytorch to your virtual environment to use BertSentenceVectorizer. 
MaCab and mecab-python3 are also required if you use Japanese BERT model.

```python
import pandas as pd
from nyaggle.feature.nlp import BertSentenceVectorizer


train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
all = pd.concat([train, test]).copy()

text_cols = ['body']
target_col = 'y'
group_col = 'user_id'


# extract BERT-based sentence vector
bv = BertSentenceVectorizer(text_columns=text_cols)

text_vector = bv.fit_transform(train)


# BERT + SVD, with cuda
bv = BertSentenceVectorizer(text_columns=text_cols, use_cuda=True, n_components=40)

text_vector_svd = bv.fit_transform(train)

# Japanese BERT
bv = BertSentenceVectorizer(text_columns=text_cols, lang='jp')

japanese_text_vector = bv.fit_transform(train)
```


### Adversarial Validation

```python
import pandas as pd
from nyaggle.validation import adversarial_validate

train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

auc, importance = adversarial_validate(train, test, importance_type='gain')

```

### Validation Splitters

nyaggle provides a set of validation splitters that compatible with sklean interface.

```python
import pandas as pd
from sklearn.model_selection import cross_validate, KFold
from nyaggle.validation import TimeSeriesSplit, Take, Skip, Nth

train = pd.read_csv('train.csv', parse_dates='dt')

# time-series split
ts = TimeSeriesSplit(train['dt'])
ts.add_fold(train_interval=('2019-01-01', '2019-01-10'), test_interval=('2019-01-10', '2019-01-20'))
ts.add_fold(train_interval=('2019-01-06', '2019-01-15'), test_interval=('2019-01-15', '2019-01-25'))

cross_validate(..., cv=ts)

# take the first 3 folds out of 10
cross_validate(..., cv=Take(3, KFold(10)))

# skip the first 3 folds, and evaluate the remaining 7 folds
cross_validate(..., cv=Skip(3, KFold(10)))

# evaluate 1st fold
cross_validate(..., cv=Nth(1, ts))

```

### Other Awesome Repositories

Here is a list of awesome repositories that provide general utility functions for data science competitions.
Please let me know if you have another one :)

- [jeongyoonlee/Kaggler](https://github.com/jeongyoonlee/Kaggler)
- [mxbi/mlcrate](https://github.com/mxbi/mlcrate)
- [analokmaus/kuma_utils](https://github.com/analokmaus/kuma_utils)
- [Far0n/kaggletils](https://github.com/Far0n/kaggletils)
- [MLWave/Kaggle-Ensemble-Guide](https://github.com/MLWave/Kaggle-Ensemble-Guide)
- [rushter/heamy](https://github.com/rushter/heamy)


%package help
Summary:	Development documents and examples for nyaggle
Provides:	python3-nyaggle-doc
%description help
# nyaggle

![GitHub Actions CI Status](https://github.com/nyanp/nyaggle/workflows/Python%20package/badge.svg)
![GitHub Actions CI Status](https://github.com/nyanp/nyaggle/workflows/weekly_test/badge.svg)
![Python Versions](https://img.shields.io/pypi/pyversions/nyaggle.svg?logo=python&logoColor=white)
![Documentation Status](https://readthedocs.org/projects/nyaggle/badge/?version=latest)

[**Documentation**](https://nyaggle.readthedocs.io/en/latest/index.html)
| [**Slide (Japanese)**](https://docs.google.com/presentation/d/1jv3J7DISw8phZT4z9rqjM-azdrQ4L4wWJN5P-gKL6fA/edit?usp=sharing)

**nyaggle** is a utility library for Kaggle and offline competitions, 
particularly focused on experiment tracking, feature engineering and validation.

- **nyaggle.ensemble** - Averaging & stacking
- **nyaggle.experiment** - Experiment tracking
- **nyaggle.feature_store** - Lightweight feature storage using feather-format
- **nyaggle.features** - sklearn-compatible features
- **nyaggle.hyper_parameters** - Collection of GBDT hyper-parameters used in past Kaggle competitions
- **nyaggle.validation** - Adversarial validation & sklearn-compatible CV splitters

## Installation

You can install nyaggle via pip:

```Shell
$pip install nyaggle
```

## Examples

### Experiment Tracking

`run_experiment()` is an high-level API for experiment with cross validation.
It outputs parameters, metrics, out of fold predictions, test predictions,
feature importance and submission.csv under the specified directory.

It can be combined with mlflow tracking.

```python
from sklearn.model_selection import train_test_split

from nyaggle.experiment import run_experiment
from nyaggle.testing import make_classification_df

X, y = make_classification_df()
X_train, X_test, y_train, y_test = train_test_split(X, y)

params = {
    'n_estimators': 1000,
    'max_depth': 8
}

result = run_experiment(params,
                        X_train,
                        y_train,
                        X_test)

# You can get outputs that needed in data science competitions with 1 API

print(result.test_prediction)  # Test prediction in numpy array
print(result.oof_prediction)   # Out-of-fold prediction in numpy array
print(result.models)           # Trained models for each fold
print(result.importance)       # Feature importance for each fold
print(result.metrics)          # Evalulation metrics for each fold
print(result.time)             # Elapsed time
print(result.submission_df)    # The output dataframe saved as submission.csv

# ...and all outputs have been saved under the logging directory (default: output/yyyymmdd_HHMMSS).


# You can use it with mlflow and track your experiments through mlflow-ui
result = run_experiment(params,
                        X_train,
                        y_train,
                        X_test,
                        with_mlflow=True)
```

nyaggle also has a low-level API which has similar interface to
[mlflow tracking](https://www.mlflow.org/docs/latest/tracking.html) and [wandb](https://www.wandb.com/).

```python
from nyaggle.experiment import Experiment

with Experiment(logging_directory='./output/') as exp:
    # log key-value pair as a parameter
    exp.log_param('lr', 0.01)
    exp.log_param('optimizer', 'adam')

    # log text
    exp.log('blah blah blah')

    # log metric
    exp.log_metric('CV', 0.85)

    # log numpy ndarray, pandas dafaframe and any artifacts
    exp.log_numpy('predicted', predicted)
    exp.log_dataframe('submission', sub, file_format='csv')
    exp.log_artifact('path-to-your-file')
```

### Feature Engineering

#### Target Encoding with K-Fold

```python
import pandas as pd
import numpy as np

from sklearn.model_selection import KFold
from nyaggle.feature.category_encoder import TargetEncoder


train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
all = pd.concat([train, test]).copy()

cat_cols = [c for c in train.columns if train[c].dtype == np.object]
target_col = 'y'

kf = KFold(5)

# Target encoding with K-fold
te = TargetEncoder(kf.split(train))

# use fit/fit_transform to train data, then apply transform to test data
train.loc[:, cat_cols] = te.fit_transform(train[cat_cols], train[target_col])
test.loc[:, cat_cols] = te.transform(test[cat_cols])

# ... or just call fit_transform to concatenated data
all.loc[:, cat_cols] = te.fit_transform(all[cat_cols], all[cat_cols])
```

#### Text Vectorization using BERT

You need to install pytorch to your virtual environment to use BertSentenceVectorizer. 
MaCab and mecab-python3 are also required if you use Japanese BERT model.

```python
import pandas as pd
from nyaggle.feature.nlp import BertSentenceVectorizer


train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
all = pd.concat([train, test]).copy()

text_cols = ['body']
target_col = 'y'
group_col = 'user_id'


# extract BERT-based sentence vector
bv = BertSentenceVectorizer(text_columns=text_cols)

text_vector = bv.fit_transform(train)


# BERT + SVD, with cuda
bv = BertSentenceVectorizer(text_columns=text_cols, use_cuda=True, n_components=40)

text_vector_svd = bv.fit_transform(train)

# Japanese BERT
bv = BertSentenceVectorizer(text_columns=text_cols, lang='jp')

japanese_text_vector = bv.fit_transform(train)
```


### Adversarial Validation

```python
import pandas as pd
from nyaggle.validation import adversarial_validate

train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

auc, importance = adversarial_validate(train, test, importance_type='gain')

```

### Validation Splitters

nyaggle provides a set of validation splitters that compatible with sklean interface.

```python
import pandas as pd
from sklearn.model_selection import cross_validate, KFold
from nyaggle.validation import TimeSeriesSplit, Take, Skip, Nth

train = pd.read_csv('train.csv', parse_dates='dt')

# time-series split
ts = TimeSeriesSplit(train['dt'])
ts.add_fold(train_interval=('2019-01-01', '2019-01-10'), test_interval=('2019-01-10', '2019-01-20'))
ts.add_fold(train_interval=('2019-01-06', '2019-01-15'), test_interval=('2019-01-15', '2019-01-25'))

cross_validate(..., cv=ts)

# take the first 3 folds out of 10
cross_validate(..., cv=Take(3, KFold(10)))

# skip the first 3 folds, and evaluate the remaining 7 folds
cross_validate(..., cv=Skip(3, KFold(10)))

# evaluate 1st fold
cross_validate(..., cv=Nth(1, ts))

```

### Other Awesome Repositories

Here is a list of awesome repositories that provide general utility functions for data science competitions.
Please let me know if you have another one :)

- [jeongyoonlee/Kaggler](https://github.com/jeongyoonlee/Kaggler)
- [mxbi/mlcrate](https://github.com/mxbi/mlcrate)
- [analokmaus/kuma_utils](https://github.com/analokmaus/kuma_utils)
- [Far0n/kaggletils](https://github.com/Far0n/kaggletils)
- [MLWave/Kaggle-Ensemble-Guide](https://github.com/MLWave/Kaggle-Ensemble-Guide)
- [rushter/heamy](https://github.com/rushter/heamy)


%prep
%autosetup -n nyaggle-0.1.5

%build
%py3_build

%install
%py3_install
install -d -m755 %{buildroot}/%{_pkgdocdir}
if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi
if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi
if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi
if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi
pushd %{buildroot}
if [ -d usr/lib ]; then
	find usr/lib -type f -printf "\"/%h/%f\"\n" >> filelist.lst
fi
if [ -d usr/lib64 ]; then
	find usr/lib64 -type f -printf "\"/%h/%f\"\n" >> filelist.lst
fi
if [ -d usr/bin ]; then
	find usr/bin -type f -printf "\"/%h/%f\"\n" >> filelist.lst
fi
if [ -d usr/sbin ]; then
	find usr/sbin -type f -printf "\"/%h/%f\"\n" >> filelist.lst
fi
touch doclist.lst
if [ -d usr/share/man ]; then
	find usr/share/man -type f -printf "\"/%h/%f.gz\"\n" >> doclist.lst
fi
popd
mv %{buildroot}/filelist.lst .
mv %{buildroot}/doclist.lst .

%files -n python3-nyaggle -f filelist.lst
%dir %{python3_sitelib}/*

%files help -f doclist.lst
%{_docdir}/*

%changelog
* Tue Jun 20 2023 Python_Bot <Python_Bot@openeuler.org> - 0.1.5-1
- Package Spec generated