diff options
-rw-r--r-- | .gitignore | 1 | ||||
-rw-r--r-- | python-kaggler.spec | 849 | ||||
-rw-r--r-- | sources | 1 |
3 files changed, 851 insertions, 0 deletions
@@ -0,0 +1 @@ +/Kaggler-0.9.15.tar.gz diff --git a/python-kaggler.spec b/python-kaggler.spec new file mode 100644 index 0000000..ec8107e --- /dev/null +++ b/python-kaggler.spec @@ -0,0 +1,849 @@ +%global _empty_manifest_terminate_build 0 +Name: python-Kaggler +Version: 0.9.15 +Release: 1 +Summary: Code for Kaggle Data Science Competitions. +License: LICENSE +URL: https://github.com/jeongyoonlee/Kaggler +Source0: https://mirrors.nju.edu.cn/pypi/web/packages/30/0c/ac3fc0136360f5ebf0e538ac09dd07e00905a9a59a94c28758e5dc174c27/Kaggler-0.9.15.tar.gz +BuildArch: noarch + + +%description +[](https://badge.fury.io/py/Kaggler) +[](https://github.com/jeongyoonlee/Kaggler/actions/workflows/test.yml) +[](https://pepy.tech/project/kaggler) +[](https://codecov.io/gh/jeongyoonlee/Kaggler) + + +# Kaggler +Kaggler is a Python package for lightweight online machine learning algorithms and utility functions for ETL and data analysis. It is distributed under the MIT License. + +Its online learning algorithms are inspired by Kaggle user [tinrtgu's code](http://goo.gl/K8hQBx). It uses the sparse input format that handles large sparse data efficiently. Core code is optimized for speed by using Cython. + + +## Installation + +### Dependencies +Python packages required are listed in `requirements.txt` +* cython +* h5py +* hyperopt +* lightgbm +* ml_metrics +* numpy/scipy +* pandas +* scikit-learn + +### Using pip +Python package is available at PyPi for pip installation: +``` +pip install -U Kaggler +``` +If installation fails because it cannot find `MurmurHash3.h`, please add `.` to +`LD_LIBRARY_PATH` as described [here](https://github.com/jeongyoonlee/Kaggler/issues/32). + +### From source code +If you want to install it from source code: +``` +python setup.py build_ext --inplace +python setup.py install +``` + + +## Feature Engineering + +### One-Hot, Label, Target, Frequency, and Embedding Encoders for Categorical Features +```python +import pandas as pd +from kaggler.preprocessing import OneHotEncoder, LabelEncoder, TargetEncoder, FrequencyEncoder, EmbeddingEncoder + +trn = pd.read_csv('train.csv') +target_col = trn.columns[-1] +cat_cols = [col for col in trn.columns if trn[col].dtype == 'object'] + +ohe = OneHotEncoder(min_obs=100) # grouping all categories with less than 100 occurences +lbe = LabelEncoder(min_obs=100) # grouping all categories with less than 100 occurences +te = TargetEncoder() # replacing each category with the average target value of the category +fe = FrequencyEncoder() # replacing each category with the frequency value of the category +ee = EmbeddingEncoder() # mapping each category to a vector of real numbers + +X_ohe = ohe.fit_transform(trn[cat_cols]) # X_ohe is a scipy sparse matrix +trn[cat_cols] = lbe.fit_transform(trn[cat_cols]) +trn[cat_cols] = te.fit_transform(trn[cat_cols]) +trn[cat_cols] = fe.fit_transform(trn[cat_cols]) +X_ee = ee.fit_transform(trn[cat_cols], trn[target_col]) # X_ee is a numpy matrix + +tst = pd.read_csv('test.csv') +X_ohe = ohe.transform(tst[cat_cols]) +tst[cat_cols] = lbe.transform(tst[cat_cols]) +tst[cat_cols] = te.transform(tst[cat_cols]) +tst[cat_cols] = fe.transform(tst[cat_cols]) +X_ee = ee.transform(tst[cat_cols]) +``` + +### Denoising AutoEncoder (DAE) +For reference for DAE, please check out [Vincent et al. (2010), "Stacked Denoising Autoencoders"](https://www.jmlr.org/papers/volume11/vincent10a/vincent10a.pdf). +```python +import pandas as pd +from kaggler.preprocessing import DAE + +trn = pd.read_csv('train.csv') +tst = pd.read_csv('test.csv') +target_col = trn.columns[-1] +cat_cols = [col for col in trn.columns if trn[col].dtype == 'object'] +num_cols = [col for col in trn.columns if col not in cat_cols + [target_col]] + +# Default DAE with only the swapping noise and a single encoder/decoder pair. +dae = DAE(cat_cols=cat_cols, num_cols=num_cols, n_encoding=128) +X = dae.fit_transform(pd.concat([trn, tst], axis=0)) # encoding input features into the encoding vectors with size of 128 + +# Stacked DAE with the Gaussian noise, swapping noise and zero masking in 3 pairs of the encoder/decoder. +sdae = DAE(cat_cols=cat_cols, num_cols=num_cols, n_encoding=128, n_layer=3, + noise_std=.05, swap_prob=.2, mask_prob=.1) +X = sdae.fit_transform(pd.concat([trn, tst], axis=0)) + +# Supervised DAE with the Gaussian noise, swapping noise and zero masking in 3 encoders in the encoder/decoder pair. +sdae = SDAE(cat_cols=cat_cols, num_cols=num_cols, n_encoding=128, n_encoder=3, + noise_std=.05, swap_prob=.2, mask_prob=.1) +X = sdae.fit_transform(trn, trn[target_col]) + +``` + +## AutoML + +### Feature Selection & Hyperparameter Tuning +```python +import pandas as pd +from sklearn.datasets import make_classification +from sklearn.model_selection import train_test_split +from kaggler.metrics import auc +from kaggler.model import AutoLGB + + +RANDOM_SEED = 42 +N_OBS = 10000 +N_FEATURE = 100 +N_IMP_FEATURE = 20 + +X, y = make_classification(n_samples=N_OBS, + n_features=N_FEATURE, + n_informative=N_IMP_FEATURE, + random_state=RANDOM_SEED) +X = pd.DataFrame(X, columns=['x{}'.format(i) for i in range(X.shape[1])]) +y = pd.Series(y) + +X_trn, X_tst, y_trn, y_tst = train_test_split(X, y, + test_size=.2, + random_state=RANDOM_SEED) + +model = AutoLGB(objective='binary', metric='auc') +model.tune(X_trn, y_trn) +model.fit(X_trn, y_trn) +p = model.predict(X_tst) +print('AUC: {:.4f}'.format(auc(y_tst, p))) + +``` + +## Ensemble + +### Netflix Blending +```python +import numpy as np +from kaggler.ensemble import netflix +from kaggler.metrics import rmse + +# Load the predictions of input models for ensemble +p1 = np.loadtxt('model1_prediction.txt') +p2 = np.loadtxt('model2_prediction.txt') +p3 = np.loadtxt('model3_prediction.txt') + +# Calculate RMSEs of model predictions and all-zero prediction. +# At a competition, RMSEs (or RMLSEs) of submissions can be used. +y = np.loadtxt('target.txt') +e0 = rmse(y, np.zeros_like(y)) +e1 = rmse(y, p1) +e2 = rmse(y, p2) +e3 = rmse(y, p3) + +p, w = netflix([e1, e2, e3], [p1, p2, p3], e0, l=0.0001) # l is an optional regularization parameter. +``` + + +## Algorithms +Currently algorithms available are as follows: + +### Online learning algorithms +* Stochastic Gradient Descent (SGD) +* Follow-the-Regularized-Leader (FTRL) +* Factorization Machine (FM) +* Neural Networks (NN) - with a single (NN) or two (NN_H2) ReLU hidden layers +* Decision Tree + +### Batch learning algorithm +* Neural Networks (NN) - with a single hidden layer and L-BFGS optimization + +### Examples +```python +from kaggler.online_model import SGD, FTRL, FM, NN + +# SGD +clf = SGD(a=.01, # learning rate + l1=1e-6, # L1 regularization parameter + l2=1e-6, # L2 regularization parameter + n=2**20, # number of hashed features + epoch=10, # number of epochs + interaction=True) # use feature interaction or not + +# FTRL +clf = FTRL(a=.1, # alpha in the per-coordinate rate + b=1, # beta in the per-coordinate rate + l1=1., # L1 regularization parameter + l2=1., # L2 regularization parameter + n=2**20, # number of hashed features + epoch=1, # number of epochs + interaction=True) # use feature interaction or not + +# FM +clf = FM(n=1e5, # number of features + epoch=100, # number of epochs + dim=4, # size of factors for interactions + a=.01) # learning rate + +# NN +clf = NN(n=1e5, # number of features + epoch=10, # number of epochs + h=16, # number of hidden units + a=.1, # learning rate + l2=1e-6) # L2 regularization parameter + +# online training and prediction directly with a libsvm file +for x, y in clf.read_sparse('train.sparse'): + p = clf.predict_one(x) # predict for an input + clf.update_one(x, p - y) # update the model with the target using error + +for x, _ in clf.read_sparse('test.sparse'): + p = clf.predict_one(x) + +# online training and prediction with a scipy sparse matrix +from kaggler import load_data + +X, y = load_data('train.sps') + +clf.fit(X, y) +p = clf.predict(X) +``` + +## Data I/O +Kaggler supports CSV (`.csv`), LibSVM (`.sps`), and HDF5 (`.h5`) file formats: +``` +# CSV format: target,feature1,feature2,... +1,1,0,0,1,0.5 +0,0,1,0,0,5 + +# LibSVM format: target feature-index1:feature-value1 feature-index2:feature-value2 +1 1:1 4:1 5:0.5 +0 2:1 5:1 + +# HDF5 +- issparse: binary flag indicating whether it stores sparse data or not. +- target: stores a target variable as a numpy.array +- shape: available only if issparse == 1. shape of scipy.sparse.csr_matrix +- indices: available only if issparse == 1. indices of scipy.sparse.csr_matrix +- indptr: available only if issparse == 1. indptr of scipy.sparse.csr_matrix +- data: dense feature matrix if issparse == 0 else data of scipy.sparse.csr_matrix +``` + +```python +from kaggler.data_io import load_data, save_data + +X, y = load_data('train.csv') # use the first column as a target variable +X, y = load_data('train.h5') # load the feature matrix and target vector from a HDF5 file. +X, y = load_data('train.sps') # load the feature matrix and target vector from LibSVM file. + +save_data(X, y, 'train.csv') +save_data(X, y, 'train.h5') +save_data(X, y, 'train.sps') +``` + +## Documentation +Package documentation is available at [here](https://kaggler.readthedocs.io/en/latest/) + + + +%package -n python3-Kaggler +Summary: Code for Kaggle Data Science Competitions. +Provides: python-Kaggler +BuildRequires: python3-devel +BuildRequires: python3-setuptools +BuildRequires: python3-pip +%description -n python3-Kaggler +[](https://badge.fury.io/py/Kaggler) +[](https://github.com/jeongyoonlee/Kaggler/actions/workflows/test.yml) +[](https://pepy.tech/project/kaggler) +[](https://codecov.io/gh/jeongyoonlee/Kaggler) + + +# Kaggler +Kaggler is a Python package for lightweight online machine learning algorithms and utility functions for ETL and data analysis. It is distributed under the MIT License. + +Its online learning algorithms are inspired by Kaggle user [tinrtgu's code](http://goo.gl/K8hQBx). It uses the sparse input format that handles large sparse data efficiently. Core code is optimized for speed by using Cython. + + +## Installation + +### Dependencies +Python packages required are listed in `requirements.txt` +* cython +* h5py +* hyperopt +* lightgbm +* ml_metrics +* numpy/scipy +* pandas +* scikit-learn + +### Using pip +Python package is available at PyPi for pip installation: +``` +pip install -U Kaggler +``` +If installation fails because it cannot find `MurmurHash3.h`, please add `.` to +`LD_LIBRARY_PATH` as described [here](https://github.com/jeongyoonlee/Kaggler/issues/32). + +### From source code +If you want to install it from source code: +``` +python setup.py build_ext --inplace +python setup.py install +``` + + +## Feature Engineering + +### One-Hot, Label, Target, Frequency, and Embedding Encoders for Categorical Features +```python +import pandas as pd +from kaggler.preprocessing import OneHotEncoder, LabelEncoder, TargetEncoder, FrequencyEncoder, EmbeddingEncoder + +trn = pd.read_csv('train.csv') +target_col = trn.columns[-1] +cat_cols = [col for col in trn.columns if trn[col].dtype == 'object'] + +ohe = OneHotEncoder(min_obs=100) # grouping all categories with less than 100 occurences +lbe = LabelEncoder(min_obs=100) # grouping all categories with less than 100 occurences +te = TargetEncoder() # replacing each category with the average target value of the category +fe = FrequencyEncoder() # replacing each category with the frequency value of the category +ee = EmbeddingEncoder() # mapping each category to a vector of real numbers + +X_ohe = ohe.fit_transform(trn[cat_cols]) # X_ohe is a scipy sparse matrix +trn[cat_cols] = lbe.fit_transform(trn[cat_cols]) +trn[cat_cols] = te.fit_transform(trn[cat_cols]) +trn[cat_cols] = fe.fit_transform(trn[cat_cols]) +X_ee = ee.fit_transform(trn[cat_cols], trn[target_col]) # X_ee is a numpy matrix + +tst = pd.read_csv('test.csv') +X_ohe = ohe.transform(tst[cat_cols]) +tst[cat_cols] = lbe.transform(tst[cat_cols]) +tst[cat_cols] = te.transform(tst[cat_cols]) +tst[cat_cols] = fe.transform(tst[cat_cols]) +X_ee = ee.transform(tst[cat_cols]) +``` + +### Denoising AutoEncoder (DAE) +For reference for DAE, please check out [Vincent et al. (2010), "Stacked Denoising Autoencoders"](https://www.jmlr.org/papers/volume11/vincent10a/vincent10a.pdf). +```python +import pandas as pd +from kaggler.preprocessing import DAE + +trn = pd.read_csv('train.csv') +tst = pd.read_csv('test.csv') +target_col = trn.columns[-1] +cat_cols = [col for col in trn.columns if trn[col].dtype == 'object'] +num_cols = [col for col in trn.columns if col not in cat_cols + [target_col]] + +# Default DAE with only the swapping noise and a single encoder/decoder pair. +dae = DAE(cat_cols=cat_cols, num_cols=num_cols, n_encoding=128) +X = dae.fit_transform(pd.concat([trn, tst], axis=0)) # encoding input features into the encoding vectors with size of 128 + +# Stacked DAE with the Gaussian noise, swapping noise and zero masking in 3 pairs of the encoder/decoder. +sdae = DAE(cat_cols=cat_cols, num_cols=num_cols, n_encoding=128, n_layer=3, + noise_std=.05, swap_prob=.2, mask_prob=.1) +X = sdae.fit_transform(pd.concat([trn, tst], axis=0)) + +# Supervised DAE with the Gaussian noise, swapping noise and zero masking in 3 encoders in the encoder/decoder pair. +sdae = SDAE(cat_cols=cat_cols, num_cols=num_cols, n_encoding=128, n_encoder=3, + noise_std=.05, swap_prob=.2, mask_prob=.1) +X = sdae.fit_transform(trn, trn[target_col]) + +``` + +## AutoML + +### Feature Selection & Hyperparameter Tuning +```python +import pandas as pd +from sklearn.datasets import make_classification +from sklearn.model_selection import train_test_split +from kaggler.metrics import auc +from kaggler.model import AutoLGB + + +RANDOM_SEED = 42 +N_OBS = 10000 +N_FEATURE = 100 +N_IMP_FEATURE = 20 + +X, y = make_classification(n_samples=N_OBS, + n_features=N_FEATURE, + n_informative=N_IMP_FEATURE, + random_state=RANDOM_SEED) +X = pd.DataFrame(X, columns=['x{}'.format(i) for i in range(X.shape[1])]) +y = pd.Series(y) + +X_trn, X_tst, y_trn, y_tst = train_test_split(X, y, + test_size=.2, + random_state=RANDOM_SEED) + +model = AutoLGB(objective='binary', metric='auc') +model.tune(X_trn, y_trn) +model.fit(X_trn, y_trn) +p = model.predict(X_tst) +print('AUC: {:.4f}'.format(auc(y_tst, p))) + +``` + +## Ensemble + +### Netflix Blending +```python +import numpy as np +from kaggler.ensemble import netflix +from kaggler.metrics import rmse + +# Load the predictions of input models for ensemble +p1 = np.loadtxt('model1_prediction.txt') +p2 = np.loadtxt('model2_prediction.txt') +p3 = np.loadtxt('model3_prediction.txt') + +# Calculate RMSEs of model predictions and all-zero prediction. +# At a competition, RMSEs (or RMLSEs) of submissions can be used. +y = np.loadtxt('target.txt') +e0 = rmse(y, np.zeros_like(y)) +e1 = rmse(y, p1) +e2 = rmse(y, p2) +e3 = rmse(y, p3) + +p, w = netflix([e1, e2, e3], [p1, p2, p3], e0, l=0.0001) # l is an optional regularization parameter. +``` + + +## Algorithms +Currently algorithms available are as follows: + +### Online learning algorithms +* Stochastic Gradient Descent (SGD) +* Follow-the-Regularized-Leader (FTRL) +* Factorization Machine (FM) +* Neural Networks (NN) - with a single (NN) or two (NN_H2) ReLU hidden layers +* Decision Tree + +### Batch learning algorithm +* Neural Networks (NN) - with a single hidden layer and L-BFGS optimization + +### Examples +```python +from kaggler.online_model import SGD, FTRL, FM, NN + +# SGD +clf = SGD(a=.01, # learning rate + l1=1e-6, # L1 regularization parameter + l2=1e-6, # L2 regularization parameter + n=2**20, # number of hashed features + epoch=10, # number of epochs + interaction=True) # use feature interaction or not + +# FTRL +clf = FTRL(a=.1, # alpha in the per-coordinate rate + b=1, # beta in the per-coordinate rate + l1=1., # L1 regularization parameter + l2=1., # L2 regularization parameter + n=2**20, # number of hashed features + epoch=1, # number of epochs + interaction=True) # use feature interaction or not + +# FM +clf = FM(n=1e5, # number of features + epoch=100, # number of epochs + dim=4, # size of factors for interactions + a=.01) # learning rate + +# NN +clf = NN(n=1e5, # number of features + epoch=10, # number of epochs + h=16, # number of hidden units + a=.1, # learning rate + l2=1e-6) # L2 regularization parameter + +# online training and prediction directly with a libsvm file +for x, y in clf.read_sparse('train.sparse'): + p = clf.predict_one(x) # predict for an input + clf.update_one(x, p - y) # update the model with the target using error + +for x, _ in clf.read_sparse('test.sparse'): + p = clf.predict_one(x) + +# online training and prediction with a scipy sparse matrix +from kaggler import load_data + +X, y = load_data('train.sps') + +clf.fit(X, y) +p = clf.predict(X) +``` + +## Data I/O +Kaggler supports CSV (`.csv`), LibSVM (`.sps`), and HDF5 (`.h5`) file formats: +``` +# CSV format: target,feature1,feature2,... +1,1,0,0,1,0.5 +0,0,1,0,0,5 + +# LibSVM format: target feature-index1:feature-value1 feature-index2:feature-value2 +1 1:1 4:1 5:0.5 +0 2:1 5:1 + +# HDF5 +- issparse: binary flag indicating whether it stores sparse data or not. +- target: stores a target variable as a numpy.array +- shape: available only if issparse == 1. shape of scipy.sparse.csr_matrix +- indices: available only if issparse == 1. indices of scipy.sparse.csr_matrix +- indptr: available only if issparse == 1. indptr of scipy.sparse.csr_matrix +- data: dense feature matrix if issparse == 0 else data of scipy.sparse.csr_matrix +``` + +```python +from kaggler.data_io import load_data, save_data + +X, y = load_data('train.csv') # use the first column as a target variable +X, y = load_data('train.h5') # load the feature matrix and target vector from a HDF5 file. +X, y = load_data('train.sps') # load the feature matrix and target vector from LibSVM file. + +save_data(X, y, 'train.csv') +save_data(X, y, 'train.h5') +save_data(X, y, 'train.sps') +``` + +## Documentation +Package documentation is available at [here](https://kaggler.readthedocs.io/en/latest/) + + + +%package help +Summary: Development documents and examples for Kaggler +Provides: python3-Kaggler-doc +%description help +[](https://badge.fury.io/py/Kaggler) +[](https://github.com/jeongyoonlee/Kaggler/actions/workflows/test.yml) +[](https://pepy.tech/project/kaggler) +[](https://codecov.io/gh/jeongyoonlee/Kaggler) + + +# Kaggler +Kaggler is a Python package for lightweight online machine learning algorithms and utility functions for ETL and data analysis. It is distributed under the MIT License. + +Its online learning algorithms are inspired by Kaggle user [tinrtgu's code](http://goo.gl/K8hQBx). It uses the sparse input format that handles large sparse data efficiently. Core code is optimized for speed by using Cython. + + +## Installation + +### Dependencies +Python packages required are listed in `requirements.txt` +* cython +* h5py +* hyperopt +* lightgbm +* ml_metrics +* numpy/scipy +* pandas +* scikit-learn + +### Using pip +Python package is available at PyPi for pip installation: +``` +pip install -U Kaggler +``` +If installation fails because it cannot find `MurmurHash3.h`, please add `.` to +`LD_LIBRARY_PATH` as described [here](https://github.com/jeongyoonlee/Kaggler/issues/32). + +### From source code +If you want to install it from source code: +``` +python setup.py build_ext --inplace +python setup.py install +``` + + +## Feature Engineering + +### One-Hot, Label, Target, Frequency, and Embedding Encoders for Categorical Features +```python +import pandas as pd +from kaggler.preprocessing import OneHotEncoder, LabelEncoder, TargetEncoder, FrequencyEncoder, EmbeddingEncoder + +trn = pd.read_csv('train.csv') +target_col = trn.columns[-1] +cat_cols = [col for col in trn.columns if trn[col].dtype == 'object'] + +ohe = OneHotEncoder(min_obs=100) # grouping all categories with less than 100 occurences +lbe = LabelEncoder(min_obs=100) # grouping all categories with less than 100 occurences +te = TargetEncoder() # replacing each category with the average target value of the category +fe = FrequencyEncoder() # replacing each category with the frequency value of the category +ee = EmbeddingEncoder() # mapping each category to a vector of real numbers + +X_ohe = ohe.fit_transform(trn[cat_cols]) # X_ohe is a scipy sparse matrix +trn[cat_cols] = lbe.fit_transform(trn[cat_cols]) +trn[cat_cols] = te.fit_transform(trn[cat_cols]) +trn[cat_cols] = fe.fit_transform(trn[cat_cols]) +X_ee = ee.fit_transform(trn[cat_cols], trn[target_col]) # X_ee is a numpy matrix + +tst = pd.read_csv('test.csv') +X_ohe = ohe.transform(tst[cat_cols]) +tst[cat_cols] = lbe.transform(tst[cat_cols]) +tst[cat_cols] = te.transform(tst[cat_cols]) +tst[cat_cols] = fe.transform(tst[cat_cols]) +X_ee = ee.transform(tst[cat_cols]) +``` + +### Denoising AutoEncoder (DAE) +For reference for DAE, please check out [Vincent et al. (2010), "Stacked Denoising Autoencoders"](https://www.jmlr.org/papers/volume11/vincent10a/vincent10a.pdf). +```python +import pandas as pd +from kaggler.preprocessing import DAE + +trn = pd.read_csv('train.csv') +tst = pd.read_csv('test.csv') +target_col = trn.columns[-1] +cat_cols = [col for col in trn.columns if trn[col].dtype == 'object'] +num_cols = [col for col in trn.columns if col not in cat_cols + [target_col]] + +# Default DAE with only the swapping noise and a single encoder/decoder pair. +dae = DAE(cat_cols=cat_cols, num_cols=num_cols, n_encoding=128) +X = dae.fit_transform(pd.concat([trn, tst], axis=0)) # encoding input features into the encoding vectors with size of 128 + +# Stacked DAE with the Gaussian noise, swapping noise and zero masking in 3 pairs of the encoder/decoder. +sdae = DAE(cat_cols=cat_cols, num_cols=num_cols, n_encoding=128, n_layer=3, + noise_std=.05, swap_prob=.2, mask_prob=.1) +X = sdae.fit_transform(pd.concat([trn, tst], axis=0)) + +# Supervised DAE with the Gaussian noise, swapping noise and zero masking in 3 encoders in the encoder/decoder pair. +sdae = SDAE(cat_cols=cat_cols, num_cols=num_cols, n_encoding=128, n_encoder=3, + noise_std=.05, swap_prob=.2, mask_prob=.1) +X = sdae.fit_transform(trn, trn[target_col]) + +``` + +## AutoML + +### Feature Selection & Hyperparameter Tuning +```python +import pandas as pd +from sklearn.datasets import make_classification +from sklearn.model_selection import train_test_split +from kaggler.metrics import auc +from kaggler.model import AutoLGB + + +RANDOM_SEED = 42 +N_OBS = 10000 +N_FEATURE = 100 +N_IMP_FEATURE = 20 + +X, y = make_classification(n_samples=N_OBS, + n_features=N_FEATURE, + n_informative=N_IMP_FEATURE, + random_state=RANDOM_SEED) +X = pd.DataFrame(X, columns=['x{}'.format(i) for i in range(X.shape[1])]) +y = pd.Series(y) + +X_trn, X_tst, y_trn, y_tst = train_test_split(X, y, + test_size=.2, + random_state=RANDOM_SEED) + +model = AutoLGB(objective='binary', metric='auc') +model.tune(X_trn, y_trn) +model.fit(X_trn, y_trn) +p = model.predict(X_tst) +print('AUC: {:.4f}'.format(auc(y_tst, p))) + +``` + +## Ensemble + +### Netflix Blending +```python +import numpy as np +from kaggler.ensemble import netflix +from kaggler.metrics import rmse + +# Load the predictions of input models for ensemble +p1 = np.loadtxt('model1_prediction.txt') +p2 = np.loadtxt('model2_prediction.txt') +p3 = np.loadtxt('model3_prediction.txt') + +# Calculate RMSEs of model predictions and all-zero prediction. +# At a competition, RMSEs (or RMLSEs) of submissions can be used. +y = np.loadtxt('target.txt') +e0 = rmse(y, np.zeros_like(y)) +e1 = rmse(y, p1) +e2 = rmse(y, p2) +e3 = rmse(y, p3) + +p, w = netflix([e1, e2, e3], [p1, p2, p3], e0, l=0.0001) # l is an optional regularization parameter. +``` + + +## Algorithms +Currently algorithms available are as follows: + +### Online learning algorithms +* Stochastic Gradient Descent (SGD) +* Follow-the-Regularized-Leader (FTRL) +* Factorization Machine (FM) +* Neural Networks (NN) - with a single (NN) or two (NN_H2) ReLU hidden layers +* Decision Tree + +### Batch learning algorithm +* Neural Networks (NN) - with a single hidden layer and L-BFGS optimization + +### Examples +```python +from kaggler.online_model import SGD, FTRL, FM, NN + +# SGD +clf = SGD(a=.01, # learning rate + l1=1e-6, # L1 regularization parameter + l2=1e-6, # L2 regularization parameter + n=2**20, # number of hashed features + epoch=10, # number of epochs + interaction=True) # use feature interaction or not + +# FTRL +clf = FTRL(a=.1, # alpha in the per-coordinate rate + b=1, # beta in the per-coordinate rate + l1=1., # L1 regularization parameter + l2=1., # L2 regularization parameter + n=2**20, # number of hashed features + epoch=1, # number of epochs + interaction=True) # use feature interaction or not + +# FM +clf = FM(n=1e5, # number of features + epoch=100, # number of epochs + dim=4, # size of factors for interactions + a=.01) # learning rate + +# NN +clf = NN(n=1e5, # number of features + epoch=10, # number of epochs + h=16, # number of hidden units + a=.1, # learning rate + l2=1e-6) # L2 regularization parameter + +# online training and prediction directly with a libsvm file +for x, y in clf.read_sparse('train.sparse'): + p = clf.predict_one(x) # predict for an input + clf.update_one(x, p - y) # update the model with the target using error + +for x, _ in clf.read_sparse('test.sparse'): + p = clf.predict_one(x) + +# online training and prediction with a scipy sparse matrix +from kaggler import load_data + +X, y = load_data('train.sps') + +clf.fit(X, y) +p = clf.predict(X) +``` + +## Data I/O +Kaggler supports CSV (`.csv`), LibSVM (`.sps`), and HDF5 (`.h5`) file formats: +``` +# CSV format: target,feature1,feature2,... +1,1,0,0,1,0.5 +0,0,1,0,0,5 + +# LibSVM format: target feature-index1:feature-value1 feature-index2:feature-value2 +1 1:1 4:1 5:0.5 +0 2:1 5:1 + +# HDF5 +- issparse: binary flag indicating whether it stores sparse data or not. +- target: stores a target variable as a numpy.array +- shape: available only if issparse == 1. shape of scipy.sparse.csr_matrix +- indices: available only if issparse == 1. indices of scipy.sparse.csr_matrix +- indptr: available only if issparse == 1. indptr of scipy.sparse.csr_matrix +- data: dense feature matrix if issparse == 0 else data of scipy.sparse.csr_matrix +``` + +```python +from kaggler.data_io import load_data, save_data + +X, y = load_data('train.csv') # use the first column as a target variable +X, y = load_data('train.h5') # load the feature matrix and target vector from a HDF5 file. +X, y = load_data('train.sps') # load the feature matrix and target vector from LibSVM file. + +save_data(X, y, 'train.csv') +save_data(X, y, 'train.h5') +save_data(X, y, 'train.sps') +``` + +## Documentation +Package documentation is available at [here](https://kaggler.readthedocs.io/en/latest/) + + + +%prep +%autosetup -n Kaggler-0.9.15 + +%build +%py3_build + +%install +%py3_install +install -d -m755 %{buildroot}/%{_pkgdocdir} +if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi +if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi +if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi +if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi +pushd %{buildroot} +if [ -d usr/lib ]; then + find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/lib64 ]; then + find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/bin ]; then + find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/sbin ]; then + find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst +fi +touch doclist.lst +if [ -d usr/share/man ]; then + find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst +fi +popd +mv %{buildroot}/filelist.lst . +mv %{buildroot}/doclist.lst . + +%files -n python3-Kaggler -f filelist.lst +%dir %{python3_sitelib}/* + +%files help -f doclist.lst +%{_docdir}/* + +%changelog +* Fri May 05 2023 Python_Bot <Python_Bot@openeuler.org> - 0.9.15-1 +- Package Spec generated @@ -0,0 +1 @@ +a9df59905152b6d62d32d2d63d080e8c Kaggler-0.9.15.tar.gz |