automatic import of python-boruta

author: CoprDistGit <infra@openeuler.org> 2023-04-10 12:29:49 +0000
committer: CoprDistGit <infra@openeuler.org> 2023-04-10 12:29:49 +0000
commit: f2b6f966945eb5b68e5a56814bdb1c80fa12c6d9 (patch)
tree: 3b93075bda2cbf389f177243d8f26be77f38b6b9
parent: 47260cde72f712fe369382a74ab55a27cee5fef9 (diff)
3 files changed, 644 insertions, 0 deletions
diff --git a/.gitignore b/.gitignore
index e69de29..2ce6b94 100644
--- a/.gitignore
+++ b/.gitignore
@@ -0,0 +1 @@
+/Boruta-0.3.tar.gz
diff --git a/python-boruta.spec b/python-boruta.spec
new file mode 100644
index 0000000..5b96bd0
--- /dev/null
+++ b/python-boruta.spec
@@ -0,0 +1,642 @@
+%global _empty_manifest_terminate_build 0
+Name:		python-Boruta
+Version:	0.3
+Release:	1
+Summary:	Python Implementation of Boruta Feature Selection
+License:	BSD 3 clause
+URL:		https://github.com/danielhomola/boruta_py
+Source0:	https://mirrors.nju.edu.cn/pypi/web/packages/d5/ab/800c93706b1919dbdcb48fcab3d5251dbd135fa2ca7cd345f7a4dcb0864b/Boruta-0.3.tar.gz
+BuildArch:	noarch
+
+Requires:	python3-numpy
+Requires:	python3-scikit-learn
+Requires:	python3-scipy
+
+%description
+# boruta_py #
+
+This project hosts Python implementations of the [Boruta all-relevant feature selection method](https://m2.icm.edu.pl/boruta/).
+
+[Related blog post] (http://danielhomola.com/2015/05/08/borutapy-an-all-relevant-feature-selection-method/)
+
+## Dependencies ##
+
+* numpy
+* scipy
+* scikit-learn
+
+## How to use ##
+Download, import and do as you would with any other scikit-learn method:
+* fit(X, y)
+* transform(X)
+* fit_transform(X, y)
+
+## Description ##
+
+Python implementations of the Boruta R package.
+
+This implementation tries to mimic the scikit-learn interface, so use fit,
+transform or fit_transform, to run the feature selection.
+
+For more, see the docs of these functions, and the examples below.
+
+Original code and method by: Miron B Kursa, https://m2.icm.edu.pl/boruta/
+
+Boruta is an all relevant feature selection method, while most other are
+minimal optimal; this means it tries to find all features carrying
+information usable for prediction, rather than finding a possibly compact
+subset of features on which some classifier has a minimal error.
+
+Why bother with all relevant feature selection?
+When you try to understand the phenomenon that made your data, you should
+care about all factors that contribute to it, not just the bluntest signs
+of it in context of your methodology (yes, minimal optimal set of features
+by definition depends on your classifier choice).
+
+
+## What's different in BorutaPy? ##
+
+It is the original R package recoded in Python with a few added extra features.
+Some improvements include:  
+
+* Faster run times, thanks to scikit-learn
+
+* Scikit-learn like interface
+
+* Compatible with any ensemble method from scikit-learn
+
+* Automatic n_estimator selection
+
+* Ranking of features
+
+For more details, please check the top of the docstring.
+
+We highly recommend using pruned trees with a depth between 3-7.
+
+Also, after playing around a lot with the original code I identified a few areas
+where the core algorithm could be improved/altered to make it less strict and
+more applicable to biological data, where the Bonferroni correction might be
+overly harsh.
+
+__Percentile as threshold__  
+The original method uses the maximum of the shadow features as a threshold in
+deciding which real feature is doing better than the shadow ones. This could be
+overly harsh.
+
+To control this, I added the perc parameter, which sets the
+percentile of the shadow features' importances, the algorithm uses as the
+threshold. The default of 100 which is equivalent to taking the maximum as the
+R version of Boruta does, but it could be relaxed. Note, since this is the
+percentile, it changes with the size of the dataset. With several thousands of
+features it isn't as stringent as with a few dozens at the end of a Boruta run.
+
+
+__Two step correction for multiple testing__  
+The correction for multiple testing was relaxed by making it a two step
+process, rather than a harsh one step Bonferroni correction.
+
+We need to correct firstly because in each iteration we test a number of
+features against the null hypothesis (does a feature perform better than
+expected by random). For this the Bonferroni correction is used in the original
+code which is known to be too stringent in such scenarios (at least for
+biological data), and also the original code corrects for n features, even if
+we are in the 50th iteration where we only have k<<n features left. For this
+reason the first step of correction is the widely used Benjamini Hochberg FDR.
+
+Following that however we also need to account for the fact that we have been
+testing the same features over and over again in each iteration with the
+same test. For this scenario the Bonferroni is perfect, so it is applied by
+deviding the p-value threshold with the current iteration index.
+
+If this two step correction is not required, the two_step parameter has to be
+set to False, then (with perc=100) BorutaPy behaves exactly as the R version.
+
+## Parameters ##
+
+__estimator__ : object
+   > A supervised learning estimator, with a 'fit' method that returns the
+   > feature_importances_ attribute. Important features must correspond to
+   > high absolute values in the feature_importances_.
+
+__n_estimators__ : int or string, default = 1000
+   > If int sets the number of estimators in the chosen ensemble method.
+   > If 'auto' this is determined automatically based on the size of the
+   > dataset. The other parameters of the used estimators need to be set
+   > with initialisation.
+
+__perc__ : int, default = 100
+   > Instead of the max we use the percentile defined by the user, to pick
+   > our threshold for comparison between shadow and real features. The max
+   > tend to be too stringent. This provides a finer control over this. The
+   > lower perc is the more false positives will be picked as relevant but
+   > also the less relevant features will be left out. The usual trade-off.
+   > The default is essentially the vanilla Boruta corresponding to the max.
+
+__alpha__ : float, default = 0.05
+   > Level at which the corrected p-values will get rejected in both correction
+   steps.
+
+__two_step__ : Boolean, default = True
+  > If you want to use the original implementation of Boruta with Bonferroni
+  > correction only set this to False.
+
+__max_iter__ : int, default = 100
+   > The number of maximum iterations to perform.
+
+__verbose__ : int, default=0
+   > Controls verbosity of output.
+
+
+## Attributes ##
+
+**n_features_** : int
+   > The number of selected features.
+
+**support_** : array of shape [n_features]
+   > The mask of selected features - only confirmed ones are True.
+
+**support_weak_** : array of shape [n_features]
+  >  The mask of selected tentative features, which haven't gained enough
+  >  support during the max_iter number of iterations..
+
+**ranking_** : array of shape [n_features]
+  >  The feature ranking, such that ``ranking_[i]`` corresponds to the
+  >  ranking position of the i-th feature. Selected (i.e., estimated
+  >  best) features are assigned rank 1 and tentative features are assigned
+  >  rank 2.
+
+
+## Examples ##
+
+    import pandas as pd
+    from sklearn.ensemble import RandomForestClassifier
+    from boruta import BorutaPy
+
+    # load X and y
+    # NOTE BorutaPy accepts numpy arrays only, hence the .values attribute
+    X = pd.read_csv('examples/test_X.csv', index_col=0).values
+    y = pd.read_csv('examples/test_y.csv', header=None, index_col=0).values
+    y = y.ravel()
+
+    # define random forest classifier, with utilising all cores and
+    # sampling in proportion to y labels
+    rf = RandomForestClassifier(n_jobs=-1, class_weight='balanced', max_depth=5)
+
+    # define Boruta feature selection method
+    feat_selector = BorutaPy(rf, n_estimators='auto', verbose=2, random_state=1)
+
+    # find all relevant features - 5 features should be selected
+    feat_selector.fit(X, y)
+
+    # check selected features - first 5 features are selected
+    feat_selector.support_
+
+    # check ranking of features
+    feat_selector.ranking_
+
+    # call transform() on X to filter it down to selected features
+    X_filtered = feat_selector.transform(X)
+
+## References ##
+
+1. Kursa M., Rudnicki W., "Feature Selection with the Boruta Package" Journal of Statistical Software, Vol. 36, Issue 11, Sep 2010
+
+
+
+
+%package -n python3-Boruta
+Summary:	Python Implementation of Boruta Feature Selection
+Provides:	python-Boruta
+BuildRequires:	python3-devel
+BuildRequires:	python3-setuptools
+BuildRequires:	python3-pip
+%description -n python3-Boruta
+# boruta_py #
+
+This project hosts Python implementations of the [Boruta all-relevant feature selection method](https://m2.icm.edu.pl/boruta/).
+
+[Related blog post] (http://danielhomola.com/2015/05/08/borutapy-an-all-relevant-feature-selection-method/)
+
+## Dependencies ##
+
+* numpy
+* scipy
+* scikit-learn
+
+## How to use ##
+Download, import and do as you would with any other scikit-learn method:
+* fit(X, y)
+* transform(X)
+* fit_transform(X, y)
+
+## Description ##
+
+Python implementations of the Boruta R package.
+
+This implementation tries to mimic the scikit-learn interface, so use fit,
+transform or fit_transform, to run the feature selection.
+
+For more, see the docs of these functions, and the examples below.
+
+Original code and method by: Miron B Kursa, https://m2.icm.edu.pl/boruta/
+
+Boruta is an all relevant feature selection method, while most other are
+minimal optimal; this means it tries to find all features carrying
+information usable for prediction, rather than finding a possibly compact
+subset of features on which some classifier has a minimal error.
+
+Why bother with all relevant feature selection?
+When you try to understand the phenomenon that made your data, you should
+care about all factors that contribute to it, not just the bluntest signs
+of it in context of your methodology (yes, minimal optimal set of features
+by definition depends on your classifier choice).
+
+
+## What's different in BorutaPy? ##
+
+It is the original R package recoded in Python with a few added extra features.
+Some improvements include:  
+
+* Faster run times, thanks to scikit-learn
+
+* Scikit-learn like interface
+
+* Compatible with any ensemble method from scikit-learn
+
+* Automatic n_estimator selection
+
+* Ranking of features
+
+For more details, please check the top of the docstring.
+
+We highly recommend using pruned trees with a depth between 3-7.
+
+Also, after playing around a lot with the original code I identified a few areas
+where the core algorithm could be improved/altered to make it less strict and
+more applicable to biological data, where the Bonferroni correction might be
+overly harsh.
+
+__Percentile as threshold__  
+The original method uses the maximum of the shadow features as a threshold in
+deciding which real feature is doing better than the shadow ones. This could be
+overly harsh.
+
+To control this, I added the perc parameter, which sets the
+percentile of the shadow features' importances, the algorithm uses as the
+threshold. The default of 100 which is equivalent to taking the maximum as the
+R version of Boruta does, but it could be relaxed. Note, since this is the
+percentile, it changes with the size of the dataset. With several thousands of
+features it isn't as stringent as with a few dozens at the end of a Boruta run.
+
+
+__Two step correction for multiple testing__  
+The correction for multiple testing was relaxed by making it a two step
+process, rather than a harsh one step Bonferroni correction.
+
+We need to correct firstly because in each iteration we test a number of
+features against the null hypothesis (does a feature perform better than
+expected by random). For this the Bonferroni correction is used in the original
+code which is known to be too stringent in such scenarios (at least for
+biological data), and also the original code corrects for n features, even if
+we are in the 50th iteration where we only have k<<n features left. For this
+reason the first step of correction is the widely used Benjamini Hochberg FDR.
+
+Following that however we also need to account for the fact that we have been
+testing the same features over and over again in each iteration with the
+same test. For this scenario the Bonferroni is perfect, so it is applied by
+deviding the p-value threshold with the current iteration index.
+
+If this two step correction is not required, the two_step parameter has to be
+set to False, then (with perc=100) BorutaPy behaves exactly as the R version.
+
+## Parameters ##
+
+__estimator__ : object
+   > A supervised learning estimator, with a 'fit' method that returns the
+   > feature_importances_ attribute. Important features must correspond to
+   > high absolute values in the feature_importances_.
+
+__n_estimators__ : int or string, default = 1000
+   > If int sets the number of estimators in the chosen ensemble method.
+   > If 'auto' this is determined automatically based on the size of the
+   > dataset. The other parameters of the used estimators need to be set
+   > with initialisation.
+
+__perc__ : int, default = 100
+   > Instead of the max we use the percentile defined by the user, to pick
+   > our threshold for comparison between shadow and real features. The max
+   > tend to be too stringent. This provides a finer control over this. The
+   > lower perc is the more false positives will be picked as relevant but
+   > also the less relevant features will be left out. The usual trade-off.
+   > The default is essentially the vanilla Boruta corresponding to the max.
+
+__alpha__ : float, default = 0.05
+   > Level at which the corrected p-values will get rejected in both correction
+   steps.
+
+__two_step__ : Boolean, default = True
+  > If you want to use the original implementation of Boruta with Bonferroni
+  > correction only set this to False.
+
+__max_iter__ : int, default = 100
+   > The number of maximum iterations to perform.
+
+__verbose__ : int, default=0
+   > Controls verbosity of output.
+
+
+## Attributes ##
+
+**n_features_** : int
+   > The number of selected features.
+
+**support_** : array of shape [n_features]
+   > The mask of selected features - only confirmed ones are True.
+
+**support_weak_** : array of shape [n_features]
+  >  The mask of selected tentative features, which haven't gained enough
+  >  support during the max_iter number of iterations..
+
+**ranking_** : array of shape [n_features]
+  >  The feature ranking, such that ``ranking_[i]`` corresponds to the
+  >  ranking position of the i-th feature. Selected (i.e., estimated
+  >  best) features are assigned rank 1 and tentative features are assigned
+  >  rank 2.
+
+
+## Examples ##
+
+    import pandas as pd
+    from sklearn.ensemble import RandomForestClassifier
+    from boruta import BorutaPy
+
+    # load X and y
+    # NOTE BorutaPy accepts numpy arrays only, hence the .values attribute
+    X = pd.read_csv('examples/test_X.csv', index_col=0).values
+    y = pd.read_csv('examples/test_y.csv', header=None, index_col=0).values
+    y = y.ravel()
+
+    # define random forest classifier, with utilising all cores and
+    # sampling in proportion to y labels
+    rf = RandomForestClassifier(n_jobs=-1, class_weight='balanced', max_depth=5)
+
+    # define Boruta feature selection method
+    feat_selector = BorutaPy(rf, n_estimators='auto', verbose=2, random_state=1)
+
+    # find all relevant features - 5 features should be selected
+    feat_selector.fit(X, y)
+
+    # check selected features - first 5 features are selected
+    feat_selector.support_
+
+    # check ranking of features
+    feat_selector.ranking_
+
+    # call transform() on X to filter it down to selected features
+    X_filtered = feat_selector.transform(X)
+
+## References ##
+
+1. Kursa M., Rudnicki W., "Feature Selection with the Boruta Package" Journal of Statistical Software, Vol. 36, Issue 11, Sep 2010
+
+
+
+
+%package help
+Summary:	Development documents and examples for Boruta
+Provides:	python3-Boruta-doc
+%description help
+# boruta_py #
+
+This project hosts Python implementations of the [Boruta all-relevant feature selection method](https://m2.icm.edu.pl/boruta/).
+
+[Related blog post] (http://danielhomola.com/2015/05/08/borutapy-an-all-relevant-feature-selection-method/)
+
+## Dependencies ##
+
+* numpy
+* scipy
+* scikit-learn
+
+## How to use ##
+Download, import and do as you would with any other scikit-learn method:
+* fit(X, y)
+* transform(X)
+* fit_transform(X, y)
+
+## Description ##
+
+Python implementations of the Boruta R package.
+
+This implementation tries to mimic the scikit-learn interface, so use fit,
+transform or fit_transform, to run the feature selection.
+
+For more, see the docs of these functions, and the examples below.
+
+Original code and method by: Miron B Kursa, https://m2.icm.edu.pl/boruta/
+
+Boruta is an all relevant feature selection method, while most other are
+minimal optimal; this means it tries to find all features carrying
+information usable for prediction, rather than finding a possibly compact
+subset of features on which some classifier has a minimal error.
+
+Why bother with all relevant feature selection?
+When you try to understand the phenomenon that made your data, you should
+care about all factors that contribute to it, not just the bluntest signs
+of it in context of your methodology (yes, minimal optimal set of features
+by definition depends on your classifier choice).
+
+
+## What's different in BorutaPy? ##
+
+It is the original R package recoded in Python with a few added extra features.
+Some improvements include:  
+
+* Faster run times, thanks to scikit-learn
+
+* Scikit-learn like interface
+
+* Compatible with any ensemble method from scikit-learn
+
+* Automatic n_estimator selection
+
+* Ranking of features
+
+For more details, please check the top of the docstring.
+
+We highly recommend using pruned trees with a depth between 3-7.
+
+Also, after playing around a lot with the original code I identified a few areas
+where the core algorithm could be improved/altered to make it less strict and
+more applicable to biological data, where the Bonferroni correction might be
+overly harsh.
+
+__Percentile as threshold__  
+The original method uses the maximum of the shadow features as a threshold in
+deciding which real feature is doing better than the shadow ones. This could be
+overly harsh.
+
+To control this, I added the perc parameter, which sets the
+percentile of the shadow features' importances, the algorithm uses as the
+threshold. The default of 100 which is equivalent to taking the maximum as the
+R version of Boruta does, but it could be relaxed. Note, since this is the
+percentile, it changes with the size of the dataset. With several thousands of
+features it isn't as stringent as with a few dozens at the end of a Boruta run.
+
+
+__Two step correction for multiple testing__  
+The correction for multiple testing was relaxed by making it a two step
+process, rather than a harsh one step Bonferroni correction.
+
+We need to correct firstly because in each iteration we test a number of
+features against the null hypothesis (does a feature perform better than
+expected by random). For this the Bonferroni correction is used in the original
+code which is known to be too stringent in such scenarios (at least for
+biological data), and also the original code corrects for n features, even if
+we are in the 50th iteration where we only have k<<n features left. For this
+reason the first step of correction is the widely used Benjamini Hochberg FDR.
+
+Following that however we also need to account for the fact that we have been
+testing the same features over and over again in each iteration with the
+same test. For this scenario the Bonferroni is perfect, so it is applied by
+deviding the p-value threshold with the current iteration index.
+
+If this two step correction is not required, the two_step parameter has to be
+set to False, then (with perc=100) BorutaPy behaves exactly as the R version.
+
+## Parameters ##
+
+__estimator__ : object
+   > A supervised learning estimator, with a 'fit' method that returns the
+   > feature_importances_ attribute. Important features must correspond to
+   > high absolute values in the feature_importances_.
+
+__n_estimators__ : int or string, default = 1000
+   > If int sets the number of estimators in the chosen ensemble method.
+   > If 'auto' this is determined automatically based on the size of the
+   > dataset. The other parameters of the used estimators need to be set
+   > with initialisation.
+
+__perc__ : int, default = 100
+   > Instead of the max we use the percentile defined by the user, to pick
+   > our threshold for comparison between shadow and real features. The max
+   > tend to be too stringent. This provides a finer control over this. The
+   > lower perc is the more false positives will be picked as relevant but
+   > also the less relevant features will be left out. The usual trade-off.
+   > The default is essentially the vanilla Boruta corresponding to the max.
+
+__alpha__ : float, default = 0.05
+   > Level at which the corrected p-values will get rejected in both correction
+   steps.
+
+__two_step__ : Boolean, default = True
+  > If you want to use the original implementation of Boruta with Bonferroni
+  > correction only set this to False.
+
+__max_iter__ : int, default = 100
+   > The number of maximum iterations to perform.
+
+__verbose__ : int, default=0
+   > Controls verbosity of output.
+
+
+## Attributes ##
+
+**n_features_** : int
+   > The number of selected features.
+
+**support_** : array of shape [n_features]
+   > The mask of selected features - only confirmed ones are True.
+
+**support_weak_** : array of shape [n_features]
+  >  The mask of selected tentative features, which haven't gained enough
+  >  support during the max_iter number of iterations..
+
+**ranking_** : array of shape [n_features]
+  >  The feature ranking, such that ``ranking_[i]`` corresponds to the
+  >  ranking position of the i-th feature. Selected (i.e., estimated
+  >  best) features are assigned rank 1 and tentative features are assigned
+  >  rank 2.
+
+
+## Examples ##
+
+    import pandas as pd
+    from sklearn.ensemble import RandomForestClassifier
+    from boruta import BorutaPy
+
+    # load X and y
+    # NOTE BorutaPy accepts numpy arrays only, hence the .values attribute
+    X = pd.read_csv('examples/test_X.csv', index_col=0).values
+    y = pd.read_csv('examples/test_y.csv', header=None, index_col=0).values
+    y = y.ravel()
+
+    # define random forest classifier, with utilising all cores and
+    # sampling in proportion to y labels
+    rf = RandomForestClassifier(n_jobs=-1, class_weight='balanced', max_depth=5)
+
+    # define Boruta feature selection method
+    feat_selector = BorutaPy(rf, n_estimators='auto', verbose=2, random_state=1)
+
+    # find all relevant features - 5 features should be selected
+    feat_selector.fit(X, y)
+
+    # check selected features - first 5 features are selected
+    feat_selector.support_
+
+    # check ranking of features
+    feat_selector.ranking_
+
+    # call transform() on X to filter it down to selected features
+    X_filtered = feat_selector.transform(X)
+
+## References ##
+
+1. Kursa M., Rudnicki W., "Feature Selection with the Boruta Package" Journal of Statistical Software, Vol. 36, Issue 11, Sep 2010
+
+
+
+
+%prep
+%autosetup -n Boruta-0.3
+
+%build
+%py3_build
+
+%install
+%py3_install
+install -d -m755 %{buildroot}/%{_pkgdocdir}
+if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi
+if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi
+if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi
+if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi
+pushd %{buildroot}
+if [ -d usr/lib ]; then
+	find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+if [ -d usr/lib64 ]; then
+	find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+if [ -d usr/bin ]; then
+	find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+if [ -d usr/sbin ]; then
+	find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+touch doclist.lst
+if [ -d usr/share/man ]; then
+	find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst
+fi
+popd
+mv %{buildroot}/filelist.lst .
+mv %{buildroot}/doclist.lst .
+
+%files -n python3-Boruta -f filelist.lst
+%dir %{python3_sitelib}/*
+
+%files help -f doclist.lst
+%{_docdir}/*
+
+%changelog
+* Mon Apr 10 2023 Python_Bot <Python_Bot@openeuler.org> - 0.3-1
+- Package Spec generated
diff --git a/sources b/sources
new file mode 100644
index 0000000..0a497af
--- /dev/null
+++ b/sources
@@ -0,0 +1 @@
+1d804dccc34427afd007bc0f2fcc630e  Boruta-0.3.tar.gz
author	CoprDistGit <infra@openeuler.org>	2023-04-10 12:29:49 +0000
committer	CoprDistGit <infra@openeuler.org>	2023-04-10 12:29:49 +0000
commit	f2b6f966945eb5b68e5a56814bdb1c80fa12c6d9 (patch)
tree	3b93075bda2cbf389f177243d8f26be77f38b6b9
parent	47260cde72f712fe369382a74ab55a27cee5fef9 (diff)