summaryrefslogtreecommitdiff
path: root/python-documentfeatureselection.spec
diff options
context:
space:
mode:
authorCoprDistGit <infra@openeuler.org>2023-05-31 03:44:28 +0000
committerCoprDistGit <infra@openeuler.org>2023-05-31 03:44:28 +0000
commit452434f6c3e00811d2ee474b26dfb13e05138f9a (patch)
tree16ace5ca40f342b02d84ab4517bca0f0e6d31d26 /python-documentfeatureselection.spec
parentf9045cc845a006d421f0e04b28aeb8acf303fb79 (diff)
automatic import of python-documentfeatureselection
Diffstat (limited to 'python-documentfeatureselection.spec')
-rw-r--r--python-documentfeatureselection.spec408
1 files changed, 408 insertions, 0 deletions
diff --git a/python-documentfeatureselection.spec b/python-documentfeatureselection.spec
new file mode 100644
index 0000000..b32d77a
--- /dev/null
+++ b/python-documentfeatureselection.spec
@@ -0,0 +1,408 @@
+%global _empty_manifest_terminate_build 0
+Name: python-DocumentFeatureSelection
+Version: 1.5
+Release: 1
+Summary: Various methods of feature selection from Text Data
+License: CeCILL-B
+URL: https://github.com/Kensuke-Mitsuzawa/DocumentFeatureSelection
+Source0: https://mirrors.nju.edu.cn/pypi/web/packages/b2/d8/7af550d2c17096b15619b1832bdc97cecc3ad2af86a2351b85d19df664a9/DocumentFeatureSelection-1.5.tar.gz
+BuildArch: noarch
+
+Requires: python3-six
+Requires: python3-setuptools
+Requires: python3-joblib
+Requires: python3-numpy
+Requires: python3-scipy
+Requires: python3-nltk
+Requires: python3-scikit-learn
+Requires: python3-pypandoc
+Requires: python3-cython
+Requires: python3-sqlitedict
+Requires: python3-nose
+Requires: python3-typing
+
+%description
+# what's this?
+This is set of feature selection codes from text data.
+(About feature selection, see [here](http://nlp.stanford.edu/IR-book/html/htmledition/feature-selection-1.html) or [here](http://stackoverflow.com/questions/13603882/feature-selection-and-reduction-for-text-classification))
+The Feature selection is really important when you use machine learning metrics on natural language data.
+The natural language data usually contains a lot of noise information, thus machine learning metrics are weak if you don't process any feature selection.
+(There is some exceptions of algorithms like _Decision Tree_ or _Random forest_ . They have feature selection metric inside the algorithm itself)
+The feature selection is also useful when you observe your text data.
+With the feature selection, you can get to know which features really contribute to specific labels.
+Please visit [project page on github](https://github.com/Kensuke-Mitsuzawa/DocumentFeatureSelection).
+If you find any bugs and you report it to github issue, I'm glad.
+Any pull-requests are welcomed.
+## Supporting methods
+This package provides you some feature selection metrics.
+Currently, this package supports following feature selection methods
+* TF-IDF
+* Pointwise mutual information (PMI)
+* Strength of Association (SOA)
+* Bi-Normal Separation (BNS)
+## Contribution of this package
+* Easy interface for pre-processing
+* Easy interface for accessing feature selection methods
+* Fast speed computation thanks to sparse matrix and multi-processing
+# Overview of methods
+## TF-IDF
+This method, in fact, just calls `TfidfTransformer` of the scikit-learn.
+See [scikit-learn document](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html) about detailed information.
+## PMI
+PMI is calculated by correlation between _feature_ (i.e. token) and _category_ (i.e. label).
+Concretely, it makes _cross-table_ (or called _contingency table_) and calculates joint probability and marginal probability on it.
+To know more, see [reference](https://www.eecis.udel.edu/~trnka/CISC889-11S/lectures/philip-pmi.pdf)
+In python world, [NLTK](http://www.nltk.org/howto/collocations.html) and [Other package](https://github.com/Bollegala/svdmi) also provide PMI.
+Check them and choose based on your preference and usage.
+## SOA
+SOA is improved feature-selection method from PMI.
+PMI is weak when feature has low word frequency.
+SOA is based on PMI computing, however, it is feasible on such low frequency features.
+Moreover, you can get anti-correlation between features and categories.
+In this package, SOA formula is from following paper,
+`Saif Mohammad and Svetlana Kiritchenko, "Using Hashtags to Capture Fine Emotion Categories from Tweets", Computational Intelligence, 01/2014; 31(2).`
+```
+SOA(w, e)\ =\ log_2\frac{freq(w, e) * freq(\neg{e})}{freq(e) * freq(w, \neg{e})}
+```
+Where
+* freq(w, e) is the number of times _w_ occurs in an unit(sentence or document) with label _e_
+* freq(w,¬e) is the number of times _w_ occurs in units that does not have the label _e_
+* freq(e) is the number of units having the label _e_
+* freq(¬e) is the number of units having NOT the label _e_
+## BNS
+BNS is a feature selection method for binary class data.
+There is several methods available for binary class data, such as _information gain (IG)_, _chi-squared
+(CHI)_, _odds ratio (Odds)_.
+The problem is when you execute your feature selection on skewed data.
+These methods are weak for such skewed data, however, _BNS_ is feasible only for skewed data.
+The following paper shows how BNS is feasible for skewed data.
+```Lei Tang and Huan Liu, "Bias Analysis in Text Classification for Highly Skewed Data", 2005```
+or
+```George Forman, "An Extensive Empirical Study of Feature Selection Metrics for Text Classification",Journal of Machine Learning Research 3 (2003) 1289-1305```
+# Requirement
+* Python 3.x(checked under Python 3.5)
+# Setting up
+## install
+`python setup.py install`
+### Note
+You might see error message during running this command, such as
+```
+We failed to install numpy automatically. Try installing numpy manually or Try anaconda distribution.
+```
+This is because `setup.py` tries to instal numpy and scipy with `pip`, however it fails.
+We need numpy and scipy before we install `scikit-learn`.
+In this case, you take following choice
+* You install `numpy` and `scipy` manually
+* You use `anaconda` python distribution. Please visit [their site](https://www.continuum.io/downloads).
+# Example
+```python
+input_dict = {
+ "label_a": [
+ ["I", "aa", "aa", "aa", "aa", "aa"],
+ ["bb", "aa", "aa", "aa", "aa", "aa"],
+ ["I", "aa", "hero", "some", "ok", "aa"]
+ ],
+ "label_b": [
+ ["bb", "bb", "bb"],
+ ["bb", "bb", "bb"],
+ ["hero", "ok", "bb"],
+ ["hero", "cc", "bb"],
+ ],
+ "label_c": [
+ ["cc", "cc", "cc"],
+ ["cc", "cc", "bb"],
+ ["xx", "xx", "cc"],
+ ["aa", "xx", "cc"],
+ ]
+}
+from DocumentFeatureSelection import interface
+interface.run_feature_selection(input_dict, method='pmi', use_cython=True).convert_score_matrix2score_record()
+```
+Then, you get the result
+```python
+[{'score': 0.14976146817207336, 'label': 'label_c', 'feature': 'bb', 'frequency': 1.0}, ...]
+```
+See scripts in `examples/`
+# For developers
+You could set up dev environment with docker-compose.
+This command runs test with the docker container.
+```bash
+$ cd tests/
+$ docker-compose build
+$ docker-compose up
+```
+
+%package -n python3-DocumentFeatureSelection
+Summary: Various methods of feature selection from Text Data
+Provides: python-DocumentFeatureSelection
+BuildRequires: python3-devel
+BuildRequires: python3-setuptools
+BuildRequires: python3-pip
+%description -n python3-DocumentFeatureSelection
+# what's this?
+This is set of feature selection codes from text data.
+(About feature selection, see [here](http://nlp.stanford.edu/IR-book/html/htmledition/feature-selection-1.html) or [here](http://stackoverflow.com/questions/13603882/feature-selection-and-reduction-for-text-classification))
+The Feature selection is really important when you use machine learning metrics on natural language data.
+The natural language data usually contains a lot of noise information, thus machine learning metrics are weak if you don't process any feature selection.
+(There is some exceptions of algorithms like _Decision Tree_ or _Random forest_ . They have feature selection metric inside the algorithm itself)
+The feature selection is also useful when you observe your text data.
+With the feature selection, you can get to know which features really contribute to specific labels.
+Please visit [project page on github](https://github.com/Kensuke-Mitsuzawa/DocumentFeatureSelection).
+If you find any bugs and you report it to github issue, I'm glad.
+Any pull-requests are welcomed.
+## Supporting methods
+This package provides you some feature selection metrics.
+Currently, this package supports following feature selection methods
+* TF-IDF
+* Pointwise mutual information (PMI)
+* Strength of Association (SOA)
+* Bi-Normal Separation (BNS)
+## Contribution of this package
+* Easy interface for pre-processing
+* Easy interface for accessing feature selection methods
+* Fast speed computation thanks to sparse matrix and multi-processing
+# Overview of methods
+## TF-IDF
+This method, in fact, just calls `TfidfTransformer` of the scikit-learn.
+See [scikit-learn document](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html) about detailed information.
+## PMI
+PMI is calculated by correlation between _feature_ (i.e. token) and _category_ (i.e. label).
+Concretely, it makes _cross-table_ (or called _contingency table_) and calculates joint probability and marginal probability on it.
+To know more, see [reference](https://www.eecis.udel.edu/~trnka/CISC889-11S/lectures/philip-pmi.pdf)
+In python world, [NLTK](http://www.nltk.org/howto/collocations.html) and [Other package](https://github.com/Bollegala/svdmi) also provide PMI.
+Check them and choose based on your preference and usage.
+## SOA
+SOA is improved feature-selection method from PMI.
+PMI is weak when feature has low word frequency.
+SOA is based on PMI computing, however, it is feasible on such low frequency features.
+Moreover, you can get anti-correlation between features and categories.
+In this package, SOA formula is from following paper,
+`Saif Mohammad and Svetlana Kiritchenko, "Using Hashtags to Capture Fine Emotion Categories from Tweets", Computational Intelligence, 01/2014; 31(2).`
+```
+SOA(w, e)\ =\ log_2\frac{freq(w, e) * freq(\neg{e})}{freq(e) * freq(w, \neg{e})}
+```
+Where
+* freq(w, e) is the number of times _w_ occurs in an unit(sentence or document) with label _e_
+* freq(w,¬e) is the number of times _w_ occurs in units that does not have the label _e_
+* freq(e) is the number of units having the label _e_
+* freq(¬e) is the number of units having NOT the label _e_
+## BNS
+BNS is a feature selection method for binary class data.
+There is several methods available for binary class data, such as _information gain (IG)_, _chi-squared
+(CHI)_, _odds ratio (Odds)_.
+The problem is when you execute your feature selection on skewed data.
+These methods are weak for such skewed data, however, _BNS_ is feasible only for skewed data.
+The following paper shows how BNS is feasible for skewed data.
+```Lei Tang and Huan Liu, "Bias Analysis in Text Classification for Highly Skewed Data", 2005```
+or
+```George Forman, "An Extensive Empirical Study of Feature Selection Metrics for Text Classification",Journal of Machine Learning Research 3 (2003) 1289-1305```
+# Requirement
+* Python 3.x(checked under Python 3.5)
+# Setting up
+## install
+`python setup.py install`
+### Note
+You might see error message during running this command, such as
+```
+We failed to install numpy automatically. Try installing numpy manually or Try anaconda distribution.
+```
+This is because `setup.py` tries to instal numpy and scipy with `pip`, however it fails.
+We need numpy and scipy before we install `scikit-learn`.
+In this case, you take following choice
+* You install `numpy` and `scipy` manually
+* You use `anaconda` python distribution. Please visit [their site](https://www.continuum.io/downloads).
+# Example
+```python
+input_dict = {
+ "label_a": [
+ ["I", "aa", "aa", "aa", "aa", "aa"],
+ ["bb", "aa", "aa", "aa", "aa", "aa"],
+ ["I", "aa", "hero", "some", "ok", "aa"]
+ ],
+ "label_b": [
+ ["bb", "bb", "bb"],
+ ["bb", "bb", "bb"],
+ ["hero", "ok", "bb"],
+ ["hero", "cc", "bb"],
+ ],
+ "label_c": [
+ ["cc", "cc", "cc"],
+ ["cc", "cc", "bb"],
+ ["xx", "xx", "cc"],
+ ["aa", "xx", "cc"],
+ ]
+}
+from DocumentFeatureSelection import interface
+interface.run_feature_selection(input_dict, method='pmi', use_cython=True).convert_score_matrix2score_record()
+```
+Then, you get the result
+```python
+[{'score': 0.14976146817207336, 'label': 'label_c', 'feature': 'bb', 'frequency': 1.0}, ...]
+```
+See scripts in `examples/`
+# For developers
+You could set up dev environment with docker-compose.
+This command runs test with the docker container.
+```bash
+$ cd tests/
+$ docker-compose build
+$ docker-compose up
+```
+
+%package help
+Summary: Development documents and examples for DocumentFeatureSelection
+Provides: python3-DocumentFeatureSelection-doc
+%description help
+# what's this?
+This is set of feature selection codes from text data.
+(About feature selection, see [here](http://nlp.stanford.edu/IR-book/html/htmledition/feature-selection-1.html) or [here](http://stackoverflow.com/questions/13603882/feature-selection-and-reduction-for-text-classification))
+The Feature selection is really important when you use machine learning metrics on natural language data.
+The natural language data usually contains a lot of noise information, thus machine learning metrics are weak if you don't process any feature selection.
+(There is some exceptions of algorithms like _Decision Tree_ or _Random forest_ . They have feature selection metric inside the algorithm itself)
+The feature selection is also useful when you observe your text data.
+With the feature selection, you can get to know which features really contribute to specific labels.
+Please visit [project page on github](https://github.com/Kensuke-Mitsuzawa/DocumentFeatureSelection).
+If you find any bugs and you report it to github issue, I'm glad.
+Any pull-requests are welcomed.
+## Supporting methods
+This package provides you some feature selection metrics.
+Currently, this package supports following feature selection methods
+* TF-IDF
+* Pointwise mutual information (PMI)
+* Strength of Association (SOA)
+* Bi-Normal Separation (BNS)
+## Contribution of this package
+* Easy interface for pre-processing
+* Easy interface for accessing feature selection methods
+* Fast speed computation thanks to sparse matrix and multi-processing
+# Overview of methods
+## TF-IDF
+This method, in fact, just calls `TfidfTransformer` of the scikit-learn.
+See [scikit-learn document](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html) about detailed information.
+## PMI
+PMI is calculated by correlation between _feature_ (i.e. token) and _category_ (i.e. label).
+Concretely, it makes _cross-table_ (or called _contingency table_) and calculates joint probability and marginal probability on it.
+To know more, see [reference](https://www.eecis.udel.edu/~trnka/CISC889-11S/lectures/philip-pmi.pdf)
+In python world, [NLTK](http://www.nltk.org/howto/collocations.html) and [Other package](https://github.com/Bollegala/svdmi) also provide PMI.
+Check them and choose based on your preference and usage.
+## SOA
+SOA is improved feature-selection method from PMI.
+PMI is weak when feature has low word frequency.
+SOA is based on PMI computing, however, it is feasible on such low frequency features.
+Moreover, you can get anti-correlation between features and categories.
+In this package, SOA formula is from following paper,
+`Saif Mohammad and Svetlana Kiritchenko, "Using Hashtags to Capture Fine Emotion Categories from Tweets", Computational Intelligence, 01/2014; 31(2).`
+```
+SOA(w, e)\ =\ log_2\frac{freq(w, e) * freq(\neg{e})}{freq(e) * freq(w, \neg{e})}
+```
+Where
+* freq(w, e) is the number of times _w_ occurs in an unit(sentence or document) with label _e_
+* freq(w,¬e) is the number of times _w_ occurs in units that does not have the label _e_
+* freq(e) is the number of units having the label _e_
+* freq(¬e) is the number of units having NOT the label _e_
+## BNS
+BNS is a feature selection method for binary class data.
+There is several methods available for binary class data, such as _information gain (IG)_, _chi-squared
+(CHI)_, _odds ratio (Odds)_.
+The problem is when you execute your feature selection on skewed data.
+These methods are weak for such skewed data, however, _BNS_ is feasible only for skewed data.
+The following paper shows how BNS is feasible for skewed data.
+```Lei Tang and Huan Liu, "Bias Analysis in Text Classification for Highly Skewed Data", 2005```
+or
+```George Forman, "An Extensive Empirical Study of Feature Selection Metrics for Text Classification",Journal of Machine Learning Research 3 (2003) 1289-1305```
+# Requirement
+* Python 3.x(checked under Python 3.5)
+# Setting up
+## install
+`python setup.py install`
+### Note
+You might see error message during running this command, such as
+```
+We failed to install numpy automatically. Try installing numpy manually or Try anaconda distribution.
+```
+This is because `setup.py` tries to instal numpy and scipy with `pip`, however it fails.
+We need numpy and scipy before we install `scikit-learn`.
+In this case, you take following choice
+* You install `numpy` and `scipy` manually
+* You use `anaconda` python distribution. Please visit [their site](https://www.continuum.io/downloads).
+# Example
+```python
+input_dict = {
+ "label_a": [
+ ["I", "aa", "aa", "aa", "aa", "aa"],
+ ["bb", "aa", "aa", "aa", "aa", "aa"],
+ ["I", "aa", "hero", "some", "ok", "aa"]
+ ],
+ "label_b": [
+ ["bb", "bb", "bb"],
+ ["bb", "bb", "bb"],
+ ["hero", "ok", "bb"],
+ ["hero", "cc", "bb"],
+ ],
+ "label_c": [
+ ["cc", "cc", "cc"],
+ ["cc", "cc", "bb"],
+ ["xx", "xx", "cc"],
+ ["aa", "xx", "cc"],
+ ]
+}
+from DocumentFeatureSelection import interface
+interface.run_feature_selection(input_dict, method='pmi', use_cython=True).convert_score_matrix2score_record()
+```
+Then, you get the result
+```python
+[{'score': 0.14976146817207336, 'label': 'label_c', 'feature': 'bb', 'frequency': 1.0}, ...]
+```
+See scripts in `examples/`
+# For developers
+You could set up dev environment with docker-compose.
+This command runs test with the docker container.
+```bash
+$ cd tests/
+$ docker-compose build
+$ docker-compose up
+```
+
+%prep
+%autosetup -n DocumentFeatureSelection-1.5
+
+%build
+%py3_build
+
+%install
+%py3_install
+install -d -m755 %{buildroot}/%{_pkgdocdir}
+if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi
+if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi
+if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi
+if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi
+pushd %{buildroot}
+if [ -d usr/lib ]; then
+ find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+if [ -d usr/lib64 ]; then
+ find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+if [ -d usr/bin ]; then
+ find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+if [ -d usr/sbin ]; then
+ find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+touch doclist.lst
+if [ -d usr/share/man ]; then
+ find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst
+fi
+popd
+mv %{buildroot}/filelist.lst .
+mv %{buildroot}/doclist.lst .
+
+%files -n python3-DocumentFeatureSelection -f filelist.lst
+%dir %{python3_sitelib}/*
+
+%files help -f doclist.lst
+%{_docdir}/*
+
+%changelog
+* Wed May 31 2023 Python_Bot <Python_Bot@openeuler.org> - 1.5-1
+- Package Spec generated