%global _empty_manifest_terminate_build 0 Name: python-whichlang Version: 0.0.4 Release: 1 Summary: Does language identification for Indian languages License: MIT License URL: https://github.com/xtraspeed/whichlang Source0: https://mirrors.aliyun.com/pypi/web/packages/63/d5/dbd25ab5fdf4a0eaea0601158872129d53fd28e117ffd7a9b2b7f0782d84/whichlang-0.0.4.tar.gz BuildArch: noarch %description # whichlang whichlang is a Python library for identifying the language of the given text ## Installation Use the package manager [pip](https://pip.pypa.io/en/stable/) to install whichlang. ```bash pip install whichlang ``` ## Usage ```python from whichlang import whichlang as wl f = open('sample-test-files\\sample-hindi.txt','r') data = f.read() # returns tuple of top 3 probable languages, first one being most probable language print (wl.which_lang(data)) >>> ('Hindi', 'Marathi', 'Punjabi') #Hindi is most probable. ``` ``` # For training a language model # assamese.txt is train data # Assamese is the language model created python train_lang_models.py -f train-data\as\assamese.txt -l Assamese ``` ## Contributing Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change. ## License [MIT](https://choosealicense.com/licenses/mit/) ## Available Languages Hindi, Telugu, Tamil, Kannada, Malayalam, Punjabi, Marathi, Gujarati, Oriya, Assamese. ## Acknowledgements 1. We would like to thank the [Leipzig Corpora collection](https://corpora.uni-leipzig.de/en) where we collected data for training models. Dirk Goldhahn, Thomas Eckart and Uwe Quasthoff (2012): Building Large Monolingual Dictionaries at the Leipzig Corpora Collection: From 100 to 200 Languages. In: Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), 2012 2. whichlang is based on N-gram based Text categorization: Cavnar, William B., and John M. Trenkle. "N-gram-based text categorization." Proceedings of SDAIR-94, 3rd annual symposium on document analysis and information retrieval. Vol. 161175. 1994. The same approach was used in library [langdetect]((https://github.com/fedelopez77/langdetect)). We found this approach quite effective and wanted to explore for Indian languages. In whichlang, we train, optimize and make models readily available for Indian languages since these languages have been less explored. %package -n python3-whichlang Summary: Does language identification for Indian languages Provides: python-whichlang BuildRequires: python3-devel BuildRequires: python3-setuptools BuildRequires: python3-pip %description -n python3-whichlang # whichlang whichlang is a Python library for identifying the language of the given text ## Installation Use the package manager [pip](https://pip.pypa.io/en/stable/) to install whichlang. ```bash pip install whichlang ``` ## Usage ```python from whichlang import whichlang as wl f = open('sample-test-files\\sample-hindi.txt','r') data = f.read() # returns tuple of top 3 probable languages, first one being most probable language print (wl.which_lang(data)) >>> ('Hindi', 'Marathi', 'Punjabi') #Hindi is most probable. ``` ``` # For training a language model # assamese.txt is train data # Assamese is the language model created python train_lang_models.py -f train-data\as\assamese.txt -l Assamese ``` ## Contributing Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change. ## License [MIT](https://choosealicense.com/licenses/mit/) ## Available Languages Hindi, Telugu, Tamil, Kannada, Malayalam, Punjabi, Marathi, Gujarati, Oriya, Assamese. ## Acknowledgements 1. We would like to thank the [Leipzig Corpora collection](https://corpora.uni-leipzig.de/en) where we collected data for training models. Dirk Goldhahn, Thomas Eckart and Uwe Quasthoff (2012): Building Large Monolingual Dictionaries at the Leipzig Corpora Collection: From 100 to 200 Languages. In: Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), 2012 2. whichlang is based on N-gram based Text categorization: Cavnar, William B., and John M. Trenkle. "N-gram-based text categorization." Proceedings of SDAIR-94, 3rd annual symposium on document analysis and information retrieval. Vol. 161175. 1994. The same approach was used in library [langdetect]((https://github.com/fedelopez77/langdetect)). We found this approach quite effective and wanted to explore for Indian languages. In whichlang, we train, optimize and make models readily available for Indian languages since these languages have been less explored. %package help Summary: Development documents and examples for whichlang Provides: python3-whichlang-doc %description help # whichlang whichlang is a Python library for identifying the language of the given text ## Installation Use the package manager [pip](https://pip.pypa.io/en/stable/) to install whichlang. ```bash pip install whichlang ``` ## Usage ```python from whichlang import whichlang as wl f = open('sample-test-files\\sample-hindi.txt','r') data = f.read() # returns tuple of top 3 probable languages, first one being most probable language print (wl.which_lang(data)) >>> ('Hindi', 'Marathi', 'Punjabi') #Hindi is most probable. ``` ``` # For training a language model # assamese.txt is train data # Assamese is the language model created python train_lang_models.py -f train-data\as\assamese.txt -l Assamese ``` ## Contributing Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change. ## License [MIT](https://choosealicense.com/licenses/mit/) ## Available Languages Hindi, Telugu, Tamil, Kannada, Malayalam, Punjabi, Marathi, Gujarati, Oriya, Assamese. ## Acknowledgements 1. We would like to thank the [Leipzig Corpora collection](https://corpora.uni-leipzig.de/en) where we collected data for training models. Dirk Goldhahn, Thomas Eckart and Uwe Quasthoff (2012): Building Large Monolingual Dictionaries at the Leipzig Corpora Collection: From 100 to 200 Languages. In: Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), 2012 2. whichlang is based on N-gram based Text categorization: Cavnar, William B., and John M. Trenkle. "N-gram-based text categorization." Proceedings of SDAIR-94, 3rd annual symposium on document analysis and information retrieval. Vol. 161175. 1994. The same approach was used in library [langdetect]((https://github.com/fedelopez77/langdetect)). We found this approach quite effective and wanted to explore for Indian languages. In whichlang, we train, optimize and make models readily available for Indian languages since these languages have been less explored. %prep %autosetup -n whichlang-0.0.4 %build %py3_build %install %py3_install install -d -m755 %{buildroot}/%{_pkgdocdir} if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi pushd %{buildroot} if [ -d usr/lib ]; then find usr/lib -type f -printf "\"/%h/%f\"\n" >> filelist.lst fi if [ -d usr/lib64 ]; then find usr/lib64 -type f -printf "\"/%h/%f\"\n" >> filelist.lst fi if [ -d usr/bin ]; then find usr/bin -type f -printf "\"/%h/%f\"\n" >> filelist.lst fi if [ -d usr/sbin ]; then find usr/sbin -type f -printf "\"/%h/%f\"\n" >> filelist.lst fi touch doclist.lst if [ -d usr/share/man ]; then find usr/share/man -type f -printf "\"/%h/%f.gz\"\n" >> doclist.lst fi popd mv %{buildroot}/filelist.lst . mv %{buildroot}/doclist.lst . %files -n python3-whichlang -f filelist.lst %dir %{python3_sitelib}/* %files help -f doclist.lst %{_docdir}/* %changelog * Tue Jun 20 2023 Python_Bot - 0.0.4-1 - Package Spec generated