diff options
| -rw-r--r-- | .gitignore | 1 | ||||
| -rw-r--r-- | python-wiktionary-de-parser.spec | 335 | ||||
| -rw-r--r-- | sources | 1 |
3 files changed, 337 insertions, 0 deletions
@@ -0,0 +1 @@ +/wiktionary-de-parser-0.9.5.tar.gz diff --git a/python-wiktionary-de-parser.spec b/python-wiktionary-de-parser.spec new file mode 100644 index 0000000..83ab00c --- /dev/null +++ b/python-wiktionary-de-parser.spec @@ -0,0 +1,335 @@ +%global _empty_manifest_terminate_build 0 +Name: python-wiktionary-de-parser +Version: 0.9.5 +Release: 1 +Summary: Extracts data from German Wiktionary dump files. Allows you to add your own extraction methods đ +License: MIT +URL: https://github.com/gambolputty/wiktionary-de-parser +Source0: https://mirrors.nju.edu.cn/pypi/web/packages/d6/e6/d91d18aff8de3b01402413043ea9a53c83ea83d36bed6ba6c47f37be6ab8/wiktionary-de-parser-0.9.5.tar.gz +BuildArch: noarch + +Requires: python3-lxml +Requires: python3-mwparserfromhell + +%description +# wiktionary-de-parser + +This is a Python module to extract data from German Wiktionary XML files (for Python 3.7+). It allows you to add your own extraction methods. + +## Installation + +`pip install wiktionary-de-parser` + +## Features + +- Extracts flexion tables, genus, IPA, language, lemma, part of speech (basic), syllables, raw Wikitext +- Allows you to add your own extraction methods (pass them as argument) +- Yields per section, not per page (a word can have multiple meanings --> multiple sections of a Wiktionary pages) + +## Usage + +```python +from bz2 import BZ2File +from wiktionary_de_parser import Parser + +bzfile_path = '/tmp/dewiktionary-latest-pages-articles-multistream.xml.bz2' +bz_file = BZ2File(bzfile_path) + +for record in Parser(bz_file): + if 'lang_code' not in record or record['lang_code'] != 'de': + continue + # do stuff with 'record' +``` + +Note: In this example we load a compressed Wiktionary dump file that was [obtained from here](https://dumps.wikimedia.org/dewiktionary/latest). + +### Adding new extraction methods + +An extraction method takes the following arguments: + +- `title` (_string_): The title of the current Wiktionary page +- `text` (_string_): The [Wikitext](https://en.wikipedia.org/wiki/Wiki#Editing) of the current word entry/section +- `current_record` (_Dict_): A dictionary with all values of the current iteration (e. g. `current_record['lang_code']`) + +It must return a `Dict` with the results or `False` if the record was processed unsuccesfully. + +```python +# Create a new extraction method +def my_method(title, text, current_record): + # do stuff + return {'my_field': my_data} if my_data else False + +# Pass a list with all extraction methods to the class constructor: +for record in Parser(bz_file, custom_methods=[my_method]): + print(record['my_field']) +``` + +## Output +Example output for the word "Abend": +```python +{'flexion': {'Akkusativ Plural': 'Abende', + 'Akkusativ Singular': 'Abend', + 'Dativ Plural': 'Abenden', + 'Dativ Singular': 'Abend', + 'Genitiv Plural': 'Abende', + 'Genitiv Singular': 'Abends', + 'Genus': 'm', + 'Nominativ Plural': 'Abende', + 'Nominativ Singular': 'Abend'}, + 'inflected': False, + 'ipa': ['ËaËbnĚŠt', 'ËaËbmĚŠt'], + 'lang': 'Deutsch', + 'lang_code': 'de', + 'lemma': 'Abend', + 'pos': {'Substantiv': []}, + 'rhymes': ['aËbnĚŠt'], + 'syllables': ['Abend'], + 'title': 'Abend'} +``` + +## Development +This project uses [Poetry](https://python-poetry.org/). + +1. Install [Poetry](https://python-poetry.org/). +2. Clone this repository +3. Run `poetry install` inside of the project folder to install dependencies. +4. Change `wiktionary_de_parser/run.py` to your needs. +5. Run `poetry run python wiktionary_de_parser/run.py` to run the parser. Or `poetry run pytest` to run tests. + +## License + +[MIT](https://github.com/gambolputty/wiktionary-de-parser/blob/master/LICENSE.md) Š Gregor Weichbrodt + + +%package -n python3-wiktionary-de-parser +Summary: Extracts data from German Wiktionary dump files. Allows you to add your own extraction methods đ +Provides: python-wiktionary-de-parser +BuildRequires: python3-devel +BuildRequires: python3-setuptools +BuildRequires: python3-pip +%description -n python3-wiktionary-de-parser +# wiktionary-de-parser + +This is a Python module to extract data from German Wiktionary XML files (for Python 3.7+). It allows you to add your own extraction methods. + +## Installation + +`pip install wiktionary-de-parser` + +## Features + +- Extracts flexion tables, genus, IPA, language, lemma, part of speech (basic), syllables, raw Wikitext +- Allows you to add your own extraction methods (pass them as argument) +- Yields per section, not per page (a word can have multiple meanings --> multiple sections of a Wiktionary pages) + +## Usage + +```python +from bz2 import BZ2File +from wiktionary_de_parser import Parser + +bzfile_path = '/tmp/dewiktionary-latest-pages-articles-multistream.xml.bz2' +bz_file = BZ2File(bzfile_path) + +for record in Parser(bz_file): + if 'lang_code' not in record or record['lang_code'] != 'de': + continue + # do stuff with 'record' +``` + +Note: In this example we load a compressed Wiktionary dump file that was [obtained from here](https://dumps.wikimedia.org/dewiktionary/latest). + +### Adding new extraction methods + +An extraction method takes the following arguments: + +- `title` (_string_): The title of the current Wiktionary page +- `text` (_string_): The [Wikitext](https://en.wikipedia.org/wiki/Wiki#Editing) of the current word entry/section +- `current_record` (_Dict_): A dictionary with all values of the current iteration (e. g. `current_record['lang_code']`) + +It must return a `Dict` with the results or `False` if the record was processed unsuccesfully. + +```python +# Create a new extraction method +def my_method(title, text, current_record): + # do stuff + return {'my_field': my_data} if my_data else False + +# Pass a list with all extraction methods to the class constructor: +for record in Parser(bz_file, custom_methods=[my_method]): + print(record['my_field']) +``` + +## Output +Example output for the word "Abend": +```python +{'flexion': {'Akkusativ Plural': 'Abende', + 'Akkusativ Singular': 'Abend', + 'Dativ Plural': 'Abenden', + 'Dativ Singular': 'Abend', + 'Genitiv Plural': 'Abende', + 'Genitiv Singular': 'Abends', + 'Genus': 'm', + 'Nominativ Plural': 'Abende', + 'Nominativ Singular': 'Abend'}, + 'inflected': False, + 'ipa': ['ËaËbnĚŠt', 'ËaËbmĚŠt'], + 'lang': 'Deutsch', + 'lang_code': 'de', + 'lemma': 'Abend', + 'pos': {'Substantiv': []}, + 'rhymes': ['aËbnĚŠt'], + 'syllables': ['Abend'], + 'title': 'Abend'} +``` + +## Development +This project uses [Poetry](https://python-poetry.org/). + +1. Install [Poetry](https://python-poetry.org/). +2. Clone this repository +3. Run `poetry install` inside of the project folder to install dependencies. +4. Change `wiktionary_de_parser/run.py` to your needs. +5. Run `poetry run python wiktionary_de_parser/run.py` to run the parser. Or `poetry run pytest` to run tests. + +## License + +[MIT](https://github.com/gambolputty/wiktionary-de-parser/blob/master/LICENSE.md) Š Gregor Weichbrodt + + +%package help +Summary: Development documents and examples for wiktionary-de-parser +Provides: python3-wiktionary-de-parser-doc +%description help +# wiktionary-de-parser + +This is a Python module to extract data from German Wiktionary XML files (for Python 3.7+). It allows you to add your own extraction methods. + +## Installation + +`pip install wiktionary-de-parser` + +## Features + +- Extracts flexion tables, genus, IPA, language, lemma, part of speech (basic), syllables, raw Wikitext +- Allows you to add your own extraction methods (pass them as argument) +- Yields per section, not per page (a word can have multiple meanings --> multiple sections of a Wiktionary pages) + +## Usage + +```python +from bz2 import BZ2File +from wiktionary_de_parser import Parser + +bzfile_path = '/tmp/dewiktionary-latest-pages-articles-multistream.xml.bz2' +bz_file = BZ2File(bzfile_path) + +for record in Parser(bz_file): + if 'lang_code' not in record or record['lang_code'] != 'de': + continue + # do stuff with 'record' +``` + +Note: In this example we load a compressed Wiktionary dump file that was [obtained from here](https://dumps.wikimedia.org/dewiktionary/latest). + +### Adding new extraction methods + +An extraction method takes the following arguments: + +- `title` (_string_): The title of the current Wiktionary page +- `text` (_string_): The [Wikitext](https://en.wikipedia.org/wiki/Wiki#Editing) of the current word entry/section +- `current_record` (_Dict_): A dictionary with all values of the current iteration (e. g. `current_record['lang_code']`) + +It must return a `Dict` with the results or `False` if the record was processed unsuccesfully. + +```python +# Create a new extraction method +def my_method(title, text, current_record): + # do stuff + return {'my_field': my_data} if my_data else False + +# Pass a list with all extraction methods to the class constructor: +for record in Parser(bz_file, custom_methods=[my_method]): + print(record['my_field']) +``` + +## Output +Example output for the word "Abend": +```python +{'flexion': {'Akkusativ Plural': 'Abende', + 'Akkusativ Singular': 'Abend', + 'Dativ Plural': 'Abenden', + 'Dativ Singular': 'Abend', + 'Genitiv Plural': 'Abende', + 'Genitiv Singular': 'Abends', + 'Genus': 'm', + 'Nominativ Plural': 'Abende', + 'Nominativ Singular': 'Abend'}, + 'inflected': False, + 'ipa': ['ËaËbnĚŠt', 'ËaËbmĚŠt'], + 'lang': 'Deutsch', + 'lang_code': 'de', + 'lemma': 'Abend', + 'pos': {'Substantiv': []}, + 'rhymes': ['aËbnĚŠt'], + 'syllables': ['Abend'], + 'title': 'Abend'} +``` + +## Development +This project uses [Poetry](https://python-poetry.org/). + +1. Install [Poetry](https://python-poetry.org/). +2. Clone this repository +3. Run `poetry install` inside of the project folder to install dependencies. +4. Change `wiktionary_de_parser/run.py` to your needs. +5. Run `poetry run python wiktionary_de_parser/run.py` to run the parser. Or `poetry run pytest` to run tests. + +## License + +[MIT](https://github.com/gambolputty/wiktionary-de-parser/blob/master/LICENSE.md) Š Gregor Weichbrodt + + +%prep +%autosetup -n wiktionary-de-parser-0.9.5 + +%build +%py3_build + +%install +%py3_install +install -d -m755 %{buildroot}/%{_pkgdocdir} +if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi +if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi +if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi +if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi +pushd %{buildroot} +if [ -d usr/lib ]; then + find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/lib64 ]; then + find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/bin ]; then + find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/sbin ]; then + find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst +fi +touch doclist.lst +if [ -d usr/share/man ]; then + find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst +fi +popd +mv %{buildroot}/filelist.lst . +mv %{buildroot}/doclist.lst . + +%files -n python3-wiktionary-de-parser -f filelist.lst +%dir %{python3_sitelib}/* + +%files help -f doclist.lst +%{_docdir}/* + +%changelog +* Wed May 17 2023 Python_Bot <Python_Bot@openeuler.org> - 0.9.5-1 +- Package Spec generated @@ -0,0 +1 @@ +cab9a30d254e65ef861ca91ef2a08a93 wiktionary-de-parser-0.9.5.tar.gz |
