python-pyinflect.spec


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378

%global _empty_manifest_terminate_build 0
Name:		python-pyinflect
Version:	0.5.1
Release:	1
Summary:	A python module for word inflections designed for use with Spacy.
License:	MIT License
URL:		https://github.com/bjascob/pyinflect
Source0:	https://mirrors.aliyun.com/pypi/web/packages/4e/6b/2b4857746fe3362258b2842184ae0ad11bc1259ae4bc0ed49d0ea6b22137/pyinflect-0.5.1.tar.gz
BuildArch:	noarch


%description
# pyinflect<br/>
**A python module for word inflections that works as a spaCy extension**.

--> Note that a more sophisticated system now exists in **[LemmInflect](https://github.com/bjascob/LemmInflect)** which includes both lemmatization and inflection, along with more advanced methods for word form disambiguation.  You might want to try that module first if you're looking for top performance.

This module is designed as an extension for **[spaCy](https://github.com/explosion/spaCy)** and will return the the inflected form of a word based on a supplied Penn Treekbank part-of-speech tag.  It can also be used a standalone module outside of Spacy. It is based on the **[Automatically Generated Inflection Database (AGID)](http://wordlist.aspell.net/other)**.  The AGID data provides a list of inflections for various word lemma. See the `scripts` directory for utilities that make good examples or the `tests` directory for unit tests / examples.

## Installation
```
pip3 install pyinflect
```

## Usage as an Extension to Spacy
To use with Spacy, you need Spacy version 2.0 or later.  Versions 1.9 and earlier do not support the extension methods used here.

To use as an extension to Spacy, first import the module.  This will create a new `inflect` method for each spaCy `Token` that takes a Penn Treebank tag as its parameter.  The method returns the inflected form of the token's lemma based on the supplied treekbank tag.
```
> import spacy
> import pyinflect
> nlp = spacy.load('en_core_web_sm')
> tokens = nlp('This is an example of xxtest.')
> tokens[3]._.inflect('NNS')
examples
```
When more than one spelling/form exists for the given tag, an optional form number can be supplied, otherwise the first one is returned.
```
> tokens[1]._.inflect('VBD', form_num=0)
was
> tokens[1]._.inflect('VBD', form_num=1)
were
```
When the lemma you wish to inflect is not in the lookup dictionary, the method returns `None`.  The optional parameter `inflect_oov` can be used to inflect the word using regular inflection rules.  In this case `form_num=0` selects the "regular" inflection and `form_num=1` selects the "doubled" version for verbs and adj/adv or the "Greco-Latin" for nouns.
```
> tokens[5]._.inflect('VBG', inflect_oov=True)
xxtesting
> tokens[5]._.inflect('VBG', inflect_oov=True, form_num=1)
xxtestting
```
You will need to figure out yourself which form_num to use.  There are basic helper functions in `pyinflect.InflectionRules` which can make a guess if the lemma uses "doubling" or "Greco-Latin" style rules.


## Usage Standalone
To use standalone, import the method `getAllInflections` and/or `getInflection` and call them directly.  `getAllInflections` returns all entries in the `infl.csv` file as a dictionary of inflected forms, where each form entry is a tuple with one or more spellings/forms for a given treebank tag.  The optional parameter `pos_type` (which is V, A or N) can be used to limited the returned data to specific parts of speech.  The method `getInflection` takes a lemma and a Penn Treebank tag and returns a tuple of the specific inflection(s) associated with it.
```
> from pyinflect import getAllInflections, getInflection
> getAllInflections('watch')
{'NN': ('watch',), 'NNS': ('watches',), 'VB': ('watch',), 'VBP': ('watch',), 'VBD': ('watched',), 'VBN': ('watched',), 'VBG': ('watching',), 'VBZ': ('watches',)}

> getAllInflections('watch', pos_type='V')
{'VB': ('watch',), 'VBP': ('watch',), 'VBD': ('watched',), 'VBN': ('watched',), 'VBG': ('watching',), 'VBZ': ('watches',)}

> getInflection('watch', tag='VBD')
('watched',)
```
The method `getInflection` takes the parameter `inflect_oov` and uses it similarly to what is described above with spaCy.
```
> getInflection('xxtest', 'VBG', inflect_oov=True)
('xxtesting', 'xxtestting')
```

## Issues:
If you find a bug, please report it on the **[GitHub issues list](https://github.com/bjascob/pyInflect/issues)**.  However be aware that when in comes to returning the correct inflection there are a number of different types of issues that can arise.  Some of these are not  readily fixable.  Issues with inflected forms include...
* Multiple spellings for an inflection (ie.. arthroplasties, arthroplastyes or arthroplastys)
* Mass form and plural types (ie.. people vs peoples)
* Forms that depend on context (ie.. further vs farther)
* Infections that are not fully specified by the tag (ie.. be/VBD can be "was" or "were")
* Incorrect lemmatization from spaCy (ie.. hating -> hat')
* Incorrect tagging (ie.. VBN vs. VBD)
* Errors in the AGID database

In order to assure that pyInflect returns the most commonly used inflected form/spelling for a given tag, a corpus technique is used.  In `scripts/12_CreateOverridesList.py`, words are lemmatized and tagged with spaCy then re-inflected with pyInflect.  When the original corpus word differs from pyInflect, the most commonly seen form is written to the `overrides.csv` file.  This technique can also help overcome lemmatization and tagging issues from spaCy and errors in the AGID database.  The file `CorpMultiInfls.txt` is a list of inflections/tags that came from multiple words in the corpus and thus may be problematic.

One common issue is that some forms of the verb "be" are not completely specified by the treekbank tag.  For instance be/VBD inflects to either "was" or "were" and be/VBP inflects to either "am", or "are". When the inflected form is ambiguous the first form is returned by default.  Setting the `form_num` in the Spacy inflection method allows returning other form(s).

Note that the AGID data is created by a 3rd party and not maintained here.  Some lemma are not in that data file, `infl.csv`, and thus can not be inflected using the dictionary methods.  In some cases the AGID may not contain the best inflection of the word.  For instance, lemma "people" with tag "NNS" will return "peoples" (pre-overrides) where you may want the word "people" which is also plural.


## Tags:
The module determines the inflection(s) returned by either a `pos_type` or a Penn Treebank `tag`.  The `pos_type` is either 'V', A' or 'N' for 'Verb', 'Adjective'/'Adverb' or 'Noun' respectively.  A list of treebank tags can be found **[here](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html)**.  Not all of these are used by pyinflect.  The following is a list of the various types and tags used...

    pos_type = 'A'
    * JJ      Adjective
    * JJR     Adjective, comparative
    * JJS     Adjective, superlative
    * RB      Adverb
    * RBR     Adverb, comparative
    * RBS     Adverb, superlative

    pos_type = 'N'
    * NN      Noun, singular or mass
    * NNS     Noun, plural

    pos_type = 'V'
    * VB      Verb, base form
    * VBD     Verb, past tense
    * VBG     Verb, gerund or present participle
    * VBN     Verb, past participle
    * VBP     Verb, non-3rd person singular present
    * VBZ     Verb, 3rd person singular present
    * MD      Modal


%package -n python3-pyinflect
Summary:	A python module for word inflections designed for use with Spacy.
Provides:	python-pyinflect
BuildRequires:	python3-devel
BuildRequires:	python3-setuptools
BuildRequires:	python3-pip
%description -n python3-pyinflect
# pyinflect<br/>
**A python module for word inflections that works as a spaCy extension**.

--> Note that a more sophisticated system now exists in **[LemmInflect](https://github.com/bjascob/LemmInflect)** which includes both lemmatization and inflection, along with more advanced methods for word form disambiguation.  You might want to try that module first if you're looking for top performance.

This module is designed as an extension for **[spaCy](https://github.com/explosion/spaCy)** and will return the the inflected form of a word based on a supplied Penn Treekbank part-of-speech tag.  It can also be used a standalone module outside of Spacy. It is based on the **[Automatically Generated Inflection Database (AGID)](http://wordlist.aspell.net/other)**.  The AGID data provides a list of inflections for various word lemma. See the `scripts` directory for utilities that make good examples or the `tests` directory for unit tests / examples.

## Installation
```
pip3 install pyinflect
```

## Usage as an Extension to Spacy
To use with Spacy, you need Spacy version 2.0 or later.  Versions 1.9 and earlier do not support the extension methods used here.

To use as an extension to Spacy, first import the module.  This will create a new `inflect` method for each spaCy `Token` that takes a Penn Treebank tag as its parameter.  The method returns the inflected form of the token's lemma based on the supplied treekbank tag.
```
> import spacy
> import pyinflect
> nlp = spacy.load('en_core_web_sm')
> tokens = nlp('This is an example of xxtest.')
> tokens[3]._.inflect('NNS')
examples
```
When more than one spelling/form exists for the given tag, an optional form number can be supplied, otherwise the first one is returned.
```
> tokens[1]._.inflect('VBD', form_num=0)
was
> tokens[1]._.inflect('VBD', form_num=1)
were
```
When the lemma you wish to inflect is not in the lookup dictionary, the method returns `None`.  The optional parameter `inflect_oov` can be used to inflect the word using regular inflection rules.  In this case `form_num=0` selects the "regular" inflection and `form_num=1` selects the "doubled" version for verbs and adj/adv or the "Greco-Latin" for nouns.
```
> tokens[5]._.inflect('VBG', inflect_oov=True)
xxtesting
> tokens[5]._.inflect('VBG', inflect_oov=True, form_num=1)
xxtestting
```
You will need to figure out yourself which form_num to use.  There are basic helper functions in `pyinflect.InflectionRules` which can make a guess if the lemma uses "doubling" or "Greco-Latin" style rules.


## Usage Standalone
To use standalone, import the method `getAllInflections` and/or `getInflection` and call them directly.  `getAllInflections` returns all entries in the `infl.csv` file as a dictionary of inflected forms, where each form entry is a tuple with one or more spellings/forms for a given treebank tag.  The optional parameter `pos_type` (which is V, A or N) can be used to limited the returned data to specific parts of speech.  The method `getInflection` takes a lemma and a Penn Treebank tag and returns a tuple of the specific inflection(s) associated with it.
```
> from pyinflect import getAllInflections, getInflection
> getAllInflections('watch')
{'NN': ('watch',), 'NNS': ('watches',), 'VB': ('watch',), 'VBP': ('watch',), 'VBD': ('watched',), 'VBN': ('watched',), 'VBG': ('watching',), 'VBZ': ('watches',)}

> getAllInflections('watch', pos_type='V')
{'VB': ('watch',), 'VBP': ('watch',), 'VBD': ('watched',), 'VBN': ('watched',), 'VBG': ('watching',), 'VBZ': ('watches',)}

> getInflection('watch', tag='VBD')
('watched',)
```
The method `getInflection` takes the parameter `inflect_oov` and uses it similarly to what is described above with spaCy.
```
> getInflection('xxtest', 'VBG', inflect_oov=True)
('xxtesting', 'xxtestting')
```

## Issues:
If you find a bug, please report it on the **[GitHub issues list](https://github.com/bjascob/pyInflect/issues)**.  However be aware that when in comes to returning the correct inflection there are a number of different types of issues that can arise.  Some of these are not  readily fixable.  Issues with inflected forms include...
* Multiple spellings for an inflection (ie.. arthroplasties, arthroplastyes or arthroplastys)
* Mass form and plural types (ie.. people vs peoples)
* Forms that depend on context (ie.. further vs farther)
* Infections that are not fully specified by the tag (ie.. be/VBD can be "was" or "were")
* Incorrect lemmatization from spaCy (ie.. hating -> hat')
* Incorrect tagging (ie.. VBN vs. VBD)
* Errors in the AGID database

In order to assure that pyInflect returns the most commonly used inflected form/spelling for a given tag, a corpus technique is used.  In `scripts/12_CreateOverridesList.py`, words are lemmatized and tagged with spaCy then re-inflected with pyInflect.  When the original corpus word differs from pyInflect, the most commonly seen form is written to the `overrides.csv` file.  This technique can also help overcome lemmatization and tagging issues from spaCy and errors in the AGID database.  The file `CorpMultiInfls.txt` is a list of inflections/tags that came from multiple words in the corpus and thus may be problematic.

One common issue is that some forms of the verb "be" are not completely specified by the treekbank tag.  For instance be/VBD inflects to either "was" or "were" and be/VBP inflects to either "am", or "are". When the inflected form is ambiguous the first form is returned by default.  Setting the `form_num` in the Spacy inflection method allows returning other form(s).

Note that the AGID data is created by a 3rd party and not maintained here.  Some lemma are not in that data file, `infl.csv`, and thus can not be inflected using the dictionary methods.  In some cases the AGID may not contain the best inflection of the word.  For instance, lemma "people" with tag "NNS" will return "peoples" (pre-overrides) where you may want the word "people" which is also plural.


## Tags:
The module determines the inflection(s) returned by either a `pos_type` or a Penn Treebank `tag`.  The `pos_type` is either 'V', A' or 'N' for 'Verb', 'Adjective'/'Adverb' or 'Noun' respectively.  A list of treebank tags can be found **[here](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html)**.  Not all of these are used by pyinflect.  The following is a list of the various types and tags used...

    pos_type = 'A'
    * JJ      Adjective
    * JJR     Adjective, comparative
    * JJS     Adjective, superlative
    * RB      Adverb
    * RBR     Adverb, comparative
    * RBS     Adverb, superlative

    pos_type = 'N'
    * NN      Noun, singular or mass
    * NNS     Noun, plural

    pos_type = 'V'
    * VB      Verb, base form
    * VBD     Verb, past tense
    * VBG     Verb, gerund or present participle
    * VBN     Verb, past participle
    * VBP     Verb, non-3rd person singular present
    * VBZ     Verb, 3rd person singular present
    * MD      Modal


%package help
Summary:	Development documents and examples for pyinflect
Provides:	python3-pyinflect-doc
%description help
# pyinflect<br/>
**A python module for word inflections that works as a spaCy extension**.

--> Note that a more sophisticated system now exists in **[LemmInflect](https://github.com/bjascob/LemmInflect)** which includes both lemmatization and inflection, along with more advanced methods for word form disambiguation.  You might want to try that module first if you're looking for top performance.

This module is designed as an extension for **[spaCy](https://github.com/explosion/spaCy)** and will return the the inflected form of a word based on a supplied Penn Treekbank part-of-speech tag.  It can also be used a standalone module outside of Spacy. It is based on the **[Automatically Generated Inflection Database (AGID)](http://wordlist.aspell.net/other)**.  The AGID data provides a list of inflections for various word lemma. See the `scripts` directory for utilities that make good examples or the `tests` directory for unit tests / examples.

## Installation
```
pip3 install pyinflect
```

## Usage as an Extension to Spacy
To use with Spacy, you need Spacy version 2.0 or later.  Versions 1.9 and earlier do not support the extension methods used here.

To use as an extension to Spacy, first import the module.  This will create a new `inflect` method for each spaCy `Token` that takes a Penn Treebank tag as its parameter.  The method returns the inflected form of the token's lemma based on the supplied treekbank tag.
```
> import spacy
> import pyinflect
> nlp = spacy.load('en_core_web_sm')
> tokens = nlp('This is an example of xxtest.')
> tokens[3]._.inflect('NNS')
examples
```
When more than one spelling/form exists for the given tag, an optional form number can be supplied, otherwise the first one is returned.
```
> tokens[1]._.inflect('VBD', form_num=0)
was
> tokens[1]._.inflect('VBD', form_num=1)
were
```
When the lemma you wish to inflect is not in the lookup dictionary, the method returns `None`.  The optional parameter `inflect_oov` can be used to inflect the word using regular inflection rules.  In this case `form_num=0` selects the "regular" inflection and `form_num=1` selects the "doubled" version for verbs and adj/adv or the "Greco-Latin" for nouns.
```
> tokens[5]._.inflect('VBG', inflect_oov=True)
xxtesting
> tokens[5]._.inflect('VBG', inflect_oov=True, form_num=1)
xxtestting
```
You will need to figure out yourself which form_num to use.  There are basic helper functions in `pyinflect.InflectionRules` which can make a guess if the lemma uses "doubling" or "Greco-Latin" style rules.


## Usage Standalone
To use standalone, import the method `getAllInflections` and/or `getInflection` and call them directly.  `getAllInflections` returns all entries in the `infl.csv` file as a dictionary of inflected forms, where each form entry is a tuple with one or more spellings/forms for a given treebank tag.  The optional parameter `pos_type` (which is V, A or N) can be used to limited the returned data to specific parts of speech.  The method `getInflection` takes a lemma and a Penn Treebank tag and returns a tuple of the specific inflection(s) associated with it.
```
> from pyinflect import getAllInflections, getInflection
> getAllInflections('watch')
{'NN': ('watch',), 'NNS': ('watches',), 'VB': ('watch',), 'VBP': ('watch',), 'VBD': ('watched',), 'VBN': ('watched',), 'VBG': ('watching',), 'VBZ': ('watches',)}

> getAllInflections('watch', pos_type='V')
{'VB': ('watch',), 'VBP': ('watch',), 'VBD': ('watched',), 'VBN': ('watched',), 'VBG': ('watching',), 'VBZ': ('watches',)}

> getInflection('watch', tag='VBD')
('watched',)
```
The method `getInflection` takes the parameter `inflect_oov` and uses it similarly to what is described above with spaCy.
```
> getInflection('xxtest', 'VBG', inflect_oov=True)
('xxtesting', 'xxtestting')
```

## Issues:
If you find a bug, please report it on the **[GitHub issues list](https://github.com/bjascob/pyInflect/issues)**.  However be aware that when in comes to returning the correct inflection there are a number of different types of issues that can arise.  Some of these are not  readily fixable.  Issues with inflected forms include...
* Multiple spellings for an inflection (ie.. arthroplasties, arthroplastyes or arthroplastys)
* Mass form and plural types (ie.. people vs peoples)
* Forms that depend on context (ie.. further vs farther)
* Infections that are not fully specified by the tag (ie.. be/VBD can be "was" or "were")
* Incorrect lemmatization from spaCy (ie.. hating -> hat')
* Incorrect tagging (ie.. VBN vs. VBD)
* Errors in the AGID database

In order to assure that pyInflect returns the most commonly used inflected form/spelling for a given tag, a corpus technique is used.  In `scripts/12_CreateOverridesList.py`, words are lemmatized and tagged with spaCy then re-inflected with pyInflect.  When the original corpus word differs from pyInflect, the most commonly seen form is written to the `overrides.csv` file.  This technique can also help overcome lemmatization and tagging issues from spaCy and errors in the AGID database.  The file `CorpMultiInfls.txt` is a list of inflections/tags that came from multiple words in the corpus and thus may be problematic.

One common issue is that some forms of the verb "be" are not completely specified by the treekbank tag.  For instance be/VBD inflects to either "was" or "were" and be/VBP inflects to either "am", or "are". When the inflected form is ambiguous the first form is returned by default.  Setting the `form_num` in the Spacy inflection method allows returning other form(s).

Note that the AGID data is created by a 3rd party and not maintained here.  Some lemma are not in that data file, `infl.csv`, and thus can not be inflected using the dictionary methods.  In some cases the AGID may not contain the best inflection of the word.  For instance, lemma "people" with tag "NNS" will return "peoples" (pre-overrides) where you may want the word "people" which is also plural.


## Tags:
The module determines the inflection(s) returned by either a `pos_type` or a Penn Treebank `tag`.  The `pos_type` is either 'V', A' or 'N' for 'Verb', 'Adjective'/'Adverb' or 'Noun' respectively.  A list of treebank tags can be found **[here](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html)**.  Not all of these are used by pyinflect.  The following is a list of the various types and tags used...

    pos_type = 'A'
    * JJ      Adjective
    * JJR     Adjective, comparative
    * JJS     Adjective, superlative
    * RB      Adverb
    * RBR     Adverb, comparative
    * RBS     Adverb, superlative

    pos_type = 'N'
    * NN      Noun, singular or mass
    * NNS     Noun, plural

    pos_type = 'V'
    * VB      Verb, base form
    * VBD     Verb, past tense
    * VBG     Verb, gerund or present participle
    * VBN     Verb, past participle
    * VBP     Verb, non-3rd person singular present
    * VBZ     Verb, 3rd person singular present
    * MD      Modal


%prep
%autosetup -n pyinflect-0.5.1

%build
%py3_build

%install
%py3_install
install -d -m755 %{buildroot}/%{_pkgdocdir}
if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi
if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi
if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi
if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi
pushd %{buildroot}
if [ -d usr/lib ]; then
	find usr/lib -type f -printf "\"/%h/%f\"\n" >> filelist.lst
fi
if [ -d usr/lib64 ]; then
	find usr/lib64 -type f -printf "\"/%h/%f\"\n" >> filelist.lst
fi
if [ -d usr/bin ]; then
	find usr/bin -type f -printf "\"/%h/%f\"\n" >> filelist.lst
fi
if [ -d usr/sbin ]; then
	find usr/sbin -type f -printf "\"/%h/%f\"\n" >> filelist.lst
fi
touch doclist.lst
if [ -d usr/share/man ]; then
	find usr/share/man -type f -printf "\"/%h/%f.gz\"\n" >> doclist.lst
fi
popd
mv %{buildroot}/filelist.lst .
mv %{buildroot}/doclist.lst .

%files -n python3-pyinflect -f filelist.lst
%dir %{python3_sitelib}/*

%files help -f doclist.lst
%{_docdir}/*

%changelog
* Fri Jun 09 2023 Python_Bot <Python_Bot@openeuler.org> - 0.5.1-1
- Package Spec generated