summaryrefslogtreecommitdiff
path: root/python-unidic.spec
blob: abfe57d5c8c8c0e05e7a3a4638b66ba2948bfa3f (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
%global _empty_manifest_terminate_build 0
Name:		python-unidic
Version:	1.1.0
Release:	1
Summary:	UniDic packaged for Python
License:	MIT
URL:		https://github.com/polm/unidic-py
Source0:	https://mirrors.nju.edu.cn/pypi/web/packages/5a/09/271dfbf8d5b56adddc70e30fa94249f5d3ab35f615bf278d65258045564a/unidic-1.1.0.tar.gz
BuildArch:	noarch


%description
# unidic-py

This is a version of [UniDic](https://unidic.ninjal.ac.jp/) packaged for use
with pip. 

Currently it supports 2.3.0, the latest version of UniDic. **Note this will
take up 1GB on disk after install.** If you want a small package, try
[unidic-lite](https://github.com/polm/unidic-lite).

The data for this dictionary is hosted as part of the AWS Open Data
Sponsorship Program. You can read the announcement
[here](https://aws.amazon.com/jp/blogs/news/published-unidic-mecab-on-aws-open-data/).

After installing via pip, you need to download the dictionary using the
following command:

    python -m unidic download

With [fugashi](https://github.com/polm/fugashi) or
[mecab-python3](https://github.com/samurait/mecab-python3) unidic will be used
automatically when installed, though if you want you can manually pass the
MeCab arguments:

    import fugashi
    import unidic
    tagger = fugashi.Tagger('-d "{}"'.format(unidic.DICDIR))
    # that's it!

## Differences from the Official UniDic Release

This has a few changes from the official UniDic release to make it easier to use.

- entries for 令和 have been added
- single-character numeric and alphabetic words have been deleted
- `unk.def` has been modified so unknown punctuation won't be marked as a noun

See the `extras` directory for details on how to replicate the build process.

## Fields

Here is a list of fields included in this edition of UniDic. For more information see the [UniDic FAQ](https://unidic.ninjal.ac.jp/faq#col_name), though not all fields are included. For fields in the UniDic FAQ the name given there is included.

Fields which are not applicable are usually marked with an asterisk (`*`).

- **pos1, pos2, pos3, pos4**: Part of speech fields. The earlier fields are more general, the later fields are more specific.
- **cType:** 活用型, conjugation type. Will have a value like `五段-ラ行`. 
- **cForm:** 活用形, conjugation shape. Will have a value like `連用形-促音便`.
- **lForm:** 語彙素読み, lemma reading. The reading of the lemma in katakana, this uses the same format as the `kana` field, not `pron`.
- **lemma:** 語彙素(+語彙素細分類). The lemma is a non-inflected "dictionary form" of a word. UniDic lemmas sometimes include extra info or have unusual forms, like using katakana for some place names. 
- **orth:** 書字形出現形, the word as it appears in text, this appears to be identical to the surface.
- **pron:** 発音形出現形, pronunciation. This is similar to kana except that long vowels are indicated with a ー, so 講師 is こーし. 
- **orthBase:** 書字形基本形, the uninflected form of the word using its current written form. For example, for 彷徨った the lemma is さ迷う but the orthBase is 彷徨う. 
- **pronBase:** 発音形基本形, the pronunciation of the base form. Like `pron` for the `lemma` or `orthBase`.
- **goshu:** 語種, word type. Etymological category. In order of frequency, 和, 固, 漢, 外, 混, 記号, 不明. Defined for all dictionary words, blank for unks.
- **iType:** 語頭変化化型, "i" is for "initial". This is the type of initial transformation the word undergoes when combining, for example 兵 is へ半濁 because it can be read as べい in combination. This is available for <2% of entries.
- **iForm:** 語頭変化形, this is the initial form of the word in context, such as 基本形 or 半濁音形. 
- **fType:** 語末変化化型, "f" is for "final", but otherwise as iType. For example 医学 is ク促 because it can change to いがっ (apparently). This is available for <0.1% of entries.
- **fForm:** 語末変化形, as iForm but for final transformations.
- **iConType:** 語頭変化結合型, initial change fusion type. Describes phonetic change at the start of the word in counting expressions. Only available for a few hundred entries, mostly numbers. Values are N followed by a letter or number; most entries with this value are numeric.
- **fConType:** 語末変化結合型, final change fusion type. This is also used for counting expressions, and like iConType it is only available for a few hundred entries. Unlike iConType the values are very complicated, like `B1S6SjShS,B1S6S8SjShS`. 
- **type:** Not entirely clear what this is, seems to have some overlap with POS. 
- **kana:** 読みがな, this is the typical representation of a word in kana, unlike pron. 講師 is こうし.
- **kanaBase:** 仮名形基本形, this is the typical kana representation of the lemma.
- **form:** 語形出現形, seems to be the same as `pron`.
- **formBase:** 語形基本形 seems to be the same as `pronBase`.
- **aType:** Accent type. This is a (potentially) comma-separated field which has the number of the mora taking the accent in 標準語 (standard language). When there are multiple values, more common accent patterns come first.
- **aConType:** This describes how the accent shifts when the word is used in a counter expression. It uses complicated notation.
- **aModType:** Presumably accent related but unclear use. Available for <25% of entries and only has 6 non-default values.
- **lid:** 語彙表ID. A long lemma ID. This seems to be a kind of GUID. There is usually one entry per line in the CSV, except that half-width and full-width variations can be combined.
- **lemma_id:** 語彙素ID. A shorter lemma id, starting from 1. This seems to be as unique as the `lemma` field, so many CSV lines can share this value.

# License

The modern Japanese UniDic is available under the GPL, LGPL, or BSD license,
[see here](https://unidic.ninjal.ac.jp/download#unidic_bccwj). UniDic is
developed by [NINJAL](https://www.ninjal.ac.jp/), the National Institute for
Japanese Language and Linguistics. UniDic is copyrighted by the UniDic
Consortium and is distributed here under the terms of the [BSD
License](./LICENSE.unidic).

The code in this repository is not written or maintained by NINJAL. The code is
available under the MIT or WTFPL License, as you prefer.

%package -n python3-unidic
Summary:	UniDic packaged for Python
Provides:	python-unidic
BuildRequires:	python3-devel
BuildRequires:	python3-setuptools
BuildRequires:	python3-pip
%description -n python3-unidic
# unidic-py

This is a version of [UniDic](https://unidic.ninjal.ac.jp/) packaged for use
with pip. 

Currently it supports 2.3.0, the latest version of UniDic. **Note this will
take up 1GB on disk after install.** If you want a small package, try
[unidic-lite](https://github.com/polm/unidic-lite).

The data for this dictionary is hosted as part of the AWS Open Data
Sponsorship Program. You can read the announcement
[here](https://aws.amazon.com/jp/blogs/news/published-unidic-mecab-on-aws-open-data/).

After installing via pip, you need to download the dictionary using the
following command:

    python -m unidic download

With [fugashi](https://github.com/polm/fugashi) or
[mecab-python3](https://github.com/samurait/mecab-python3) unidic will be used
automatically when installed, though if you want you can manually pass the
MeCab arguments:

    import fugashi
    import unidic
    tagger = fugashi.Tagger('-d "{}"'.format(unidic.DICDIR))
    # that's it!

## Differences from the Official UniDic Release

This has a few changes from the official UniDic release to make it easier to use.

- entries for 令和 have been added
- single-character numeric and alphabetic words have been deleted
- `unk.def` has been modified so unknown punctuation won't be marked as a noun

See the `extras` directory for details on how to replicate the build process.

## Fields

Here is a list of fields included in this edition of UniDic. For more information see the [UniDic FAQ](https://unidic.ninjal.ac.jp/faq#col_name), though not all fields are included. For fields in the UniDic FAQ the name given there is included.

Fields which are not applicable are usually marked with an asterisk (`*`).

- **pos1, pos2, pos3, pos4**: Part of speech fields. The earlier fields are more general, the later fields are more specific.
- **cType:** 活用型, conjugation type. Will have a value like `五段-ラ行`. 
- **cForm:** 活用形, conjugation shape. Will have a value like `連用形-促音便`.
- **lForm:** 語彙素読み, lemma reading. The reading of the lemma in katakana, this uses the same format as the `kana` field, not `pron`.
- **lemma:** 語彙素(+語彙素細分類). The lemma is a non-inflected "dictionary form" of a word. UniDic lemmas sometimes include extra info or have unusual forms, like using katakana for some place names. 
- **orth:** 書字形出現形, the word as it appears in text, this appears to be identical to the surface.
- **pron:** 発音形出現形, pronunciation. This is similar to kana except that long vowels are indicated with a ー, so 講師 is こーし. 
- **orthBase:** 書字形基本形, the uninflected form of the word using its current written form. For example, for 彷徨った the lemma is さ迷う but the orthBase is 彷徨う. 
- **pronBase:** 発音形基本形, the pronunciation of the base form. Like `pron` for the `lemma` or `orthBase`.
- **goshu:** 語種, word type. Etymological category. In order of frequency, 和, 固, 漢, 外, 混, 記号, 不明. Defined for all dictionary words, blank for unks.
- **iType:** 語頭変化化型, "i" is for "initial". This is the type of initial transformation the word undergoes when combining, for example 兵 is へ半濁 because it can be read as べい in combination. This is available for <2% of entries.
- **iForm:** 語頭変化形, this is the initial form of the word in context, such as 基本形 or 半濁音形. 
- **fType:** 語末変化化型, "f" is for "final", but otherwise as iType. For example 医学 is ク促 because it can change to いがっ (apparently). This is available for <0.1% of entries.
- **fForm:** 語末変化形, as iForm but for final transformations.
- **iConType:** 語頭変化結合型, initial change fusion type. Describes phonetic change at the start of the word in counting expressions. Only available for a few hundred entries, mostly numbers. Values are N followed by a letter or number; most entries with this value are numeric.
- **fConType:** 語末変化結合型, final change fusion type. This is also used for counting expressions, and like iConType it is only available for a few hundred entries. Unlike iConType the values are very complicated, like `B1S6SjShS,B1S6S8SjShS`. 
- **type:** Not entirely clear what this is, seems to have some overlap with POS. 
- **kana:** 読みがな, this is the typical representation of a word in kana, unlike pron. 講師 is こうし.
- **kanaBase:** 仮名形基本形, this is the typical kana representation of the lemma.
- **form:** 語形出現形, seems to be the same as `pron`.
- **formBase:** 語形基本形 seems to be the same as `pronBase`.
- **aType:** Accent type. This is a (potentially) comma-separated field which has the number of the mora taking the accent in 標準語 (standard language). When there are multiple values, more common accent patterns come first.
- **aConType:** This describes how the accent shifts when the word is used in a counter expression. It uses complicated notation.
- **aModType:** Presumably accent related but unclear use. Available for <25% of entries and only has 6 non-default values.
- **lid:** 語彙表ID. A long lemma ID. This seems to be a kind of GUID. There is usually one entry per line in the CSV, except that half-width and full-width variations can be combined.
- **lemma_id:** 語彙素ID. A shorter lemma id, starting from 1. This seems to be as unique as the `lemma` field, so many CSV lines can share this value.

# License

The modern Japanese UniDic is available under the GPL, LGPL, or BSD license,
[see here](https://unidic.ninjal.ac.jp/download#unidic_bccwj). UniDic is
developed by [NINJAL](https://www.ninjal.ac.jp/), the National Institute for
Japanese Language and Linguistics. UniDic is copyrighted by the UniDic
Consortium and is distributed here under the terms of the [BSD
License](./LICENSE.unidic).

The code in this repository is not written or maintained by NINJAL. The code is
available under the MIT or WTFPL License, as you prefer.

%package help
Summary:	Development documents and examples for unidic
Provides:	python3-unidic-doc
%description help
# unidic-py

This is a version of [UniDic](https://unidic.ninjal.ac.jp/) packaged for use
with pip. 

Currently it supports 2.3.0, the latest version of UniDic. **Note this will
take up 1GB on disk after install.** If you want a small package, try
[unidic-lite](https://github.com/polm/unidic-lite).

The data for this dictionary is hosted as part of the AWS Open Data
Sponsorship Program. You can read the announcement
[here](https://aws.amazon.com/jp/blogs/news/published-unidic-mecab-on-aws-open-data/).

After installing via pip, you need to download the dictionary using the
following command:

    python -m unidic download

With [fugashi](https://github.com/polm/fugashi) or
[mecab-python3](https://github.com/samurait/mecab-python3) unidic will be used
automatically when installed, though if you want you can manually pass the
MeCab arguments:

    import fugashi
    import unidic
    tagger = fugashi.Tagger('-d "{}"'.format(unidic.DICDIR))
    # that's it!

## Differences from the Official UniDic Release

This has a few changes from the official UniDic release to make it easier to use.

- entries for 令和 have been added
- single-character numeric and alphabetic words have been deleted
- `unk.def` has been modified so unknown punctuation won't be marked as a noun

See the `extras` directory for details on how to replicate the build process.

## Fields

Here is a list of fields included in this edition of UniDic. For more information see the [UniDic FAQ](https://unidic.ninjal.ac.jp/faq#col_name), though not all fields are included. For fields in the UniDic FAQ the name given there is included.

Fields which are not applicable are usually marked with an asterisk (`*`).

- **pos1, pos2, pos3, pos4**: Part of speech fields. The earlier fields are more general, the later fields are more specific.
- **cType:** 活用型, conjugation type. Will have a value like `五段-ラ行`. 
- **cForm:** 活用形, conjugation shape. Will have a value like `連用形-促音便`.
- **lForm:** 語彙素読み, lemma reading. The reading of the lemma in katakana, this uses the same format as the `kana` field, not `pron`.
- **lemma:** 語彙素(+語彙素細分類). The lemma is a non-inflected "dictionary form" of a word. UniDic lemmas sometimes include extra info or have unusual forms, like using katakana for some place names. 
- **orth:** 書字形出現形, the word as it appears in text, this appears to be identical to the surface.
- **pron:** 発音形出現形, pronunciation. This is similar to kana except that long vowels are indicated with a ー, so 講師 is こーし. 
- **orthBase:** 書字形基本形, the uninflected form of the word using its current written form. For example, for 彷徨った the lemma is さ迷う but the orthBase is 彷徨う. 
- **pronBase:** 発音形基本形, the pronunciation of the base form. Like `pron` for the `lemma` or `orthBase`.
- **goshu:** 語種, word type. Etymological category. In order of frequency, 和, 固, 漢, 外, 混, 記号, 不明. Defined for all dictionary words, blank for unks.
- **iType:** 語頭変化化型, "i" is for "initial". This is the type of initial transformation the word undergoes when combining, for example 兵 is へ半濁 because it can be read as べい in combination. This is available for <2% of entries.
- **iForm:** 語頭変化形, this is the initial form of the word in context, such as 基本形 or 半濁音形. 
- **fType:** 語末変化化型, "f" is for "final", but otherwise as iType. For example 医学 is ク促 because it can change to いがっ (apparently). This is available for <0.1% of entries.
- **fForm:** 語末変化形, as iForm but for final transformations.
- **iConType:** 語頭変化結合型, initial change fusion type. Describes phonetic change at the start of the word in counting expressions. Only available for a few hundred entries, mostly numbers. Values are N followed by a letter or number; most entries with this value are numeric.
- **fConType:** 語末変化結合型, final change fusion type. This is also used for counting expressions, and like iConType it is only available for a few hundred entries. Unlike iConType the values are very complicated, like `B1S6SjShS,B1S6S8SjShS`. 
- **type:** Not entirely clear what this is, seems to have some overlap with POS. 
- **kana:** 読みがな, this is the typical representation of a word in kana, unlike pron. 講師 is こうし.
- **kanaBase:** 仮名形基本形, this is the typical kana representation of the lemma.
- **form:** 語形出現形, seems to be the same as `pron`.
- **formBase:** 語形基本形 seems to be the same as `pronBase`.
- **aType:** Accent type. This is a (potentially) comma-separated field which has the number of the mora taking the accent in 標準語 (standard language). When there are multiple values, more common accent patterns come first.
- **aConType:** This describes how the accent shifts when the word is used in a counter expression. It uses complicated notation.
- **aModType:** Presumably accent related but unclear use. Available for <25% of entries and only has 6 non-default values.
- **lid:** 語彙表ID. A long lemma ID. This seems to be a kind of GUID. There is usually one entry per line in the CSV, except that half-width and full-width variations can be combined.
- **lemma_id:** 語彙素ID. A shorter lemma id, starting from 1. This seems to be as unique as the `lemma` field, so many CSV lines can share this value.

# License

The modern Japanese UniDic is available under the GPL, LGPL, or BSD license,
[see here](https://unidic.ninjal.ac.jp/download#unidic_bccwj). UniDic is
developed by [NINJAL](https://www.ninjal.ac.jp/), the National Institute for
Japanese Language and Linguistics. UniDic is copyrighted by the UniDic
Consortium and is distributed here under the terms of the [BSD
License](./LICENSE.unidic).

The code in this repository is not written or maintained by NINJAL. The code is
available under the MIT or WTFPL License, as you prefer.

%prep
%autosetup -n unidic-1.1.0

%build
%py3_build

%install
%py3_install
install -d -m755 %{buildroot}/%{_pkgdocdir}
if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi
if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi
if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi
if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi
pushd %{buildroot}
if [ -d usr/lib ]; then
	find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst
fi
if [ -d usr/lib64 ]; then
	find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst
fi
if [ -d usr/bin ]; then
	find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst
fi
if [ -d usr/sbin ]; then
	find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst
fi
touch doclist.lst
if [ -d usr/share/man ]; then
	find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst
fi
popd
mv %{buildroot}/filelist.lst .
mv %{buildroot}/doclist.lst .

%files -n python3-unidic -f filelist.lst
%dir %{python3_sitelib}/*

%files help -f doclist.lst
%{_docdir}/*

%changelog
* Sun Apr 23 2023 Python_Bot <Python_Bot@openeuler.org> - 1.1.0-1
- Package Spec generated