automatic import of python-gruutopeneuler20.03

author: CoprDistGit <infra@openeuler.org> 2023-05-05 09:59:27 +0000
committer: CoprDistGit <infra@openeuler.org> 2023-05-05 09:59:27 +0000
commit: 43bc1851cae674a6d6b426789d6997eccbca89c1 (patch)
tree: 791da4732e2f2007cb76f3c1a9d7e464db408207 /python-gruut.spec
parent: f75ccdc371218187b790ffe6d18481ede5d1880c (diff)
1 files changed, 1395 insertions, 0 deletions
diff --git a/python-gruut.spec b/python-gruut.spec
new file mode 100644
index 0000000..c172cdd
--- /dev/null
+++ b/python-gruut.spec
@@ -0,0 +1,1395 @@
+%global _empty_manifest_terminate_build 0
+Name:		python-gruut
+Version:	2.3.4
+Release:	1
+Summary:	A tokenizer, text cleaner, and phonemizer for many human languages.
+License:	MIT License
+URL:		https://github.com/rhasspy/gruut
+Source0:	https://mirrors.nju.edu.cn/pypi/web/packages/4c/74/40e0bff02cf4daa3908c440e2111b20490c82080259f0114d0cfe07ce126/gruut-2.3.4.tar.gz
+BuildArch:	noarch
+
+
+%description
+# Gruut
+
+A tokenizer, text cleaner, and [IPA](https://en.wikipedia.org/wiki/International_Phonetic_Alphabet) phonemizer for several human languages that supports [SSML](#ssml).
+
+```python
+from gruut import sentences
+
+text = 'He wound it around the wound, saying "I read it was $10 to read."'
+
+for sent in sentences(text, lang="en-us"):
+    for word in sent:
+        if word.phonemes:
+            print(word.text, *word.phonemes)
+```
+
+which outputs:
+
+```
+He h ˈi
+wound w ˈaʊ n d
+it ˈɪ t
+around ɚ ˈaʊ n d
+the ð ə
+wound w ˈu n d
+, |
+saying s ˈeɪ ɪ ŋ
+I ˈaɪ
+read ɹ ˈɛ d
+it ˈɪ t
+was w ə z
+ten t ˈɛ n
+dollars d ˈɑ l ɚ z
+to t ə
+read ɹ ˈi d
+. ‖
+```
+
+Note that "wound" and "read" have different pronunciations when used in different (grammatical) contexts.
+
+A [subset of SSML](#ssml) is also supported:
+
+```python
+from gruut import sentences
+
+ssml_text = """<?xml version="1.0" encoding="ISO-8859-1"?>
+<speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis"
+    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
+    xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
+                http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
+    xml:lang="en-US">
+<s>Today at 4pm, 2/1/2000.</s>
+<s xml:lang="it">Un mese fà, 2/1/2000.</s>
+</speak>"""
+
+for sent in sentences(ssml_text, ssml=True):
+    for word in sent:
+        if word.phonemes:
+            print(sent.idx, word.lang, word.text, *word.phonemes)
+```
+
+with the output:
+
+```
+0 en-US Today t ə d ˈeɪ
+0 en-US at ˈæ t
+0 en-US four f ˈɔ ɹ
+0 en-US P p ˈi
+0 en-US M ˈɛ m
+0 en-US , |
+0 en-US February f ˈɛ b j u ˌɛ ɹ i
+0 en-US first f ˈɚ s t
+0 en-US , |
+0 en-US two t ˈu
+0 en-US thousand θ ˈaʊ z ə n d
+0 en-US . ‖
+1 it Un u n
+1 it mese ˈm e s e
+1 it fà f a
+1 it , |
+1 it due d j u
+1 it gennaio d͡ʒ e n n ˈa j o
+1 it duemila d u e ˈm i l a
+1 it . ‖
+```
+
+See [the documentation](https://rhasspy.github.io/gruut/) for more details.
+
+## Installation
+
+```sh
+pip install gruut
+```
+
+Languages besides English can be added during installation. For example, with French and Italian support:
+
+```sh
+pip install -f 'https://synesthesiam.github.io/prebuilt-apps/' gruut[fr,it]
+```
+
+The extra pip repo is needed for an updated [num2words fork](https://github.com/rhasspy/num2words) that includes support for more languages.
+
+You may also [manually download language files](https://github.com/rhasspy/gruut/releases/latest) and use put them in `$XDG_CONFIG_HOME/gruut/` (`$HOME/.config/gruut` by default).
+
+gruut will look for language files in the directory `$XDG_CONFIG_HOME/gruut/<lang>/` if the corresponding Python package is not installed. Note that `<lang>` here is the **full** language name, e.g. `de-de` instead of just `de`. 
+
+## Supported Languages
+
+gruut currently supports:
+
+* Arabic (`ar`)
+* Czech (`cs` or `cs-cz`)
+* German (`de` or `de-de`)
+* English (`en` or `en-us`)
+* Spanish (`es` or `es-es`)
+* Farsi/Persian (`fa`)
+* French (`fr` or `fr-fr`)
+* Italian (`it` or `it-it`)
+* Luxembourgish (`lb`)
+* Dutch (`nl`)
+* Russian (`ru` or `ru-ru`)
+* Swedish (`sv` or `sv-se`)
+* Swahili (`sw`)
+
+The goal is to support all of [voice2json's languages](https://github.com/synesthesiam/voice2json-profiles#supported-languages)
+
+## Dependencies
+
+* Python 3.7 or higher
+* Linux
+    * Tested on Debian Bullseye
+* [num2words fork](https://github.com/rhasspy/num2words) and [Babel](https://pypi.org/project/Babel/)
+    * Currency/number handling
+    * num2words fork includes additional language support (Arabic, Farsi, Swedish, Swahili)
+* gruut-ipa
+    * [IPA](https://en.wikipedia.org/wiki/International_Phonetic_Alphabet) pronunciation manipulation
+* [pycrfsuite](https://github.com/scrapinghub/python-crfsuite)
+    * Part of speech tagging and grapheme to phoneme models
+* [pydateparser](https://github.com/GLibAi/pydateparser)
+    * Date parsing for multiple languages
+
+## Numbers, Dates, and More
+
+`gruut` can automatically verbalize numbers, dates, and other expressions. This is done in a locale-aware manner for both parsing and verbalization, so "1/1/2020" may be interpreted as "M/D/Y" or "D/M/Y" depending on the word or sentence's language (e.g., `<s lang="...">`).
+
+The following types of expressions can be automatically expanded into words by `gruut`:
+
+* Numbers - "123" to "one hundred and twenty three" (disable with `verbalize_numbers=False` or `--no-numbers`)
+    * Relies on `Babel` for parsing and `num2words` for verbalization
+* Dates - "1/1/2020" to "January first, twenty twenty" (disable with `verbalize_dates=False` or `--no-dates`)
+    * Relies on `pydateparser` for parsing and both `Babel` and `num2words` for verbalization
+* Currency - "$10" to "ten dollars" (disable with `verbalize_currency=False` or `--no-currency`)
+    * Relies on `Babel` for parsing and both `Babel` and `num2words` for verbalization
+* Times - "12:01am" to "twelve oh one A M" (disable with `verbalize_times=False` or `--no-times`)
+    * English only
+    * Relies on `num2words` for verbalization
+
+## Command-Line Usage
+
+The `gruut` module can be executed with `python3 -m gruut --language <LANGUAGE> <TEXT>` or with the `gruut` command (from `setup.py`).
+
+The `gruut` command is line-oriented, consuming text and producing [JSONL](https://jsonlines.org/).
+You will probably want to install [jq](https://stedolan.github.io/jq/) to manipulate the [JSONL](https://jsonlines.org/) output from `gruut`.
+
+### Plain Text
+
+Takes raw text and outputs [JSONL](https://jsonlines.org/) with cleaned words/tokens.
+
+```sh
+echo 'This, right here, is some "RAW" text!' \
+   | gruut --language en-us \
+   | jq --raw-output '.words[].text'
+This
+,
+right
+here
+,
+is
+some
+"
+RAW
+"
+text
+!
+```
+
+More information is available in the full JSON output:
+
+```sh
+gruut --language en-us 'More  text.' | jq .
+```
+
+Output:
+
+```json
+{
+  "idx": 0,
+  "text": "More text.",
+  "text_with_ws": "More text.",
+  "text_spoken": "More text",
+  "par_idx": 0,
+  "lang": "en-us",
+  "voice": "",
+  "words": [
+    {
+      "idx": 0,
+      "text": "More",
+      "text_with_ws": "More ",
+      "leading_ws": "",
+      "training_ws": " ",
+      "sent_idx": 0,
+      "par_idx": 0,
+      "lang": "en-us",
+      "voice": "",
+      "pos": "JJR",
+      "phonemes": [
+        "m",
+        "ˈɔ",
+        "ɹ"
+      ],
+      "is_major_break": false,
+      "is_minor_break": false,
+      "is_punctuation": false,
+      "is_break": false,
+      "is_spoken": true,
+      "pause_before_ms": 0,
+      "pause_after_ms": 0
+    },
+    {
+      "idx": 1,
+      "text": "text",
+      "text_with_ws": "text",
+      "leading_ws": "",
+      "training_ws": "",
+      "sent_idx": 0,
+      "par_idx": 0,
+      "lang": "en-us",
+      "voice": "",
+      "pos": "NN",
+      "phonemes": [
+        "t",
+        "ˈɛ",
+        "k",
+        "s",
+        "t"
+      ],
+      "is_major_break": false,
+      "is_minor_break": false,
+      "is_punctuation": false,
+      "is_break": false,
+      "is_spoken": true,
+      "pause_before_ms": 0,
+      "pause_after_ms": 0
+    },
+    {
+      "idx": 2,
+      "text": ".",
+      "text_with_ws": ".",
+      "leading_ws": "",
+      "training_ws": "",
+      "sent_idx": 0,
+      "par_idx": 0,
+      "lang": "en-us",
+      "voice": "",
+      "pos": null,
+      "phonemes": [
+        "‖"
+      ],
+      "is_major_break": true,
+      "is_minor_break": false,
+      "is_punctuation": false,
+      "is_break": true,
+      "is_spoken": false,
+      "pause_before_ms": 0,
+      "pause_after_ms": 0
+    }
+  ],
+  "pause_before_ms": 0,
+  "pause_after_ms": 0
+}
+```
+
+For the whole input line and each word, the `text` property contains the processed input text with normalized whitespace while `text_with_ws` retains the original whitespace. The `text_spoken` property only contains words that are spoken, so punctuation and breaks are excluded.
+
+Within each word, there is:
+
+* `idx` - zero-based index of the word in the sentence
+* `sent_idx` - zero-based index of the sentence in the input text
+* `pos` - part of speech tag (if available)
+* `phonemes` - list of [IPA](https://en.wikipedia.org/wiki/International_Phonetic_Alphabet) phonemes for the word (if available)
+* `is_minor_break` - `true` if "word" separates phrases (comma, semicolon, etc.)
+* `is_major_break` - `true` if "word" separates sentences (period, question mark, etc.)
+* `is_break` - `true` if "word" is a major or minor break
+* `is_punctuation` - `true` if "word" is a surrounding punctuation mark (quote, bracket, etc.)
+* `is_spoken` - `true` if not a break or punctuation
+
+See `python3 -m gruut <LANGUAGE> --help` for more options.
+
+### SSML
+
+A subset of [SSML](https://www.w3.org/TR/speech-synthesis11/) is supported:
+
+* `<speak>` - wrap around SSML text
+    * `lang` - set language for document
+* `<p>` - paragraph
+    * `lang` - set language for paragraph
+* `<s>` - sentence (disables automatic sentence breaking)
+    * `lang` - set language for sentence
+* `<w>` / `<token>` - word (disables automatic tokenization)
+    * `lang` - set language for word
+    * `role` - set word role (see [word roles](#word-roles))
+* `<lang lang="...">` - set language inner text
+* `<voice name="...">` - set voice of inner text
+* `<say-as interpret-as="">` - force interpretation of inner text
+    * `interpret-as` one of "spell-out", "date", "number", "time", or "currency"
+    * `format` - way to format text depending on `interpret-as`
+        * number - one of "cardinal", "ordinal", "digits", "year"
+        * date - string with "d" (cardinal day), "o" (ordinal day), "m" (month), or "y" (year)
+* `<break time="">` - Pause for given amount of time
+    * time - seconds ("123s") or milliseconds ("123ms")
+* `<mark name="">` - User-defined mark (`marks_before` and `marks_after` attributes of words/sentences)
+    * name - name of mark
+* `<sub alias="">` - substitute `alias` for inner text
+* `<phoneme ph="...">` - supply phonemes for inner text
+    * `ph` - phonemes for each word of inner text, separated by whitespace
+* `<lexicon id="...">` - inline or external pronunciation lexicon
+    * `id` - unique id of lexicon (used in `<lookup ref="...">`)
+    * `uri` - if empty or missing, lexicon is inline
+    * One or more `<lexeme>` child elements with:
+        *  Optional `role="..."` ([word roles][#word-roles] separated by whitespace)
+        * `<grapheme>WORD</grapheme>` - word text
+        * `<phoneme>P H O N E M E S</phoneme>` - word pronunciation (phonemes separated by whitespace)
+* `<lookup ref="...">` - use pronunciation lexicon for child elements
+    * `ref` - id from a `<lexicon id="...">`
+
+#### Word Roles
+
+During phonemization, word roles are used to disambiguate pronunciations. Unless manually specified, a word's role is derived from its part of speech tag as `gruut:<TAG>`. For initialisms and `spell-out`, the role `gruut:letter` is used to indicate that e.g., "a" should be spoken as `/eɪ/` instead of `/ə/`.
+
+For `en-us`, the following additional roles are available from the part-of-speech tagger:
+
+* `gruut:CD` - number
+* `gruut:DT` - determiner
+* `gruut:IN` - preposition or subordinating conjunction 
+* `gruut:JJ` - adjective
+* `gruut:NN` - noun
+* `gruut:PRP` - personal pronoun
+* `gruut:RB` - adverb
+* `gruut:VB` - verb
+* `gruut:VB` - verb (past tense)
+
+#### Inline Lexicons
+
+Inline [pronunciation lexicons](https://www.w3.org/TR/2008/REC-pronunciation-lexicon-20081014/) are supported via the `<lexicon>` and `<lookup>` tags. gruut diverges slightly from the [SSML standard](https://www.w3.org/TR/speech-synthesis11/) here by allowing lexicons to be defined within the SSML document itself (`url` is blank or missing). Additionally, the `id` attribute of the `<lexicon>` element can be left off to indicate a "default" inline lexicon that does not require a corresponding `<lookup>` tag.
+
+For example, the following document will yield three different pronunciations for the word "tomato":
+
+``` xml
+<?xml version="1.0"?>
+<speak version="1.1"
+       xmlns="http://www.w3.org/2001/10/synthesis"
+       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
+       xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
+                 http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
+       xml:lang="en-US">
+
+  <lexicon xml:id="test" alphabet="ipa">
+    <lexeme>
+      <grapheme>
+        tomato
+      </grapheme>
+      <phoneme>
+        <!-- Individual phonemes are separated by whitespace -->
+        t ə m ˈɑ t oʊ
+      </phoneme>
+    </lexeme>
+    <lexeme>
+      <grapheme role="fake-role">
+        tomato
+      </grapheme>
+      <phoneme>
+        <!-- Made up pronunciation for fake word role -->
+        t ə m ˈi t oʊ
+      </phoneme>
+    </lexeme>
+  </lexicon>
+
+  <w>tomato</w>
+  <lookup ref="test">
+    <w>tomato</w>
+    <w role="fake-role">tomato</w>
+  </lookup>
+</speak>
+```
+
+The first "tomato" will be looked up in the U.S. English lexicon (`/t ə m ˈeɪ t oʊ/`). Within the `<lookup>` tag's scope, the second and third "tomato" words will be looked up in the inline lexicon. The third "tomato" word has a [role](#word-roles) attached  (selecting a made up pronunciation in this case).
+
+Even further from the SSML standard, gruut allows you to leave off the `<lexicon>` id entirely. With no `id`, a `<lookup>` tag is no longer needed, allowing you to override the pronunciation of any word in the document: 
+
+``` xml
+<?xml version="1.0"?>
+<speak version="1.1"
+       xmlns="http://www.w3.org/2001/10/synthesis"
+       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
+       xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
+                 http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
+       xml:lang="en-US">
+
+  <!-- No id means change all words without a lookup -->
+  <lexicon>
+    <lexeme>
+      <grapheme>
+        tomato
+      </grapheme>
+      <phoneme>
+        t ə m ˈɑ t oʊ
+      </phoneme>
+    </lexeme>
+  </lexicon>
+
+  <w>tomato</w>
+</speak>
+```
+
+This will yield a pronunciation of `/t ə m ˈɑ t oʊ/` for all instances of "tomato" in the document (unless they have a `<lookup>`).
+
+## Intended Audience
+
+gruut is useful for transforming raw text into phonetic pronunciations, similar to [phonemizer](https://github.com/bootphon/phonemizer). Unlike phonemizer, gruut looks up words in a pre-built lexicon (pronunciation dictionary) or guesses word pronunciations with a pre-trained grapheme-to-phoneme model. Phonemes for each language come from a [carefully chosen inventory](https://en.wikipedia.org/wiki/Template:Language_phonologies).
+
+For each supported language, gruut includes a:
+
+* A word pronunciation lexicon built from open source data
+    * See [pron_dict](https://github.com/Kyubyong/pron_dictionaries)
+* A pre-trained grapheme-to-phoneme model for guessing word pronunciations
+
+Some languages also include:
+
+* A pre-trained part of speech tagger built from open source data:
+    * See [universal dependencies](https://universaldependencies.org/)
+
+
+
+
+%package -n python3-gruut
+Summary:	A tokenizer, text cleaner, and phonemizer for many human languages.
+Provides:	python-gruut
+BuildRequires:	python3-devel
+BuildRequires:	python3-setuptools
+BuildRequires:	python3-pip
+%description -n python3-gruut
+# Gruut
+
+A tokenizer, text cleaner, and [IPA](https://en.wikipedia.org/wiki/International_Phonetic_Alphabet) phonemizer for several human languages that supports [SSML](#ssml).
+
+```python
+from gruut import sentences
+
+text = 'He wound it around the wound, saying "I read it was $10 to read."'
+
+for sent in sentences(text, lang="en-us"):
+    for word in sent:
+        if word.phonemes:
+            print(word.text, *word.phonemes)
+```
+
+which outputs:
+
+```
+He h ˈi
+wound w ˈaʊ n d
+it ˈɪ t
+around ɚ ˈaʊ n d
+the ð ə
+wound w ˈu n d
+, |
+saying s ˈeɪ ɪ ŋ
+I ˈaɪ
+read ɹ ˈɛ d
+it ˈɪ t
+was w ə z
+ten t ˈɛ n
+dollars d ˈɑ l ɚ z
+to t ə
+read ɹ ˈi d
+. ‖
+```
+
+Note that "wound" and "read" have different pronunciations when used in different (grammatical) contexts.
+
+A [subset of SSML](#ssml) is also supported:
+
+```python
+from gruut import sentences
+
+ssml_text = """<?xml version="1.0" encoding="ISO-8859-1"?>
+<speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis"
+    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
+    xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
+                http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
+    xml:lang="en-US">
+<s>Today at 4pm, 2/1/2000.</s>
+<s xml:lang="it">Un mese fà, 2/1/2000.</s>
+</speak>"""
+
+for sent in sentences(ssml_text, ssml=True):
+    for word in sent:
+        if word.phonemes:
+            print(sent.idx, word.lang, word.text, *word.phonemes)
+```
+
+with the output:
+
+```
+0 en-US Today t ə d ˈeɪ
+0 en-US at ˈæ t
+0 en-US four f ˈɔ ɹ
+0 en-US P p ˈi
+0 en-US M ˈɛ m
+0 en-US , |
+0 en-US February f ˈɛ b j u ˌɛ ɹ i
+0 en-US first f ˈɚ s t
+0 en-US , |
+0 en-US two t ˈu
+0 en-US thousand θ ˈaʊ z ə n d
+0 en-US . ‖
+1 it Un u n
+1 it mese ˈm e s e
+1 it fà f a
+1 it , |
+1 it due d j u
+1 it gennaio d͡ʒ e n n ˈa j o
+1 it duemila d u e ˈm i l a
+1 it . ‖
+```
+
+See [the documentation](https://rhasspy.github.io/gruut/) for more details.
+
+## Installation
+
+```sh
+pip install gruut
+```
+
+Languages besides English can be added during installation. For example, with French and Italian support:
+
+```sh
+pip install -f 'https://synesthesiam.github.io/prebuilt-apps/' gruut[fr,it]
+```
+
+The extra pip repo is needed for an updated [num2words fork](https://github.com/rhasspy/num2words) that includes support for more languages.
+
+You may also [manually download language files](https://github.com/rhasspy/gruut/releases/latest) and use put them in `$XDG_CONFIG_HOME/gruut/` (`$HOME/.config/gruut` by default).
+
+gruut will look for language files in the directory `$XDG_CONFIG_HOME/gruut/<lang>/` if the corresponding Python package is not installed. Note that `<lang>` here is the **full** language name, e.g. `de-de` instead of just `de`. 
+
+## Supported Languages
+
+gruut currently supports:
+
+* Arabic (`ar`)
+* Czech (`cs` or `cs-cz`)
+* German (`de` or `de-de`)
+* English (`en` or `en-us`)
+* Spanish (`es` or `es-es`)
+* Farsi/Persian (`fa`)
+* French (`fr` or `fr-fr`)
+* Italian (`it` or `it-it`)
+* Luxembourgish (`lb`)
+* Dutch (`nl`)
+* Russian (`ru` or `ru-ru`)
+* Swedish (`sv` or `sv-se`)
+* Swahili (`sw`)
+
+The goal is to support all of [voice2json's languages](https://github.com/synesthesiam/voice2json-profiles#supported-languages)
+
+## Dependencies
+
+* Python 3.7 or higher
+* Linux
+    * Tested on Debian Bullseye
+* [num2words fork](https://github.com/rhasspy/num2words) and [Babel](https://pypi.org/project/Babel/)
+    * Currency/number handling
+    * num2words fork includes additional language support (Arabic, Farsi, Swedish, Swahili)
+* gruut-ipa
+    * [IPA](https://en.wikipedia.org/wiki/International_Phonetic_Alphabet) pronunciation manipulation
+* [pycrfsuite](https://github.com/scrapinghub/python-crfsuite)
+    * Part of speech tagging and grapheme to phoneme models
+* [pydateparser](https://github.com/GLibAi/pydateparser)
+    * Date parsing for multiple languages
+
+## Numbers, Dates, and More
+
+`gruut` can automatically verbalize numbers, dates, and other expressions. This is done in a locale-aware manner for both parsing and verbalization, so "1/1/2020" may be interpreted as "M/D/Y" or "D/M/Y" depending on the word or sentence's language (e.g., `<s lang="...">`).
+
+The following types of expressions can be automatically expanded into words by `gruut`:
+
+* Numbers - "123" to "one hundred and twenty three" (disable with `verbalize_numbers=False` or `--no-numbers`)
+    * Relies on `Babel` for parsing and `num2words` for verbalization
+* Dates - "1/1/2020" to "January first, twenty twenty" (disable with `verbalize_dates=False` or `--no-dates`)
+    * Relies on `pydateparser` for parsing and both `Babel` and `num2words` for verbalization
+* Currency - "$10" to "ten dollars" (disable with `verbalize_currency=False` or `--no-currency`)
+    * Relies on `Babel` for parsing and both `Babel` and `num2words` for verbalization
+* Times - "12:01am" to "twelve oh one A M" (disable with `verbalize_times=False` or `--no-times`)
+    * English only
+    * Relies on `num2words` for verbalization
+
+## Command-Line Usage
+
+The `gruut` module can be executed with `python3 -m gruut --language <LANGUAGE> <TEXT>` or with the `gruut` command (from `setup.py`).
+
+The `gruut` command is line-oriented, consuming text and producing [JSONL](https://jsonlines.org/).
+You will probably want to install [jq](https://stedolan.github.io/jq/) to manipulate the [JSONL](https://jsonlines.org/) output from `gruut`.
+
+### Plain Text
+
+Takes raw text and outputs [JSONL](https://jsonlines.org/) with cleaned words/tokens.
+
+```sh
+echo 'This, right here, is some "RAW" text!' \
+   | gruut --language en-us \
+   | jq --raw-output '.words[].text'
+This
+,
+right
+here
+,
+is
+some
+"
+RAW
+"
+text
+!
+```
+
+More information is available in the full JSON output:
+
+```sh
+gruut --language en-us 'More  text.' | jq .
+```
+
+Output:
+
+```json
+{
+  "idx": 0,
+  "text": "More text.",
+  "text_with_ws": "More text.",
+  "text_spoken": "More text",
+  "par_idx": 0,
+  "lang": "en-us",
+  "voice": "",
+  "words": [
+    {
+      "idx": 0,
+      "text": "More",
+      "text_with_ws": "More ",
+      "leading_ws": "",
+      "training_ws": " ",
+      "sent_idx": 0,
+      "par_idx": 0,
+      "lang": "en-us",
+      "voice": "",
+      "pos": "JJR",
+      "phonemes": [
+        "m",
+        "ˈɔ",
+        "ɹ"
+      ],
+      "is_major_break": false,
+      "is_minor_break": false,
+      "is_punctuation": false,
+      "is_break": false,
+      "is_spoken": true,
+      "pause_before_ms": 0,
+      "pause_after_ms": 0
+    },
+    {
+      "idx": 1,
+      "text": "text",
+      "text_with_ws": "text",
+      "leading_ws": "",
+      "training_ws": "",
+      "sent_idx": 0,
+      "par_idx": 0,
+      "lang": "en-us",
+      "voice": "",
+      "pos": "NN",
+      "phonemes": [
+        "t",
+        "ˈɛ",
+        "k",
+        "s",
+        "t"
+      ],
+      "is_major_break": false,
+      "is_minor_break": false,
+      "is_punctuation": false,
+      "is_break": false,
+      "is_spoken": true,
+      "pause_before_ms": 0,
+      "pause_after_ms": 0
+    },
+    {
+      "idx": 2,
+      "text": ".",
+      "text_with_ws": ".",
+      "leading_ws": "",
+      "training_ws": "",
+      "sent_idx": 0,
+      "par_idx": 0,
+      "lang": "en-us",
+      "voice": "",
+      "pos": null,
+      "phonemes": [
+        "‖"
+      ],
+      "is_major_break": true,
+      "is_minor_break": false,
+      "is_punctuation": false,
+      "is_break": true,
+      "is_spoken": false,
+      "pause_before_ms": 0,
+      "pause_after_ms": 0
+    }
+  ],
+  "pause_before_ms": 0,
+  "pause_after_ms": 0
+}
+```
+
+For the whole input line and each word, the `text` property contains the processed input text with normalized whitespace while `text_with_ws` retains the original whitespace. The `text_spoken` property only contains words that are spoken, so punctuation and breaks are excluded.
+
+Within each word, there is:
+
+* `idx` - zero-based index of the word in the sentence
+* `sent_idx` - zero-based index of the sentence in the input text
+* `pos` - part of speech tag (if available)
+* `phonemes` - list of [IPA](https://en.wikipedia.org/wiki/International_Phonetic_Alphabet) phonemes for the word (if available)
+* `is_minor_break` - `true` if "word" separates phrases (comma, semicolon, etc.)
+* `is_major_break` - `true` if "word" separates sentences (period, question mark, etc.)
+* `is_break` - `true` if "word" is a major or minor break
+* `is_punctuation` - `true` if "word" is a surrounding punctuation mark (quote, bracket, etc.)
+* `is_spoken` - `true` if not a break or punctuation
+
+See `python3 -m gruut <LANGUAGE> --help` for more options.
+
+### SSML
+
+A subset of [SSML](https://www.w3.org/TR/speech-synthesis11/) is supported:
+
+* `<speak>` - wrap around SSML text
+    * `lang` - set language for document
+* `<p>` - paragraph
+    * `lang` - set language for paragraph
+* `<s>` - sentence (disables automatic sentence breaking)
+    * `lang` - set language for sentence
+* `<w>` / `<token>` - word (disables automatic tokenization)
+    * `lang` - set language for word
+    * `role` - set word role (see [word roles](#word-roles))
+* `<lang lang="...">` - set language inner text
+* `<voice name="...">` - set voice of inner text
+* `<say-as interpret-as="">` - force interpretation of inner text
+    * `interpret-as` one of "spell-out", "date", "number", "time", or "currency"
+    * `format` - way to format text depending on `interpret-as`
+        * number - one of "cardinal", "ordinal", "digits", "year"
+        * date - string with "d" (cardinal day), "o" (ordinal day), "m" (month), or "y" (year)
+* `<break time="">` - Pause for given amount of time
+    * time - seconds ("123s") or milliseconds ("123ms")
+* `<mark name="">` - User-defined mark (`marks_before` and `marks_after` attributes of words/sentences)
+    * name - name of mark
+* `<sub alias="">` - substitute `alias` for inner text
+* `<phoneme ph="...">` - supply phonemes for inner text
+    * `ph` - phonemes for each word of inner text, separated by whitespace
+* `<lexicon id="...">` - inline or external pronunciation lexicon
+    * `id` - unique id of lexicon (used in `<lookup ref="...">`)
+    * `uri` - if empty or missing, lexicon is inline
+    * One or more `<lexeme>` child elements with:
+        *  Optional `role="..."` ([word roles][#word-roles] separated by whitespace)
+        * `<grapheme>WORD</grapheme>` - word text
+        * `<phoneme>P H O N E M E S</phoneme>` - word pronunciation (phonemes separated by whitespace)
+* `<lookup ref="...">` - use pronunciation lexicon for child elements
+    * `ref` - id from a `<lexicon id="...">`
+
+#### Word Roles
+
+During phonemization, word roles are used to disambiguate pronunciations. Unless manually specified, a word's role is derived from its part of speech tag as `gruut:<TAG>`. For initialisms and `spell-out`, the role `gruut:letter` is used to indicate that e.g., "a" should be spoken as `/eɪ/` instead of `/ə/`.
+
+For `en-us`, the following additional roles are available from the part-of-speech tagger:
+
+* `gruut:CD` - number
+* `gruut:DT` - determiner
+* `gruut:IN` - preposition or subordinating conjunction 
+* `gruut:JJ` - adjective
+* `gruut:NN` - noun
+* `gruut:PRP` - personal pronoun
+* `gruut:RB` - adverb
+* `gruut:VB` - verb
+* `gruut:VB` - verb (past tense)
+
+#### Inline Lexicons
+
+Inline [pronunciation lexicons](https://www.w3.org/TR/2008/REC-pronunciation-lexicon-20081014/) are supported via the `<lexicon>` and `<lookup>` tags. gruut diverges slightly from the [SSML standard](https://www.w3.org/TR/speech-synthesis11/) here by allowing lexicons to be defined within the SSML document itself (`url` is blank or missing). Additionally, the `id` attribute of the `<lexicon>` element can be left off to indicate a "default" inline lexicon that does not require a corresponding `<lookup>` tag.
+
+For example, the following document will yield three different pronunciations for the word "tomato":
+
+``` xml
+<?xml version="1.0"?>
+<speak version="1.1"
+       xmlns="http://www.w3.org/2001/10/synthesis"
+       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
+       xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
+                 http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
+       xml:lang="en-US">
+
+  <lexicon xml:id="test" alphabet="ipa">
+    <lexeme>
+      <grapheme>
+        tomato
+      </grapheme>
+      <phoneme>
+        <!-- Individual phonemes are separated by whitespace -->
+        t ə m ˈɑ t oʊ
+      </phoneme>
+    </lexeme>
+    <lexeme>
+      <grapheme role="fake-role">
+        tomato
+      </grapheme>
+      <phoneme>
+        <!-- Made up pronunciation for fake word role -->
+        t ə m ˈi t oʊ
+      </phoneme>
+    </lexeme>
+  </lexicon>
+
+  <w>tomato</w>
+  <lookup ref="test">
+    <w>tomato</w>
+    <w role="fake-role">tomato</w>
+  </lookup>
+</speak>
+```
+
+The first "tomato" will be looked up in the U.S. English lexicon (`/t ə m ˈeɪ t oʊ/`). Within the `<lookup>` tag's scope, the second and third "tomato" words will be looked up in the inline lexicon. The third "tomato" word has a [role](#word-roles) attached  (selecting a made up pronunciation in this case).
+
+Even further from the SSML standard, gruut allows you to leave off the `<lexicon>` id entirely. With no `id`, a `<lookup>` tag is no longer needed, allowing you to override the pronunciation of any word in the document: 
+
+``` xml
+<?xml version="1.0"?>
+<speak version="1.1"
+       xmlns="http://www.w3.org/2001/10/synthesis"
+       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
+       xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
+                 http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
+       xml:lang="en-US">
+
+  <!-- No id means change all words without a lookup -->
+  <lexicon>
+    <lexeme>
+      <grapheme>
+        tomato
+      </grapheme>
+      <phoneme>
+        t ə m ˈɑ t oʊ
+      </phoneme>
+    </lexeme>
+  </lexicon>
+
+  <w>tomato</w>
+</speak>
+```
+
+This will yield a pronunciation of `/t ə m ˈɑ t oʊ/` for all instances of "tomato" in the document (unless they have a `<lookup>`).
+
+## Intended Audience
+
+gruut is useful for transforming raw text into phonetic pronunciations, similar to [phonemizer](https://github.com/bootphon/phonemizer). Unlike phonemizer, gruut looks up words in a pre-built lexicon (pronunciation dictionary) or guesses word pronunciations with a pre-trained grapheme-to-phoneme model. Phonemes for each language come from a [carefully chosen inventory](https://en.wikipedia.org/wiki/Template:Language_phonologies).
+
+For each supported language, gruut includes a:
+
+* A word pronunciation lexicon built from open source data
+    * See [pron_dict](https://github.com/Kyubyong/pron_dictionaries)
+* A pre-trained grapheme-to-phoneme model for guessing word pronunciations
+
+Some languages also include:
+
+* A pre-trained part of speech tagger built from open source data:
+    * See [universal dependencies](https://universaldependencies.org/)
+
+
+
+
+%package help
+Summary:	Development documents and examples for gruut
+Provides:	python3-gruut-doc
+%description help
+# Gruut
+
+A tokenizer, text cleaner, and [IPA](https://en.wikipedia.org/wiki/International_Phonetic_Alphabet) phonemizer for several human languages that supports [SSML](#ssml).
+
+```python
+from gruut import sentences
+
+text = 'He wound it around the wound, saying "I read it was $10 to read."'
+
+for sent in sentences(text, lang="en-us"):
+    for word in sent:
+        if word.phonemes:
+            print(word.text, *word.phonemes)
+```
+
+which outputs:
+
+```
+He h ˈi
+wound w ˈaʊ n d
+it ˈɪ t
+around ɚ ˈaʊ n d
+the ð ə
+wound w ˈu n d
+, |
+saying s ˈeɪ ɪ ŋ
+I ˈaɪ
+read ɹ ˈɛ d
+it ˈɪ t
+was w ə z
+ten t ˈɛ n
+dollars d ˈɑ l ɚ z
+to t ə
+read ɹ ˈi d
+. ‖
+```
+
+Note that "wound" and "read" have different pronunciations when used in different (grammatical) contexts.
+
+A [subset of SSML](#ssml) is also supported:
+
+```python
+from gruut import sentences
+
+ssml_text = """<?xml version="1.0" encoding="ISO-8859-1"?>
+<speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis"
+    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
+    xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
+                http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
+    xml:lang="en-US">
+<s>Today at 4pm, 2/1/2000.</s>
+<s xml:lang="it">Un mese fà, 2/1/2000.</s>
+</speak>"""
+
+for sent in sentences(ssml_text, ssml=True):
+    for word in sent:
+        if word.phonemes:
+            print(sent.idx, word.lang, word.text, *word.phonemes)
+```
+
+with the output:
+
+```
+0 en-US Today t ə d ˈeɪ
+0 en-US at ˈæ t
+0 en-US four f ˈɔ ɹ
+0 en-US P p ˈi
+0 en-US M ˈɛ m
+0 en-US , |
+0 en-US February f ˈɛ b j u ˌɛ ɹ i
+0 en-US first f ˈɚ s t
+0 en-US , |
+0 en-US two t ˈu
+0 en-US thousand θ ˈaʊ z ə n d
+0 en-US . ‖
+1 it Un u n
+1 it mese ˈm e s e
+1 it fà f a
+1 it , |
+1 it due d j u
+1 it gennaio d͡ʒ e n n ˈa j o
+1 it duemila d u e ˈm i l a
+1 it . ‖
+```
+
+See [the documentation](https://rhasspy.github.io/gruut/) for more details.
+
+## Installation
+
+```sh
+pip install gruut
+```
+
+Languages besides English can be added during installation. For example, with French and Italian support:
+
+```sh
+pip install -f 'https://synesthesiam.github.io/prebuilt-apps/' gruut[fr,it]
+```
+
+The extra pip repo is needed for an updated [num2words fork](https://github.com/rhasspy/num2words) that includes support for more languages.
+
+You may also [manually download language files](https://github.com/rhasspy/gruut/releases/latest) and use put them in `$XDG_CONFIG_HOME/gruut/` (`$HOME/.config/gruut` by default).
+
+gruut will look for language files in the directory `$XDG_CONFIG_HOME/gruut/<lang>/` if the corresponding Python package is not installed. Note that `<lang>` here is the **full** language name, e.g. `de-de` instead of just `de`. 
+
+## Supported Languages
+
+gruut currently supports:
+
+* Arabic (`ar`)
+* Czech (`cs` or `cs-cz`)
+* German (`de` or `de-de`)
+* English (`en` or `en-us`)
+* Spanish (`es` or `es-es`)
+* Farsi/Persian (`fa`)
+* French (`fr` or `fr-fr`)
+* Italian (`it` or `it-it`)
+* Luxembourgish (`lb`)
+* Dutch (`nl`)
+* Russian (`ru` or `ru-ru`)
+* Swedish (`sv` or `sv-se`)
+* Swahili (`sw`)
+
+The goal is to support all of [voice2json's languages](https://github.com/synesthesiam/voice2json-profiles#supported-languages)
+
+## Dependencies
+
+* Python 3.7 or higher
+* Linux
+    * Tested on Debian Bullseye
+* [num2words fork](https://github.com/rhasspy/num2words) and [Babel](https://pypi.org/project/Babel/)
+    * Currency/number handling
+    * num2words fork includes additional language support (Arabic, Farsi, Swedish, Swahili)
+* gruut-ipa
+    * [IPA](https://en.wikipedia.org/wiki/International_Phonetic_Alphabet) pronunciation manipulation
+* [pycrfsuite](https://github.com/scrapinghub/python-crfsuite)
+    * Part of speech tagging and grapheme to phoneme models
+* [pydateparser](https://github.com/GLibAi/pydateparser)
+    * Date parsing for multiple languages
+
+## Numbers, Dates, and More
+
+`gruut` can automatically verbalize numbers, dates, and other expressions. This is done in a locale-aware manner for both parsing and verbalization, so "1/1/2020" may be interpreted as "M/D/Y" or "D/M/Y" depending on the word or sentence's language (e.g., `<s lang="...">`).
+
+The following types of expressions can be automatically expanded into words by `gruut`:
+
+* Numbers - "123" to "one hundred and twenty three" (disable with `verbalize_numbers=False` or `--no-numbers`)
+    * Relies on `Babel` for parsing and `num2words` for verbalization
+* Dates - "1/1/2020" to "January first, twenty twenty" (disable with `verbalize_dates=False` or `--no-dates`)
+    * Relies on `pydateparser` for parsing and both `Babel` and `num2words` for verbalization
+* Currency - "$10" to "ten dollars" (disable with `verbalize_currency=False` or `--no-currency`)
+    * Relies on `Babel` for parsing and both `Babel` and `num2words` for verbalization
+* Times - "12:01am" to "twelve oh one A M" (disable with `verbalize_times=False` or `--no-times`)
+    * English only
+    * Relies on `num2words` for verbalization
+
+## Command-Line Usage
+
+The `gruut` module can be executed with `python3 -m gruut --language <LANGUAGE> <TEXT>` or with the `gruut` command (from `setup.py`).
+
+The `gruut` command is line-oriented, consuming text and producing [JSONL](https://jsonlines.org/).
+You will probably want to install [jq](https://stedolan.github.io/jq/) to manipulate the [JSONL](https://jsonlines.org/) output from `gruut`.
+
+### Plain Text
+
+Takes raw text and outputs [JSONL](https://jsonlines.org/) with cleaned words/tokens.
+
+```sh
+echo 'This, right here, is some "RAW" text!' \
+   | gruut --language en-us \
+   | jq --raw-output '.words[].text'
+This
+,
+right
+here
+,
+is
+some
+"
+RAW
+"
+text
+!
+```
+
+More information is available in the full JSON output:
+
+```sh
+gruut --language en-us 'More  text.' | jq .
+```
+
+Output:
+
+```json
+{
+  "idx": 0,
+  "text": "More text.",
+  "text_with_ws": "More text.",
+  "text_spoken": "More text",
+  "par_idx": 0,
+  "lang": "en-us",
+  "voice": "",
+  "words": [
+    {
+      "idx": 0,
+      "text": "More",
+      "text_with_ws": "More ",
+      "leading_ws": "",
+      "training_ws": " ",
+      "sent_idx": 0,
+      "par_idx": 0,
+      "lang": "en-us",
+      "voice": "",
+      "pos": "JJR",
+      "phonemes": [
+        "m",
+        "ˈɔ",
+        "ɹ"
+      ],
+      "is_major_break": false,
+      "is_minor_break": false,
+      "is_punctuation": false,
+      "is_break": false,
+      "is_spoken": true,
+      "pause_before_ms": 0,
+      "pause_after_ms": 0
+    },
+    {
+      "idx": 1,
+      "text": "text",
+      "text_with_ws": "text",
+      "leading_ws": "",
+      "training_ws": "",
+      "sent_idx": 0,
+      "par_idx": 0,
+      "lang": "en-us",
+      "voice": "",
+      "pos": "NN",
+      "phonemes": [
+        "t",
+        "ˈɛ",
+        "k",
+        "s",
+        "t"
+      ],
+      "is_major_break": false,
+      "is_minor_break": false,
+      "is_punctuation": false,
+      "is_break": false,
+      "is_spoken": true,
+      "pause_before_ms": 0,
+      "pause_after_ms": 0
+    },
+    {
+      "idx": 2,
+      "text": ".",
+      "text_with_ws": ".",
+      "leading_ws": "",
+      "training_ws": "",
+      "sent_idx": 0,
+      "par_idx": 0,
+      "lang": "en-us",
+      "voice": "",
+      "pos": null,
+      "phonemes": [
+        "‖"
+      ],
+      "is_major_break": true,
+      "is_minor_break": false,
+      "is_punctuation": false,
+      "is_break": true,
+      "is_spoken": false,
+      "pause_before_ms": 0,
+      "pause_after_ms": 0
+    }
+  ],
+  "pause_before_ms": 0,
+  "pause_after_ms": 0
+}
+```
+
+For the whole input line and each word, the `text` property contains the processed input text with normalized whitespace while `text_with_ws` retains the original whitespace. The `text_spoken` property only contains words that are spoken, so punctuation and breaks are excluded.
+
+Within each word, there is:
+
+* `idx` - zero-based index of the word in the sentence
+* `sent_idx` - zero-based index of the sentence in the input text
+* `pos` - part of speech tag (if available)
+* `phonemes` - list of [IPA](https://en.wikipedia.org/wiki/International_Phonetic_Alphabet) phonemes for the word (if available)
+* `is_minor_break` - `true` if "word" separates phrases (comma, semicolon, etc.)
+* `is_major_break` - `true` if "word" separates sentences (period, question mark, etc.)
+* `is_break` - `true` if "word" is a major or minor break
+* `is_punctuation` - `true` if "word" is a surrounding punctuation mark (quote, bracket, etc.)
+* `is_spoken` - `true` if not a break or punctuation
+
+See `python3 -m gruut <LANGUAGE> --help` for more options.
+
+### SSML
+
+A subset of [SSML](https://www.w3.org/TR/speech-synthesis11/) is supported:
+
+* `<speak>` - wrap around SSML text
+    * `lang` - set language for document
+* `<p>` - paragraph
+    * `lang` - set language for paragraph
+* `<s>` - sentence (disables automatic sentence breaking)
+    * `lang` - set language for sentence
+* `<w>` / `<token>` - word (disables automatic tokenization)
+    * `lang` - set language for word
+    * `role` - set word role (see [word roles](#word-roles))
+* `<lang lang="...">` - set language inner text
+* `<voice name="...">` - set voice of inner text
+* `<say-as interpret-as="">` - force interpretation of inner text
+    * `interpret-as` one of "spell-out", "date", "number", "time", or "currency"
+    * `format` - way to format text depending on `interpret-as`
+        * number - one of "cardinal", "ordinal", "digits", "year"
+        * date - string with "d" (cardinal day), "o" (ordinal day), "m" (month), or "y" (year)
+* `<break time="">` - Pause for given amount of time
+    * time - seconds ("123s") or milliseconds ("123ms")
+* `<mark name="">` - User-defined mark (`marks_before` and `marks_after` attributes of words/sentences)
+    * name - name of mark
+* `<sub alias="">` - substitute `alias` for inner text
+* `<phoneme ph="...">` - supply phonemes for inner text
+    * `ph` - phonemes for each word of inner text, separated by whitespace
+* `<lexicon id="...">` - inline or external pronunciation lexicon
+    * `id` - unique id of lexicon (used in `<lookup ref="...">`)
+    * `uri` - if empty or missing, lexicon is inline
+    * One or more `<lexeme>` child elements with:
+        *  Optional `role="..."` ([word roles][#word-roles] separated by whitespace)
+        * `<grapheme>WORD</grapheme>` - word text
+        * `<phoneme>P H O N E M E S</phoneme>` - word pronunciation (phonemes separated by whitespace)
+* `<lookup ref="...">` - use pronunciation lexicon for child elements
+    * `ref` - id from a `<lexicon id="...">`
+
+#### Word Roles
+
+During phonemization, word roles are used to disambiguate pronunciations. Unless manually specified, a word's role is derived from its part of speech tag as `gruut:<TAG>`. For initialisms and `spell-out`, the role `gruut:letter` is used to indicate that e.g., "a" should be spoken as `/eɪ/` instead of `/ə/`.
+
+For `en-us`, the following additional roles are available from the part-of-speech tagger:
+
+* `gruut:CD` - number
+* `gruut:DT` - determiner
+* `gruut:IN` - preposition or subordinating conjunction 
+* `gruut:JJ` - adjective
+* `gruut:NN` - noun
+* `gruut:PRP` - personal pronoun
+* `gruut:RB` - adverb
+* `gruut:VB` - verb
+* `gruut:VB` - verb (past tense)
+
+#### Inline Lexicons
+
+Inline [pronunciation lexicons](https://www.w3.org/TR/2008/REC-pronunciation-lexicon-20081014/) are supported via the `<lexicon>` and `<lookup>` tags. gruut diverges slightly from the [SSML standard](https://www.w3.org/TR/speech-synthesis11/) here by allowing lexicons to be defined within the SSML document itself (`url` is blank or missing). Additionally, the `id` attribute of the `<lexicon>` element can be left off to indicate a "default" inline lexicon that does not require a corresponding `<lookup>` tag.
+
+For example, the following document will yield three different pronunciations for the word "tomato":
+
+``` xml
+<?xml version="1.0"?>
+<speak version="1.1"
+       xmlns="http://www.w3.org/2001/10/synthesis"
+       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
+       xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
+                 http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
+       xml:lang="en-US">
+
+  <lexicon xml:id="test" alphabet="ipa">
+    <lexeme>
+      <grapheme>
+        tomato
+      </grapheme>
+      <phoneme>
+        <!-- Individual phonemes are separated by whitespace -->
+        t ə m ˈɑ t oʊ
+      </phoneme>
+    </lexeme>
+    <lexeme>
+      <grapheme role="fake-role">
+        tomato
+      </grapheme>
+      <phoneme>
+        <!-- Made up pronunciation for fake word role -->
+        t ə m ˈi t oʊ
+      </phoneme>
+    </lexeme>
+  </lexicon>
+
+  <w>tomato</w>
+  <lookup ref="test">
+    <w>tomato</w>
+    <w role="fake-role">tomato</w>
+  </lookup>
+</speak>
+```
+
+The first "tomato" will be looked up in the U.S. English lexicon (`/t ə m ˈeɪ t oʊ/`). Within the `<lookup>` tag's scope, the second and third "tomato" words will be looked up in the inline lexicon. The third "tomato" word has a [role](#word-roles) attached  (selecting a made up pronunciation in this case).
+
+Even further from the SSML standard, gruut allows you to leave off the `<lexicon>` id entirely. With no `id`, a `<lookup>` tag is no longer needed, allowing you to override the pronunciation of any word in the document: 
+
+``` xml
+<?xml version="1.0"?>
+<speak version="1.1"
+       xmlns="http://www.w3.org/2001/10/synthesis"
+       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
+       xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
+                 http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
+       xml:lang="en-US">
+
+  <!-- No id means change all words without a lookup -->
+  <lexicon>
+    <lexeme>
+      <grapheme>
+        tomato
+      </grapheme>
+      <phoneme>
+        t ə m ˈɑ t oʊ
+      </phoneme>
+    </lexeme>
+  </lexicon>
+
+  <w>tomato</w>
+</speak>
+```
+
+This will yield a pronunciation of `/t ə m ˈɑ t oʊ/` for all instances of "tomato" in the document (unless they have a `<lookup>`).
+
+## Intended Audience
+
+gruut is useful for transforming raw text into phonetic pronunciations, similar to [phonemizer](https://github.com/bootphon/phonemizer). Unlike phonemizer, gruut looks up words in a pre-built lexicon (pronunciation dictionary) or guesses word pronunciations with a pre-trained grapheme-to-phoneme model. Phonemes for each language come from a [carefully chosen inventory](https://en.wikipedia.org/wiki/Template:Language_phonologies).
+
+For each supported language, gruut includes a:
+
+* A word pronunciation lexicon built from open source data
+    * See [pron_dict](https://github.com/Kyubyong/pron_dictionaries)
+* A pre-trained grapheme-to-phoneme model for guessing word pronunciations
+
+Some languages also include:
+
+* A pre-trained part of speech tagger built from open source data:
+    * See [universal dependencies](https://universaldependencies.org/)
+
+
+
+
+%prep
+%autosetup -n gruut-2.3.4
+
+%build
+%py3_build
+
+%install
+%py3_install
+install -d -m755 %{buildroot}/%{_pkgdocdir}
+if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi
+if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi
+if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi
+if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi
+pushd %{buildroot}
+if [ -d usr/lib ]; then
+	find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+if [ -d usr/lib64 ]; then
+	find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+if [ -d usr/bin ]; then
+	find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+if [ -d usr/sbin ]; then
+	find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+touch doclist.lst
+if [ -d usr/share/man ]; then
+	find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst
+fi
+popd
+mv %{buildroot}/filelist.lst .
+mv %{buildroot}/doclist.lst .
+
+%files -n python3-gruut -f filelist.lst
+%dir %{python3_sitelib}/*
+
+%files help -f doclist.lst
+%{_docdir}/*
+
+%changelog
+* Fri May 05 2023 Python_Bot <Python_Bot@openeuler.org> - 2.3.4-1
+- Package Spec generated
author	CoprDistGit <infra@openeuler.org>	2023-05-05 09:59:27 +0000
committer	CoprDistGit <infra@openeuler.org>	2023-05-05 09:59:27 +0000
commit	43bc1851cae674a6d6b426789d6997eccbca89c1 (patch)
tree	791da4732e2f2007cb76f3c1a9d7e464db408207 /python-gruut.spec
parent	f75ccdc371218187b790ffe6d18481ede5d1880c (diff)