%global _empty_manifest_terminate_build 0
Name: python-spacy-stanza
Version: 1.0.3
Release: 1
Summary: Use the latest Stanza (StanfordNLP) research models directly in spaCy
License: MIT
URL: https://explosion.ai
Source0: https://mirrors.aliyun.com/pypi/web/packages/95/54/b30f48236ab4701a52e4304a69e01b561f8ac5778f3defa14dc720644461/spacy-stanza-1.0.3.tar.gz
BuildArch: noarch
Requires: python3-spacy
Requires: python3-stanza
%description
# spaCy + Stanza (formerly StanfordNLP)
This package wraps the [Stanza](https://github.com/stanfordnlp/stanza)
(formerly StanfordNLP) library, so you can use Stanford's models in a
[spaCy](https://spacy.io) pipeline. The Stanford models achieved top accuracy
in the CoNLL 2017 and 2018 shared task, which involves tokenization,
part-of-speech tagging, morphological analysis, lemmatization and labeled
dependency parsing in 68 languages. As of v1.0, Stanza also supports named
entity recognition for selected languages.
> ⚠️ Previous version of this package were available as
> [`spacy-stanfordnlp`](https://pypi.python.org/pypi/spacy-stanfordnlp).
[![Azure Pipelines](https://img.shields.io/azure-devops/build/explosion-ai/public/17/master.svg?logo=azure-pipelines&style=flat-square)](https://dev.azure.com/explosion-ai/public/_build?definitionId=17)
[![PyPi](https://img.shields.io/pypi/v/spacy-stanza.svg?style=flat-square)](https://pypi.python.org/pypi/spacy-stanza)
[![GitHub](https://img.shields.io/github/release/explosion/spacy-stanza/all.svg?style=flat-square)](https://github.com/explosion/spacy-stanza)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg?style=flat-square)](https://github.com/ambv/black)
Using this wrapper, you'll be able to use the following annotations, computed by
your pretrained `stanza` model:
- Statistical tokenization (reflected in the `Doc` and its tokens)
- Lemmatization (`token.lemma` and `token.lemma_`)
- Part-of-speech tagging (`token.tag`, `token.tag_`, `token.pos`, `token.pos_`)
- Morphological analysis (`token.morph`)
- Dependency parsing (`token.dep`, `token.dep_`, `token.head`)
- Named entity recognition (`doc.ents`, `token.ent_type`, `token.ent_type_`, `token.ent_iob`, `token.ent_iob_`)
- Sentence segmentation (`doc.sents`)
## ️️️⌛️ Installation
As of v1.0.0 `spacy-stanza` is only compatible with **spaCy v3.x**. To install
the most recent version:
```bash
pip install spacy-stanza
```
For spaCy v2, install v0.2.x and refer to the [v0.2.x usage
documentation](https://github.com/explosion/spacy-stanza/tree/v0.2.x#-usage--examples):
```bash
pip install "spacy-stanza<0.3.0"
```
Make sure to also
[download](https://stanfordnlp.github.io/stanza/download_models.html) one of
the [pre-trained Stanza
models](https://stanfordnlp.github.io/stanza/models.html).
## 📖 Usage & Examples
> ⚠️ **Important note:** This package has been refactored to take advantage of
> [spaCy v3.0](https://spacy.io). Previous versions that were built for [spaCy
> v2.x](https://v2.spacy.io) worked considerably differently. Please see
> previous tagged versions of this README for documentation on prior versions.
Use `spacy_stanza.load_pipeline()` to create an `nlp` object that you can use to
process a text with a Stanza pipeline and create a spaCy [`Doc`
object](https://spacy.io/api/doc). By default, both the spaCy pipeline and the
Stanza pipeline will be initialized with the same `lang`, e.g. "en":
```python
import stanza
import spacy_stanza
# Download the stanza model if necessary
stanza.download("en")
# Initialize the pipeline
nlp = spacy_stanza.load_pipeline("en")
doc = nlp("Barack Obama was born in Hawaii. He was elected president in 2008.")
for token in doc:
print(token.text, token.lemma_, token.pos_, token.dep_, token.ent_type_)
print(doc.ents)
```
If language data for the given language is available in spaCy, the respective
language class can be used as the base for the `nlp` object – for example,
`English()`. This lets you use spaCy's lexical attributes like `is_stop` or
`like_num`. The `nlp` object follows the same API as any other spaCy `Language`
class – so you can visualize the `Doc` objects with displaCy, add custom
components to the pipeline, use the rule-based matcher and do pretty much
anything else you'd normally do in spaCy.
```python
# Access spaCy's lexical attributes
print([token.is_stop for token in doc])
print([token.like_num for token in doc])
# Visualize dependencies
from spacy import displacy
displacy.serve(doc) # or displacy.render if you're in a Jupyter notebook
# Process texts with nlp.pipe
for doc in nlp.pipe(["Lots of texts", "Even more texts", "..."]):
print(doc.text)
# Combine with your own custom pipeline components
from spacy import Language
@Language.component("custom_component")
def custom_component(doc):
# Do something to the doc here
print(f"Custom component called: {doc.text}")
return doc
nlp.add_pipe("custom_component")
doc = nlp("Some text")
# Serialize attributes to a numpy array
np_array = doc.to_array(['ORTH', 'LEMMA', 'POS'])
```
### Stanza Pipeline options
Additional options for the Stanza
[`Pipeline`](https://stanfordnlp.github.io/stanza/pipeline.html#pipeline) can be
provided as keyword arguments following the `Pipeline` API:
- Provide the Stanza language as `lang`. For Stanza languages without spaCy
support, use "xx" for the spaCy language setting:
```python
# Initialize a pipeline for Coptic
nlp = spacy_stanza.load_pipeline("xx", lang="cop")
```
- Provide Stanza pipeline settings following the `Pipeline` API:
```python
# Initialize a German pipeline with the `hdt` package
nlp = spacy_stanza.load_pipeline("de", package="hdt")
```
- Tokenize with spaCy rather than the statistical tokenizer (only for English):
```python
nlp = spacy_stanza.load_pipeline("en", processors= {"tokenize": "spacy"})
```
- Provide any additional processor settings as additional keyword arguments:
```python
# Provide pretokenized texts (whitespace tokenization)
nlp = spacy_stanza.load_pipeline("de", tokenize_pretokenized=True)
```
The spaCy config specifies all `Pipeline` options in the `[nlp.tokenizer]`
block. For example, the config for the last example above, a German pipeline
with pretokenized texts:
```ini
[nlp.tokenizer]
@tokenizers = "spacy_stanza.PipelineAsTokenizer.v1"
lang = "de"
dir = null
package = "default"
logging_level = null
verbose = null
use_gpu = true
[nlp.tokenizer.kwargs]
tokenize_pretokenized = true
[nlp.tokenizer.processors]
```
### Serialization
The full Stanza pipeline configuration is stored in the spaCy pipeline
[config](https://spacy.io/usage/training#config), so you can save and load the
pipeline just like any other `nlp` pipeline:
```python
# Save to a local directory
nlp.to_disk("./stanza-spacy-model")
# Reload the pipeline
nlp = spacy.load("./stanza-spacy-model")
```
Note that this **does not save any Stanza model data by default**. The Stanza
models are very large, so for now, this package expects you to download the
models separately with `stanza.download()` and have them available either in
the default model directory or in the path specified under
`[nlp.tokenizer.dir]` in the config.
### Adding additional spaCy pipeline components
By default, the spaCy pipeline in the `nlp` object returned by
`spacy_stanza.load_pipeline()` will be empty, because all `stanza` attributes
are computed and set within the custom tokenizer,
[`StanzaTokenizer`](spacy_stanza/tokenizer.py). But since it's a regular `nlp`
object, you can add your own components to the pipeline. For example, you could
add [your own custom text classification
component](https://spacy.io/usage/training) with `nlp.add_pipe("textcat",
source=source_nlp)`, or augment the named entities with your own rule-based
patterns using the [`EntityRuler`
component](https://spacy.io/usage/rule-based-matching#entityruler).
%package -n python3-spacy-stanza
Summary: Use the latest Stanza (StanfordNLP) research models directly in spaCy
Provides: python-spacy-stanza
BuildRequires: python3-devel
BuildRequires: python3-setuptools
BuildRequires: python3-pip
%description -n python3-spacy-stanza
# spaCy + Stanza (formerly StanfordNLP)
This package wraps the [Stanza](https://github.com/stanfordnlp/stanza)
(formerly StanfordNLP) library, so you can use Stanford's models in a
[spaCy](https://spacy.io) pipeline. The Stanford models achieved top accuracy
in the CoNLL 2017 and 2018 shared task, which involves tokenization,
part-of-speech tagging, morphological analysis, lemmatization and labeled
dependency parsing in 68 languages. As of v1.0, Stanza also supports named
entity recognition for selected languages.
> ⚠️ Previous version of this package were available as
> [`spacy-stanfordnlp`](https://pypi.python.org/pypi/spacy-stanfordnlp).
[![Azure Pipelines](https://img.shields.io/azure-devops/build/explosion-ai/public/17/master.svg?logo=azure-pipelines&style=flat-square)](https://dev.azure.com/explosion-ai/public/_build?definitionId=17)
[![PyPi](https://img.shields.io/pypi/v/spacy-stanza.svg?style=flat-square)](https://pypi.python.org/pypi/spacy-stanza)
[![GitHub](https://img.shields.io/github/release/explosion/spacy-stanza/all.svg?style=flat-square)](https://github.com/explosion/spacy-stanza)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg?style=flat-square)](https://github.com/ambv/black)
Using this wrapper, you'll be able to use the following annotations, computed by
your pretrained `stanza` model:
- Statistical tokenization (reflected in the `Doc` and its tokens)
- Lemmatization (`token.lemma` and `token.lemma_`)
- Part-of-speech tagging (`token.tag`, `token.tag_`, `token.pos`, `token.pos_`)
- Morphological analysis (`token.morph`)
- Dependency parsing (`token.dep`, `token.dep_`, `token.head`)
- Named entity recognition (`doc.ents`, `token.ent_type`, `token.ent_type_`, `token.ent_iob`, `token.ent_iob_`)
- Sentence segmentation (`doc.sents`)
## ️️️⌛️ Installation
As of v1.0.0 `spacy-stanza` is only compatible with **spaCy v3.x**. To install
the most recent version:
```bash
pip install spacy-stanza
```
For spaCy v2, install v0.2.x and refer to the [v0.2.x usage
documentation](https://github.com/explosion/spacy-stanza/tree/v0.2.x#-usage--examples):
```bash
pip install "spacy-stanza<0.3.0"
```
Make sure to also
[download](https://stanfordnlp.github.io/stanza/download_models.html) one of
the [pre-trained Stanza
models](https://stanfordnlp.github.io/stanza/models.html).
## 📖 Usage & Examples
> ⚠️ **Important note:** This package has been refactored to take advantage of
> [spaCy v3.0](https://spacy.io). Previous versions that were built for [spaCy
> v2.x](https://v2.spacy.io) worked considerably differently. Please see
> previous tagged versions of this README for documentation on prior versions.
Use `spacy_stanza.load_pipeline()` to create an `nlp` object that you can use to
process a text with a Stanza pipeline and create a spaCy [`Doc`
object](https://spacy.io/api/doc). By default, both the spaCy pipeline and the
Stanza pipeline will be initialized with the same `lang`, e.g. "en":
```python
import stanza
import spacy_stanza
# Download the stanza model if necessary
stanza.download("en")
# Initialize the pipeline
nlp = spacy_stanza.load_pipeline("en")
doc = nlp("Barack Obama was born in Hawaii. He was elected president in 2008.")
for token in doc:
print(token.text, token.lemma_, token.pos_, token.dep_, token.ent_type_)
print(doc.ents)
```
If language data for the given language is available in spaCy, the respective
language class can be used as the base for the `nlp` object – for example,
`English()`. This lets you use spaCy's lexical attributes like `is_stop` or
`like_num`. The `nlp` object follows the same API as any other spaCy `Language`
class – so you can visualize the `Doc` objects with displaCy, add custom
components to the pipeline, use the rule-based matcher and do pretty much
anything else you'd normally do in spaCy.
```python
# Access spaCy's lexical attributes
print([token.is_stop for token in doc])
print([token.like_num for token in doc])
# Visualize dependencies
from spacy import displacy
displacy.serve(doc) # or displacy.render if you're in a Jupyter notebook
# Process texts with nlp.pipe
for doc in nlp.pipe(["Lots of texts", "Even more texts", "..."]):
print(doc.text)
# Combine with your own custom pipeline components
from spacy import Language
@Language.component("custom_component")
def custom_component(doc):
# Do something to the doc here
print(f"Custom component called: {doc.text}")
return doc
nlp.add_pipe("custom_component")
doc = nlp("Some text")
# Serialize attributes to a numpy array
np_array = doc.to_array(['ORTH', 'LEMMA', 'POS'])
```
### Stanza Pipeline options
Additional options for the Stanza
[`Pipeline`](https://stanfordnlp.github.io/stanza/pipeline.html#pipeline) can be
provided as keyword arguments following the `Pipeline` API:
- Provide the Stanza language as `lang`. For Stanza languages without spaCy
support, use "xx" for the spaCy language setting:
```python
# Initialize a pipeline for Coptic
nlp = spacy_stanza.load_pipeline("xx", lang="cop")
```
- Provide Stanza pipeline settings following the `Pipeline` API:
```python
# Initialize a German pipeline with the `hdt` package
nlp = spacy_stanza.load_pipeline("de", package="hdt")
```
- Tokenize with spaCy rather than the statistical tokenizer (only for English):
```python
nlp = spacy_stanza.load_pipeline("en", processors= {"tokenize": "spacy"})
```
- Provide any additional processor settings as additional keyword arguments:
```python
# Provide pretokenized texts (whitespace tokenization)
nlp = spacy_stanza.load_pipeline("de", tokenize_pretokenized=True)
```
The spaCy config specifies all `Pipeline` options in the `[nlp.tokenizer]`
block. For example, the config for the last example above, a German pipeline
with pretokenized texts:
```ini
[nlp.tokenizer]
@tokenizers = "spacy_stanza.PipelineAsTokenizer.v1"
lang = "de"
dir = null
package = "default"
logging_level = null
verbose = null
use_gpu = true
[nlp.tokenizer.kwargs]
tokenize_pretokenized = true
[nlp.tokenizer.processors]
```
### Serialization
The full Stanza pipeline configuration is stored in the spaCy pipeline
[config](https://spacy.io/usage/training#config), so you can save and load the
pipeline just like any other `nlp` pipeline:
```python
# Save to a local directory
nlp.to_disk("./stanza-spacy-model")
# Reload the pipeline
nlp = spacy.load("./stanza-spacy-model")
```
Note that this **does not save any Stanza model data by default**. The Stanza
models are very large, so for now, this package expects you to download the
models separately with `stanza.download()` and have them available either in
the default model directory or in the path specified under
`[nlp.tokenizer.dir]` in the config.
### Adding additional spaCy pipeline components
By default, the spaCy pipeline in the `nlp` object returned by
`spacy_stanza.load_pipeline()` will be empty, because all `stanza` attributes
are computed and set within the custom tokenizer,
[`StanzaTokenizer`](spacy_stanza/tokenizer.py). But since it's a regular `nlp`
object, you can add your own components to the pipeline. For example, you could
add [your own custom text classification
component](https://spacy.io/usage/training) with `nlp.add_pipe("textcat",
source=source_nlp)`, or augment the named entities with your own rule-based
patterns using the [`EntityRuler`
component](https://spacy.io/usage/rule-based-matching#entityruler).
%package help
Summary: Development documents and examples for spacy-stanza
Provides: python3-spacy-stanza-doc
%description help
# spaCy + Stanza (formerly StanfordNLP)
This package wraps the [Stanza](https://github.com/stanfordnlp/stanza)
(formerly StanfordNLP) library, so you can use Stanford's models in a
[spaCy](https://spacy.io) pipeline. The Stanford models achieved top accuracy
in the CoNLL 2017 and 2018 shared task, which involves tokenization,
part-of-speech tagging, morphological analysis, lemmatization and labeled
dependency parsing in 68 languages. As of v1.0, Stanza also supports named
entity recognition for selected languages.
> ⚠️ Previous version of this package were available as
> [`spacy-stanfordnlp`](https://pypi.python.org/pypi/spacy-stanfordnlp).
[![Azure Pipelines](https://img.shields.io/azure-devops/build/explosion-ai/public/17/master.svg?logo=azure-pipelines&style=flat-square)](https://dev.azure.com/explosion-ai/public/_build?definitionId=17)
[![PyPi](https://img.shields.io/pypi/v/spacy-stanza.svg?style=flat-square)](https://pypi.python.org/pypi/spacy-stanza)
[![GitHub](https://img.shields.io/github/release/explosion/spacy-stanza/all.svg?style=flat-square)](https://github.com/explosion/spacy-stanza)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg?style=flat-square)](https://github.com/ambv/black)
Using this wrapper, you'll be able to use the following annotations, computed by
your pretrained `stanza` model:
- Statistical tokenization (reflected in the `Doc` and its tokens)
- Lemmatization (`token.lemma` and `token.lemma_`)
- Part-of-speech tagging (`token.tag`, `token.tag_`, `token.pos`, `token.pos_`)
- Morphological analysis (`token.morph`)
- Dependency parsing (`token.dep`, `token.dep_`, `token.head`)
- Named entity recognition (`doc.ents`, `token.ent_type`, `token.ent_type_`, `token.ent_iob`, `token.ent_iob_`)
- Sentence segmentation (`doc.sents`)
## ️️️⌛️ Installation
As of v1.0.0 `spacy-stanza` is only compatible with **spaCy v3.x**. To install
the most recent version:
```bash
pip install spacy-stanza
```
For spaCy v2, install v0.2.x and refer to the [v0.2.x usage
documentation](https://github.com/explosion/spacy-stanza/tree/v0.2.x#-usage--examples):
```bash
pip install "spacy-stanza<0.3.0"
```
Make sure to also
[download](https://stanfordnlp.github.io/stanza/download_models.html) one of
the [pre-trained Stanza
models](https://stanfordnlp.github.io/stanza/models.html).
## 📖 Usage & Examples
> ⚠️ **Important note:** This package has been refactored to take advantage of
> [spaCy v3.0](https://spacy.io). Previous versions that were built for [spaCy
> v2.x](https://v2.spacy.io) worked considerably differently. Please see
> previous tagged versions of this README for documentation on prior versions.
Use `spacy_stanza.load_pipeline()` to create an `nlp` object that you can use to
process a text with a Stanza pipeline and create a spaCy [`Doc`
object](https://spacy.io/api/doc). By default, both the spaCy pipeline and the
Stanza pipeline will be initialized with the same `lang`, e.g. "en":
```python
import stanza
import spacy_stanza
# Download the stanza model if necessary
stanza.download("en")
# Initialize the pipeline
nlp = spacy_stanza.load_pipeline("en")
doc = nlp("Barack Obama was born in Hawaii. He was elected president in 2008.")
for token in doc:
print(token.text, token.lemma_, token.pos_, token.dep_, token.ent_type_)
print(doc.ents)
```
If language data for the given language is available in spaCy, the respective
language class can be used as the base for the `nlp` object – for example,
`English()`. This lets you use spaCy's lexical attributes like `is_stop` or
`like_num`. The `nlp` object follows the same API as any other spaCy `Language`
class – so you can visualize the `Doc` objects with displaCy, add custom
components to the pipeline, use the rule-based matcher and do pretty much
anything else you'd normally do in spaCy.
```python
# Access spaCy's lexical attributes
print([token.is_stop for token in doc])
print([token.like_num for token in doc])
# Visualize dependencies
from spacy import displacy
displacy.serve(doc) # or displacy.render if you're in a Jupyter notebook
# Process texts with nlp.pipe
for doc in nlp.pipe(["Lots of texts", "Even more texts", "..."]):
print(doc.text)
# Combine with your own custom pipeline components
from spacy import Language
@Language.component("custom_component")
def custom_component(doc):
# Do something to the doc here
print(f"Custom component called: {doc.text}")
return doc
nlp.add_pipe("custom_component")
doc = nlp("Some text")
# Serialize attributes to a numpy array
np_array = doc.to_array(['ORTH', 'LEMMA', 'POS'])
```
### Stanza Pipeline options
Additional options for the Stanza
[`Pipeline`](https://stanfordnlp.github.io/stanza/pipeline.html#pipeline) can be
provided as keyword arguments following the `Pipeline` API:
- Provide the Stanza language as `lang`. For Stanza languages without spaCy
support, use "xx" for the spaCy language setting:
```python
# Initialize a pipeline for Coptic
nlp = spacy_stanza.load_pipeline("xx", lang="cop")
```
- Provide Stanza pipeline settings following the `Pipeline` API:
```python
# Initialize a German pipeline with the `hdt` package
nlp = spacy_stanza.load_pipeline("de", package="hdt")
```
- Tokenize with spaCy rather than the statistical tokenizer (only for English):
```python
nlp = spacy_stanza.load_pipeline("en", processors= {"tokenize": "spacy"})
```
- Provide any additional processor settings as additional keyword arguments:
```python
# Provide pretokenized texts (whitespace tokenization)
nlp = spacy_stanza.load_pipeline("de", tokenize_pretokenized=True)
```
The spaCy config specifies all `Pipeline` options in the `[nlp.tokenizer]`
block. For example, the config for the last example above, a German pipeline
with pretokenized texts:
```ini
[nlp.tokenizer]
@tokenizers = "spacy_stanza.PipelineAsTokenizer.v1"
lang = "de"
dir = null
package = "default"
logging_level = null
verbose = null
use_gpu = true
[nlp.tokenizer.kwargs]
tokenize_pretokenized = true
[nlp.tokenizer.processors]
```
### Serialization
The full Stanza pipeline configuration is stored in the spaCy pipeline
[config](https://spacy.io/usage/training#config), so you can save and load the
pipeline just like any other `nlp` pipeline:
```python
# Save to a local directory
nlp.to_disk("./stanza-spacy-model")
# Reload the pipeline
nlp = spacy.load("./stanza-spacy-model")
```
Note that this **does not save any Stanza model data by default**. The Stanza
models are very large, so for now, this package expects you to download the
models separately with `stanza.download()` and have them available either in
the default model directory or in the path specified under
`[nlp.tokenizer.dir]` in the config.
### Adding additional spaCy pipeline components
By default, the spaCy pipeline in the `nlp` object returned by
`spacy_stanza.load_pipeline()` will be empty, because all `stanza` attributes
are computed and set within the custom tokenizer,
[`StanzaTokenizer`](spacy_stanza/tokenizer.py). But since it's a regular `nlp`
object, you can add your own components to the pipeline. For example, you could
add [your own custom text classification
component](https://spacy.io/usage/training) with `nlp.add_pipe("textcat",
source=source_nlp)`, or augment the named entities with your own rule-based
patterns using the [`EntityRuler`
component](https://spacy.io/usage/rule-based-matching#entityruler).
%prep
%autosetup -n spacy-stanza-1.0.3
%build
%py3_build
%install
%py3_install
install -d -m755 %{buildroot}/%{_pkgdocdir}
if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi
if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi
if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi
if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi
pushd %{buildroot}
if [ -d usr/lib ]; then
find usr/lib -type f -printf "\"/%h/%f\"\n" >> filelist.lst
fi
if [ -d usr/lib64 ]; then
find usr/lib64 -type f -printf "\"/%h/%f\"\n" >> filelist.lst
fi
if [ -d usr/bin ]; then
find usr/bin -type f -printf "\"/%h/%f\"\n" >> filelist.lst
fi
if [ -d usr/sbin ]; then
find usr/sbin -type f -printf "\"/%h/%f\"\n" >> filelist.lst
fi
touch doclist.lst
if [ -d usr/share/man ]; then
find usr/share/man -type f -printf "\"/%h/%f.gz\"\n" >> doclist.lst
fi
popd
mv %{buildroot}/filelist.lst .
mv %{buildroot}/doclist.lst .
%files -n python3-spacy-stanza -f filelist.lst
%dir %{python3_sitelib}/*
%files help -f doclist.lst
%{_docdir}/*
%changelog
* Thu Jun 08 2023 Python_Bot - 1.0.3-1
- Package Spec generated