From 1dc049a8158cf96f17b37582360735dbb7eb280f Mon Sep 17 00:00:00 2001 From: CoprDistGit Date: Wed, 10 May 2023 09:05:29 +0000 Subject: automatic import of python-spacy-conll --- .gitignore | 1 + python-spacy-conll.spec | 1084 +++++++++++++++++++++++++++++++++++++++++++++++ sources | 1 + 3 files changed, 1086 insertions(+) create mode 100644 python-spacy-conll.spec create mode 100644 sources diff --git a/.gitignore b/.gitignore index e69de29..de9960a 100644 --- a/.gitignore +++ b/.gitignore @@ -0,0 +1 @@ +/spacy_conll-3.4.0.tar.gz diff --git a/python-spacy-conll.spec b/python-spacy-conll.spec new file mode 100644 index 0000000..b82a6f2 --- /dev/null +++ b/python-spacy-conll.spec @@ -0,0 +1,1084 @@ +%global _empty_manifest_terminate_build 0 +Name: python-spacy-conll +Version: 3.4.0 +Release: 1 +Summary: A custom pipeline component for spaCy that can convert any parsed Doc and its sentences into CoNLL-U format. Also provides a command line entry point. +License: BSD 2 +URL: https://github.com/BramVanroy/spacy_conll +Source0: https://mirrors.nju.edu.cn/pypi/web/packages/47/d7/d44f97e01ab22c2c21b0c7c6d581c5888a4049dfd2c249f3ee5f3b44426b/spacy_conll-3.4.0.tar.gz +BuildArch: noarch + +Requires: python3-spacy +Requires: python3-dataclasses +Requires: python3-pandas +Requires: python3-spacy-udpipe +Requires: python3-spacy-stanza +Requires: python3-pandas +Requires: python3-spacy-udpipe +Requires: python3-spacy-stanza +Requires: python3-pytest +Requires: python3-flake8 +Requires: python3-isort +Requires: python3-black +Requires: python3-pygments +Requires: python3-spacy-udpipe +Requires: python3-spacy-stanza +Requires: python3-pandas + +%description +# Parsing to CoNLL with spaCy, spacy-stanza, and spacy-udpipe + +This module allows you to parse text into CoNLL-U format. You can use it as a command line tool, or embed it in your + own scripts by adding it as a custom pipeline component to a spaCy, `spacy-stanza`, or `spacy-udpipe` pipeline. It + also provides an easy-to-use function to quickly initialize a parser as well as a ConllParser class with built-in + functionality to parse files or text. + +Note that the module simply takes a parser's output and puts it in a formatted string adhering to the linked ConLL-U + format. The output tags depend on the spaCy model used. If you want Universal Depencies tags as output, I advise you + to use this library in combination with [spacy-stanza](https://github.com/explosion/spacy-stanza), which is a spaCy + interface using `stanza` and its models behind the scenes. Those models use the Universal Dependencies formalism and + yield state-of-the-art performance. `stanza` is a new and improved version of `stanfordnlp`. As an alternative to the + Stanford models, you can use the spaCy wrapper for `UDPipe`, [spacy-udpipe](https://github.com/TakeLab/spacy-udpipe), + which is slightly less accurate than `stanza` but much faster. + + +## Installation + +By default, this package automatically installs only [spaCy](https://spacy.io/usage/models#section-quickstart) as + dependency. Because [spaCy's models](https://spacy.io/usage/models) are not necessarily trained on Universal + Dependencies conventions, their output labels are not UD either. By using `spacy-stanza` or `spacy-udpipe`, we get + the easy-to-use interface of spaCy as a wrapper around `stanza` and `UDPipe` respectively, including their models that + *are* trained on UD data. + +**NOTE**: `spacy-stanza` and `spacy-udpipe` are not installed automatically as a dependency for this library, because + it might be too much overhead for those who don't need UD. If you wish to use their functionality, you have to install +them manually or use one of the available options as described below. + +If you want to retrieve CoNLL info as a `pandas` DataFrame, this library will automatically export it if it detects + that `pandas` is installed. See the Usage section for more. + +To install the library, simply use pip. + +```shell +# only includes spacy by default +pip install spacy_conll +``` + +A number of options are available to make installation of additional dependencies easier: + +```shell +# include spacy-stanza and spacy-udpipe +pip install spacy_conll[parsers] +# include pandas +pip install spacy_conll[pd] +# include pandas, spacy-stanza and spacy-udpipe +pip install spacy_conll[all] +# include pandas, spacy-stanza and spacy-udpipe and additional libaries for testing and formatting +pip install spacy_conll[dev] +``` + + +## Usage + +When the ConllFormatter is added to a spaCy pipeline, it adds CoNLL properties for `Token`, sentence `Span` and `Doc`. + Note that arbitrary Span's are not included and do not receive these properties. + +On all three of these levels, two custom properties are exposed by default, `._.conll` and its string + representation `._.conll_str`. However, if you have `pandas` installed, then `._.conll_pd` will + be added automatically, too! + +- `._.conll`: raw CoNLL format + - in Token: a dictionary containing all the expected CoNLL fields as keys and the parsed properties as values. + - in sentence Span: a list of its tokens' `._.conll` dictionaries (list of dictionaries). + - in a Doc: a list of its sentences' `._.conll` lists (list of list of dictionaries). + +- `._.conll_str`: string representation of the CoNLL format + - in Token: tab-separated representation of the contents of the CoNLL fields ending with a newline. + - in sentence Span: the expected CoNLL format where each row represents a token. When + `ConllFormatter(include_headers=True)` is used, two header lines are included as well, as per the + [CoNLL format](https://universaldependencies.org/format.html#sentence-boundaries-and-comments). + - in Doc: all its sentences' `._.conll_str` combined and separated by new lines. + +- `._.conll_pd`: `pandas` representation of the CoNLL format + - in Token: a Series representation of this token's CoNLL properties. + - in sentence Span: a DataFrame representation of this sentence, with the CoNLL names as column headers. + - in Doc: a concatenation of its sentences' DataFrame's, leading to a new a DataFrame whose index is reset. + +You can use `spacy_conll` in your own Python code as a custom pipeline component, or you can use the built-in + command-line script which offers typically needed functionality. See the following section for more. + + +### In Python + +This library offers the ConllFormatter class which serves as a custom spaCy pipeline component. It can be instantiated + as follows. It is important that you import `spacy_conll` before adding the pipe! + +```python +import spacy +nlp = spacy.load("en_core_web_sm") +nlp.add_pipe("conll_formatter", last=True) +``` + +Because this library supports different spaCy wrappers (`spacy`, `stanza`, and `udpipe`), a convenience function is + available as well. With `utils.init_parser` you can easily instantiate a parser with a single line. You can + find the function's signature below. Have a look at the [source code](spacy_conll/utils.py) to read more about all the + possible arguments or try out the [examples](examples/). + +**NOTE**: `is_tokenized` does not work for `spacy-udpipe`. Using `is_tokenized` for `spacy-stanza` also affects sentence + segmentation, effectively *only* splitting on new lines. With `spacy`, `is_tokenized` disables sentence splitting completely. + +```python +def init_parser( + model_or_lang: str, + parser: str, + *, + is_tokenized: bool = False, + disable_sbd: bool = False, + exclude_spacy_components: Optional[List[str]] = None, + parser_opts: Optional[Dict] = None, + **kwargs, +) +``` + +For instance, if you want to load a Dutch `stanza` model in silent mode with the CoNLL formatter already attached, you + can simply use the following snippet. `parser_opts` is passed to the `stanza` pipeline initialisation automatically. + Any other keyword arguments (`kwargs`), on the other hand, are passed to the `ConllFormatter` initialisation. + +```python +from spacy_conll import init_parser + +nlp = init_parser("nl", "stanza", parser_opts={"verbose": False}) +``` + +The `ConllFormatter` allows you to customize the extension names, and you can also specify conversion maps for the +output properties. + +To illustrate, here is an advanced example, showing the more complex options: + +- `ext_names`: changes the attribute names to a custom key by using a dictionary. +- `conversion_maps`: a two-level dictionary that looks like `{field_name: {tag_name: replacement}}`. In + other words, you can specify in which field a certain value should be replaced by another. This is especially useful + when you are not satisfied with the tagset of a model and wish to change some tags to an alternative0. +- `field_names`: allows you to change the default CoNLL-U field names to your own custom names. Similar to the + conversion map above, you should use any of the default field names as keys and add your own key as value. + Possible keys are : "ID", "FORM", "LEMMA", "UPOS", "XPOS", "FEATS", "HEAD", "DEPREL", "DEPS", "MISC". + +The example below + +- shows how to manually add the component; +- changes the custom attribute `conll_pd` to pandas (`conll_pd` only availabe if `pandas` is installed); +- converts any `nsubj` deprel tag to `subj`. + +```python +import spacy + + +nlp = spacy.load("en_core_web_sm") +config = {"ext_names": {"conll_pd": "pandas"}, + "conversion_maps": {"deprel": {"nsubj": "subj"}}} +nlp.add_pipe("conll_formatter", config=config, last=True) +doc = nlp("I like cookies.") +print(doc._.pandas) +``` + +This is the same as: + +```python +from spacy_conll import init_parser + +nlp = init_parser("en_core_web_sm", + "spacy", + ext_names={"conll_pd": "pandas"}, + conversion_maps={"deprel": {"nsubj": "subj"}}) +doc = nlp("I like cookies.") +print(doc._.pandas) +``` + + +The snippets above will output a pandas DataFrame by using `._.pandas` rather than the standard +`._.conll_pd`, and all occurrences of `nsubj` in the deprel field are replaced by `subj`. + +``` + ID FORM LEMMA UPOS XPOS FEATS HEAD DEPREL DEPS MISC +0 1 I I PRON PRP Case=Nom|Number=Sing|Person=1|PronType=Prs 2 subj _ _ +1 2 like like VERB VBP Tense=Pres|VerbForm=Fin 0 ROOT _ _ +2 3 cookies cookie NOUN NNS Number=Plur 2 dobj _ SpaceAfter=No +3 4 . . PUNCT . PunctType=Peri 2 punct _ SpaceAfter=No +``` + +Another initialization example that would replace the column names "UPOS" with "upostag" amd "XPOS" with "xpostag": + +```python +import spacy + + +nlp = spacy.load("en_core_web_sm") +config = {"field_names": {"UPOS": "upostag", "XPOS": "xpostag"}} +nlp.add_pipe("conll_formatter", config=config, last=True) +``` + +#### Reading CoNLL into a spaCy object + +It is possible to read a CoNLL string or text file and parse it as a spaCy object. This can be useful if you have raw +CoNLL data that you wish to process in different ways. The process is straightforward. + +```python +from spacy_conll import init_parser +from spacy_conll.parser import ConllParser + + +nlp = ConllParser(init_parser("en_core_web_sm", "spacy")) + +doc = nlp.parse_conll_file_as_spacy("path/to/your/conll-sample.txt") +''' +or straight from raw text: +conllstr = """ +# text = From the AP comes this story : +1 From from ADP IN _ 3 case 3:case _ +2 the the DET DT Definite=Def|PronType=Art 3 det 3:det _ +3 AP AP PROPN NNP Number=Sing 4 obl 4:obl:from _ +4 comes come VERB VBZ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 0 root 0:root _ +5 this this DET DT Number=Sing|PronType=Dem 6 det 6:det _ +6 story story NOUN NN Number=Sing 4 nsubj 4:nsubj _ +""" +doc = nlp.parse_conll_text_as_spacy(conllstr) +''' + +# Multiple CoNLL entries (separated by two newlines) will be included as different sentences in the resulting Doc +for sent in doc.sents: + for token in sent: + print(token.text, token.dep_, token.pos_) +``` + +### Command line + +Upon installation, a command-line script is added under tha alias `parse-as-conll`. You can use it to parse a +string or file into CoNLL format given a number of options. + +```shell +parse-as-conll -h +usage: parse-as-conll [-h] [-f INPUT_FILE] [-a INPUT_ENCODING] [-b INPUT_STR] [-o OUTPUT_FILE] + [-c OUTPUT_ENCODING] [-s] [-t] [-d] [-e] [-j N_PROCESS] [-v] + [--ignore_pipe_errors] [--no_split_on_newline] + model_or_lang {spacy,stanza,udpipe} + +Parse an input string or input file to CoNLL-U format using a spaCy-wrapped parser. The output +can be written to stdout or a file, or both. + +positional arguments: + model_or_lang Model or language to use. SpaCy models must be pre-installed, stanza + and udpipe models will be downloaded automatically + {spacy,stanza,udpipe} + Which parser to use. Parsers other than 'spacy' need to be installed + separately. For 'stanza' you need 'spacy-stanza', and for 'udpipe' the + 'spacy-udpipe' library is required. + +optional arguments: + -h, --help show this help message and exit + -f INPUT_FILE, --input_file INPUT_FILE + Path to file with sentences to parse. Has precedence over 'input_str'. + (default: None) + -a INPUT_ENCODING, --input_encoding INPUT_ENCODING + Encoding of the input file. Default value is system default. (default: + cp1252) + -b INPUT_STR, --input_str INPUT_STR + Input string to parse. (default: None) + -o OUTPUT_FILE, --output_file OUTPUT_FILE + Path to output file. If not specified, the output will be printed on + standard output. (default: None) + -c OUTPUT_ENCODING, --output_encoding OUTPUT_ENCODING + Encoding of the output file. Default value is system default. (default: + cp1252) + -s, --disable_sbd Whether to disable spaCy automatic sentence boundary detection. In + practice, disabling means that every line will be parsed as one + sentence, regardless of its actual content. When 'is_tokenized' is + enabled, 'disable_sbd' is enabled automatically (see 'is_tokenized'). + Only works when using 'spacy' as 'parser'. (default: False) + -t, --is_tokenized Whether your text has already been tokenized (space-seperated). Setting + this option has as an important consequence that no sentence splitting + at all will be done except splitting on new lines. So if your input is + a file, and you want to use pretokenised text, make sure that each line + contains exactly one sentence. (default: False) + -d, --include_headers + Whether to include headers before the output of every sentence. These + headers include the sentence text and the sentence ID as per the CoNLL + format. (default: False) + -e, --no_force_counting + Whether to disable force counting the 'sent_id', starting from 1 and + increasing for each sentence. Instead, 'sent_id' will depend on how + spaCy returns the sentences. Must have 'include_headers' enabled. + (default: False) + -j N_PROCESS, --n_process N_PROCESS + Number of processes to use in nlp.pipe(). -1 will use as many cores as + available. Might not work for a 'parser' other than 'spacy' depending + on your environment. (default: 1) + -v, --verbose Whether to always print the output to stdout, regardless of + 'output_file'. (default: False) + --ignore_pipe_errors Whether to ignore a priori errors concerning 'n_process' By default we + try to determine whether processing works on your system and stop + execution if we think it doesn't. If you know what you are doing, you + can ignore such pre-emptive errors, though, and run the code as-is, + which will then throw the default Python errors when applicable. + (default: False) + --no_split_on_newline + By default, the input file or string is split on newlines for faster + processing of the split up parts. If you want to disable that behavior, + you can use this flag. (default: False) +``` + + +For example, parsing a single line, multi-sentence string: + +```shell +parse-as-conll en_core_web_sm spacy --input_str "I like cookies. What about you?" --include_headers + +# sent_id = 1 +# text = I like cookies. +1 I I PRON PRP Case=Nom|Number=Sing|Person=1|PronType=Prs 2 nsubj _ _ +2 like like VERB VBP Tense=Pres|VerbForm=Fin 0 ROOT _ _ +3 cookies cookie NOUN NNS Number=Plur 2 dobj _ SpaceAfter=No +4 . . PUNCT . PunctType=Peri 2 punct _ _ + +# sent_id = 2 +# text = What about you? +1 What what PRON WP _ 2 dep _ _ +2 about about ADP IN _ 0 ROOT _ _ +3 you you PRON PRP Case=Acc|Person=2|PronType=Prs 2 pobj _ SpaceAfter=No +4 ? ? PUNCT . PunctType=Peri 2 punct _ SpaceAfter=No +``` + +For example, parsing a large input file and writing output to a given output file, using four processes: + +```shell +parse-as-conll en_core_web_sm spacy --input_file large-input.txt --output_file large-conll-output.txt --include_headers --disable_sbd -j 4 +``` + + +## Credits + +The first version of this library was inspired by initial work by [rgalhama](https://github.com/rgalhama/spaCy2CoNLLU) + and has evolved a lot since then. + + +%package -n python3-spacy-conll +Summary: A custom pipeline component for spaCy that can convert any parsed Doc and its sentences into CoNLL-U format. Also provides a command line entry point. +Provides: python-spacy-conll +BuildRequires: python3-devel +BuildRequires: python3-setuptools +BuildRequires: python3-pip +%description -n python3-spacy-conll +# Parsing to CoNLL with spaCy, spacy-stanza, and spacy-udpipe + +This module allows you to parse text into CoNLL-U format. You can use it as a command line tool, or embed it in your + own scripts by adding it as a custom pipeline component to a spaCy, `spacy-stanza`, or `spacy-udpipe` pipeline. It + also provides an easy-to-use function to quickly initialize a parser as well as a ConllParser class with built-in + functionality to parse files or text. + +Note that the module simply takes a parser's output and puts it in a formatted string adhering to the linked ConLL-U + format. The output tags depend on the spaCy model used. If you want Universal Depencies tags as output, I advise you + to use this library in combination with [spacy-stanza](https://github.com/explosion/spacy-stanza), which is a spaCy + interface using `stanza` and its models behind the scenes. Those models use the Universal Dependencies formalism and + yield state-of-the-art performance. `stanza` is a new and improved version of `stanfordnlp`. As an alternative to the + Stanford models, you can use the spaCy wrapper for `UDPipe`, [spacy-udpipe](https://github.com/TakeLab/spacy-udpipe), + which is slightly less accurate than `stanza` but much faster. + + +## Installation + +By default, this package automatically installs only [spaCy](https://spacy.io/usage/models#section-quickstart) as + dependency. Because [spaCy's models](https://spacy.io/usage/models) are not necessarily trained on Universal + Dependencies conventions, their output labels are not UD either. By using `spacy-stanza` or `spacy-udpipe`, we get + the easy-to-use interface of spaCy as a wrapper around `stanza` and `UDPipe` respectively, including their models that + *are* trained on UD data. + +**NOTE**: `spacy-stanza` and `spacy-udpipe` are not installed automatically as a dependency for this library, because + it might be too much overhead for those who don't need UD. If you wish to use their functionality, you have to install +them manually or use one of the available options as described below. + +If you want to retrieve CoNLL info as a `pandas` DataFrame, this library will automatically export it if it detects + that `pandas` is installed. See the Usage section for more. + +To install the library, simply use pip. + +```shell +# only includes spacy by default +pip install spacy_conll +``` + +A number of options are available to make installation of additional dependencies easier: + +```shell +# include spacy-stanza and spacy-udpipe +pip install spacy_conll[parsers] +# include pandas +pip install spacy_conll[pd] +# include pandas, spacy-stanza and spacy-udpipe +pip install spacy_conll[all] +# include pandas, spacy-stanza and spacy-udpipe and additional libaries for testing and formatting +pip install spacy_conll[dev] +``` + + +## Usage + +When the ConllFormatter is added to a spaCy pipeline, it adds CoNLL properties for `Token`, sentence `Span` and `Doc`. + Note that arbitrary Span's are not included and do not receive these properties. + +On all three of these levels, two custom properties are exposed by default, `._.conll` and its string + representation `._.conll_str`. However, if you have `pandas` installed, then `._.conll_pd` will + be added automatically, too! + +- `._.conll`: raw CoNLL format + - in Token: a dictionary containing all the expected CoNLL fields as keys and the parsed properties as values. + - in sentence Span: a list of its tokens' `._.conll` dictionaries (list of dictionaries). + - in a Doc: a list of its sentences' `._.conll` lists (list of list of dictionaries). + +- `._.conll_str`: string representation of the CoNLL format + - in Token: tab-separated representation of the contents of the CoNLL fields ending with a newline. + - in sentence Span: the expected CoNLL format where each row represents a token. When + `ConllFormatter(include_headers=True)` is used, two header lines are included as well, as per the + [CoNLL format](https://universaldependencies.org/format.html#sentence-boundaries-and-comments). + - in Doc: all its sentences' `._.conll_str` combined and separated by new lines. + +- `._.conll_pd`: `pandas` representation of the CoNLL format + - in Token: a Series representation of this token's CoNLL properties. + - in sentence Span: a DataFrame representation of this sentence, with the CoNLL names as column headers. + - in Doc: a concatenation of its sentences' DataFrame's, leading to a new a DataFrame whose index is reset. + +You can use `spacy_conll` in your own Python code as a custom pipeline component, or you can use the built-in + command-line script which offers typically needed functionality. See the following section for more. + + +### In Python + +This library offers the ConllFormatter class which serves as a custom spaCy pipeline component. It can be instantiated + as follows. It is important that you import `spacy_conll` before adding the pipe! + +```python +import spacy +nlp = spacy.load("en_core_web_sm") +nlp.add_pipe("conll_formatter", last=True) +``` + +Because this library supports different spaCy wrappers (`spacy`, `stanza`, and `udpipe`), a convenience function is + available as well. With `utils.init_parser` you can easily instantiate a parser with a single line. You can + find the function's signature below. Have a look at the [source code](spacy_conll/utils.py) to read more about all the + possible arguments or try out the [examples](examples/). + +**NOTE**: `is_tokenized` does not work for `spacy-udpipe`. Using `is_tokenized` for `spacy-stanza` also affects sentence + segmentation, effectively *only* splitting on new lines. With `spacy`, `is_tokenized` disables sentence splitting completely. + +```python +def init_parser( + model_or_lang: str, + parser: str, + *, + is_tokenized: bool = False, + disable_sbd: bool = False, + exclude_spacy_components: Optional[List[str]] = None, + parser_opts: Optional[Dict] = None, + **kwargs, +) +``` + +For instance, if you want to load a Dutch `stanza` model in silent mode with the CoNLL formatter already attached, you + can simply use the following snippet. `parser_opts` is passed to the `stanza` pipeline initialisation automatically. + Any other keyword arguments (`kwargs`), on the other hand, are passed to the `ConllFormatter` initialisation. + +```python +from spacy_conll import init_parser + +nlp = init_parser("nl", "stanza", parser_opts={"verbose": False}) +``` + +The `ConllFormatter` allows you to customize the extension names, and you can also specify conversion maps for the +output properties. + +To illustrate, here is an advanced example, showing the more complex options: + +- `ext_names`: changes the attribute names to a custom key by using a dictionary. +- `conversion_maps`: a two-level dictionary that looks like `{field_name: {tag_name: replacement}}`. In + other words, you can specify in which field a certain value should be replaced by another. This is especially useful + when you are not satisfied with the tagset of a model and wish to change some tags to an alternative0. +- `field_names`: allows you to change the default CoNLL-U field names to your own custom names. Similar to the + conversion map above, you should use any of the default field names as keys and add your own key as value. + Possible keys are : "ID", "FORM", "LEMMA", "UPOS", "XPOS", "FEATS", "HEAD", "DEPREL", "DEPS", "MISC". + +The example below + +- shows how to manually add the component; +- changes the custom attribute `conll_pd` to pandas (`conll_pd` only availabe if `pandas` is installed); +- converts any `nsubj` deprel tag to `subj`. + +```python +import spacy + + +nlp = spacy.load("en_core_web_sm") +config = {"ext_names": {"conll_pd": "pandas"}, + "conversion_maps": {"deprel": {"nsubj": "subj"}}} +nlp.add_pipe("conll_formatter", config=config, last=True) +doc = nlp("I like cookies.") +print(doc._.pandas) +``` + +This is the same as: + +```python +from spacy_conll import init_parser + +nlp = init_parser("en_core_web_sm", + "spacy", + ext_names={"conll_pd": "pandas"}, + conversion_maps={"deprel": {"nsubj": "subj"}}) +doc = nlp("I like cookies.") +print(doc._.pandas) +``` + + +The snippets above will output a pandas DataFrame by using `._.pandas` rather than the standard +`._.conll_pd`, and all occurrences of `nsubj` in the deprel field are replaced by `subj`. + +``` + ID FORM LEMMA UPOS XPOS FEATS HEAD DEPREL DEPS MISC +0 1 I I PRON PRP Case=Nom|Number=Sing|Person=1|PronType=Prs 2 subj _ _ +1 2 like like VERB VBP Tense=Pres|VerbForm=Fin 0 ROOT _ _ +2 3 cookies cookie NOUN NNS Number=Plur 2 dobj _ SpaceAfter=No +3 4 . . PUNCT . PunctType=Peri 2 punct _ SpaceAfter=No +``` + +Another initialization example that would replace the column names "UPOS" with "upostag" amd "XPOS" with "xpostag": + +```python +import spacy + + +nlp = spacy.load("en_core_web_sm") +config = {"field_names": {"UPOS": "upostag", "XPOS": "xpostag"}} +nlp.add_pipe("conll_formatter", config=config, last=True) +``` + +#### Reading CoNLL into a spaCy object + +It is possible to read a CoNLL string or text file and parse it as a spaCy object. This can be useful if you have raw +CoNLL data that you wish to process in different ways. The process is straightforward. + +```python +from spacy_conll import init_parser +from spacy_conll.parser import ConllParser + + +nlp = ConllParser(init_parser("en_core_web_sm", "spacy")) + +doc = nlp.parse_conll_file_as_spacy("path/to/your/conll-sample.txt") +''' +or straight from raw text: +conllstr = """ +# text = From the AP comes this story : +1 From from ADP IN _ 3 case 3:case _ +2 the the DET DT Definite=Def|PronType=Art 3 det 3:det _ +3 AP AP PROPN NNP Number=Sing 4 obl 4:obl:from _ +4 comes come VERB VBZ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 0 root 0:root _ +5 this this DET DT Number=Sing|PronType=Dem 6 det 6:det _ +6 story story NOUN NN Number=Sing 4 nsubj 4:nsubj _ +""" +doc = nlp.parse_conll_text_as_spacy(conllstr) +''' + +# Multiple CoNLL entries (separated by two newlines) will be included as different sentences in the resulting Doc +for sent in doc.sents: + for token in sent: + print(token.text, token.dep_, token.pos_) +``` + +### Command line + +Upon installation, a command-line script is added under tha alias `parse-as-conll`. You can use it to parse a +string or file into CoNLL format given a number of options. + +```shell +parse-as-conll -h +usage: parse-as-conll [-h] [-f INPUT_FILE] [-a INPUT_ENCODING] [-b INPUT_STR] [-o OUTPUT_FILE] + [-c OUTPUT_ENCODING] [-s] [-t] [-d] [-e] [-j N_PROCESS] [-v] + [--ignore_pipe_errors] [--no_split_on_newline] + model_or_lang {spacy,stanza,udpipe} + +Parse an input string or input file to CoNLL-U format using a spaCy-wrapped parser. The output +can be written to stdout or a file, or both. + +positional arguments: + model_or_lang Model or language to use. SpaCy models must be pre-installed, stanza + and udpipe models will be downloaded automatically + {spacy,stanza,udpipe} + Which parser to use. Parsers other than 'spacy' need to be installed + separately. For 'stanza' you need 'spacy-stanza', and for 'udpipe' the + 'spacy-udpipe' library is required. + +optional arguments: + -h, --help show this help message and exit + -f INPUT_FILE, --input_file INPUT_FILE + Path to file with sentences to parse. Has precedence over 'input_str'. + (default: None) + -a INPUT_ENCODING, --input_encoding INPUT_ENCODING + Encoding of the input file. Default value is system default. (default: + cp1252) + -b INPUT_STR, --input_str INPUT_STR + Input string to parse. (default: None) + -o OUTPUT_FILE, --output_file OUTPUT_FILE + Path to output file. If not specified, the output will be printed on + standard output. (default: None) + -c OUTPUT_ENCODING, --output_encoding OUTPUT_ENCODING + Encoding of the output file. Default value is system default. (default: + cp1252) + -s, --disable_sbd Whether to disable spaCy automatic sentence boundary detection. In + practice, disabling means that every line will be parsed as one + sentence, regardless of its actual content. When 'is_tokenized' is + enabled, 'disable_sbd' is enabled automatically (see 'is_tokenized'). + Only works when using 'spacy' as 'parser'. (default: False) + -t, --is_tokenized Whether your text has already been tokenized (space-seperated). Setting + this option has as an important consequence that no sentence splitting + at all will be done except splitting on new lines. So if your input is + a file, and you want to use pretokenised text, make sure that each line + contains exactly one sentence. (default: False) + -d, --include_headers + Whether to include headers before the output of every sentence. These + headers include the sentence text and the sentence ID as per the CoNLL + format. (default: False) + -e, --no_force_counting + Whether to disable force counting the 'sent_id', starting from 1 and + increasing for each sentence. Instead, 'sent_id' will depend on how + spaCy returns the sentences. Must have 'include_headers' enabled. + (default: False) + -j N_PROCESS, --n_process N_PROCESS + Number of processes to use in nlp.pipe(). -1 will use as many cores as + available. Might not work for a 'parser' other than 'spacy' depending + on your environment. (default: 1) + -v, --verbose Whether to always print the output to stdout, regardless of + 'output_file'. (default: False) + --ignore_pipe_errors Whether to ignore a priori errors concerning 'n_process' By default we + try to determine whether processing works on your system and stop + execution if we think it doesn't. If you know what you are doing, you + can ignore such pre-emptive errors, though, and run the code as-is, + which will then throw the default Python errors when applicable. + (default: False) + --no_split_on_newline + By default, the input file or string is split on newlines for faster + processing of the split up parts. If you want to disable that behavior, + you can use this flag. (default: False) +``` + + +For example, parsing a single line, multi-sentence string: + +```shell +parse-as-conll en_core_web_sm spacy --input_str "I like cookies. What about you?" --include_headers + +# sent_id = 1 +# text = I like cookies. +1 I I PRON PRP Case=Nom|Number=Sing|Person=1|PronType=Prs 2 nsubj _ _ +2 like like VERB VBP Tense=Pres|VerbForm=Fin 0 ROOT _ _ +3 cookies cookie NOUN NNS Number=Plur 2 dobj _ SpaceAfter=No +4 . . PUNCT . PunctType=Peri 2 punct _ _ + +# sent_id = 2 +# text = What about you? +1 What what PRON WP _ 2 dep _ _ +2 about about ADP IN _ 0 ROOT _ _ +3 you you PRON PRP Case=Acc|Person=2|PronType=Prs 2 pobj _ SpaceAfter=No +4 ? ? PUNCT . PunctType=Peri 2 punct _ SpaceAfter=No +``` + +For example, parsing a large input file and writing output to a given output file, using four processes: + +```shell +parse-as-conll en_core_web_sm spacy --input_file large-input.txt --output_file large-conll-output.txt --include_headers --disable_sbd -j 4 +``` + + +## Credits + +The first version of this library was inspired by initial work by [rgalhama](https://github.com/rgalhama/spaCy2CoNLLU) + and has evolved a lot since then. + + +%package help +Summary: Development documents and examples for spacy-conll +Provides: python3-spacy-conll-doc +%description help +# Parsing to CoNLL with spaCy, spacy-stanza, and spacy-udpipe + +This module allows you to parse text into CoNLL-U format. You can use it as a command line tool, or embed it in your + own scripts by adding it as a custom pipeline component to a spaCy, `spacy-stanza`, or `spacy-udpipe` pipeline. It + also provides an easy-to-use function to quickly initialize a parser as well as a ConllParser class with built-in + functionality to parse files or text. + +Note that the module simply takes a parser's output and puts it in a formatted string adhering to the linked ConLL-U + format. The output tags depend on the spaCy model used. If you want Universal Depencies tags as output, I advise you + to use this library in combination with [spacy-stanza](https://github.com/explosion/spacy-stanza), which is a spaCy + interface using `stanza` and its models behind the scenes. Those models use the Universal Dependencies formalism and + yield state-of-the-art performance. `stanza` is a new and improved version of `stanfordnlp`. As an alternative to the + Stanford models, you can use the spaCy wrapper for `UDPipe`, [spacy-udpipe](https://github.com/TakeLab/spacy-udpipe), + which is slightly less accurate than `stanza` but much faster. + + +## Installation + +By default, this package automatically installs only [spaCy](https://spacy.io/usage/models#section-quickstart) as + dependency. Because [spaCy's models](https://spacy.io/usage/models) are not necessarily trained on Universal + Dependencies conventions, their output labels are not UD either. By using `spacy-stanza` or `spacy-udpipe`, we get + the easy-to-use interface of spaCy as a wrapper around `stanza` and `UDPipe` respectively, including their models that + *are* trained on UD data. + +**NOTE**: `spacy-stanza` and `spacy-udpipe` are not installed automatically as a dependency for this library, because + it might be too much overhead for those who don't need UD. If you wish to use their functionality, you have to install +them manually or use one of the available options as described below. + +If you want to retrieve CoNLL info as a `pandas` DataFrame, this library will automatically export it if it detects + that `pandas` is installed. See the Usage section for more. + +To install the library, simply use pip. + +```shell +# only includes spacy by default +pip install spacy_conll +``` + +A number of options are available to make installation of additional dependencies easier: + +```shell +# include spacy-stanza and spacy-udpipe +pip install spacy_conll[parsers] +# include pandas +pip install spacy_conll[pd] +# include pandas, spacy-stanza and spacy-udpipe +pip install spacy_conll[all] +# include pandas, spacy-stanza and spacy-udpipe and additional libaries for testing and formatting +pip install spacy_conll[dev] +``` + + +## Usage + +When the ConllFormatter is added to a spaCy pipeline, it adds CoNLL properties for `Token`, sentence `Span` and `Doc`. + Note that arbitrary Span's are not included and do not receive these properties. + +On all three of these levels, two custom properties are exposed by default, `._.conll` and its string + representation `._.conll_str`. However, if you have `pandas` installed, then `._.conll_pd` will + be added automatically, too! + +- `._.conll`: raw CoNLL format + - in Token: a dictionary containing all the expected CoNLL fields as keys and the parsed properties as values. + - in sentence Span: a list of its tokens' `._.conll` dictionaries (list of dictionaries). + - in a Doc: a list of its sentences' `._.conll` lists (list of list of dictionaries). + +- `._.conll_str`: string representation of the CoNLL format + - in Token: tab-separated representation of the contents of the CoNLL fields ending with a newline. + - in sentence Span: the expected CoNLL format where each row represents a token. When + `ConllFormatter(include_headers=True)` is used, two header lines are included as well, as per the + [CoNLL format](https://universaldependencies.org/format.html#sentence-boundaries-and-comments). + - in Doc: all its sentences' `._.conll_str` combined and separated by new lines. + +- `._.conll_pd`: `pandas` representation of the CoNLL format + - in Token: a Series representation of this token's CoNLL properties. + - in sentence Span: a DataFrame representation of this sentence, with the CoNLL names as column headers. + - in Doc: a concatenation of its sentences' DataFrame's, leading to a new a DataFrame whose index is reset. + +You can use `spacy_conll` in your own Python code as a custom pipeline component, or you can use the built-in + command-line script which offers typically needed functionality. See the following section for more. + + +### In Python + +This library offers the ConllFormatter class which serves as a custom spaCy pipeline component. It can be instantiated + as follows. It is important that you import `spacy_conll` before adding the pipe! + +```python +import spacy +nlp = spacy.load("en_core_web_sm") +nlp.add_pipe("conll_formatter", last=True) +``` + +Because this library supports different spaCy wrappers (`spacy`, `stanza`, and `udpipe`), a convenience function is + available as well. With `utils.init_parser` you can easily instantiate a parser with a single line. You can + find the function's signature below. Have a look at the [source code](spacy_conll/utils.py) to read more about all the + possible arguments or try out the [examples](examples/). + +**NOTE**: `is_tokenized` does not work for `spacy-udpipe`. Using `is_tokenized` for `spacy-stanza` also affects sentence + segmentation, effectively *only* splitting on new lines. With `spacy`, `is_tokenized` disables sentence splitting completely. + +```python +def init_parser( + model_or_lang: str, + parser: str, + *, + is_tokenized: bool = False, + disable_sbd: bool = False, + exclude_spacy_components: Optional[List[str]] = None, + parser_opts: Optional[Dict] = None, + **kwargs, +) +``` + +For instance, if you want to load a Dutch `stanza` model in silent mode with the CoNLL formatter already attached, you + can simply use the following snippet. `parser_opts` is passed to the `stanza` pipeline initialisation automatically. + Any other keyword arguments (`kwargs`), on the other hand, are passed to the `ConllFormatter` initialisation. + +```python +from spacy_conll import init_parser + +nlp = init_parser("nl", "stanza", parser_opts={"verbose": False}) +``` + +The `ConllFormatter` allows you to customize the extension names, and you can also specify conversion maps for the +output properties. + +To illustrate, here is an advanced example, showing the more complex options: + +- `ext_names`: changes the attribute names to a custom key by using a dictionary. +- `conversion_maps`: a two-level dictionary that looks like `{field_name: {tag_name: replacement}}`. In + other words, you can specify in which field a certain value should be replaced by another. This is especially useful + when you are not satisfied with the tagset of a model and wish to change some tags to an alternative0. +- `field_names`: allows you to change the default CoNLL-U field names to your own custom names. Similar to the + conversion map above, you should use any of the default field names as keys and add your own key as value. + Possible keys are : "ID", "FORM", "LEMMA", "UPOS", "XPOS", "FEATS", "HEAD", "DEPREL", "DEPS", "MISC". + +The example below + +- shows how to manually add the component; +- changes the custom attribute `conll_pd` to pandas (`conll_pd` only availabe if `pandas` is installed); +- converts any `nsubj` deprel tag to `subj`. + +```python +import spacy + + +nlp = spacy.load("en_core_web_sm") +config = {"ext_names": {"conll_pd": "pandas"}, + "conversion_maps": {"deprel": {"nsubj": "subj"}}} +nlp.add_pipe("conll_formatter", config=config, last=True) +doc = nlp("I like cookies.") +print(doc._.pandas) +``` + +This is the same as: + +```python +from spacy_conll import init_parser + +nlp = init_parser("en_core_web_sm", + "spacy", + ext_names={"conll_pd": "pandas"}, + conversion_maps={"deprel": {"nsubj": "subj"}}) +doc = nlp("I like cookies.") +print(doc._.pandas) +``` + + +The snippets above will output a pandas DataFrame by using `._.pandas` rather than the standard +`._.conll_pd`, and all occurrences of `nsubj` in the deprel field are replaced by `subj`. + +``` + ID FORM LEMMA UPOS XPOS FEATS HEAD DEPREL DEPS MISC +0 1 I I PRON PRP Case=Nom|Number=Sing|Person=1|PronType=Prs 2 subj _ _ +1 2 like like VERB VBP Tense=Pres|VerbForm=Fin 0 ROOT _ _ +2 3 cookies cookie NOUN NNS Number=Plur 2 dobj _ SpaceAfter=No +3 4 . . PUNCT . PunctType=Peri 2 punct _ SpaceAfter=No +``` + +Another initialization example that would replace the column names "UPOS" with "upostag" amd "XPOS" with "xpostag": + +```python +import spacy + + +nlp = spacy.load("en_core_web_sm") +config = {"field_names": {"UPOS": "upostag", "XPOS": "xpostag"}} +nlp.add_pipe("conll_formatter", config=config, last=True) +``` + +#### Reading CoNLL into a spaCy object + +It is possible to read a CoNLL string or text file and parse it as a spaCy object. This can be useful if you have raw +CoNLL data that you wish to process in different ways. The process is straightforward. + +```python +from spacy_conll import init_parser +from spacy_conll.parser import ConllParser + + +nlp = ConllParser(init_parser("en_core_web_sm", "spacy")) + +doc = nlp.parse_conll_file_as_spacy("path/to/your/conll-sample.txt") +''' +or straight from raw text: +conllstr = """ +# text = From the AP comes this story : +1 From from ADP IN _ 3 case 3:case _ +2 the the DET DT Definite=Def|PronType=Art 3 det 3:det _ +3 AP AP PROPN NNP Number=Sing 4 obl 4:obl:from _ +4 comes come VERB VBZ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 0 root 0:root _ +5 this this DET DT Number=Sing|PronType=Dem 6 det 6:det _ +6 story story NOUN NN Number=Sing 4 nsubj 4:nsubj _ +""" +doc = nlp.parse_conll_text_as_spacy(conllstr) +''' + +# Multiple CoNLL entries (separated by two newlines) will be included as different sentences in the resulting Doc +for sent in doc.sents: + for token in sent: + print(token.text, token.dep_, token.pos_) +``` + +### Command line + +Upon installation, a command-line script is added under tha alias `parse-as-conll`. You can use it to parse a +string or file into CoNLL format given a number of options. + +```shell +parse-as-conll -h +usage: parse-as-conll [-h] [-f INPUT_FILE] [-a INPUT_ENCODING] [-b INPUT_STR] [-o OUTPUT_FILE] + [-c OUTPUT_ENCODING] [-s] [-t] [-d] [-e] [-j N_PROCESS] [-v] + [--ignore_pipe_errors] [--no_split_on_newline] + model_or_lang {spacy,stanza,udpipe} + +Parse an input string or input file to CoNLL-U format using a spaCy-wrapped parser. The output +can be written to stdout or a file, or both. + +positional arguments: + model_or_lang Model or language to use. SpaCy models must be pre-installed, stanza + and udpipe models will be downloaded automatically + {spacy,stanza,udpipe} + Which parser to use. Parsers other than 'spacy' need to be installed + separately. For 'stanza' you need 'spacy-stanza', and for 'udpipe' the + 'spacy-udpipe' library is required. + +optional arguments: + -h, --help show this help message and exit + -f INPUT_FILE, --input_file INPUT_FILE + Path to file with sentences to parse. Has precedence over 'input_str'. + (default: None) + -a INPUT_ENCODING, --input_encoding INPUT_ENCODING + Encoding of the input file. Default value is system default. (default: + cp1252) + -b INPUT_STR, --input_str INPUT_STR + Input string to parse. (default: None) + -o OUTPUT_FILE, --output_file OUTPUT_FILE + Path to output file. If not specified, the output will be printed on + standard output. (default: None) + -c OUTPUT_ENCODING, --output_encoding OUTPUT_ENCODING + Encoding of the output file. Default value is system default. (default: + cp1252) + -s, --disable_sbd Whether to disable spaCy automatic sentence boundary detection. In + practice, disabling means that every line will be parsed as one + sentence, regardless of its actual content. When 'is_tokenized' is + enabled, 'disable_sbd' is enabled automatically (see 'is_tokenized'). + Only works when using 'spacy' as 'parser'. (default: False) + -t, --is_tokenized Whether your text has already been tokenized (space-seperated). Setting + this option has as an important consequence that no sentence splitting + at all will be done except splitting on new lines. So if your input is + a file, and you want to use pretokenised text, make sure that each line + contains exactly one sentence. (default: False) + -d, --include_headers + Whether to include headers before the output of every sentence. These + headers include the sentence text and the sentence ID as per the CoNLL + format. (default: False) + -e, --no_force_counting + Whether to disable force counting the 'sent_id', starting from 1 and + increasing for each sentence. Instead, 'sent_id' will depend on how + spaCy returns the sentences. Must have 'include_headers' enabled. + (default: False) + -j N_PROCESS, --n_process N_PROCESS + Number of processes to use in nlp.pipe(). -1 will use as many cores as + available. Might not work for a 'parser' other than 'spacy' depending + on your environment. (default: 1) + -v, --verbose Whether to always print the output to stdout, regardless of + 'output_file'. (default: False) + --ignore_pipe_errors Whether to ignore a priori errors concerning 'n_process' By default we + try to determine whether processing works on your system and stop + execution if we think it doesn't. If you know what you are doing, you + can ignore such pre-emptive errors, though, and run the code as-is, + which will then throw the default Python errors when applicable. + (default: False) + --no_split_on_newline + By default, the input file or string is split on newlines for faster + processing of the split up parts. If you want to disable that behavior, + you can use this flag. (default: False) +``` + + +For example, parsing a single line, multi-sentence string: + +```shell +parse-as-conll en_core_web_sm spacy --input_str "I like cookies. What about you?" --include_headers + +# sent_id = 1 +# text = I like cookies. +1 I I PRON PRP Case=Nom|Number=Sing|Person=1|PronType=Prs 2 nsubj _ _ +2 like like VERB VBP Tense=Pres|VerbForm=Fin 0 ROOT _ _ +3 cookies cookie NOUN NNS Number=Plur 2 dobj _ SpaceAfter=No +4 . . PUNCT . PunctType=Peri 2 punct _ _ + +# sent_id = 2 +# text = What about you? +1 What what PRON WP _ 2 dep _ _ +2 about about ADP IN _ 0 ROOT _ _ +3 you you PRON PRP Case=Acc|Person=2|PronType=Prs 2 pobj _ SpaceAfter=No +4 ? ? PUNCT . PunctType=Peri 2 punct _ SpaceAfter=No +``` + +For example, parsing a large input file and writing output to a given output file, using four processes: + +```shell +parse-as-conll en_core_web_sm spacy --input_file large-input.txt --output_file large-conll-output.txt --include_headers --disable_sbd -j 4 +``` + + +## Credits + +The first version of this library was inspired by initial work by [rgalhama](https://github.com/rgalhama/spaCy2CoNLLU) + and has evolved a lot since then. + + +%prep +%autosetup -n spacy-conll-3.4.0 + +%build +%py3_build + +%install +%py3_install +install -d -m755 %{buildroot}/%{_pkgdocdir} +if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi +if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi +if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi +if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi +pushd %{buildroot} +if [ -d usr/lib ]; then + find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/lib64 ]; then + find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/bin ]; then + find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/sbin ]; then + find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst +fi +touch doclist.lst +if [ -d usr/share/man ]; then + find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst +fi +popd +mv %{buildroot}/filelist.lst . +mv %{buildroot}/doclist.lst . + +%files -n python3-spacy-conll -f filelist.lst +%dir %{python3_sitelib}/* + +%files help -f doclist.lst +%{_docdir}/* + +%changelog +* Wed May 10 2023 Python_Bot - 3.4.0-1 +- Package Spec generated diff --git a/sources b/sources new file mode 100644 index 0000000..b1850ae --- /dev/null +++ b/sources @@ -0,0 +1 @@ +9cabacc829a31a9f3d0b39bfa2f64d39 spacy_conll-3.4.0.tar.gz -- cgit v1.2.3