summaryrefslogtreecommitdiff
path: root/python-benepar.spec
blob: e9a2cedb27d4252cd5d74e2096408766e88001ba (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
%global _empty_manifest_terminate_build 0
Name:		python-benepar
Version:	0.2.0
Release:	1
Summary:	Berkeley Neural Parser
License:	Apache Software License
URL:		https://github.com/nikitakit/self-attentive-parser
Source0:	https://mirrors.nju.edu.cn/pypi/web/packages/9e/17/c398a35d0f303a534de8ec6949aa2ee68cc6bdbf0930685d92719b97aa1e/benepar-0.2.0.tar.gz
BuildArch:	noarch


%description
`benepar_en3` | English | 95.40 F1 on [revised](https://catalog.ldc.upenn.edu/LDC2015T13) WSJ test set. The training data uses revised tokenization and syntactic annotation based on the same guidelines as the English Web Treebank and OntoNotes, which better matches modern tokenization practices in libraries like spaCy. Based on T5-small.
`benepar_en3_large` | English | 96.29 F1 on [revised](https://catalog.ldc.upenn.edu/LDC2015T13) WSJ test set. The training data uses revised tokenization and syntactic annotation based on the same guidelines as the English Web Treebank and OntoNotes, which better matches modern tokenization practices in libraries like spaCy. Based on T5-large.
`benepar_zh2` | Chinese | 92.56 F1 on CTB 5.1 test set. Usage with spaCy allows supports parsing from raw text, but the NLTK API only supports parsing previously tokenized sentences. Based on Chinese ELECTRA-180G-large.
`benepar_ar` | Arabic | 90.52 F1 on SPMRL2013/2014 test set. Only supports using the NLTK API for parsing previously tokenized sentences. Parsing from raw text and spaCy integration are not supported. Based on XLM-R.
`benepar_de` | German | 92.10 F1 on SPMRL2013/2014 test set. Based on XLM-R.
`benepar_eu` | Basque | 93.36 F1 on SPMRL2013/2014 test set. Usage with spaCy first requires implementing Basque support in spaCy. Based on XLM-R.
`benepar_fr` | French | 88.43 F1 on SPMRL2013/2014 test set. Based on XLM-R.
`benepar_he` | Hebrew | 93.98 F1 on SPMRL2013/2014 test set. Only supports using the NLTK API for parsing previously tokenized sentences. Parsing from raw text and spaCy integration are not supported. Based on XLM-R.
`benepar_hu` | Hungarian | 96.19 F1 on SPMRL2013/2014 test set. Usage with spaCy requires a [Hungarian model for spaCy](https://github.com/oroszgy/spacy-hungarian-models). The NLTK API only supports parsing previously tokenized sentences. Based on XLM-R.
`benepar_ko` | Korean | 91.72 F1 on SPMRL2013/2014 test set. Can be used with spaCy's [multi-language sentence segmentation model](https://spacy.io/models/xx#xx_sent_ud_sm) (requires spaCy v3.0). The NLTK API only supports parsing previously tokenized sentences. Based on XLM-R.
`benepar_pl` | Polish | 97.15 F1 on SPMRL2013/2014 test set. Based on XLM-R.
`benepar_sv` | Swedish | 92.21 F1 on SPMRL2013/2014 test set. Can be used with spaCy's [multi-language sentence segmentation model](https://spacy.io/models/xx#xx_sent_ud_sm) (requires spaCy v3.0). Based on XLM-R.
`benepar_en3_wsj` | English | **Consider using `benepar_en3` or `benepar_en3_large` instead**. 95.55 F1 on [canonical](https://catalog.ldc.upenn.edu/LDC99T42) WSJ test set used for decades of English constituency parsing publications. Based on BERT-large-uncased. We believe that the revised annotation guidelines used for training `benepar_en3`/`benepar_en3_large` are more suitable for downstream use because they better handle language usage in web text, and are more consistent with modern practices in dependency parsing and libraries like spaCy. Nevertheless, we provide the `benepar_en3_wsj` model for cases where using the revised treebanking conventions are not appropriate, such as benchmarking different models on the same dataset.
## Training
Training requires cloning this repository from GitHub. While the model code in `src/benepar` is distributed in the `benepar` package on PyPI, the training and evaluation scripts directly under `src/` are not.
#### Software Requirements for Training
* Python 3.7 or higher.
* [PyTorch](http://pytorch.org/) 1.6.0, or any compatible version.
* All dependencies required by the `benepar` package, including: [NLTK](https://www.nltk.org/) 3.2, [torch-struct](https://github.com/harvardnlp/pytorch-struct) 0.4, [transformers](https://github.com/huggingface/transformers) 4.3.0, or compatible.
* [pytokenizations](https://github.com/tamuhey/tokenizations/) 0.7.2 or compatible.
* [EVALB](http://nlp.cs.nyu.edu/evalb/). Before starting, run `make` inside the `EVALB/` directory to compile an `evalb` executable. This will be called from Python for evaluation. If training on the SPMRL datasets, you will need to run `make` inside the `EVALB_SPMRL/` directory instead.
### Training Instructions
A new model can be trained using the command `python src/main.py train ...`. Some of the available arguments are:

%package -n python3-benepar
Summary:	Berkeley Neural Parser
Provides:	python-benepar
BuildRequires:	python3-devel
BuildRequires:	python3-setuptools
BuildRequires:	python3-pip
%description -n python3-benepar
`benepar_en3` | English | 95.40 F1 on [revised](https://catalog.ldc.upenn.edu/LDC2015T13) WSJ test set. The training data uses revised tokenization and syntactic annotation based on the same guidelines as the English Web Treebank and OntoNotes, which better matches modern tokenization practices in libraries like spaCy. Based on T5-small.
`benepar_en3_large` | English | 96.29 F1 on [revised](https://catalog.ldc.upenn.edu/LDC2015T13) WSJ test set. The training data uses revised tokenization and syntactic annotation based on the same guidelines as the English Web Treebank and OntoNotes, which better matches modern tokenization practices in libraries like spaCy. Based on T5-large.
`benepar_zh2` | Chinese | 92.56 F1 on CTB 5.1 test set. Usage with spaCy allows supports parsing from raw text, but the NLTK API only supports parsing previously tokenized sentences. Based on Chinese ELECTRA-180G-large.
`benepar_ar` | Arabic | 90.52 F1 on SPMRL2013/2014 test set. Only supports using the NLTK API for parsing previously tokenized sentences. Parsing from raw text and spaCy integration are not supported. Based on XLM-R.
`benepar_de` | German | 92.10 F1 on SPMRL2013/2014 test set. Based on XLM-R.
`benepar_eu` | Basque | 93.36 F1 on SPMRL2013/2014 test set. Usage with spaCy first requires implementing Basque support in spaCy. Based on XLM-R.
`benepar_fr` | French | 88.43 F1 on SPMRL2013/2014 test set. Based on XLM-R.
`benepar_he` | Hebrew | 93.98 F1 on SPMRL2013/2014 test set. Only supports using the NLTK API for parsing previously tokenized sentences. Parsing from raw text and spaCy integration are not supported. Based on XLM-R.
`benepar_hu` | Hungarian | 96.19 F1 on SPMRL2013/2014 test set. Usage with spaCy requires a [Hungarian model for spaCy](https://github.com/oroszgy/spacy-hungarian-models). The NLTK API only supports parsing previously tokenized sentences. Based on XLM-R.
`benepar_ko` | Korean | 91.72 F1 on SPMRL2013/2014 test set. Can be used with spaCy's [multi-language sentence segmentation model](https://spacy.io/models/xx#xx_sent_ud_sm) (requires spaCy v3.0). The NLTK API only supports parsing previously tokenized sentences. Based on XLM-R.
`benepar_pl` | Polish | 97.15 F1 on SPMRL2013/2014 test set. Based on XLM-R.
`benepar_sv` | Swedish | 92.21 F1 on SPMRL2013/2014 test set. Can be used with spaCy's [multi-language sentence segmentation model](https://spacy.io/models/xx#xx_sent_ud_sm) (requires spaCy v3.0). Based on XLM-R.
`benepar_en3_wsj` | English | **Consider using `benepar_en3` or `benepar_en3_large` instead**. 95.55 F1 on [canonical](https://catalog.ldc.upenn.edu/LDC99T42) WSJ test set used for decades of English constituency parsing publications. Based on BERT-large-uncased. We believe that the revised annotation guidelines used for training `benepar_en3`/`benepar_en3_large` are more suitable for downstream use because they better handle language usage in web text, and are more consistent with modern practices in dependency parsing and libraries like spaCy. Nevertheless, we provide the `benepar_en3_wsj` model for cases where using the revised treebanking conventions are not appropriate, such as benchmarking different models on the same dataset.
## Training
Training requires cloning this repository from GitHub. While the model code in `src/benepar` is distributed in the `benepar` package on PyPI, the training and evaluation scripts directly under `src/` are not.
#### Software Requirements for Training
* Python 3.7 or higher.
* [PyTorch](http://pytorch.org/) 1.6.0, or any compatible version.
* All dependencies required by the `benepar` package, including: [NLTK](https://www.nltk.org/) 3.2, [torch-struct](https://github.com/harvardnlp/pytorch-struct) 0.4, [transformers](https://github.com/huggingface/transformers) 4.3.0, or compatible.
* [pytokenizations](https://github.com/tamuhey/tokenizations/) 0.7.2 or compatible.
* [EVALB](http://nlp.cs.nyu.edu/evalb/). Before starting, run `make` inside the `EVALB/` directory to compile an `evalb` executable. This will be called from Python for evaluation. If training on the SPMRL datasets, you will need to run `make` inside the `EVALB_SPMRL/` directory instead.
### Training Instructions
A new model can be trained using the command `python src/main.py train ...`. Some of the available arguments are:

%package help
Summary:	Development documents and examples for benepar
Provides:	python3-benepar-doc
%description help
`benepar_en3` | English | 95.40 F1 on [revised](https://catalog.ldc.upenn.edu/LDC2015T13) WSJ test set. The training data uses revised tokenization and syntactic annotation based on the same guidelines as the English Web Treebank and OntoNotes, which better matches modern tokenization practices in libraries like spaCy. Based on T5-small.
`benepar_en3_large` | English | 96.29 F1 on [revised](https://catalog.ldc.upenn.edu/LDC2015T13) WSJ test set. The training data uses revised tokenization and syntactic annotation based on the same guidelines as the English Web Treebank and OntoNotes, which better matches modern tokenization practices in libraries like spaCy. Based on T5-large.
`benepar_zh2` | Chinese | 92.56 F1 on CTB 5.1 test set. Usage with spaCy allows supports parsing from raw text, but the NLTK API only supports parsing previously tokenized sentences. Based on Chinese ELECTRA-180G-large.
`benepar_ar` | Arabic | 90.52 F1 on SPMRL2013/2014 test set. Only supports using the NLTK API for parsing previously tokenized sentences. Parsing from raw text and spaCy integration are not supported. Based on XLM-R.
`benepar_de` | German | 92.10 F1 on SPMRL2013/2014 test set. Based on XLM-R.
`benepar_eu` | Basque | 93.36 F1 on SPMRL2013/2014 test set. Usage with spaCy first requires implementing Basque support in spaCy. Based on XLM-R.
`benepar_fr` | French | 88.43 F1 on SPMRL2013/2014 test set. Based on XLM-R.
`benepar_he` | Hebrew | 93.98 F1 on SPMRL2013/2014 test set. Only supports using the NLTK API for parsing previously tokenized sentences. Parsing from raw text and spaCy integration are not supported. Based on XLM-R.
`benepar_hu` | Hungarian | 96.19 F1 on SPMRL2013/2014 test set. Usage with spaCy requires a [Hungarian model for spaCy](https://github.com/oroszgy/spacy-hungarian-models). The NLTK API only supports parsing previously tokenized sentences. Based on XLM-R.
`benepar_ko` | Korean | 91.72 F1 on SPMRL2013/2014 test set. Can be used with spaCy's [multi-language sentence segmentation model](https://spacy.io/models/xx#xx_sent_ud_sm) (requires spaCy v3.0). The NLTK API only supports parsing previously tokenized sentences. Based on XLM-R.
`benepar_pl` | Polish | 97.15 F1 on SPMRL2013/2014 test set. Based on XLM-R.
`benepar_sv` | Swedish | 92.21 F1 on SPMRL2013/2014 test set. Can be used with spaCy's [multi-language sentence segmentation model](https://spacy.io/models/xx#xx_sent_ud_sm) (requires spaCy v3.0). Based on XLM-R.
`benepar_en3_wsj` | English | **Consider using `benepar_en3` or `benepar_en3_large` instead**. 95.55 F1 on [canonical](https://catalog.ldc.upenn.edu/LDC99T42) WSJ test set used for decades of English constituency parsing publications. Based on BERT-large-uncased. We believe that the revised annotation guidelines used for training `benepar_en3`/`benepar_en3_large` are more suitable for downstream use because they better handle language usage in web text, and are more consistent with modern practices in dependency parsing and libraries like spaCy. Nevertheless, we provide the `benepar_en3_wsj` model for cases where using the revised treebanking conventions are not appropriate, such as benchmarking different models on the same dataset.
## Training
Training requires cloning this repository from GitHub. While the model code in `src/benepar` is distributed in the `benepar` package on PyPI, the training and evaluation scripts directly under `src/` are not.
#### Software Requirements for Training
* Python 3.7 or higher.
* [PyTorch](http://pytorch.org/) 1.6.0, or any compatible version.
* All dependencies required by the `benepar` package, including: [NLTK](https://www.nltk.org/) 3.2, [torch-struct](https://github.com/harvardnlp/pytorch-struct) 0.4, [transformers](https://github.com/huggingface/transformers) 4.3.0, or compatible.
* [pytokenizations](https://github.com/tamuhey/tokenizations/) 0.7.2 or compatible.
* [EVALB](http://nlp.cs.nyu.edu/evalb/). Before starting, run `make` inside the `EVALB/` directory to compile an `evalb` executable. This will be called from Python for evaluation. If training on the SPMRL datasets, you will need to run `make` inside the `EVALB_SPMRL/` directory instead.
### Training Instructions
A new model can be trained using the command `python src/main.py train ...`. Some of the available arguments are:

%prep
%autosetup -n benepar-0.2.0

%build
%py3_build

%install
%py3_install
install -d -m755 %{buildroot}/%{_pkgdocdir}
if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi
if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi
if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi
if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi
pushd %{buildroot}
if [ -d usr/lib ]; then
	find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst
fi
if [ -d usr/lib64 ]; then
	find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst
fi
if [ -d usr/bin ]; then
	find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst
fi
if [ -d usr/sbin ]; then
	find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst
fi
touch doclist.lst
if [ -d usr/share/man ]; then
	find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst
fi
popd
mv %{buildroot}/filelist.lst .
mv %{buildroot}/doclist.lst .

%files -n python3-benepar -f filelist.lst
%dir %{python3_sitelib}/*

%files help -f doclist.lst
%{_docdir}/*

%changelog
* Fri May 05 2023 Python_Bot <Python_Bot@openeuler.org> - 0.2.0-1
- Package Spec generated