1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
|
%global _empty_manifest_terminate_build 0
Name: python-errant
Version: 2.3.3
Release: 1
Summary: The ERRor ANnotation Toolkit (ERRANT). Automatically extract and classify edits in parallel sentences.
License: MIT
URL: https://github.com/chrisjbryant/errant
Source0: https://mirrors.nju.edu.cn/pypi/web/packages/64/2f/712b8c24aa36a7d52ccd6ba354bd330d10c5b1dd1a0f8eee36a46028dd50/errant-2.3.3.tar.gz
BuildArch: noarch
Requires: python3-spacy
Requires: python3-rapidfuzz
%description
# ERRANT v2.3.3
This repository contains the grammatical ERRor ANnotation Toolkit (ERRANT) described in:
> Christopher Bryant, Mariano Felice, and Ted Briscoe. 2017. [**Automatic annotation and evaluation of error types for grammatical error correction**](https://www.aclweb.org/anthology/P17-1074/). In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Vancouver, Canada.
> Mariano Felice, Christopher Bryant, and Ted Briscoe. 2016. [**Automatic extraction of learner errors in ESL sentences using linguistically enhanced alignments**](https://www.aclweb.org/anthology/C16-1079/). In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers. Osaka, Japan.
If you make use of this code, please cite the above papers. More information about ERRANT can be found [here](https://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-938.html). In particular, see Chapter 5 for definitions of error types.
# Overview
The main aim of ERRANT is to automatically annotate parallel English sentences with error type information. Specifically, given an original and corrected sentence pair, ERRANT will extract the edits that transform the former to the latter and classify them according to a rule-based error type framework. This can be used to standardise parallel datasets or facilitate detailed error type evaluation. Annotated output files are in M2 format and an evaluation script is provided.
### Example:
**Original**: This are gramamtical sentence .
**Corrected**: This is a grammatical sentence .
**Output M2**:
S This are gramamtical sentence .
A 1 2|||R:VERB:SVA|||is|||REQUIRED|||-NONE-|||0
A 2 2|||M:DET|||a|||REQUIRED|||-NONE-|||0
A 2 3|||R:SPELL|||grammatical|||REQUIRED|||-NONE-|||0
A -1 -1|||noop|||-NONE-|||REQUIRED|||-NONE-|||1
In M2 format, a line preceded by S denotes an original sentence while a line preceded by A indicates an edit annotation. Each edit line consists of the start and end token offset of the edit, the error type, and the tokenized correction string. The next two fields are included for historical reasons (see the CoNLL-2014 shared task) while the last field is the annotator id.
A "noop" edit is a special kind of edit that explicitly indicates an annotator/system made no changes to the original sentence. If there is only one annotator, noop edits are optional, otherwise a noop edit should be included whenever at least 1 out of n annotators considered the original sentence to be correct. This is something to be aware of when combining individual M2 files, as missing noops can affect evaluation.
# Installation
## Pip Install
The easiest way to install ERRANT and its dependencies is using `pip`. We also recommend installing it in a clean virtual environment (e.g. with `venv`). The latest version of ERRANT only supports Python >= 3.6.
```
python3 -m venv errant_env
source errant_env/bin/activate
pip3 install -U pip setuptools wheel
pip3 install errant
python3 -m spacy download en
```
This will create and activate a new python3 environment called `errant_env` in the current directory. `pip` will then update some setup tools and install ERRANT, [spaCy](https://spacy.io/), [python-Levenshtein](https://pypi.org/project/python-Levenshtein/) and spaCy's default English model in this environment. You can deactivate the environment at any time by running `deactivate`, but must remember to activate it again whenever you want to use ERRANT.
#### ERRANT and spaCy
ERRANT was originally designed to work with spaCy v1.9.0 and works best with this version. SpaCy v1.9.0 does not work with Python >= 3.7 however, and so we were forced to update ERRANT to be compatible with spaCy 2. Since spaCy 2 uses a neural system to trade speed for accuracy, this means ERRANT v2.2 is **~4x slower** than ERRANT v2.1. We have not yet extended ERRANT to work with spaCy 3, but preliminary tests suggest ERRANT will become even slower.
Consequently, we recommend ERRANT v2.1.0 if speed is a priority and you can use Python < 3.7.
```
pip3 install errant==2.1.0
```
#### BEA-2019 Shared Task
ERRANT v2.0.0 was designed to be fully compatible with the [BEA-2019 Shared Task](https://www.cl.cam.ac.uk/research/nl/bea2019st/). If you want to directly compare against the results in the shared task, you should make sure to install ERRANT v2.0.0 as newer versions may produce slightly different scores. You can also use [Codalab](https://competitions.codalab.org/competitions/20228) to evaluate anonymously on the shared task datasets. ERRANT v2.0.0 is not compatible with Python >= 3.7.
```
pip3 install errant==2.0.0
```
## Source Install
If you prefer to install ERRANT from source, you can instead run the following commands:
```
git clone https://github.com/chrisjbryant/errant.git
cd errant
python3 -m venv errant_env
source errant_env/bin/activate
pip3 install -U pip setuptools wheel
pip3 install -e .
python3 -m spacy download en
```
This will clone the github ERRANT source into the current directory, build and activate a python environment inside it, and then install ERRANT and all its dependencies. If you wish to modify ERRANT code, this is the recommended way to install it.
# Usage
## CLI
Three main commands are provided with ERRANT: `errant_parallel`, `errant_m2` and `errant_compare`. You can run them from anywhere on the command line without having to invoke a specific python script.
1. `errant_parallel`
This is the main annotation command that takes an original text file and at least one parallel corrected text file as input, and outputs an annotated M2 file. By default, it is assumed that the original and corrected text files are word tokenised with one sentence per line.
Example:
```
errant_parallel -orig <orig_file> -cor <cor_file1> [<cor_file2> ...] -out <out_m2>
```
2. `errant_m2`
This is a variant of `errant_parallel` that operates on an M2 file instead of parallel text files. This makes it easier to reprocess existing M2 files. You must also specify whether you want to use gold or auto edits; i.e. `-gold` will only classify the existing edits, while `-auto` will extract and classify automatic edits. In both settings, uncorrected edits and noops are preserved.
Example:
```
errant_m2 {-auto|-gold} m2_file -out <out_m2>
```
3. `errant_compare`
This is the evaluation command that compares a hypothesis M2 file against a reference M2 file. The default behaviour evaluates the hypothesis overall in terms of span-based correction. The `-cat {1,2,3}` flag can be used to evaluate error types at increasing levels of granularity, while the `-ds` or `-dt` flag can be used to evaluate in terms of span-based or token-based detection (i.e. ignoring the correction). All scores are presented in terms of Precision, Recall and F-score (default: F0.5), and counts for True Positives (TP), False Positives (FP) and False Negatives (FN) are also shown.
Examples:
```
errant_compare -hyp <hyp_m2> -ref <ref_m2>
errant_compare -hyp <hyp_m2> -ref <ref_m2> -cat {1,2,3}
errant_compare -hyp <hyp_m2> -ref <ref_m2> -ds
errant_compare -hyp <hyp_m2> -ref <ref_m2> -ds -cat {1,2,3}
```
All these scripts also have additional advanced command line options which can be displayed using the `-h` flag.
## API
As of v2.0.0, ERRANT now also comes with an API.
### Quick Start
```
import errant
annotator = errant.load('en')
orig = annotator.parse('This are gramamtical sentence .')
cor = annotator.parse('This is a grammatical sentence .')
edits = annotator.annotate(orig, cor)
for e in edits:
print(e.o_start, e.o_end, e.o_str, e.c_start, e.c_end, e.c_str, e.type)
```
### Loading
`errant`.**load**(lang, nlp=None)
Create an ERRANT Annotator object. The `lang` parameter currently only accepts `'en'` for English, but we hope to extend it for other languages in the future. The optional `nlp` parameter can be used if you have already preloaded spacy and do not want ERRANT to load it again.
```
import errant
import spacy
nlp = spacy.load('en')
annotator = errant.load('en', nlp)
```
### Annotator Objects
An Annotator object is the main interface for ERRANT.
#### Methods
`annotator`.**parse**(string, tokenise=False)
Lemmatise, POS tag, and parse a text string with spacy. Set `tokenise` to True to also word tokenise with spacy. Returns a spacy Doc object.
`annotator`.**align**(orig, cor, lev=False)
Align spacy-parsed original and corrected text. The default uses a linguistically-enhanced Damerau-Levenshtein alignment, but the `lev` flag can be used for a standard Levenshtein alignment. Returns an Alignment object.
`annotator`.**merge**(alignment, merging='rules')
Extract edits from the optimum alignment in an Alignment object. Four different merging strategies are available:
1. rules: Use a rule-based merging strategy (default)
2. all-split: Merge nothing: MSSDI -> M, S, S, D, I
3. all-merge: Merge adjacent non-matches: MSSDI -> M, SSDI
4. all-equal: Merge adjacent same-type non-matches: MSSDI -> M, SS, D, I
Returns a list of Edit objects.
`annotator`.**classify**(edit)
Classify an edit. Sets the `edit.type` attribute in an Edit object and returns the same Edit object.
`annotator`.**annotate**(orig, cor, lev=False, merging='rules')
Run the full annotation pipeline to align two sequences and extract and classify the edits. Equivalent to running `annotator.align`, `annotator.merge` and `annotator.classify` in sequence. Returns a list of Edit objects.
```
import errant
annotator = errant.load('en')
orig = annotator.parse('This are gramamtical sentence .')
cor = annotator.parse('This is a grammatical sentence .')
alignment = annotator.align(orig, cor)
edits = annotator.merge(alignment)
for e in edits:
e = annotator.classify(e)
```
`annotator`.**import_edit**(orig, cor, edit, min=True, old_cat=False)
Load an Edit object from a list. `orig` and `cor` must be spacy-parsed Doc objects and the edit must be of the form: `[o_start, o_end, c_start, c_end(, type)]`. The values must be integers that correspond to the token start and end offsets in the original and corrected Doc objects. The `type` value is an optional string that denotes the error type of the edit (if known). Set `min` to True to minimise the edit (e.g. [a b -> a c] = [b -> c]) and `old_cat` to True to preserve the old error type category (i.e. turn off the classifier).
```
import errant
annotator = errant.load('en')
orig = annotator.parse('This are gramamtical sentence .')
cor = annotator.parse('This is a grammatical sentence .')
edit = [1, 2, 1, 2, 'SVA'] # are -> is
edit = annotator.import_edit(orig, cor, edit)
print(edit.to_m2())
```
### Alignment Objects
An Alignment object is created from two spacy-parsed text sequences.
#### Attributes
`alignment`.**orig**
`alignment`.**cor**
The spacy-parsed original and corrected text sequences.
`alignment`.**cost_matrix**
`alignment`.**op_matrix**
The cost matrix and operation matrix produced by the alignment.
`alignment`.**align_seq**
The first cheapest alignment between the two sequences.
### Edit Objects
An Edit object represents a transformation between two text sequences.
#### Attributes
`edit`.**o_start**
`edit`.**o_end**
`edit`.**o_toks**
`edit`.**o_str**
The start and end offsets, the spacy tokens, and the string for the edit in the *original* text.
`edit`.**c_start**
`edit`.**c_end**
`edit`.**c_toks**
`edit`.**c_str**
The start and end offsets, the spacy tokens, and the string for the edit in the *corrected* text.
`edit`.**type**
The error type string.
#### Methods
`edit`.**to_m2**(id=0)
Format the edit for an output M2 file. `id` is the annotator id.
## Development for Other Languages
If you want to develop ERRANT for other languages, you should mimic the `errant/en` directory structure. For example, ERRANT for French should import a merger from `errant.fr.merger` and a classifier from `errant.fr.classifier` that respectively have equivalent `get_rule_edits` and `classify` methods. You will also need to add `'fr'` to the list of supported languages in `errant/__init__.py`.
# Contact
If you have any questions, suggestions or bug reports, you can contact the authors at:
christopher d0t bryant at cl.cam.ac.uk
mariano d0t felice at cl.cam.ac.uk
%package -n python3-errant
Summary: The ERRor ANnotation Toolkit (ERRANT). Automatically extract and classify edits in parallel sentences.
Provides: python-errant
BuildRequires: python3-devel
BuildRequires: python3-setuptools
BuildRequires: python3-pip
%description -n python3-errant
# ERRANT v2.3.3
This repository contains the grammatical ERRor ANnotation Toolkit (ERRANT) described in:
> Christopher Bryant, Mariano Felice, and Ted Briscoe. 2017. [**Automatic annotation and evaluation of error types for grammatical error correction**](https://www.aclweb.org/anthology/P17-1074/). In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Vancouver, Canada.
> Mariano Felice, Christopher Bryant, and Ted Briscoe. 2016. [**Automatic extraction of learner errors in ESL sentences using linguistically enhanced alignments**](https://www.aclweb.org/anthology/C16-1079/). In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers. Osaka, Japan.
If you make use of this code, please cite the above papers. More information about ERRANT can be found [here](https://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-938.html). In particular, see Chapter 5 for definitions of error types.
# Overview
The main aim of ERRANT is to automatically annotate parallel English sentences with error type information. Specifically, given an original and corrected sentence pair, ERRANT will extract the edits that transform the former to the latter and classify them according to a rule-based error type framework. This can be used to standardise parallel datasets or facilitate detailed error type evaluation. Annotated output files are in M2 format and an evaluation script is provided.
### Example:
**Original**: This are gramamtical sentence .
**Corrected**: This is a grammatical sentence .
**Output M2**:
S This are gramamtical sentence .
A 1 2|||R:VERB:SVA|||is|||REQUIRED|||-NONE-|||0
A 2 2|||M:DET|||a|||REQUIRED|||-NONE-|||0
A 2 3|||R:SPELL|||grammatical|||REQUIRED|||-NONE-|||0
A -1 -1|||noop|||-NONE-|||REQUIRED|||-NONE-|||1
In M2 format, a line preceded by S denotes an original sentence while a line preceded by A indicates an edit annotation. Each edit line consists of the start and end token offset of the edit, the error type, and the tokenized correction string. The next two fields are included for historical reasons (see the CoNLL-2014 shared task) while the last field is the annotator id.
A "noop" edit is a special kind of edit that explicitly indicates an annotator/system made no changes to the original sentence. If there is only one annotator, noop edits are optional, otherwise a noop edit should be included whenever at least 1 out of n annotators considered the original sentence to be correct. This is something to be aware of when combining individual M2 files, as missing noops can affect evaluation.
# Installation
## Pip Install
The easiest way to install ERRANT and its dependencies is using `pip`. We also recommend installing it in a clean virtual environment (e.g. with `venv`). The latest version of ERRANT only supports Python >= 3.6.
```
python3 -m venv errant_env
source errant_env/bin/activate
pip3 install -U pip setuptools wheel
pip3 install errant
python3 -m spacy download en
```
This will create and activate a new python3 environment called `errant_env` in the current directory. `pip` will then update some setup tools and install ERRANT, [spaCy](https://spacy.io/), [python-Levenshtein](https://pypi.org/project/python-Levenshtein/) and spaCy's default English model in this environment. You can deactivate the environment at any time by running `deactivate`, but must remember to activate it again whenever you want to use ERRANT.
#### ERRANT and spaCy
ERRANT was originally designed to work with spaCy v1.9.0 and works best with this version. SpaCy v1.9.0 does not work with Python >= 3.7 however, and so we were forced to update ERRANT to be compatible with spaCy 2. Since spaCy 2 uses a neural system to trade speed for accuracy, this means ERRANT v2.2 is **~4x slower** than ERRANT v2.1. We have not yet extended ERRANT to work with spaCy 3, but preliminary tests suggest ERRANT will become even slower.
Consequently, we recommend ERRANT v2.1.0 if speed is a priority and you can use Python < 3.7.
```
pip3 install errant==2.1.0
```
#### BEA-2019 Shared Task
ERRANT v2.0.0 was designed to be fully compatible with the [BEA-2019 Shared Task](https://www.cl.cam.ac.uk/research/nl/bea2019st/). If you want to directly compare against the results in the shared task, you should make sure to install ERRANT v2.0.0 as newer versions may produce slightly different scores. You can also use [Codalab](https://competitions.codalab.org/competitions/20228) to evaluate anonymously on the shared task datasets. ERRANT v2.0.0 is not compatible with Python >= 3.7.
```
pip3 install errant==2.0.0
```
## Source Install
If you prefer to install ERRANT from source, you can instead run the following commands:
```
git clone https://github.com/chrisjbryant/errant.git
cd errant
python3 -m venv errant_env
source errant_env/bin/activate
pip3 install -U pip setuptools wheel
pip3 install -e .
python3 -m spacy download en
```
This will clone the github ERRANT source into the current directory, build and activate a python environment inside it, and then install ERRANT and all its dependencies. If you wish to modify ERRANT code, this is the recommended way to install it.
# Usage
## CLI
Three main commands are provided with ERRANT: `errant_parallel`, `errant_m2` and `errant_compare`. You can run them from anywhere on the command line without having to invoke a specific python script.
1. `errant_parallel`
This is the main annotation command that takes an original text file and at least one parallel corrected text file as input, and outputs an annotated M2 file. By default, it is assumed that the original and corrected text files are word tokenised with one sentence per line.
Example:
```
errant_parallel -orig <orig_file> -cor <cor_file1> [<cor_file2> ...] -out <out_m2>
```
2. `errant_m2`
This is a variant of `errant_parallel` that operates on an M2 file instead of parallel text files. This makes it easier to reprocess existing M2 files. You must also specify whether you want to use gold or auto edits; i.e. `-gold` will only classify the existing edits, while `-auto` will extract and classify automatic edits. In both settings, uncorrected edits and noops are preserved.
Example:
```
errant_m2 {-auto|-gold} m2_file -out <out_m2>
```
3. `errant_compare`
This is the evaluation command that compares a hypothesis M2 file against a reference M2 file. The default behaviour evaluates the hypothesis overall in terms of span-based correction. The `-cat {1,2,3}` flag can be used to evaluate error types at increasing levels of granularity, while the `-ds` or `-dt` flag can be used to evaluate in terms of span-based or token-based detection (i.e. ignoring the correction). All scores are presented in terms of Precision, Recall and F-score (default: F0.5), and counts for True Positives (TP), False Positives (FP) and False Negatives (FN) are also shown.
Examples:
```
errant_compare -hyp <hyp_m2> -ref <ref_m2>
errant_compare -hyp <hyp_m2> -ref <ref_m2> -cat {1,2,3}
errant_compare -hyp <hyp_m2> -ref <ref_m2> -ds
errant_compare -hyp <hyp_m2> -ref <ref_m2> -ds -cat {1,2,3}
```
All these scripts also have additional advanced command line options which can be displayed using the `-h` flag.
## API
As of v2.0.0, ERRANT now also comes with an API.
### Quick Start
```
import errant
annotator = errant.load('en')
orig = annotator.parse('This are gramamtical sentence .')
cor = annotator.parse('This is a grammatical sentence .')
edits = annotator.annotate(orig, cor)
for e in edits:
print(e.o_start, e.o_end, e.o_str, e.c_start, e.c_end, e.c_str, e.type)
```
### Loading
`errant`.**load**(lang, nlp=None)
Create an ERRANT Annotator object. The `lang` parameter currently only accepts `'en'` for English, but we hope to extend it for other languages in the future. The optional `nlp` parameter can be used if you have already preloaded spacy and do not want ERRANT to load it again.
```
import errant
import spacy
nlp = spacy.load('en')
annotator = errant.load('en', nlp)
```
### Annotator Objects
An Annotator object is the main interface for ERRANT.
#### Methods
`annotator`.**parse**(string, tokenise=False)
Lemmatise, POS tag, and parse a text string with spacy. Set `tokenise` to True to also word tokenise with spacy. Returns a spacy Doc object.
`annotator`.**align**(orig, cor, lev=False)
Align spacy-parsed original and corrected text. The default uses a linguistically-enhanced Damerau-Levenshtein alignment, but the `lev` flag can be used for a standard Levenshtein alignment. Returns an Alignment object.
`annotator`.**merge**(alignment, merging='rules')
Extract edits from the optimum alignment in an Alignment object. Four different merging strategies are available:
1. rules: Use a rule-based merging strategy (default)
2. all-split: Merge nothing: MSSDI -> M, S, S, D, I
3. all-merge: Merge adjacent non-matches: MSSDI -> M, SSDI
4. all-equal: Merge adjacent same-type non-matches: MSSDI -> M, SS, D, I
Returns a list of Edit objects.
`annotator`.**classify**(edit)
Classify an edit. Sets the `edit.type` attribute in an Edit object and returns the same Edit object.
`annotator`.**annotate**(orig, cor, lev=False, merging='rules')
Run the full annotation pipeline to align two sequences and extract and classify the edits. Equivalent to running `annotator.align`, `annotator.merge` and `annotator.classify` in sequence. Returns a list of Edit objects.
```
import errant
annotator = errant.load('en')
orig = annotator.parse('This are gramamtical sentence .')
cor = annotator.parse('This is a grammatical sentence .')
alignment = annotator.align(orig, cor)
edits = annotator.merge(alignment)
for e in edits:
e = annotator.classify(e)
```
`annotator`.**import_edit**(orig, cor, edit, min=True, old_cat=False)
Load an Edit object from a list. `orig` and `cor` must be spacy-parsed Doc objects and the edit must be of the form: `[o_start, o_end, c_start, c_end(, type)]`. The values must be integers that correspond to the token start and end offsets in the original and corrected Doc objects. The `type` value is an optional string that denotes the error type of the edit (if known). Set `min` to True to minimise the edit (e.g. [a b -> a c] = [b -> c]) and `old_cat` to True to preserve the old error type category (i.e. turn off the classifier).
```
import errant
annotator = errant.load('en')
orig = annotator.parse('This are gramamtical sentence .')
cor = annotator.parse('This is a grammatical sentence .')
edit = [1, 2, 1, 2, 'SVA'] # are -> is
edit = annotator.import_edit(orig, cor, edit)
print(edit.to_m2())
```
### Alignment Objects
An Alignment object is created from two spacy-parsed text sequences.
#### Attributes
`alignment`.**orig**
`alignment`.**cor**
The spacy-parsed original and corrected text sequences.
`alignment`.**cost_matrix**
`alignment`.**op_matrix**
The cost matrix and operation matrix produced by the alignment.
`alignment`.**align_seq**
The first cheapest alignment between the two sequences.
### Edit Objects
An Edit object represents a transformation between two text sequences.
#### Attributes
`edit`.**o_start**
`edit`.**o_end**
`edit`.**o_toks**
`edit`.**o_str**
The start and end offsets, the spacy tokens, and the string for the edit in the *original* text.
`edit`.**c_start**
`edit`.**c_end**
`edit`.**c_toks**
`edit`.**c_str**
The start and end offsets, the spacy tokens, and the string for the edit in the *corrected* text.
`edit`.**type**
The error type string.
#### Methods
`edit`.**to_m2**(id=0)
Format the edit for an output M2 file. `id` is the annotator id.
## Development for Other Languages
If you want to develop ERRANT for other languages, you should mimic the `errant/en` directory structure. For example, ERRANT for French should import a merger from `errant.fr.merger` and a classifier from `errant.fr.classifier` that respectively have equivalent `get_rule_edits` and `classify` methods. You will also need to add `'fr'` to the list of supported languages in `errant/__init__.py`.
# Contact
If you have any questions, suggestions or bug reports, you can contact the authors at:
christopher d0t bryant at cl.cam.ac.uk
mariano d0t felice at cl.cam.ac.uk
%package help
Summary: Development documents and examples for errant
Provides: python3-errant-doc
%description help
# ERRANT v2.3.3
This repository contains the grammatical ERRor ANnotation Toolkit (ERRANT) described in:
> Christopher Bryant, Mariano Felice, and Ted Briscoe. 2017. [**Automatic annotation and evaluation of error types for grammatical error correction**](https://www.aclweb.org/anthology/P17-1074/). In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Vancouver, Canada.
> Mariano Felice, Christopher Bryant, and Ted Briscoe. 2016. [**Automatic extraction of learner errors in ESL sentences using linguistically enhanced alignments**](https://www.aclweb.org/anthology/C16-1079/). In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers. Osaka, Japan.
If you make use of this code, please cite the above papers. More information about ERRANT can be found [here](https://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-938.html). In particular, see Chapter 5 for definitions of error types.
# Overview
The main aim of ERRANT is to automatically annotate parallel English sentences with error type information. Specifically, given an original and corrected sentence pair, ERRANT will extract the edits that transform the former to the latter and classify them according to a rule-based error type framework. This can be used to standardise parallel datasets or facilitate detailed error type evaluation. Annotated output files are in M2 format and an evaluation script is provided.
### Example:
**Original**: This are gramamtical sentence .
**Corrected**: This is a grammatical sentence .
**Output M2**:
S This are gramamtical sentence .
A 1 2|||R:VERB:SVA|||is|||REQUIRED|||-NONE-|||0
A 2 2|||M:DET|||a|||REQUIRED|||-NONE-|||0
A 2 3|||R:SPELL|||grammatical|||REQUIRED|||-NONE-|||0
A -1 -1|||noop|||-NONE-|||REQUIRED|||-NONE-|||1
In M2 format, a line preceded by S denotes an original sentence while a line preceded by A indicates an edit annotation. Each edit line consists of the start and end token offset of the edit, the error type, and the tokenized correction string. The next two fields are included for historical reasons (see the CoNLL-2014 shared task) while the last field is the annotator id.
A "noop" edit is a special kind of edit that explicitly indicates an annotator/system made no changes to the original sentence. If there is only one annotator, noop edits are optional, otherwise a noop edit should be included whenever at least 1 out of n annotators considered the original sentence to be correct. This is something to be aware of when combining individual M2 files, as missing noops can affect evaluation.
# Installation
## Pip Install
The easiest way to install ERRANT and its dependencies is using `pip`. We also recommend installing it in a clean virtual environment (e.g. with `venv`). The latest version of ERRANT only supports Python >= 3.6.
```
python3 -m venv errant_env
source errant_env/bin/activate
pip3 install -U pip setuptools wheel
pip3 install errant
python3 -m spacy download en
```
This will create and activate a new python3 environment called `errant_env` in the current directory. `pip` will then update some setup tools and install ERRANT, [spaCy](https://spacy.io/), [python-Levenshtein](https://pypi.org/project/python-Levenshtein/) and spaCy's default English model in this environment. You can deactivate the environment at any time by running `deactivate`, but must remember to activate it again whenever you want to use ERRANT.
#### ERRANT and spaCy
ERRANT was originally designed to work with spaCy v1.9.0 and works best with this version. SpaCy v1.9.0 does not work with Python >= 3.7 however, and so we were forced to update ERRANT to be compatible with spaCy 2. Since spaCy 2 uses a neural system to trade speed for accuracy, this means ERRANT v2.2 is **~4x slower** than ERRANT v2.1. We have not yet extended ERRANT to work with spaCy 3, but preliminary tests suggest ERRANT will become even slower.
Consequently, we recommend ERRANT v2.1.0 if speed is a priority and you can use Python < 3.7.
```
pip3 install errant==2.1.0
```
#### BEA-2019 Shared Task
ERRANT v2.0.0 was designed to be fully compatible with the [BEA-2019 Shared Task](https://www.cl.cam.ac.uk/research/nl/bea2019st/). If you want to directly compare against the results in the shared task, you should make sure to install ERRANT v2.0.0 as newer versions may produce slightly different scores. You can also use [Codalab](https://competitions.codalab.org/competitions/20228) to evaluate anonymously on the shared task datasets. ERRANT v2.0.0 is not compatible with Python >= 3.7.
```
pip3 install errant==2.0.0
```
## Source Install
If you prefer to install ERRANT from source, you can instead run the following commands:
```
git clone https://github.com/chrisjbryant/errant.git
cd errant
python3 -m venv errant_env
source errant_env/bin/activate
pip3 install -U pip setuptools wheel
pip3 install -e .
python3 -m spacy download en
```
This will clone the github ERRANT source into the current directory, build and activate a python environment inside it, and then install ERRANT and all its dependencies. If you wish to modify ERRANT code, this is the recommended way to install it.
# Usage
## CLI
Three main commands are provided with ERRANT: `errant_parallel`, `errant_m2` and `errant_compare`. You can run them from anywhere on the command line without having to invoke a specific python script.
1. `errant_parallel`
This is the main annotation command that takes an original text file and at least one parallel corrected text file as input, and outputs an annotated M2 file. By default, it is assumed that the original and corrected text files are word tokenised with one sentence per line.
Example:
```
errant_parallel -orig <orig_file> -cor <cor_file1> [<cor_file2> ...] -out <out_m2>
```
2. `errant_m2`
This is a variant of `errant_parallel` that operates on an M2 file instead of parallel text files. This makes it easier to reprocess existing M2 files. You must also specify whether you want to use gold or auto edits; i.e. `-gold` will only classify the existing edits, while `-auto` will extract and classify automatic edits. In both settings, uncorrected edits and noops are preserved.
Example:
```
errant_m2 {-auto|-gold} m2_file -out <out_m2>
```
3. `errant_compare`
This is the evaluation command that compares a hypothesis M2 file against a reference M2 file. The default behaviour evaluates the hypothesis overall in terms of span-based correction. The `-cat {1,2,3}` flag can be used to evaluate error types at increasing levels of granularity, while the `-ds` or `-dt` flag can be used to evaluate in terms of span-based or token-based detection (i.e. ignoring the correction). All scores are presented in terms of Precision, Recall and F-score (default: F0.5), and counts for True Positives (TP), False Positives (FP) and False Negatives (FN) are also shown.
Examples:
```
errant_compare -hyp <hyp_m2> -ref <ref_m2>
errant_compare -hyp <hyp_m2> -ref <ref_m2> -cat {1,2,3}
errant_compare -hyp <hyp_m2> -ref <ref_m2> -ds
errant_compare -hyp <hyp_m2> -ref <ref_m2> -ds -cat {1,2,3}
```
All these scripts also have additional advanced command line options which can be displayed using the `-h` flag.
## API
As of v2.0.0, ERRANT now also comes with an API.
### Quick Start
```
import errant
annotator = errant.load('en')
orig = annotator.parse('This are gramamtical sentence .')
cor = annotator.parse('This is a grammatical sentence .')
edits = annotator.annotate(orig, cor)
for e in edits:
print(e.o_start, e.o_end, e.o_str, e.c_start, e.c_end, e.c_str, e.type)
```
### Loading
`errant`.**load**(lang, nlp=None)
Create an ERRANT Annotator object. The `lang` parameter currently only accepts `'en'` for English, but we hope to extend it for other languages in the future. The optional `nlp` parameter can be used if you have already preloaded spacy and do not want ERRANT to load it again.
```
import errant
import spacy
nlp = spacy.load('en')
annotator = errant.load('en', nlp)
```
### Annotator Objects
An Annotator object is the main interface for ERRANT.
#### Methods
`annotator`.**parse**(string, tokenise=False)
Lemmatise, POS tag, and parse a text string with spacy. Set `tokenise` to True to also word tokenise with spacy. Returns a spacy Doc object.
`annotator`.**align**(orig, cor, lev=False)
Align spacy-parsed original and corrected text. The default uses a linguistically-enhanced Damerau-Levenshtein alignment, but the `lev` flag can be used for a standard Levenshtein alignment. Returns an Alignment object.
`annotator`.**merge**(alignment, merging='rules')
Extract edits from the optimum alignment in an Alignment object. Four different merging strategies are available:
1. rules: Use a rule-based merging strategy (default)
2. all-split: Merge nothing: MSSDI -> M, S, S, D, I
3. all-merge: Merge adjacent non-matches: MSSDI -> M, SSDI
4. all-equal: Merge adjacent same-type non-matches: MSSDI -> M, SS, D, I
Returns a list of Edit objects.
`annotator`.**classify**(edit)
Classify an edit. Sets the `edit.type` attribute in an Edit object and returns the same Edit object.
`annotator`.**annotate**(orig, cor, lev=False, merging='rules')
Run the full annotation pipeline to align two sequences and extract and classify the edits. Equivalent to running `annotator.align`, `annotator.merge` and `annotator.classify` in sequence. Returns a list of Edit objects.
```
import errant
annotator = errant.load('en')
orig = annotator.parse('This are gramamtical sentence .')
cor = annotator.parse('This is a grammatical sentence .')
alignment = annotator.align(orig, cor)
edits = annotator.merge(alignment)
for e in edits:
e = annotator.classify(e)
```
`annotator`.**import_edit**(orig, cor, edit, min=True, old_cat=False)
Load an Edit object from a list. `orig` and `cor` must be spacy-parsed Doc objects and the edit must be of the form: `[o_start, o_end, c_start, c_end(, type)]`. The values must be integers that correspond to the token start and end offsets in the original and corrected Doc objects. The `type` value is an optional string that denotes the error type of the edit (if known). Set `min` to True to minimise the edit (e.g. [a b -> a c] = [b -> c]) and `old_cat` to True to preserve the old error type category (i.e. turn off the classifier).
```
import errant
annotator = errant.load('en')
orig = annotator.parse('This are gramamtical sentence .')
cor = annotator.parse('This is a grammatical sentence .')
edit = [1, 2, 1, 2, 'SVA'] # are -> is
edit = annotator.import_edit(orig, cor, edit)
print(edit.to_m2())
```
### Alignment Objects
An Alignment object is created from two spacy-parsed text sequences.
#### Attributes
`alignment`.**orig**
`alignment`.**cor**
The spacy-parsed original and corrected text sequences.
`alignment`.**cost_matrix**
`alignment`.**op_matrix**
The cost matrix and operation matrix produced by the alignment.
`alignment`.**align_seq**
The first cheapest alignment between the two sequences.
### Edit Objects
An Edit object represents a transformation between two text sequences.
#### Attributes
`edit`.**o_start**
`edit`.**o_end**
`edit`.**o_toks**
`edit`.**o_str**
The start and end offsets, the spacy tokens, and the string for the edit in the *original* text.
`edit`.**c_start**
`edit`.**c_end**
`edit`.**c_toks**
`edit`.**c_str**
The start and end offsets, the spacy tokens, and the string for the edit in the *corrected* text.
`edit`.**type**
The error type string.
#### Methods
`edit`.**to_m2**(id=0)
Format the edit for an output M2 file. `id` is the annotator id.
## Development for Other Languages
If you want to develop ERRANT for other languages, you should mimic the `errant/en` directory structure. For example, ERRANT for French should import a merger from `errant.fr.merger` and a classifier from `errant.fr.classifier` that respectively have equivalent `get_rule_edits` and `classify` methods. You will also need to add `'fr'` to the list of supported languages in `errant/__init__.py`.
# Contact
If you have any questions, suggestions or bug reports, you can contact the authors at:
christopher d0t bryant at cl.cam.ac.uk
mariano d0t felice at cl.cam.ac.uk
%prep
%autosetup -n errant-2.3.3
%build
%py3_build
%install
%py3_install
install -d -m755 %{buildroot}/%{_pkgdocdir}
if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi
if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi
if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi
if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi
pushd %{buildroot}
if [ -d usr/lib ]; then
find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst
fi
if [ -d usr/lib64 ]; then
find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst
fi
if [ -d usr/bin ]; then
find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst
fi
if [ -d usr/sbin ]; then
find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst
fi
touch doclist.lst
if [ -d usr/share/man ]; then
find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst
fi
popd
mv %{buildroot}/filelist.lst .
mv %{buildroot}/doclist.lst .
%files -n python3-errant -f filelist.lst
%dir %{python3_sitelib}/*
%files help -f doclist.lst
%{_docdir}/*
%changelog
* Fri May 05 2023 Python_Bot <Python_Bot@openeuler.org> - 2.3.3-1
- Package Spec generated
|