%global _empty_manifest_terminate_build 0
Name: python-jury
Version: 2.2.3
Release: 1
Summary: Evaluation toolkit for neural language generation.
License: MIT
URL: https://github.com/obss/jury
Source0: https://mirrors.nju.edu.cn/pypi/web/packages/82/cb/2cc74d1c798d175573becbcf91eebb0e4bc797ea48b6a183360cdf53783f/jury-2.2.3.tar.gz
BuildArch: noarch
Requires: python3-click
Requires: python3-evaluate
Requires: python3-fire
Requires: python3-nltk
Requires: python3-rouge-score
Requires: python3-scikit-learn
Requires: python3-tqdm
Requires: python3-validators
Requires: python3-black
Requires: python3-deepdiff
Requires: python3-flake8
Requires: python3-isort
Requires: python3-pytest
Requires: python3-pytest-cov
Requires: python3-pytest-timeout
Requires: python3-sacrebleu
Requires: python3-bert-score
Requires: python3-jiwer
Requires: python3-seqeval
Requires: python3-sentencepiece
Requires: python3-unbabel-comet
Requires: python3-fairseq
Requires: python3-importlib-metadata
Requires: python3-numpy
Requires: python3-numpy
Requires: python3-sacrebleu
Requires: python3-bert-score
Requires: python3-jiwer
Requires: python3-seqeval
Requires: python3-sentencepiece
Requires: python3-unbabel-comet
Requires: python3-fairseq
Requires: python3-numpy
Requires: python3-numpy
Requires: python3-fairseq
Requires: python3-numpy
Requires: python3-numpy
%description
Jury
A comprehensive toolkit for evaluating NLP experiments offering various automated metrics. Jury offers a smooth and easy-to-use interface. It uses a more advanced version of [evaluate](https://github.com/huggingface/evaluate/) design for underlying metric computation, so that adding custom metric is easy as extending proper class.
Main advantages that Jury offers are:
- Easy to use for any NLP project.
- Unified structure for computation input across all metrics.
- Calculate many metrics at once.
- Metrics calculations can be handled concurrently to save processing time.
- It seamlessly supports evaluation for multiple predictions/multiple references.
To see more, check the [official Jury blog post](https://medium.com/codable/jury-evaluating-performance-of-nlg-models-730eb9c9999f).
# Available Metrics
The table below shows the current support status for available metrics.
| Metric | Jury Support | HF/evaluate Support |
|-------------------------------------------------------------------------------|--------------------|---------------------|
| Accuracy-Numeric | :heavy_check_mark: | :white_check_mark: |
| Accuracy-Text | :heavy_check_mark: | :x: |
| Bartscore | :heavy_check_mark: | :x: |
| Bertscore | :heavy_check_mark: | :white_check_mark: |
| Bleu | :heavy_check_mark: | :white_check_mark: |
| Bleurt | :heavy_check_mark: | :white_check_mark: |
| CER | :heavy_check_mark: | :white_check_mark: |
| CHRF | :heavy_check_mark: | :white_check_mark: |
| COMET | :heavy_check_mark: | :white_check_mark: |
| F1-Numeric | :heavy_check_mark: | :white_check_mark: |
| F1-Text | :heavy_check_mark: | :x: |
| METEOR | :heavy_check_mark: | :white_check_mark: |
| Precision-Numeric | :heavy_check_mark: | :white_check_mark: |
| Precision-Text | :heavy_check_mark: | :x: |
| Prism | :heavy_check_mark: | :x: |
| Recall-Numeric | :heavy_check_mark: | :white_check_mark: |
| Recall-Text | :heavy_check_mark: | :x: |
| ROUGE | :heavy_check_mark: | :white_check_mark: |
| SacreBleu | :heavy_check_mark: | :white_check_mark: |
| Seqeval | :heavy_check_mark: | :white_check_mark: |
| Squad | :heavy_check_mark: | :white_check_mark: |
| TER | :heavy_check_mark: | :white_check_mark: |
| WER | :heavy_check_mark: | :white_check_mark: |
| [Other metrics](https://github.com/huggingface/evaluate/tree/master/metrics)* | :white_check_mark: | :white_check_mark: |
_*_ Placeholder for the rest of the metrics available in `evaluate` package apart from those which are present in the
table.
**Notes**
* The entry :heavy_check_mark: represents that full Jury support is available meaning that all combinations of input
types (single prediction & single reference, single prediction & multiple references, multiple predictions & multiple
references) are supported
* The entry :white_check_mark: means that this metric is supported (for Jury through the `evaluate`), so that it
can (and should) be used just like the `evaluate` metric as instructed in `evaluate` implementation although
unfortunately full Jury support for those metrics are not yet available.
## Request for a New Metric
For the request of a new metric please [open an issue](https://github.com/obss/jury/issues/new?assignees=&labels=&template=new-metric.md&title=) providing the minimum information. Also, PRs addressing new metric
supports are welcomed :).
## Installation
Through pip,
pip install jury
or build from source,
git clone https://github.com/obss/jury.git
cd jury
python setup.py install
**NOTE:** There may be malfunctions of some metrics depending on `sacrebleu` package on Windows machines which is
mainly due to the package `pywin32`. For this, we fixed pywin32 version on our setup config for Windows platforms.
However, if pywin32 causes trouble in your environment we strongly recommend using `conda` manager install the package
as `conda install pywin32`.
## Usage
### API Usage
It is only two lines of code to evaluate generated outputs.
```python
from jury import Jury
scorer = Jury()
predictions = [
["the cat is on the mat", "There is cat playing on the mat"],
["Look! a wonderful day."]
]
references = [
["the cat is playing on the mat.", "The cat plays on the mat."],
["Today is a wonderful day", "The weather outside is wonderful."]
]
scores = scorer(predictions=predictions, references=references)
```
Specify metrics you want to use on instantiation.
```python
scorer = Jury(metrics=["bleu", "meteor"])
scores = scorer(predictions, references)
```
#### Use of Metrics standalone
You can directly import metrics from `jury.metrics` as classes, and then instantiate and use as desired.
```python
from jury.metrics import Bleu
bleu = Bleu.construct()
score = bleu.compute(predictions=predictions, references=references)
```
The additional parameters can either be specified on `compute()`
```python
from jury.metrics import Bleu
bleu = Bleu.construct()
score = bleu.compute(predictions=predictions, references=references, max_order=4)
```
, or alternatively on instantiation
```python
from jury.metrics import Bleu
bleu = Bleu.construct(compute_kwargs={"max_order": 1})
score = bleu.compute(predictions=predictions, references=references)
```
Note that you can seemlessly access both `jury` and `evaluate` metrics through `jury.load_metric`.
```python
import jury
bleu = jury.load_metric("bleu")
bleu_1 = jury.load_metric("bleu", resulting_name="bleu_1", compute_kwargs={"max_order": 1})
# metrics not available in `jury` but in `evaluate`
wer = jury.load_metric("competition_math") # It falls back to `evaluate` package with a warning
```
### CLI Usage
You can specify predictions file and references file paths and get the resulting scores. Each line should be paired in both files. You can optionally provide reduce function and an export path for results to be written.
jury eval --predictions /path/to/predictions.txt --references /path/to/references.txt --reduce_fn max --export /path/to/export.txt
You can also provide prediction folders and reference folders to evaluate multiple experiments. In this set up, however, it is required that the prediction and references files you need to evaluate as a pair have the same file name. These common names are paired together for prediction and reference.
jury eval --predictions /path/to/predictions_folder --references /path/to/references_folder --reduce_fn max --export /path/to/export.txt
If you want to specify metrics, and do not want to use default, specify it in config file (json) in `metrics` key.
```json
{
"predictions": "/path/to/predictions.txt",
"references": "/path/to/references.txt",
"reduce_fn": "max",
"metrics": [
"bleu",
"meteor"
]
}
```
Then, you can call jury eval with `config` argument.
jury eval --config path/to/config.json
### Custom Metrics
You can use custom metrics with inheriting `jury.metrics.Metric`, you can see current metrics implemented on Jury from [jury/metrics](https://github.com/obss/jury/tree/master/jury/metrics). Jury falls back to `evaluate` implementation of metrics for the ones that are currently not supported by Jury, you can see the metrics available for `evaluate` on [evaluate/metrics](https://github.com/huggingface/evaluate/tree/master/metrics).
Jury itself uses `evaluate.Metric` as a base class to drive its own base class as `jury.metrics.Metric`. The interface is similar; however, Jury makes the metrics to take a unified input type by handling the inputs for each metrics, and allows supporting several input types as;
- single prediction & single reference
- single prediction & multiple reference
- multiple prediction & multiple reference
As a custom metric both base classes can be used; however, we strongly recommend using `jury.metrics.Metric` as it has several advantages such as supporting computations for the input types above or unifying the type of the input.
```python
from jury.metrics import MetricForTask
class CustomMetric(MetricForTask):
def _compute_single_pred_single_ref(
self, predictions, references, reduce_fn = None, **kwargs
):
raise NotImplementedError
def _compute_single_pred_multi_ref(
self, predictions, references, reduce_fn = None, **kwargs
):
raise NotImplementedError
def _compute_multi_pred_multi_ref(
self, predictions, references, reduce_fn = None, **kwargs
):
raise NotImplementedError
```
For more details, have a look at base metric implementation [jury.metrics.Metric](./jury/metrics/_base.py)
## Contributing
PRs are welcomed as always :)
### Installation
git clone https://github.com/obss/jury.git
cd jury
pip install -e .[dev]
Also, you need to install the packages which are available through a git source separately with the following command.
For the folks who are curious about "why?"; a short explaination is that PYPI does not allow indexing a package which
are directly dependent on non-pypi packages due to security reasons. The file `requirements-dev.txt` includes packages
which are currently only available through a git source, or they are PYPI packages with no recent release or
incompatible with Jury, so that they are added as git sources or pointing to specific commits.
pip install -r requirements-dev.txt
### Tests
To tests simply run.
python tests/run_tests.py
### Code Style
To check code style,
python tests/run_code_style.py check
To format codebase,
python tests/run_code_style.py format
## Citation
If you use this package in your work, please cite it as:
@software{obss2021jury,
author = {Cavusoglu, Devrim and Akyon, Fatih Cagatay and Sert, Ulas and Cengiz, Cemil},
title = {{Jury: Comprehensive NLP Evaluation toolkit}},
month = {feb},
year = {2022},
publisher = {Zenodo},
doi = {10.5281/zenodo.6108229},
url = {https://doi.org/10.5281/zenodo.6108229}
}
## License
Licensed under the [MIT](LICENSE) License.
%package -n python3-jury
Summary: Evaluation toolkit for neural language generation.
Provides: python-jury
BuildRequires: python3-devel
BuildRequires: python3-setuptools
BuildRequires: python3-pip
%description -n python3-jury
Jury
A comprehensive toolkit for evaluating NLP experiments offering various automated metrics. Jury offers a smooth and easy-to-use interface. It uses a more advanced version of [evaluate](https://github.com/huggingface/evaluate/) design for underlying metric computation, so that adding custom metric is easy as extending proper class.
Main advantages that Jury offers are:
- Easy to use for any NLP project.
- Unified structure for computation input across all metrics.
- Calculate many metrics at once.
- Metrics calculations can be handled concurrently to save processing time.
- It seamlessly supports evaluation for multiple predictions/multiple references.
To see more, check the [official Jury blog post](https://medium.com/codable/jury-evaluating-performance-of-nlg-models-730eb9c9999f).
# Available Metrics
The table below shows the current support status for available metrics.
| Metric | Jury Support | HF/evaluate Support |
|-------------------------------------------------------------------------------|--------------------|---------------------|
| Accuracy-Numeric | :heavy_check_mark: | :white_check_mark: |
| Accuracy-Text | :heavy_check_mark: | :x: |
| Bartscore | :heavy_check_mark: | :x: |
| Bertscore | :heavy_check_mark: | :white_check_mark: |
| Bleu | :heavy_check_mark: | :white_check_mark: |
| Bleurt | :heavy_check_mark: | :white_check_mark: |
| CER | :heavy_check_mark: | :white_check_mark: |
| CHRF | :heavy_check_mark: | :white_check_mark: |
| COMET | :heavy_check_mark: | :white_check_mark: |
| F1-Numeric | :heavy_check_mark: | :white_check_mark: |
| F1-Text | :heavy_check_mark: | :x: |
| METEOR | :heavy_check_mark: | :white_check_mark: |
| Precision-Numeric | :heavy_check_mark: | :white_check_mark: |
| Precision-Text | :heavy_check_mark: | :x: |
| Prism | :heavy_check_mark: | :x: |
| Recall-Numeric | :heavy_check_mark: | :white_check_mark: |
| Recall-Text | :heavy_check_mark: | :x: |
| ROUGE | :heavy_check_mark: | :white_check_mark: |
| SacreBleu | :heavy_check_mark: | :white_check_mark: |
| Seqeval | :heavy_check_mark: | :white_check_mark: |
| Squad | :heavy_check_mark: | :white_check_mark: |
| TER | :heavy_check_mark: | :white_check_mark: |
| WER | :heavy_check_mark: | :white_check_mark: |
| [Other metrics](https://github.com/huggingface/evaluate/tree/master/metrics)* | :white_check_mark: | :white_check_mark: |
_*_ Placeholder for the rest of the metrics available in `evaluate` package apart from those which are present in the
table.
**Notes**
* The entry :heavy_check_mark: represents that full Jury support is available meaning that all combinations of input
types (single prediction & single reference, single prediction & multiple references, multiple predictions & multiple
references) are supported
* The entry :white_check_mark: means that this metric is supported (for Jury through the `evaluate`), so that it
can (and should) be used just like the `evaluate` metric as instructed in `evaluate` implementation although
unfortunately full Jury support for those metrics are not yet available.
## Request for a New Metric
For the request of a new metric please [open an issue](https://github.com/obss/jury/issues/new?assignees=&labels=&template=new-metric.md&title=) providing the minimum information. Also, PRs addressing new metric
supports are welcomed :).
## Installation
Through pip,
pip install jury
or build from source,
git clone https://github.com/obss/jury.git
cd jury
python setup.py install
**NOTE:** There may be malfunctions of some metrics depending on `sacrebleu` package on Windows machines which is
mainly due to the package `pywin32`. For this, we fixed pywin32 version on our setup config for Windows platforms.
However, if pywin32 causes trouble in your environment we strongly recommend using `conda` manager install the package
as `conda install pywin32`.
## Usage
### API Usage
It is only two lines of code to evaluate generated outputs.
```python
from jury import Jury
scorer = Jury()
predictions = [
["the cat is on the mat", "There is cat playing on the mat"],
["Look! a wonderful day."]
]
references = [
["the cat is playing on the mat.", "The cat plays on the mat."],
["Today is a wonderful day", "The weather outside is wonderful."]
]
scores = scorer(predictions=predictions, references=references)
```
Specify metrics you want to use on instantiation.
```python
scorer = Jury(metrics=["bleu", "meteor"])
scores = scorer(predictions, references)
```
#### Use of Metrics standalone
You can directly import metrics from `jury.metrics` as classes, and then instantiate and use as desired.
```python
from jury.metrics import Bleu
bleu = Bleu.construct()
score = bleu.compute(predictions=predictions, references=references)
```
The additional parameters can either be specified on `compute()`
```python
from jury.metrics import Bleu
bleu = Bleu.construct()
score = bleu.compute(predictions=predictions, references=references, max_order=4)
```
, or alternatively on instantiation
```python
from jury.metrics import Bleu
bleu = Bleu.construct(compute_kwargs={"max_order": 1})
score = bleu.compute(predictions=predictions, references=references)
```
Note that you can seemlessly access both `jury` and `evaluate` metrics through `jury.load_metric`.
```python
import jury
bleu = jury.load_metric("bleu")
bleu_1 = jury.load_metric("bleu", resulting_name="bleu_1", compute_kwargs={"max_order": 1})
# metrics not available in `jury` but in `evaluate`
wer = jury.load_metric("competition_math") # It falls back to `evaluate` package with a warning
```
### CLI Usage
You can specify predictions file and references file paths and get the resulting scores. Each line should be paired in both files. You can optionally provide reduce function and an export path for results to be written.
jury eval --predictions /path/to/predictions.txt --references /path/to/references.txt --reduce_fn max --export /path/to/export.txt
You can also provide prediction folders and reference folders to evaluate multiple experiments. In this set up, however, it is required that the prediction and references files you need to evaluate as a pair have the same file name. These common names are paired together for prediction and reference.
jury eval --predictions /path/to/predictions_folder --references /path/to/references_folder --reduce_fn max --export /path/to/export.txt
If you want to specify metrics, and do not want to use default, specify it in config file (json) in `metrics` key.
```json
{
"predictions": "/path/to/predictions.txt",
"references": "/path/to/references.txt",
"reduce_fn": "max",
"metrics": [
"bleu",
"meteor"
]
}
```
Then, you can call jury eval with `config` argument.
jury eval --config path/to/config.json
### Custom Metrics
You can use custom metrics with inheriting `jury.metrics.Metric`, you can see current metrics implemented on Jury from [jury/metrics](https://github.com/obss/jury/tree/master/jury/metrics). Jury falls back to `evaluate` implementation of metrics for the ones that are currently not supported by Jury, you can see the metrics available for `evaluate` on [evaluate/metrics](https://github.com/huggingface/evaluate/tree/master/metrics).
Jury itself uses `evaluate.Metric` as a base class to drive its own base class as `jury.metrics.Metric`. The interface is similar; however, Jury makes the metrics to take a unified input type by handling the inputs for each metrics, and allows supporting several input types as;
- single prediction & single reference
- single prediction & multiple reference
- multiple prediction & multiple reference
As a custom metric both base classes can be used; however, we strongly recommend using `jury.metrics.Metric` as it has several advantages such as supporting computations for the input types above or unifying the type of the input.
```python
from jury.metrics import MetricForTask
class CustomMetric(MetricForTask):
def _compute_single_pred_single_ref(
self, predictions, references, reduce_fn = None, **kwargs
):
raise NotImplementedError
def _compute_single_pred_multi_ref(
self, predictions, references, reduce_fn = None, **kwargs
):
raise NotImplementedError
def _compute_multi_pred_multi_ref(
self, predictions, references, reduce_fn = None, **kwargs
):
raise NotImplementedError
```
For more details, have a look at base metric implementation [jury.metrics.Metric](./jury/metrics/_base.py)
## Contributing
PRs are welcomed as always :)
### Installation
git clone https://github.com/obss/jury.git
cd jury
pip install -e .[dev]
Also, you need to install the packages which are available through a git source separately with the following command.
For the folks who are curious about "why?"; a short explaination is that PYPI does not allow indexing a package which
are directly dependent on non-pypi packages due to security reasons. The file `requirements-dev.txt` includes packages
which are currently only available through a git source, or they are PYPI packages with no recent release or
incompatible with Jury, so that they are added as git sources or pointing to specific commits.
pip install -r requirements-dev.txt
### Tests
To tests simply run.
python tests/run_tests.py
### Code Style
To check code style,
python tests/run_code_style.py check
To format codebase,
python tests/run_code_style.py format
## Citation
If you use this package in your work, please cite it as:
@software{obss2021jury,
author = {Cavusoglu, Devrim and Akyon, Fatih Cagatay and Sert, Ulas and Cengiz, Cemil},
title = {{Jury: Comprehensive NLP Evaluation toolkit}},
month = {feb},
year = {2022},
publisher = {Zenodo},
doi = {10.5281/zenodo.6108229},
url = {https://doi.org/10.5281/zenodo.6108229}
}
## License
Licensed under the [MIT](LICENSE) License.
%package help
Summary: Development documents and examples for jury
Provides: python3-jury-doc
%description help
Jury
A comprehensive toolkit for evaluating NLP experiments offering various automated metrics. Jury offers a smooth and easy-to-use interface. It uses a more advanced version of [evaluate](https://github.com/huggingface/evaluate/) design for underlying metric computation, so that adding custom metric is easy as extending proper class.
Main advantages that Jury offers are:
- Easy to use for any NLP project.
- Unified structure for computation input across all metrics.
- Calculate many metrics at once.
- Metrics calculations can be handled concurrently to save processing time.
- It seamlessly supports evaluation for multiple predictions/multiple references.
To see more, check the [official Jury blog post](https://medium.com/codable/jury-evaluating-performance-of-nlg-models-730eb9c9999f).
# Available Metrics
The table below shows the current support status for available metrics.
| Metric | Jury Support | HF/evaluate Support |
|-------------------------------------------------------------------------------|--------------------|---------------------|
| Accuracy-Numeric | :heavy_check_mark: | :white_check_mark: |
| Accuracy-Text | :heavy_check_mark: | :x: |
| Bartscore | :heavy_check_mark: | :x: |
| Bertscore | :heavy_check_mark: | :white_check_mark: |
| Bleu | :heavy_check_mark: | :white_check_mark: |
| Bleurt | :heavy_check_mark: | :white_check_mark: |
| CER | :heavy_check_mark: | :white_check_mark: |
| CHRF | :heavy_check_mark: | :white_check_mark: |
| COMET | :heavy_check_mark: | :white_check_mark: |
| F1-Numeric | :heavy_check_mark: | :white_check_mark: |
| F1-Text | :heavy_check_mark: | :x: |
| METEOR | :heavy_check_mark: | :white_check_mark: |
| Precision-Numeric | :heavy_check_mark: | :white_check_mark: |
| Precision-Text | :heavy_check_mark: | :x: |
| Prism | :heavy_check_mark: | :x: |
| Recall-Numeric | :heavy_check_mark: | :white_check_mark: |
| Recall-Text | :heavy_check_mark: | :x: |
| ROUGE | :heavy_check_mark: | :white_check_mark: |
| SacreBleu | :heavy_check_mark: | :white_check_mark: |
| Seqeval | :heavy_check_mark: | :white_check_mark: |
| Squad | :heavy_check_mark: | :white_check_mark: |
| TER | :heavy_check_mark: | :white_check_mark: |
| WER | :heavy_check_mark: | :white_check_mark: |
| [Other metrics](https://github.com/huggingface/evaluate/tree/master/metrics)* | :white_check_mark: | :white_check_mark: |
_*_ Placeholder for the rest of the metrics available in `evaluate` package apart from those which are present in the
table.
**Notes**
* The entry :heavy_check_mark: represents that full Jury support is available meaning that all combinations of input
types (single prediction & single reference, single prediction & multiple references, multiple predictions & multiple
references) are supported
* The entry :white_check_mark: means that this metric is supported (for Jury through the `evaluate`), so that it
can (and should) be used just like the `evaluate` metric as instructed in `evaluate` implementation although
unfortunately full Jury support for those metrics are not yet available.
## Request for a New Metric
For the request of a new metric please [open an issue](https://github.com/obss/jury/issues/new?assignees=&labels=&template=new-metric.md&title=) providing the minimum information. Also, PRs addressing new metric
supports are welcomed :).
## Installation
Through pip,
pip install jury
or build from source,
git clone https://github.com/obss/jury.git
cd jury
python setup.py install
**NOTE:** There may be malfunctions of some metrics depending on `sacrebleu` package on Windows machines which is
mainly due to the package `pywin32`. For this, we fixed pywin32 version on our setup config for Windows platforms.
However, if pywin32 causes trouble in your environment we strongly recommend using `conda` manager install the package
as `conda install pywin32`.
## Usage
### API Usage
It is only two lines of code to evaluate generated outputs.
```python
from jury import Jury
scorer = Jury()
predictions = [
["the cat is on the mat", "There is cat playing on the mat"],
["Look! a wonderful day."]
]
references = [
["the cat is playing on the mat.", "The cat plays on the mat."],
["Today is a wonderful day", "The weather outside is wonderful."]
]
scores = scorer(predictions=predictions, references=references)
```
Specify metrics you want to use on instantiation.
```python
scorer = Jury(metrics=["bleu", "meteor"])
scores = scorer(predictions, references)
```
#### Use of Metrics standalone
You can directly import metrics from `jury.metrics` as classes, and then instantiate and use as desired.
```python
from jury.metrics import Bleu
bleu = Bleu.construct()
score = bleu.compute(predictions=predictions, references=references)
```
The additional parameters can either be specified on `compute()`
```python
from jury.metrics import Bleu
bleu = Bleu.construct()
score = bleu.compute(predictions=predictions, references=references, max_order=4)
```
, or alternatively on instantiation
```python
from jury.metrics import Bleu
bleu = Bleu.construct(compute_kwargs={"max_order": 1})
score = bleu.compute(predictions=predictions, references=references)
```
Note that you can seemlessly access both `jury` and `evaluate` metrics through `jury.load_metric`.
```python
import jury
bleu = jury.load_metric("bleu")
bleu_1 = jury.load_metric("bleu", resulting_name="bleu_1", compute_kwargs={"max_order": 1})
# metrics not available in `jury` but in `evaluate`
wer = jury.load_metric("competition_math") # It falls back to `evaluate` package with a warning
```
### CLI Usage
You can specify predictions file and references file paths and get the resulting scores. Each line should be paired in both files. You can optionally provide reduce function and an export path for results to be written.
jury eval --predictions /path/to/predictions.txt --references /path/to/references.txt --reduce_fn max --export /path/to/export.txt
You can also provide prediction folders and reference folders to evaluate multiple experiments. In this set up, however, it is required that the prediction and references files you need to evaluate as a pair have the same file name. These common names are paired together for prediction and reference.
jury eval --predictions /path/to/predictions_folder --references /path/to/references_folder --reduce_fn max --export /path/to/export.txt
If you want to specify metrics, and do not want to use default, specify it in config file (json) in `metrics` key.
```json
{
"predictions": "/path/to/predictions.txt",
"references": "/path/to/references.txt",
"reduce_fn": "max",
"metrics": [
"bleu",
"meteor"
]
}
```
Then, you can call jury eval with `config` argument.
jury eval --config path/to/config.json
### Custom Metrics
You can use custom metrics with inheriting `jury.metrics.Metric`, you can see current metrics implemented on Jury from [jury/metrics](https://github.com/obss/jury/tree/master/jury/metrics). Jury falls back to `evaluate` implementation of metrics for the ones that are currently not supported by Jury, you can see the metrics available for `evaluate` on [evaluate/metrics](https://github.com/huggingface/evaluate/tree/master/metrics).
Jury itself uses `evaluate.Metric` as a base class to drive its own base class as `jury.metrics.Metric`. The interface is similar; however, Jury makes the metrics to take a unified input type by handling the inputs for each metrics, and allows supporting several input types as;
- single prediction & single reference
- single prediction & multiple reference
- multiple prediction & multiple reference
As a custom metric both base classes can be used; however, we strongly recommend using `jury.metrics.Metric` as it has several advantages such as supporting computations for the input types above or unifying the type of the input.
```python
from jury.metrics import MetricForTask
class CustomMetric(MetricForTask):
def _compute_single_pred_single_ref(
self, predictions, references, reduce_fn = None, **kwargs
):
raise NotImplementedError
def _compute_single_pred_multi_ref(
self, predictions, references, reduce_fn = None, **kwargs
):
raise NotImplementedError
def _compute_multi_pred_multi_ref(
self, predictions, references, reduce_fn = None, **kwargs
):
raise NotImplementedError
```
For more details, have a look at base metric implementation [jury.metrics.Metric](./jury/metrics/_base.py)
## Contributing
PRs are welcomed as always :)
### Installation
git clone https://github.com/obss/jury.git
cd jury
pip install -e .[dev]
Also, you need to install the packages which are available through a git source separately with the following command.
For the folks who are curious about "why?"; a short explaination is that PYPI does not allow indexing a package which
are directly dependent on non-pypi packages due to security reasons. The file `requirements-dev.txt` includes packages
which are currently only available through a git source, or they are PYPI packages with no recent release or
incompatible with Jury, so that they are added as git sources or pointing to specific commits.
pip install -r requirements-dev.txt
### Tests
To tests simply run.
python tests/run_tests.py
### Code Style
To check code style,
python tests/run_code_style.py check
To format codebase,
python tests/run_code_style.py format
## Citation
If you use this package in your work, please cite it as:
@software{obss2021jury,
author = {Cavusoglu, Devrim and Akyon, Fatih Cagatay and Sert, Ulas and Cengiz, Cemil},
title = {{Jury: Comprehensive NLP Evaluation toolkit}},
month = {feb},
year = {2022},
publisher = {Zenodo},
doi = {10.5281/zenodo.6108229},
url = {https://doi.org/10.5281/zenodo.6108229}
}
## License
Licensed under the [MIT](LICENSE) License.
%prep
%autosetup -n jury-2.2.3
%build
%py3_build
%install
%py3_install
install -d -m755 %{buildroot}/%{_pkgdocdir}
if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi
if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi
if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi
if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi
pushd %{buildroot}
if [ -d usr/lib ]; then
find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst
fi
if [ -d usr/lib64 ]; then
find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst
fi
if [ -d usr/bin ]; then
find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst
fi
if [ -d usr/sbin ]; then
find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst
fi
touch doclist.lst
if [ -d usr/share/man ]; then
find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst
fi
popd
mv %{buildroot}/filelist.lst .
mv %{buildroot}/doclist.lst .
%files -n python3-jury -f filelist.lst
%dir %{python3_sitelib}/*
%files help -f doclist.lst
%{_docdir}/*
%changelog
* Wed May 31 2023 Python_Bot - 2.2.3-1
- Package Spec generated