From 227aa85f40d3fd1002cd2ed0a0cfc05a66c823ae Mon Sep 17 00:00:00 2001 From: CoprDistGit Date: Wed, 31 May 2023 05:14:48 +0000 Subject: automatic import of python-cldfbench --- .gitignore | 1 + python-cldfbench.spec | 857 ++++++++++++++++++++++++++++++++++++++++++++++++++ sources | 1 + 3 files changed, 859 insertions(+) create mode 100644 python-cldfbench.spec create mode 100644 sources diff --git a/.gitignore b/.gitignore index e69de29..192c420 100644 --- a/.gitignore +++ b/.gitignore @@ -0,0 +1 @@ +/cldfbench-1.13.0.tar.gz diff --git a/python-cldfbench.spec b/python-cldfbench.spec new file mode 100644 index 0000000..039fa58 --- /dev/null +++ b/python-cldfbench.spec @@ -0,0 +1,857 @@ +%global _empty_manifest_terminate_build 0 +Name: python-cldfbench +Version: 1.13.0 +Release: 1 +Summary: Python library implementing a CLDF workbench +License: Apache 2.0 +URL: https://github.com/cldf/cldfbench +Source0: https://mirrors.nju.edu.cn/pypi/web/packages/98/1c/d6c3f474712c65e0834b729df285ce0b8cd813a7e78c37c201e305ef3817/cldfbench-1.13.0.tar.gz +BuildArch: noarch + +Requires: python3-appdirs +Requires: python3-cldfcatalog +Requires: python3-clldutils +Requires: python3-csvw +Requires: python3-pycldf +Requires: python3-pytest +Requires: python3-requests +Requires: python3-rfc3986 +Requires: python3-termcolor +Requires: python3-tqdm +Requires: python3-zenodoclient +Requires: python3-importlib-metadata +Requires: python3-pyclts +Requires: python3-pyconcepticon +Requires: python3-build +Requires: python3-flake8 +Requires: python3-twine +Requires: python3-wheel +Requires: python3-sphinx +Requires: python3-sphinx-autodoc-typehints +Requires: python3-sphinx-rtd-theme +Requires: python3-openpyxl +Requires: python3-xlrd +Requires: python3-pyglottolog +Requires: python3-odfpy +Requires: python3-odfpy +Requires: python3-openpyxl +Requires: python3-packaging +Requires: python3-pyconcepticon +Requires: python3-pyglottolog +Requires: python3-pytest-cov +Requires: python3-pytest-mock +Requires: python3-pytest +Requires: python3-tox +Requires: python3-xlrd + +%description +# cldfbench +Tooling to create [CLDF](https://cldf.clld.org) datasets from existing data. + +[![Build Status](https://github.com/cldf/cldfbench/workflows/tests/badge.svg)](https://github.com/cldf/cldfbench/actions?query=workflow%3Atests) +[![Documentation Status](https://readthedocs.org/projects/cldfbench/badge/?version=latest)](https://cldfbench.readthedocs.io/en/latest/?badge=latest) +[![PyPI](https://img.shields.io/pypi/v/cldfbench.svg)](https://pypi.org/project/cldfbench) + + +## Overview + +This package provides tools to curate cross-linguistic data, with the goal of +packaging it as [CLDF](https://cldf.clld.org) datasets. + +In particular, it supports a workflow where: +- "raw" source data is downloaded to a `raw/` subdirectory, +- and subsequently converted to one or more CLDF datasets in a `cldf/` subdirectory, with the help of: + - configuration data in a `etc/` directory and + - custom Python code (a subclass of [`cldfbench.Dataset`](src/cldfbench/dataset.py) which implements the workflow actions). + +This workflow is supported via: +- a commandline interface `cldfbench` which calls the workflow actions as [subcommands](src/cldfbench/commands), +- a `cldfbench.Dataset` base class, which must be overwritten in a custom module + to hook custom code into the workflow. + +With this workflow and the separation of the data into three directories we want +to provide a workbench for transparently deriving CLDF data from data that has been +published before. In particular we want to delineate clearly: +- what forms part of the original or source data (`raw`), +- what kind of information is added by the curators of the CLDF dataset (`etc`) +- and what data was derived using the workbench (`cldf`). + + +### Further reading + +This paper introduces `cldfbench` and uses an extended, real-world example: + +> Forkel, R., & List, J.-M. (2020). CLDFBench: Give your cross-linguistic data a lift. In N. Calzolari, F. Béchet, P. Blache, K. Choukri, C. Cieri, T. Declerck, et al. (Eds.), Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020) (pp. 6995-7002). Paris: European Language Resources Association (ELRA). [[PDF]](https://pure.mpg.de/pubman/item/item_3231858_1/component/file_3231859/shh2600.pdf) + + +## Installation + +`cldfbench` can be installed via `pip` - preferably in a +[virtual environment](https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/) - by running: +```shell script +pip install cldfbench +``` + +`cldfbench` provides some functionality that relies on python +packages which are not needed for the core functionality. These are specified as [extras](https://setuptools.readthedocs.io/en/latest/setuptools.html#declaring-extras-optional-features-with-their-own-dependencies) and can be installed using syntax like: +```shell +pip install cldfbench[] +``` +where `` is a comma-separated list of names from the following list: +- `excel`: support for reading spreadsheet data. +- `glottolog`: support to access [Glottolog data](https://github.com/glottolog/glottolog). +- `concepticon`: support to access [Concepticon data](https://github.com/concepticon/concepticon-data). +- `clts`: support to access [CLTS data](https://github.com/cldf-clts/clts). + + +## The command line interface `cldfbench` + +Installing the python package will also install a command `cldfbench` available on +the command line: +```shell script +$ cldfbench -h +usage: cldfbench [-h] [--log-level LOG_LEVEL] COMMAND ... + +optional arguments: + -h, --help show this help message and exit + --log-level LOG_LEVEL + log level [ERROR|WARN|INFO|DEBUG] (default: 20) + +available commands: + Run "COMAMND -h" to get help for a specific command. + + COMMAND + check Run generic CLDF checks + ... +``` + +As shown above, run `cldfbench -h` to get help, and `cldfbench COMMAND -h` to get +help on individual subcommands, e.g. `cldfbench new -h` to read about the usage +of the `new` subcommand. + + +### Dataset discovery + +Most `cldfbench` commands operate on an existing dataset (unlike `new`, which +creates a new one). Datasets can be discovered in two ways: + +1. Via the python module (i.e. the `*.py` file, containing the `Dataset` subclass). + To use this mode of discovery, pass the path to the python module + as `DATASET` argument, when required by a command. + +2. Via [entry point](https://packaging.python.org/specifications/entry-points/) and + dataset ID. To use this mode, specify the name of the entry point as value of + the `--entry-point` option (or use the default name `cldfbench.dataset`) and + the `Dataset.id` as `DATASET` argument. + +Discovery via entry point is particularly useful for commands that can operate +on multiple datasets. To select **all** datasets advertising a given entry point, +pass `"_"` (i.e. an underscore) as `DATASET` argument. + + +## Workflow + +For a full example of the `cldfbench` curation workflow, see [the tutorial](doc/tutorial.md). + + +### Creating a skeleton for a new dataset directory + +A directory containing stub entries for a dataset can be created running + +```bash +cldfbench new +``` + +This will create the following layout (where `` stands for the chosen dataset ID): +``` +/ +├── cldf # A stub directory for the CLDF data +│   └── README.md +├── cldfbench_.py # The python module, providing the Dataset subclass +├── etc # A stub directory for the configuration data +│   └── README.md +├── metadata.json # The metadata provided to the subcommand serialized as JSON +├── raw # A stub directory for the raw data +│   └── README.md +├── setup.cfg # Python setup config, providing defaults for test integration +├── setup.py # Python setup file, making the dataset "installable" +├── test.py # The python code to run for dataset validation +└── .travis.yml # Integrate the validation with Travis-CI +``` + + +### Implementing CLDF creation + +`cldfbench` provides tools to make CLDF creation simple. Still, each dataset is +different, and so each dataset will have to provide its own custom code to do so. +This custom code goes into the `cmd_makecldf` method of the `Dataset` subclass in +the dataset's python module. +(See also the [API documentation of `cldfbench.Dataset`](https://cldfbench.readthedocs.io/en/latest/dataset.html).) + +Typically, this code will make use of one or more +[`cldfbench.CLDFSpec`](src/cldfbench/cldf.py) instances, which describes what kind of CLDF to create. A `CLDFSpec` also gives access to a +[`cldfbench.CLDFWriter`](src/cldfbench/cldf.py) instance, which wraps a `pycldf.Dataset`. + +The main interfaces to these objects are: +- `cldfbench.Dataset.cldf_specs`: a method returning specifications of all CLDF datasets + that are created by the dataset, +- `cldfbench.Dataset.cldf_writer`: a method returning an initialized `CLDFWriter` + associated with a particular `CLDFSpec`. + +`cldfbench` supports several scenarios of CLDF creation: +- The typical use case is turning raw data into a single CLDF dataset. This would + require instantiating one `CLDFWriter` writer in the `cmd_makecldf` method, and + the defaults of `CLDFSpec` will probably be ok. Since this is the most common and + simplest case, it is supported with some extra "sugar": The initialized `CLDFWriter` + is available as `args.writer` when `cmd_makecldf` is called. +- But it is also possible to create multiple CLDF datasets: + - For a dataset containing both, lexical and typological data, it may be appropriate + to create a `Ẁordlist` and a `StructureDataset`. To do so, one would have to + call `cldf_writer` twice, passing in an approriate `CLDFSpec`. Note that if + both CLDF datasets are created in the same directory, they can share the + `LanguageTable` - but would have to specify distinct file names for the + `ParameterTable`, passing distinct values to `CLDFSpec.data_fnames`. + - When creating multiple datasets of the same CLDF module, e.g. to split a large dataset into smaller chunks, care must be taken to also disambiguate the name + of the metadata file, passing distinct values to `CLDFSpec.metadata_fname`. + +When creating CLDF, it is also often useful to have standard reference catalogs +accessible, in particular Glottolog. See the section on [Catalogs](#catalogs) for +a description of how this is supported by `cldfbench`. + + +### Catalogs + +Linking data to reference catalogs is a major goal of CLDF, thus `cldfbench` +provides tools to make catalog access and maintenance easier. Catalog data must be +accessible in local clones of the data repository. `cldfbench` provides commands: +- `catconfig` to create the clones and make them known through a configuration file, +- `catinfo` to get an overview of the installed catalogs and their versions, +- `catupdate` to update local clones from the upstream repositories. + +See: + +- https://cldfbench.readthedocs.io/en/latest//catalogs.html + +for a list of reference catalogs which are currently supported in `cldfbench`. + + +### Curating a dataset on GitHub + +One of the design goals of CLDF was to specify a data format that plays well with +version control. Thus, it's natural - and actually recommended - to curate a CLDF +dataset in a version controlled repository. The most popular way to do this in a +collaborative fashion is by using a [git](https://git-scm.com/) repository hosted on +[GitHub](https://github.com). + +The directory layout supported by `cldfbench` caters to this use case in several ways: +- Each directory contains a file `README.md`, which will be rendered as human readable + description when browsing the repository at GitHub. +- The file `.travis.yml` contains the configuration for hooking up a repository with + [Travis CI](https://www.travis-ci.org/), to provide continuous consistency checking + of the data. + + +### Archiving a dataset with Zenodo + +Curating a dataset on GitHub also provides a simple way to archiving and publishing +released versions of the data. You can hook up your repository with [Zenodo](https://zenodo.org) (following [this guide](https://guides.github.com/activities/citable-code/)). Then, Zenodo will pick up any released package, assign a DOI to it, archive it and +make it accessible in the long-term. + +Some notes: +- Hook-up with Zenodo requires the repository to be public (not private). +- You should consider using an institutional account on GitHub and Zenodo to associate the repository with. Currently, only the user account registering a repository on Zenodo can change any metadata of releases lateron. +- Once released and archived with Zenodo, it's a good idea to add the DOI assigned by Zenodo to the release description on GitHub. +- To make sure a release is picked up by Zenodo, the version number must start with a letter, e.g. "v1.0" - **not** "1.0". + +Thus, with a setup as described here, you can make sure you create [FAIR data](https://en.wikipedia.org/wiki/FAIR_data). + + +## Extending `cldfbench` + +`cldfbench` can be extended or built-upon in various ways - typically by customizing core functionality in new python packages. To support particular types of raw data, you might want a custom `Dataset` class, or to support a particular type of CLDF data, you would customize `CLDFWriter`. + +In addition to extending `cldfbench` using the standard methods of object-oriented programming, there are two more ways of extending `cldfbench`: + + +### Commands + +A python package (or a dataset) can provide additional subcommands to be run from `cldfbench`. +For more info see the [`commands.README`](src/cldfbench/commands/README.md). + + +### Custom dataset templates + +A python package can provide alternative dataset templates to be run with `cldfbench new`. +Such templates are implemented by: +- a subclass of `cldfbench.Template`, +- which is advertised using an entry point `cldfbench.scaffold`: + +```python + entry_points={ + 'cldfbench.scaffold': [ + 'template_name=mypackage.scaffold:DerivedTemplate', + ], + }, +``` + + + + +%package -n python3-cldfbench +Summary: Python library implementing a CLDF workbench +Provides: python-cldfbench +BuildRequires: python3-devel +BuildRequires: python3-setuptools +BuildRequires: python3-pip +%description -n python3-cldfbench +# cldfbench +Tooling to create [CLDF](https://cldf.clld.org) datasets from existing data. + +[![Build Status](https://github.com/cldf/cldfbench/workflows/tests/badge.svg)](https://github.com/cldf/cldfbench/actions?query=workflow%3Atests) +[![Documentation Status](https://readthedocs.org/projects/cldfbench/badge/?version=latest)](https://cldfbench.readthedocs.io/en/latest/?badge=latest) +[![PyPI](https://img.shields.io/pypi/v/cldfbench.svg)](https://pypi.org/project/cldfbench) + + +## Overview + +This package provides tools to curate cross-linguistic data, with the goal of +packaging it as [CLDF](https://cldf.clld.org) datasets. + +In particular, it supports a workflow where: +- "raw" source data is downloaded to a `raw/` subdirectory, +- and subsequently converted to one or more CLDF datasets in a `cldf/` subdirectory, with the help of: + - configuration data in a `etc/` directory and + - custom Python code (a subclass of [`cldfbench.Dataset`](src/cldfbench/dataset.py) which implements the workflow actions). + +This workflow is supported via: +- a commandline interface `cldfbench` which calls the workflow actions as [subcommands](src/cldfbench/commands), +- a `cldfbench.Dataset` base class, which must be overwritten in a custom module + to hook custom code into the workflow. + +With this workflow and the separation of the data into three directories we want +to provide a workbench for transparently deriving CLDF data from data that has been +published before. In particular we want to delineate clearly: +- what forms part of the original or source data (`raw`), +- what kind of information is added by the curators of the CLDF dataset (`etc`) +- and what data was derived using the workbench (`cldf`). + + +### Further reading + +This paper introduces `cldfbench` and uses an extended, real-world example: + +> Forkel, R., & List, J.-M. (2020). CLDFBench: Give your cross-linguistic data a lift. In N. Calzolari, F. Béchet, P. Blache, K. Choukri, C. Cieri, T. Declerck, et al. (Eds.), Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020) (pp. 6995-7002). Paris: European Language Resources Association (ELRA). [[PDF]](https://pure.mpg.de/pubman/item/item_3231858_1/component/file_3231859/shh2600.pdf) + + +## Installation + +`cldfbench` can be installed via `pip` - preferably in a +[virtual environment](https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/) - by running: +```shell script +pip install cldfbench +``` + +`cldfbench` provides some functionality that relies on python +packages which are not needed for the core functionality. These are specified as [extras](https://setuptools.readthedocs.io/en/latest/setuptools.html#declaring-extras-optional-features-with-their-own-dependencies) and can be installed using syntax like: +```shell +pip install cldfbench[] +``` +where `` is a comma-separated list of names from the following list: +- `excel`: support for reading spreadsheet data. +- `glottolog`: support to access [Glottolog data](https://github.com/glottolog/glottolog). +- `concepticon`: support to access [Concepticon data](https://github.com/concepticon/concepticon-data). +- `clts`: support to access [CLTS data](https://github.com/cldf-clts/clts). + + +## The command line interface `cldfbench` + +Installing the python package will also install a command `cldfbench` available on +the command line: +```shell script +$ cldfbench -h +usage: cldfbench [-h] [--log-level LOG_LEVEL] COMMAND ... + +optional arguments: + -h, --help show this help message and exit + --log-level LOG_LEVEL + log level [ERROR|WARN|INFO|DEBUG] (default: 20) + +available commands: + Run "COMAMND -h" to get help for a specific command. + + COMMAND + check Run generic CLDF checks + ... +``` + +As shown above, run `cldfbench -h` to get help, and `cldfbench COMMAND -h` to get +help on individual subcommands, e.g. `cldfbench new -h` to read about the usage +of the `new` subcommand. + + +### Dataset discovery + +Most `cldfbench` commands operate on an existing dataset (unlike `new`, which +creates a new one). Datasets can be discovered in two ways: + +1. Via the python module (i.e. the `*.py` file, containing the `Dataset` subclass). + To use this mode of discovery, pass the path to the python module + as `DATASET` argument, when required by a command. + +2. Via [entry point](https://packaging.python.org/specifications/entry-points/) and + dataset ID. To use this mode, specify the name of the entry point as value of + the `--entry-point` option (or use the default name `cldfbench.dataset`) and + the `Dataset.id` as `DATASET` argument. + +Discovery via entry point is particularly useful for commands that can operate +on multiple datasets. To select **all** datasets advertising a given entry point, +pass `"_"` (i.e. an underscore) as `DATASET` argument. + + +## Workflow + +For a full example of the `cldfbench` curation workflow, see [the tutorial](doc/tutorial.md). + + +### Creating a skeleton for a new dataset directory + +A directory containing stub entries for a dataset can be created running + +```bash +cldfbench new +``` + +This will create the following layout (where `` stands for the chosen dataset ID): +``` +/ +├── cldf # A stub directory for the CLDF data +│   └── README.md +├── cldfbench_.py # The python module, providing the Dataset subclass +├── etc # A stub directory for the configuration data +│   └── README.md +├── metadata.json # The metadata provided to the subcommand serialized as JSON +├── raw # A stub directory for the raw data +│   └── README.md +├── setup.cfg # Python setup config, providing defaults for test integration +├── setup.py # Python setup file, making the dataset "installable" +├── test.py # The python code to run for dataset validation +└── .travis.yml # Integrate the validation with Travis-CI +``` + + +### Implementing CLDF creation + +`cldfbench` provides tools to make CLDF creation simple. Still, each dataset is +different, and so each dataset will have to provide its own custom code to do so. +This custom code goes into the `cmd_makecldf` method of the `Dataset` subclass in +the dataset's python module. +(See also the [API documentation of `cldfbench.Dataset`](https://cldfbench.readthedocs.io/en/latest/dataset.html).) + +Typically, this code will make use of one or more +[`cldfbench.CLDFSpec`](src/cldfbench/cldf.py) instances, which describes what kind of CLDF to create. A `CLDFSpec` also gives access to a +[`cldfbench.CLDFWriter`](src/cldfbench/cldf.py) instance, which wraps a `pycldf.Dataset`. + +The main interfaces to these objects are: +- `cldfbench.Dataset.cldf_specs`: a method returning specifications of all CLDF datasets + that are created by the dataset, +- `cldfbench.Dataset.cldf_writer`: a method returning an initialized `CLDFWriter` + associated with a particular `CLDFSpec`. + +`cldfbench` supports several scenarios of CLDF creation: +- The typical use case is turning raw data into a single CLDF dataset. This would + require instantiating one `CLDFWriter` writer in the `cmd_makecldf` method, and + the defaults of `CLDFSpec` will probably be ok. Since this is the most common and + simplest case, it is supported with some extra "sugar": The initialized `CLDFWriter` + is available as `args.writer` when `cmd_makecldf` is called. +- But it is also possible to create multiple CLDF datasets: + - For a dataset containing both, lexical and typological data, it may be appropriate + to create a `Ẁordlist` and a `StructureDataset`. To do so, one would have to + call `cldf_writer` twice, passing in an approriate `CLDFSpec`. Note that if + both CLDF datasets are created in the same directory, they can share the + `LanguageTable` - but would have to specify distinct file names for the + `ParameterTable`, passing distinct values to `CLDFSpec.data_fnames`. + - When creating multiple datasets of the same CLDF module, e.g. to split a large dataset into smaller chunks, care must be taken to also disambiguate the name + of the metadata file, passing distinct values to `CLDFSpec.metadata_fname`. + +When creating CLDF, it is also often useful to have standard reference catalogs +accessible, in particular Glottolog. See the section on [Catalogs](#catalogs) for +a description of how this is supported by `cldfbench`. + + +### Catalogs + +Linking data to reference catalogs is a major goal of CLDF, thus `cldfbench` +provides tools to make catalog access and maintenance easier. Catalog data must be +accessible in local clones of the data repository. `cldfbench` provides commands: +- `catconfig` to create the clones and make them known through a configuration file, +- `catinfo` to get an overview of the installed catalogs and their versions, +- `catupdate` to update local clones from the upstream repositories. + +See: + +- https://cldfbench.readthedocs.io/en/latest//catalogs.html + +for a list of reference catalogs which are currently supported in `cldfbench`. + + +### Curating a dataset on GitHub + +One of the design goals of CLDF was to specify a data format that plays well with +version control. Thus, it's natural - and actually recommended - to curate a CLDF +dataset in a version controlled repository. The most popular way to do this in a +collaborative fashion is by using a [git](https://git-scm.com/) repository hosted on +[GitHub](https://github.com). + +The directory layout supported by `cldfbench` caters to this use case in several ways: +- Each directory contains a file `README.md`, which will be rendered as human readable + description when browsing the repository at GitHub. +- The file `.travis.yml` contains the configuration for hooking up a repository with + [Travis CI](https://www.travis-ci.org/), to provide continuous consistency checking + of the data. + + +### Archiving a dataset with Zenodo + +Curating a dataset on GitHub also provides a simple way to archiving and publishing +released versions of the data. You can hook up your repository with [Zenodo](https://zenodo.org) (following [this guide](https://guides.github.com/activities/citable-code/)). Then, Zenodo will pick up any released package, assign a DOI to it, archive it and +make it accessible in the long-term. + +Some notes: +- Hook-up with Zenodo requires the repository to be public (not private). +- You should consider using an institutional account on GitHub and Zenodo to associate the repository with. Currently, only the user account registering a repository on Zenodo can change any metadata of releases lateron. +- Once released and archived with Zenodo, it's a good idea to add the DOI assigned by Zenodo to the release description on GitHub. +- To make sure a release is picked up by Zenodo, the version number must start with a letter, e.g. "v1.0" - **not** "1.0". + +Thus, with a setup as described here, you can make sure you create [FAIR data](https://en.wikipedia.org/wiki/FAIR_data). + + +## Extending `cldfbench` + +`cldfbench` can be extended or built-upon in various ways - typically by customizing core functionality in new python packages. To support particular types of raw data, you might want a custom `Dataset` class, or to support a particular type of CLDF data, you would customize `CLDFWriter`. + +In addition to extending `cldfbench` using the standard methods of object-oriented programming, there are two more ways of extending `cldfbench`: + + +### Commands + +A python package (or a dataset) can provide additional subcommands to be run from `cldfbench`. +For more info see the [`commands.README`](src/cldfbench/commands/README.md). + + +### Custom dataset templates + +A python package can provide alternative dataset templates to be run with `cldfbench new`. +Such templates are implemented by: +- a subclass of `cldfbench.Template`, +- which is advertised using an entry point `cldfbench.scaffold`: + +```python + entry_points={ + 'cldfbench.scaffold': [ + 'template_name=mypackage.scaffold:DerivedTemplate', + ], + }, +``` + + + + +%package help +Summary: Development documents and examples for cldfbench +Provides: python3-cldfbench-doc +%description help +# cldfbench +Tooling to create [CLDF](https://cldf.clld.org) datasets from existing data. + +[![Build Status](https://github.com/cldf/cldfbench/workflows/tests/badge.svg)](https://github.com/cldf/cldfbench/actions?query=workflow%3Atests) +[![Documentation Status](https://readthedocs.org/projects/cldfbench/badge/?version=latest)](https://cldfbench.readthedocs.io/en/latest/?badge=latest) +[![PyPI](https://img.shields.io/pypi/v/cldfbench.svg)](https://pypi.org/project/cldfbench) + + +## Overview + +This package provides tools to curate cross-linguistic data, with the goal of +packaging it as [CLDF](https://cldf.clld.org) datasets. + +In particular, it supports a workflow where: +- "raw" source data is downloaded to a `raw/` subdirectory, +- and subsequently converted to one or more CLDF datasets in a `cldf/` subdirectory, with the help of: + - configuration data in a `etc/` directory and + - custom Python code (a subclass of [`cldfbench.Dataset`](src/cldfbench/dataset.py) which implements the workflow actions). + +This workflow is supported via: +- a commandline interface `cldfbench` which calls the workflow actions as [subcommands](src/cldfbench/commands), +- a `cldfbench.Dataset` base class, which must be overwritten in a custom module + to hook custom code into the workflow. + +With this workflow and the separation of the data into three directories we want +to provide a workbench for transparently deriving CLDF data from data that has been +published before. In particular we want to delineate clearly: +- what forms part of the original or source data (`raw`), +- what kind of information is added by the curators of the CLDF dataset (`etc`) +- and what data was derived using the workbench (`cldf`). + + +### Further reading + +This paper introduces `cldfbench` and uses an extended, real-world example: + +> Forkel, R., & List, J.-M. (2020). CLDFBench: Give your cross-linguistic data a lift. In N. Calzolari, F. Béchet, P. Blache, K. Choukri, C. Cieri, T. Declerck, et al. (Eds.), Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020) (pp. 6995-7002). Paris: European Language Resources Association (ELRA). [[PDF]](https://pure.mpg.de/pubman/item/item_3231858_1/component/file_3231859/shh2600.pdf) + + +## Installation + +`cldfbench` can be installed via `pip` - preferably in a +[virtual environment](https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/) - by running: +```shell script +pip install cldfbench +``` + +`cldfbench` provides some functionality that relies on python +packages which are not needed for the core functionality. These are specified as [extras](https://setuptools.readthedocs.io/en/latest/setuptools.html#declaring-extras-optional-features-with-their-own-dependencies) and can be installed using syntax like: +```shell +pip install cldfbench[] +``` +where `` is a comma-separated list of names from the following list: +- `excel`: support for reading spreadsheet data. +- `glottolog`: support to access [Glottolog data](https://github.com/glottolog/glottolog). +- `concepticon`: support to access [Concepticon data](https://github.com/concepticon/concepticon-data). +- `clts`: support to access [CLTS data](https://github.com/cldf-clts/clts). + + +## The command line interface `cldfbench` + +Installing the python package will also install a command `cldfbench` available on +the command line: +```shell script +$ cldfbench -h +usage: cldfbench [-h] [--log-level LOG_LEVEL] COMMAND ... + +optional arguments: + -h, --help show this help message and exit + --log-level LOG_LEVEL + log level [ERROR|WARN|INFO|DEBUG] (default: 20) + +available commands: + Run "COMAMND -h" to get help for a specific command. + + COMMAND + check Run generic CLDF checks + ... +``` + +As shown above, run `cldfbench -h` to get help, and `cldfbench COMMAND -h` to get +help on individual subcommands, e.g. `cldfbench new -h` to read about the usage +of the `new` subcommand. + + +### Dataset discovery + +Most `cldfbench` commands operate on an existing dataset (unlike `new`, which +creates a new one). Datasets can be discovered in two ways: + +1. Via the python module (i.e. the `*.py` file, containing the `Dataset` subclass). + To use this mode of discovery, pass the path to the python module + as `DATASET` argument, when required by a command. + +2. Via [entry point](https://packaging.python.org/specifications/entry-points/) and + dataset ID. To use this mode, specify the name of the entry point as value of + the `--entry-point` option (or use the default name `cldfbench.dataset`) and + the `Dataset.id` as `DATASET` argument. + +Discovery via entry point is particularly useful for commands that can operate +on multiple datasets. To select **all** datasets advertising a given entry point, +pass `"_"` (i.e. an underscore) as `DATASET` argument. + + +## Workflow + +For a full example of the `cldfbench` curation workflow, see [the tutorial](doc/tutorial.md). + + +### Creating a skeleton for a new dataset directory + +A directory containing stub entries for a dataset can be created running + +```bash +cldfbench new +``` + +This will create the following layout (where `` stands for the chosen dataset ID): +``` +/ +├── cldf # A stub directory for the CLDF data +│   └── README.md +├── cldfbench_.py # The python module, providing the Dataset subclass +├── etc # A stub directory for the configuration data +│   └── README.md +├── metadata.json # The metadata provided to the subcommand serialized as JSON +├── raw # A stub directory for the raw data +│   └── README.md +├── setup.cfg # Python setup config, providing defaults for test integration +├── setup.py # Python setup file, making the dataset "installable" +├── test.py # The python code to run for dataset validation +└── .travis.yml # Integrate the validation with Travis-CI +``` + + +### Implementing CLDF creation + +`cldfbench` provides tools to make CLDF creation simple. Still, each dataset is +different, and so each dataset will have to provide its own custom code to do so. +This custom code goes into the `cmd_makecldf` method of the `Dataset` subclass in +the dataset's python module. +(See also the [API documentation of `cldfbench.Dataset`](https://cldfbench.readthedocs.io/en/latest/dataset.html).) + +Typically, this code will make use of one or more +[`cldfbench.CLDFSpec`](src/cldfbench/cldf.py) instances, which describes what kind of CLDF to create. A `CLDFSpec` also gives access to a +[`cldfbench.CLDFWriter`](src/cldfbench/cldf.py) instance, which wraps a `pycldf.Dataset`. + +The main interfaces to these objects are: +- `cldfbench.Dataset.cldf_specs`: a method returning specifications of all CLDF datasets + that are created by the dataset, +- `cldfbench.Dataset.cldf_writer`: a method returning an initialized `CLDFWriter` + associated with a particular `CLDFSpec`. + +`cldfbench` supports several scenarios of CLDF creation: +- The typical use case is turning raw data into a single CLDF dataset. This would + require instantiating one `CLDFWriter` writer in the `cmd_makecldf` method, and + the defaults of `CLDFSpec` will probably be ok. Since this is the most common and + simplest case, it is supported with some extra "sugar": The initialized `CLDFWriter` + is available as `args.writer` when `cmd_makecldf` is called. +- But it is also possible to create multiple CLDF datasets: + - For a dataset containing both, lexical and typological data, it may be appropriate + to create a `Ẁordlist` and a `StructureDataset`. To do so, one would have to + call `cldf_writer` twice, passing in an approriate `CLDFSpec`. Note that if + both CLDF datasets are created in the same directory, they can share the + `LanguageTable` - but would have to specify distinct file names for the + `ParameterTable`, passing distinct values to `CLDFSpec.data_fnames`. + - When creating multiple datasets of the same CLDF module, e.g. to split a large dataset into smaller chunks, care must be taken to also disambiguate the name + of the metadata file, passing distinct values to `CLDFSpec.metadata_fname`. + +When creating CLDF, it is also often useful to have standard reference catalogs +accessible, in particular Glottolog. See the section on [Catalogs](#catalogs) for +a description of how this is supported by `cldfbench`. + + +### Catalogs + +Linking data to reference catalogs is a major goal of CLDF, thus `cldfbench` +provides tools to make catalog access and maintenance easier. Catalog data must be +accessible in local clones of the data repository. `cldfbench` provides commands: +- `catconfig` to create the clones and make them known through a configuration file, +- `catinfo` to get an overview of the installed catalogs and their versions, +- `catupdate` to update local clones from the upstream repositories. + +See: + +- https://cldfbench.readthedocs.io/en/latest//catalogs.html + +for a list of reference catalogs which are currently supported in `cldfbench`. + + +### Curating a dataset on GitHub + +One of the design goals of CLDF was to specify a data format that plays well with +version control. Thus, it's natural - and actually recommended - to curate a CLDF +dataset in a version controlled repository. The most popular way to do this in a +collaborative fashion is by using a [git](https://git-scm.com/) repository hosted on +[GitHub](https://github.com). + +The directory layout supported by `cldfbench` caters to this use case in several ways: +- Each directory contains a file `README.md`, which will be rendered as human readable + description when browsing the repository at GitHub. +- The file `.travis.yml` contains the configuration for hooking up a repository with + [Travis CI](https://www.travis-ci.org/), to provide continuous consistency checking + of the data. + + +### Archiving a dataset with Zenodo + +Curating a dataset on GitHub also provides a simple way to archiving and publishing +released versions of the data. You can hook up your repository with [Zenodo](https://zenodo.org) (following [this guide](https://guides.github.com/activities/citable-code/)). Then, Zenodo will pick up any released package, assign a DOI to it, archive it and +make it accessible in the long-term. + +Some notes: +- Hook-up with Zenodo requires the repository to be public (not private). +- You should consider using an institutional account on GitHub and Zenodo to associate the repository with. Currently, only the user account registering a repository on Zenodo can change any metadata of releases lateron. +- Once released and archived with Zenodo, it's a good idea to add the DOI assigned by Zenodo to the release description on GitHub. +- To make sure a release is picked up by Zenodo, the version number must start with a letter, e.g. "v1.0" - **not** "1.0". + +Thus, with a setup as described here, you can make sure you create [FAIR data](https://en.wikipedia.org/wiki/FAIR_data). + + +## Extending `cldfbench` + +`cldfbench` can be extended or built-upon in various ways - typically by customizing core functionality in new python packages. To support particular types of raw data, you might want a custom `Dataset` class, or to support a particular type of CLDF data, you would customize `CLDFWriter`. + +In addition to extending `cldfbench` using the standard methods of object-oriented programming, there are two more ways of extending `cldfbench`: + + +### Commands + +A python package (or a dataset) can provide additional subcommands to be run from `cldfbench`. +For more info see the [`commands.README`](src/cldfbench/commands/README.md). + + +### Custom dataset templates + +A python package can provide alternative dataset templates to be run with `cldfbench new`. +Such templates are implemented by: +- a subclass of `cldfbench.Template`, +- which is advertised using an entry point `cldfbench.scaffold`: + +```python + entry_points={ + 'cldfbench.scaffold': [ + 'template_name=mypackage.scaffold:DerivedTemplate', + ], + }, +``` + + + + +%prep +%autosetup -n cldfbench-1.13.0 + +%build +%py3_build + +%install +%py3_install +install -d -m755 %{buildroot}/%{_pkgdocdir} +if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi +if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi +if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi +if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi +pushd %{buildroot} +if [ -d usr/lib ]; then + find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/lib64 ]; then + find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/bin ]; then + find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/sbin ]; then + find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst +fi +touch doclist.lst +if [ -d usr/share/man ]; then + find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst +fi +popd +mv %{buildroot}/filelist.lst . +mv %{buildroot}/doclist.lst . + +%files -n python3-cldfbench -f filelist.lst +%dir %{python3_sitelib}/* + +%files help -f doclist.lst +%{_docdir}/* + +%changelog +* Wed May 31 2023 Python_Bot - 1.13.0-1 +- Package Spec generated diff --git a/sources b/sources new file mode 100644 index 0000000..fad3a47 --- /dev/null +++ b/sources @@ -0,0 +1 @@ +e82b71ffa3ed6b2fac06acc795feaed8 cldfbench-1.13.0.tar.gz -- cgit v1.2.3