diff options
-rw-r--r-- | .gitignore | 1 | ||||
-rw-r--r-- | python-nomenklatura.spec | 337 | ||||
-rw-r--r-- | sources | 1 |
3 files changed, 339 insertions, 0 deletions
@@ -0,0 +1 @@ +/nomenklatura-2.11.0.tar.gz diff --git a/python-nomenklatura.spec b/python-nomenklatura.spec new file mode 100644 index 0000000..b5bf5f0 --- /dev/null +++ b/python-nomenklatura.spec @@ -0,0 +1,337 @@ +%global _empty_manifest_terminate_build 0 +Name: python-nomenklatura +Version: 2.11.0 +Release: 1 +Summary: Make record linkages in followthemoney data. +License: MIT +URL: https://github.com/opensanctions/nomenklatura +Source0: https://mirrors.nju.edu.cn/pypi/web/packages/03/af/11d98f613c587017491ccaa4a28059e7162ee6262262b767515d0d5df81c/nomenklatura-2.11.0.tar.gz +BuildArch: noarch + +Requires: python3-followthemoney +Requires: python3-shortuuid +Requires: python3-jellyfish +Requires: python3-rich +Requires: python3-textual +Requires: python3-scikit-learn +Requires: python3-click +Requires: python3-wheel +Requires: python3-twine +Requires: python3-mypy +Requires: python3-flake8 +Requires: python3-pytest +Requires: python3-pytest-cov +Requires: python3-coverage +Requires: python3-types-setuptools +Requires: python3-types-requests + +%description +# nomenklatura + +Nomenklatura de-duplicates and integrates different [Follow the Money](https://followthemoney.rtfd.org/) entities. It serves to clean up messy data and to find links between different datasets. + + + +## Usage + +You can install `nomenklatura` via PyPI: + +```bash +$ pip install nomenklatura +``` + +### Command-line usage + +Much of the functionality of `nomenklatura` can be used as a command-line tool. In the following example, we'll assume that you have a file containing [Follow the Money](https://followthemoney.rtfd.org/) entities in your local directory, named `entities.ijson`. If you just want try it out, you can use the file `tests/fixtures/donations.ijson` in this repository for testing (it contains German campaign finance data). + +With the file in place, you will cross-reference the entities to generate de-duplication candidates, then run the interactive de-duplication UI in your console, and eventually apply the judgements to generate a new file with merged entities: + +```bash +# generate merge candidates using an in-memory index: +$ nomenklatura xref -r resolver.json entities.ijson +# note there is now a new file, `resolver.json` that contains de-duplication info. +$ nomenklatura dedupe -r resolver.json entites.ijson +# will pop up a user interface. +$ nomenklatura apply entities.ijson -o merged.ijson -r resolver.json +# de-duplicated data goes into `merged.ijson`: +$ cat entities.ijson | wc -l +474 +$ cat merged.ijson | wc -l +468 +``` + +### Programmatic usage + +The command-line use of `nomenklatura` is targeted at small datasets which need to be de-duplicated. For more involved scenarios, the package also offers a Python API which can be used to control the semantics of de-duplication. + +* `nomenklatura.Dataset` - implements a basic dataset for describing a set of entities. +* `nomenklatura.Loader` - a general purpose access mechanism for entities. By default, a `nomenklatura.FileLoader` is used to access entity data stored in files, but the loader can be subclassed to work with entities from a database system. +* `nomenklatura.Index` - a full-text in-memory search index for FtM entities. In the application, this is used to block de-duplication candidates, but the index can also be used to drive an API etc. +* `nomenklatura.Resolver` - the core of the de-duplication process, the resolver is essentially a graph with edges made out of entity judgements. The resolver can be used to store judgements or get the canonical ID for a given entity. + +All of the API classes have extensive type annotations, which should make their integration in any modern Python API simpler. + +## Design + +This package offers an implementation of an in-memory data deduplication framework centered around the FtM data model. The idea is the following workflow: + +* Accept FtM-shaped entities from a given loader (e.g. a JSON file, or a database) +* Build an in-memory inverted index of the entities for dedupe blocking +* Generate merge candidates using the blocking index and FtM compare +* Provide a file-based storage format for merge challenges and decisions +* Provide a text-based user interface to let users make merge decisions + +Later on, the following might be added: + +* A web application to let users make merge decisions on the web + +### Resolver graph + +The key implementation detail of nomenklatura is the `Resolver`, a graph structure that +manages user decisions regarding entity identity. Edges are `Judgements` of whether +two entity IDs are the same, not the same, or undecided. The resolver implements an +algorithm for computing connected components, which can the be used to find the best +available ID for a cluster of entities. It can also be used to evaluate transitive +judgements, e.g. if A <> B, and B = C, then we don't need to ask if A = C. + +## Reading + +* https://dedupe.readthedocs.org/en/latest/ +* https://github.com/OpenRefine/OpenRefine/wiki/Reconcilable-Data-Sources +* https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth +* https://github.com/OpenRefine/OpenRefine/wiki/Reconciliation-Service-API + + +## Contact, contributions etc. + +This codebase is licensed under the terms of an MIT license (see LICENSE). + +We're keen for any contributions, bug fixes and feature suggestions, please use the GitHub issue tracker for this repository. + +Nomenklatura is currently developed thanks to a Prototypefund grant for [OpenSanctions](https://opensanctions.org). Previous iterations of the package were developed with support from [Knight-Mozilla OpenNews](http://opennews.org) and the [Open Knowledge Foundation Labs](http://okfnlabs.org). + + +%package -n python3-nomenklatura +Summary: Make record linkages in followthemoney data. +Provides: python-nomenklatura +BuildRequires: python3-devel +BuildRequires: python3-setuptools +BuildRequires: python3-pip +%description -n python3-nomenklatura +# nomenklatura + +Nomenklatura de-duplicates and integrates different [Follow the Money](https://followthemoney.rtfd.org/) entities. It serves to clean up messy data and to find links between different datasets. + + + +## Usage + +You can install `nomenklatura` via PyPI: + +```bash +$ pip install nomenklatura +``` + +### Command-line usage + +Much of the functionality of `nomenklatura` can be used as a command-line tool. In the following example, we'll assume that you have a file containing [Follow the Money](https://followthemoney.rtfd.org/) entities in your local directory, named `entities.ijson`. If you just want try it out, you can use the file `tests/fixtures/donations.ijson` in this repository for testing (it contains German campaign finance data). + +With the file in place, you will cross-reference the entities to generate de-duplication candidates, then run the interactive de-duplication UI in your console, and eventually apply the judgements to generate a new file with merged entities: + +```bash +# generate merge candidates using an in-memory index: +$ nomenklatura xref -r resolver.json entities.ijson +# note there is now a new file, `resolver.json` that contains de-duplication info. +$ nomenklatura dedupe -r resolver.json entites.ijson +# will pop up a user interface. +$ nomenklatura apply entities.ijson -o merged.ijson -r resolver.json +# de-duplicated data goes into `merged.ijson`: +$ cat entities.ijson | wc -l +474 +$ cat merged.ijson | wc -l +468 +``` + +### Programmatic usage + +The command-line use of `nomenklatura` is targeted at small datasets which need to be de-duplicated. For more involved scenarios, the package also offers a Python API which can be used to control the semantics of de-duplication. + +* `nomenklatura.Dataset` - implements a basic dataset for describing a set of entities. +* `nomenklatura.Loader` - a general purpose access mechanism for entities. By default, a `nomenklatura.FileLoader` is used to access entity data stored in files, but the loader can be subclassed to work with entities from a database system. +* `nomenklatura.Index` - a full-text in-memory search index for FtM entities. In the application, this is used to block de-duplication candidates, but the index can also be used to drive an API etc. +* `nomenklatura.Resolver` - the core of the de-duplication process, the resolver is essentially a graph with edges made out of entity judgements. The resolver can be used to store judgements or get the canonical ID for a given entity. + +All of the API classes have extensive type annotations, which should make their integration in any modern Python API simpler. + +## Design + +This package offers an implementation of an in-memory data deduplication framework centered around the FtM data model. The idea is the following workflow: + +* Accept FtM-shaped entities from a given loader (e.g. a JSON file, or a database) +* Build an in-memory inverted index of the entities for dedupe blocking +* Generate merge candidates using the blocking index and FtM compare +* Provide a file-based storage format for merge challenges and decisions +* Provide a text-based user interface to let users make merge decisions + +Later on, the following might be added: + +* A web application to let users make merge decisions on the web + +### Resolver graph + +The key implementation detail of nomenklatura is the `Resolver`, a graph structure that +manages user decisions regarding entity identity. Edges are `Judgements` of whether +two entity IDs are the same, not the same, or undecided. The resolver implements an +algorithm for computing connected components, which can the be used to find the best +available ID for a cluster of entities. It can also be used to evaluate transitive +judgements, e.g. if A <> B, and B = C, then we don't need to ask if A = C. + +## Reading + +* https://dedupe.readthedocs.org/en/latest/ +* https://github.com/OpenRefine/OpenRefine/wiki/Reconcilable-Data-Sources +* https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth +* https://github.com/OpenRefine/OpenRefine/wiki/Reconciliation-Service-API + + +## Contact, contributions etc. + +This codebase is licensed under the terms of an MIT license (see LICENSE). + +We're keen for any contributions, bug fixes and feature suggestions, please use the GitHub issue tracker for this repository. + +Nomenklatura is currently developed thanks to a Prototypefund grant for [OpenSanctions](https://opensanctions.org). Previous iterations of the package were developed with support from [Knight-Mozilla OpenNews](http://opennews.org) and the [Open Knowledge Foundation Labs](http://okfnlabs.org). + + +%package help +Summary: Development documents and examples for nomenklatura +Provides: python3-nomenklatura-doc +%description help +# nomenklatura + +Nomenklatura de-duplicates and integrates different [Follow the Money](https://followthemoney.rtfd.org/) entities. It serves to clean up messy data and to find links between different datasets. + + + +## Usage + +You can install `nomenklatura` via PyPI: + +```bash +$ pip install nomenklatura +``` + +### Command-line usage + +Much of the functionality of `nomenklatura` can be used as a command-line tool. In the following example, we'll assume that you have a file containing [Follow the Money](https://followthemoney.rtfd.org/) entities in your local directory, named `entities.ijson`. If you just want try it out, you can use the file `tests/fixtures/donations.ijson` in this repository for testing (it contains German campaign finance data). + +With the file in place, you will cross-reference the entities to generate de-duplication candidates, then run the interactive de-duplication UI in your console, and eventually apply the judgements to generate a new file with merged entities: + +```bash +# generate merge candidates using an in-memory index: +$ nomenklatura xref -r resolver.json entities.ijson +# note there is now a new file, `resolver.json` that contains de-duplication info. +$ nomenklatura dedupe -r resolver.json entites.ijson +# will pop up a user interface. +$ nomenklatura apply entities.ijson -o merged.ijson -r resolver.json +# de-duplicated data goes into `merged.ijson`: +$ cat entities.ijson | wc -l +474 +$ cat merged.ijson | wc -l +468 +``` + +### Programmatic usage + +The command-line use of `nomenklatura` is targeted at small datasets which need to be de-duplicated. For more involved scenarios, the package also offers a Python API which can be used to control the semantics of de-duplication. + +* `nomenklatura.Dataset` - implements a basic dataset for describing a set of entities. +* `nomenklatura.Loader` - a general purpose access mechanism for entities. By default, a `nomenklatura.FileLoader` is used to access entity data stored in files, but the loader can be subclassed to work with entities from a database system. +* `nomenklatura.Index` - a full-text in-memory search index for FtM entities. In the application, this is used to block de-duplication candidates, but the index can also be used to drive an API etc. +* `nomenklatura.Resolver` - the core of the de-duplication process, the resolver is essentially a graph with edges made out of entity judgements. The resolver can be used to store judgements or get the canonical ID for a given entity. + +All of the API classes have extensive type annotations, which should make their integration in any modern Python API simpler. + +## Design + +This package offers an implementation of an in-memory data deduplication framework centered around the FtM data model. The idea is the following workflow: + +* Accept FtM-shaped entities from a given loader (e.g. a JSON file, or a database) +* Build an in-memory inverted index of the entities for dedupe blocking +* Generate merge candidates using the blocking index and FtM compare +* Provide a file-based storage format for merge challenges and decisions +* Provide a text-based user interface to let users make merge decisions + +Later on, the following might be added: + +* A web application to let users make merge decisions on the web + +### Resolver graph + +The key implementation detail of nomenklatura is the `Resolver`, a graph structure that +manages user decisions regarding entity identity. Edges are `Judgements` of whether +two entity IDs are the same, not the same, or undecided. The resolver implements an +algorithm for computing connected components, which can the be used to find the best +available ID for a cluster of entities. It can also be used to evaluate transitive +judgements, e.g. if A <> B, and B = C, then we don't need to ask if A = C. + +## Reading + +* https://dedupe.readthedocs.org/en/latest/ +* https://github.com/OpenRefine/OpenRefine/wiki/Reconcilable-Data-Sources +* https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth +* https://github.com/OpenRefine/OpenRefine/wiki/Reconciliation-Service-API + + +## Contact, contributions etc. + +This codebase is licensed under the terms of an MIT license (see LICENSE). + +We're keen for any contributions, bug fixes and feature suggestions, please use the GitHub issue tracker for this repository. + +Nomenklatura is currently developed thanks to a Prototypefund grant for [OpenSanctions](https://opensanctions.org). Previous iterations of the package were developed with support from [Knight-Mozilla OpenNews](http://opennews.org) and the [Open Knowledge Foundation Labs](http://okfnlabs.org). + + +%prep +%autosetup -n nomenklatura-2.11.0 + +%build +%py3_build + +%install +%py3_install +install -d -m755 %{buildroot}/%{_pkgdocdir} +if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi +if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi +if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi +if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi +pushd %{buildroot} +if [ -d usr/lib ]; then + find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/lib64 ]; then + find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/bin ]; then + find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/sbin ]; then + find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst +fi +touch doclist.lst +if [ -d usr/share/man ]; then + find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst +fi +popd +mv %{buildroot}/filelist.lst . +mv %{buildroot}/doclist.lst . + +%files -n python3-nomenklatura -f filelist.lst +%dir %{python3_sitelib}/* + +%files help -f doclist.lst +%{_docdir}/* + +%changelog +* Mon May 15 2023 Python_Bot <Python_Bot@openeuler.org> - 2.11.0-1 +- Package Spec generated @@ -0,0 +1 @@ +17cf0c1f8d00078c51fdd3dae01c9b33 nomenklatura-2.11.0.tar.gz |