summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
-rw-r--r--.gitignore1
-rw-r--r--python-nomenklatura.spec337
-rw-r--r--sources1
3 files changed, 339 insertions, 0 deletions
diff --git a/.gitignore b/.gitignore
index e69de29..bb6ae56 100644
--- a/.gitignore
+++ b/.gitignore
@@ -0,0 +1 @@
+/nomenklatura-2.11.0.tar.gz
diff --git a/python-nomenklatura.spec b/python-nomenklatura.spec
new file mode 100644
index 0000000..b5bf5f0
--- /dev/null
+++ b/python-nomenklatura.spec
@@ -0,0 +1,337 @@
+%global _empty_manifest_terminate_build 0
+Name: python-nomenklatura
+Version: 2.11.0
+Release: 1
+Summary: Make record linkages in followthemoney data.
+License: MIT
+URL: https://github.com/opensanctions/nomenklatura
+Source0: https://mirrors.nju.edu.cn/pypi/web/packages/03/af/11d98f613c587017491ccaa4a28059e7162ee6262262b767515d0d5df81c/nomenklatura-2.11.0.tar.gz
+BuildArch: noarch
+
+Requires: python3-followthemoney
+Requires: python3-shortuuid
+Requires: python3-jellyfish
+Requires: python3-rich
+Requires: python3-textual
+Requires: python3-scikit-learn
+Requires: python3-click
+Requires: python3-wheel
+Requires: python3-twine
+Requires: python3-mypy
+Requires: python3-flake8
+Requires: python3-pytest
+Requires: python3-pytest-cov
+Requires: python3-coverage
+Requires: python3-types-setuptools
+Requires: python3-types-requests
+
+%description
+# nomenklatura
+
+Nomenklatura de-duplicates and integrates different [Follow the Money](https://followthemoney.rtfd.org/) entities. It serves to clean up messy data and to find links between different datasets.
+
+![screenshot](./docs/screenshot.png)
+
+## Usage
+
+You can install `nomenklatura` via PyPI:
+
+```bash
+$ pip install nomenklatura
+```
+
+### Command-line usage
+
+Much of the functionality of `nomenklatura` can be used as a command-line tool. In the following example, we'll assume that you have a file containing [Follow the Money](https://followthemoney.rtfd.org/) entities in your local directory, named `entities.ijson`. If you just want try it out, you can use the file `tests/fixtures/donations.ijson` in this repository for testing (it contains German campaign finance data).
+
+With the file in place, you will cross-reference the entities to generate de-duplication candidates, then run the interactive de-duplication UI in your console, and eventually apply the judgements to generate a new file with merged entities:
+
+```bash
+# generate merge candidates using an in-memory index:
+$ nomenklatura xref -r resolver.json entities.ijson
+# note there is now a new file, `resolver.json` that contains de-duplication info.
+$ nomenklatura dedupe -r resolver.json entites.ijson
+# will pop up a user interface.
+$ nomenklatura apply entities.ijson -o merged.ijson -r resolver.json
+# de-duplicated data goes into `merged.ijson`:
+$ cat entities.ijson | wc -l
+474
+$ cat merged.ijson | wc -l
+468
+```
+
+### Programmatic usage
+
+The command-line use of `nomenklatura` is targeted at small datasets which need to be de-duplicated. For more involved scenarios, the package also offers a Python API which can be used to control the semantics of de-duplication.
+
+* `nomenklatura.Dataset` - implements a basic dataset for describing a set of entities.
+* `nomenklatura.Loader` - a general purpose access mechanism for entities. By default, a `nomenklatura.FileLoader` is used to access entity data stored in files, but the loader can be subclassed to work with entities from a database system.
+* `nomenklatura.Index` - a full-text in-memory search index for FtM entities. In the application, this is used to block de-duplication candidates, but the index can also be used to drive an API etc.
+* `nomenklatura.Resolver` - the core of the de-duplication process, the resolver is essentially a graph with edges made out of entity judgements. The resolver can be used to store judgements or get the canonical ID for a given entity.
+
+All of the API classes have extensive type annotations, which should make their integration in any modern Python API simpler.
+
+## Design
+
+This package offers an implementation of an in-memory data deduplication framework centered around the FtM data model. The idea is the following workflow:
+
+* Accept FtM-shaped entities from a given loader (e.g. a JSON file, or a database)
+* Build an in-memory inverted index of the entities for dedupe blocking
+* Generate merge candidates using the blocking index and FtM compare
+* Provide a file-based storage format for merge challenges and decisions
+* Provide a text-based user interface to let users make merge decisions
+
+Later on, the following might be added:
+
+* A web application to let users make merge decisions on the web
+
+### Resolver graph
+
+The key implementation detail of nomenklatura is the `Resolver`, a graph structure that
+manages user decisions regarding entity identity. Edges are `Judgements` of whether
+two entity IDs are the same, not the same, or undecided. The resolver implements an
+algorithm for computing connected components, which can the be used to find the best
+available ID for a cluster of entities. It can also be used to evaluate transitive
+judgements, e.g. if A <> B, and B = C, then we don't need to ask if A = C.
+
+## Reading
+
+* https://dedupe.readthedocs.org/en/latest/
+* https://github.com/OpenRefine/OpenRefine/wiki/Reconcilable-Data-Sources
+* https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth
+* https://github.com/OpenRefine/OpenRefine/wiki/Reconciliation-Service-API
+
+
+## Contact, contributions etc.
+
+This codebase is licensed under the terms of an MIT license (see LICENSE).
+
+We're keen for any contributions, bug fixes and feature suggestions, please use the GitHub issue tracker for this repository.
+
+Nomenklatura is currently developed thanks to a Prototypefund grant for [OpenSanctions](https://opensanctions.org). Previous iterations of the package were developed with support from [Knight-Mozilla OpenNews](http://opennews.org) and the [Open Knowledge Foundation Labs](http://okfnlabs.org).
+
+
+%package -n python3-nomenklatura
+Summary: Make record linkages in followthemoney data.
+Provides: python-nomenklatura
+BuildRequires: python3-devel
+BuildRequires: python3-setuptools
+BuildRequires: python3-pip
+%description -n python3-nomenklatura
+# nomenklatura
+
+Nomenklatura de-duplicates and integrates different [Follow the Money](https://followthemoney.rtfd.org/) entities. It serves to clean up messy data and to find links between different datasets.
+
+![screenshot](./docs/screenshot.png)
+
+## Usage
+
+You can install `nomenklatura` via PyPI:
+
+```bash
+$ pip install nomenklatura
+```
+
+### Command-line usage
+
+Much of the functionality of `nomenklatura` can be used as a command-line tool. In the following example, we'll assume that you have a file containing [Follow the Money](https://followthemoney.rtfd.org/) entities in your local directory, named `entities.ijson`. If you just want try it out, you can use the file `tests/fixtures/donations.ijson` in this repository for testing (it contains German campaign finance data).
+
+With the file in place, you will cross-reference the entities to generate de-duplication candidates, then run the interactive de-duplication UI in your console, and eventually apply the judgements to generate a new file with merged entities:
+
+```bash
+# generate merge candidates using an in-memory index:
+$ nomenklatura xref -r resolver.json entities.ijson
+# note there is now a new file, `resolver.json` that contains de-duplication info.
+$ nomenklatura dedupe -r resolver.json entites.ijson
+# will pop up a user interface.
+$ nomenklatura apply entities.ijson -o merged.ijson -r resolver.json
+# de-duplicated data goes into `merged.ijson`:
+$ cat entities.ijson | wc -l
+474
+$ cat merged.ijson | wc -l
+468
+```
+
+### Programmatic usage
+
+The command-line use of `nomenklatura` is targeted at small datasets which need to be de-duplicated. For more involved scenarios, the package also offers a Python API which can be used to control the semantics of de-duplication.
+
+* `nomenklatura.Dataset` - implements a basic dataset for describing a set of entities.
+* `nomenklatura.Loader` - a general purpose access mechanism for entities. By default, a `nomenklatura.FileLoader` is used to access entity data stored in files, but the loader can be subclassed to work with entities from a database system.
+* `nomenklatura.Index` - a full-text in-memory search index for FtM entities. In the application, this is used to block de-duplication candidates, but the index can also be used to drive an API etc.
+* `nomenklatura.Resolver` - the core of the de-duplication process, the resolver is essentially a graph with edges made out of entity judgements. The resolver can be used to store judgements or get the canonical ID for a given entity.
+
+All of the API classes have extensive type annotations, which should make their integration in any modern Python API simpler.
+
+## Design
+
+This package offers an implementation of an in-memory data deduplication framework centered around the FtM data model. The idea is the following workflow:
+
+* Accept FtM-shaped entities from a given loader (e.g. a JSON file, or a database)
+* Build an in-memory inverted index of the entities for dedupe blocking
+* Generate merge candidates using the blocking index and FtM compare
+* Provide a file-based storage format for merge challenges and decisions
+* Provide a text-based user interface to let users make merge decisions
+
+Later on, the following might be added:
+
+* A web application to let users make merge decisions on the web
+
+### Resolver graph
+
+The key implementation detail of nomenklatura is the `Resolver`, a graph structure that
+manages user decisions regarding entity identity. Edges are `Judgements` of whether
+two entity IDs are the same, not the same, or undecided. The resolver implements an
+algorithm for computing connected components, which can the be used to find the best
+available ID for a cluster of entities. It can also be used to evaluate transitive
+judgements, e.g. if A <> B, and B = C, then we don't need to ask if A = C.
+
+## Reading
+
+* https://dedupe.readthedocs.org/en/latest/
+* https://github.com/OpenRefine/OpenRefine/wiki/Reconcilable-Data-Sources
+* https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth
+* https://github.com/OpenRefine/OpenRefine/wiki/Reconciliation-Service-API
+
+
+## Contact, contributions etc.
+
+This codebase is licensed under the terms of an MIT license (see LICENSE).
+
+We're keen for any contributions, bug fixes and feature suggestions, please use the GitHub issue tracker for this repository.
+
+Nomenklatura is currently developed thanks to a Prototypefund grant for [OpenSanctions](https://opensanctions.org). Previous iterations of the package were developed with support from [Knight-Mozilla OpenNews](http://opennews.org) and the [Open Knowledge Foundation Labs](http://okfnlabs.org).
+
+
+%package help
+Summary: Development documents and examples for nomenklatura
+Provides: python3-nomenklatura-doc
+%description help
+# nomenklatura
+
+Nomenklatura de-duplicates and integrates different [Follow the Money](https://followthemoney.rtfd.org/) entities. It serves to clean up messy data and to find links between different datasets.
+
+![screenshot](./docs/screenshot.png)
+
+## Usage
+
+You can install `nomenklatura` via PyPI:
+
+```bash
+$ pip install nomenklatura
+```
+
+### Command-line usage
+
+Much of the functionality of `nomenklatura` can be used as a command-line tool. In the following example, we'll assume that you have a file containing [Follow the Money](https://followthemoney.rtfd.org/) entities in your local directory, named `entities.ijson`. If you just want try it out, you can use the file `tests/fixtures/donations.ijson` in this repository for testing (it contains German campaign finance data).
+
+With the file in place, you will cross-reference the entities to generate de-duplication candidates, then run the interactive de-duplication UI in your console, and eventually apply the judgements to generate a new file with merged entities:
+
+```bash
+# generate merge candidates using an in-memory index:
+$ nomenklatura xref -r resolver.json entities.ijson
+# note there is now a new file, `resolver.json` that contains de-duplication info.
+$ nomenklatura dedupe -r resolver.json entites.ijson
+# will pop up a user interface.
+$ nomenklatura apply entities.ijson -o merged.ijson -r resolver.json
+# de-duplicated data goes into `merged.ijson`:
+$ cat entities.ijson | wc -l
+474
+$ cat merged.ijson | wc -l
+468
+```
+
+### Programmatic usage
+
+The command-line use of `nomenklatura` is targeted at small datasets which need to be de-duplicated. For more involved scenarios, the package also offers a Python API which can be used to control the semantics of de-duplication.
+
+* `nomenklatura.Dataset` - implements a basic dataset for describing a set of entities.
+* `nomenklatura.Loader` - a general purpose access mechanism for entities. By default, a `nomenklatura.FileLoader` is used to access entity data stored in files, but the loader can be subclassed to work with entities from a database system.
+* `nomenklatura.Index` - a full-text in-memory search index for FtM entities. In the application, this is used to block de-duplication candidates, but the index can also be used to drive an API etc.
+* `nomenklatura.Resolver` - the core of the de-duplication process, the resolver is essentially a graph with edges made out of entity judgements. The resolver can be used to store judgements or get the canonical ID for a given entity.
+
+All of the API classes have extensive type annotations, which should make their integration in any modern Python API simpler.
+
+## Design
+
+This package offers an implementation of an in-memory data deduplication framework centered around the FtM data model. The idea is the following workflow:
+
+* Accept FtM-shaped entities from a given loader (e.g. a JSON file, or a database)
+* Build an in-memory inverted index of the entities for dedupe blocking
+* Generate merge candidates using the blocking index and FtM compare
+* Provide a file-based storage format for merge challenges and decisions
+* Provide a text-based user interface to let users make merge decisions
+
+Later on, the following might be added:
+
+* A web application to let users make merge decisions on the web
+
+### Resolver graph
+
+The key implementation detail of nomenklatura is the `Resolver`, a graph structure that
+manages user decisions regarding entity identity. Edges are `Judgements` of whether
+two entity IDs are the same, not the same, or undecided. The resolver implements an
+algorithm for computing connected components, which can the be used to find the best
+available ID for a cluster of entities. It can also be used to evaluate transitive
+judgements, e.g. if A <> B, and B = C, then we don't need to ask if A = C.
+
+## Reading
+
+* https://dedupe.readthedocs.org/en/latest/
+* https://github.com/OpenRefine/OpenRefine/wiki/Reconcilable-Data-Sources
+* https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth
+* https://github.com/OpenRefine/OpenRefine/wiki/Reconciliation-Service-API
+
+
+## Contact, contributions etc.
+
+This codebase is licensed under the terms of an MIT license (see LICENSE).
+
+We're keen for any contributions, bug fixes and feature suggestions, please use the GitHub issue tracker for this repository.
+
+Nomenklatura is currently developed thanks to a Prototypefund grant for [OpenSanctions](https://opensanctions.org). Previous iterations of the package were developed with support from [Knight-Mozilla OpenNews](http://opennews.org) and the [Open Knowledge Foundation Labs](http://okfnlabs.org).
+
+
+%prep
+%autosetup -n nomenklatura-2.11.0
+
+%build
+%py3_build
+
+%install
+%py3_install
+install -d -m755 %{buildroot}/%{_pkgdocdir}
+if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi
+if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi
+if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi
+if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi
+pushd %{buildroot}
+if [ -d usr/lib ]; then
+ find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+if [ -d usr/lib64 ]; then
+ find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+if [ -d usr/bin ]; then
+ find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+if [ -d usr/sbin ]; then
+ find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+touch doclist.lst
+if [ -d usr/share/man ]; then
+ find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst
+fi
+popd
+mv %{buildroot}/filelist.lst .
+mv %{buildroot}/doclist.lst .
+
+%files -n python3-nomenklatura -f filelist.lst
+%dir %{python3_sitelib}/*
+
+%files help -f doclist.lst
+%{_docdir}/*
+
+%changelog
+* Mon May 15 2023 Python_Bot <Python_Bot@openeuler.org> - 2.11.0-1
+- Package Spec generated
diff --git a/sources b/sources
new file mode 100644
index 0000000..17dca34
--- /dev/null
+++ b/sources
@@ -0,0 +1 @@
+17cf0c1f8d00078c51fdd3dae01c9b33 nomenklatura-2.11.0.tar.gz