summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
-rw-r--r--.gitignore1
-rw-r--r--python-fuzzycat.spec287
-rw-r--r--sources1
3 files changed, 289 insertions, 0 deletions
diff --git a/.gitignore b/.gitignore
index e69de29..b14167a 100644
--- a/.gitignore
+++ b/.gitignore
@@ -0,0 +1 @@
+/fuzzycat-0.1.23.tar.gz
diff --git a/python-fuzzycat.spec b/python-fuzzycat.spec
new file mode 100644
index 0000000..7566f9b
--- /dev/null
+++ b/python-fuzzycat.spec
@@ -0,0 +1,287 @@
+%global _empty_manifest_terminate_build 0
+Name: python-fuzzycat
+Version: 0.1.23
+Release: 1
+Summary: Fuzzy matching utilities for scholarly metadata
+License: MIT License
+URL: https://github.com/miku/fuzzycat
+Source0: https://mirrors.aliyun.com/pypi/web/packages/ff/91/19a75e56b496384ca5635de7640cc50ef0de75853b1ce3c758ea6e85cdb0/fuzzycat-0.1.23.tar.gz
+BuildArch: noarch
+
+Requires: python3-dynaconf
+Requires: python3-elasticsearch
+Requires: python3-elasticsearch-dsl
+Requires: python3-fatcat-openapi-client
+Requires: python3-ftfy
+Requires: python3-glom
+Requires: python3-grobid-tei-xml
+Requires: python3-jellyfish
+Requires: python3-pyyaml
+Requires: python3-regex
+Requires: python3-requests
+Requires: python3-thefuzz
+Requires: python3-toml
+Requires: python3-unidecode
+Requires: python3-zstandard
+Requires: python3-ipython
+Requires: python3-isort
+Requires: python3-mypy
+Requires: python3-pylint
+Requires: python3-pytest
+Requires: python3-pytest-cov
+Requires: python3-twine
+Requires: python3-yapf
+
+%description
+![https://pypi.org/project/fuzzycat/](https://img.shields.io/pypi/v/fuzzycat?style=flat-square)
+This Python library contains routines for finding near-duplicate bibliographic
+entities (primarily research papers), and estimating whether two metadata
+records describe the same work (or variations of the same work). Some routines
+are designed to work "offline" with batches of billions of sorted metadata
+records, and others are designed to work "online" making queries against hosted
+web services and catalogs.
+`fuzzycat` was originally developed by Martin Czygan at the Internet Archive,
+and is used in the construction of a [citation
+graph](https://gitlab.com/internetarchive/refcat) and to identify duplicate
+records in the [fatcat.wiki](https://fatcat.wiki) catalog and
+[scholar.archive.org](https://scholar.archive.org) search index.
+**DISCLAIMER:** this tool is still under development, as indicated by the "0"
+major version. The interface, semantics, and behavior are likely to be tweaked.
+## Quickstart
+Inside a `virtualenv` (or similar), install with [pip](https://pypi.org/project/pip/):
+```
+pip install fuzzycat
+```
+The `fuzzycat.simple` module contains high-level helpers which query Internet
+Archive hosted services:
+ import elasticsearch
+ from fuzzycat.simple import *
+ es_client = elasticsearch.Elasticsearch("https://search.fatcat.wiki:443")
+ # parses reference using GROBID (at https://grobid.qa.fatcat.wiki),
+ # then queries Elasticsearch (at https://search.fatcat.wiki),
+ # then scores candidates against latest catalog record fetched from
+ # https://api.fatcat.wiki
+ best_match = closest_fuzzy_unstructured_match(
+ """Cunningham HB, Weis JJ, Taveras LR, Huerta S. Mesh migration following abdominal hernia repair: a comprehensive review. Hernia. 2019 Apr;23(2):235-243. doi: 10.1007/s10029-019-01898-9. Epub 2019 Jan 30. PMID: 30701369.""",
+ es_client=es_client)
+ print(best_match)
+ # FuzzyReleaseMatchResult(status=<Status.EXACT: 'exact'>, reason=<Reason.DOI: 'doi'>, release={...})
+ # same as above, but without the GROBID parsing, and returns multiple results
+ matches = close_fuzzy_biblio_matches(
+ dict(
+ title="Mesh migration following abdominal hernia repair: a comprehensive review",
+ first_author="Cunningham",
+ year=2019,
+ journal="Hernia",
+ ),
+ es_client=es_client,
+ )
+A CLI tool is included for processing records in UNIX stdin/stdout pipelines:
+ # print usage
+ python -m fuzzycat
+## Features and Use-Cases
+The [refcat project](https://gitlab.com/internetarchive/refcat) builds on top
+of this library to build a citation graph by processing billions of structured
+and unstructured reference records extracted from scholarly papers (note: jfor
+performance critical parts, some code has been ported to Go, albeit the test
+suite is shared between the Python and Go implementations).
+Automated imports of metadata records into the fatcat catalog use fuzzycat to
+filter new metadata which look like duplicates of existing records from other
+sources.
+In conjunction with standard command-line tools (like `sort`), fatcat bulk
+metadata snapshots can be clustered and reduced into groups to flag duplicate
+records for merging.
+Extracted reference strings from any source (webpages, books, papers, wikis,
+databases, etc) can be resolved against the fatcat catalog of scholarly papers.
+## Support and Acknowledgements
+Work on this software received support from the Andrew W. Mellon Foundation
+through multiple phases of the ["Ensuring the Persistent Access of Open Access
+Journal Literature"](https://mellon.org/grants/grants-database/advanced-search/?amount-low=&amount-high=&year-start=&year-end=&city=&state=&country=&q=%22Ensuring+the+Persistent+Access%22&per_page=25) project (see [original announcement](http://blog.archive.org/2018/03/05/andrew-w-mellon-foundation-awards-grant-to-the-internet-archive-for-long-tail-journal-preservation/)).
+Additional acknowledgements [at fatcat.wiki](https://fatcat.wiki/about).
+
+%package -n python3-fuzzycat
+Summary: Fuzzy matching utilities for scholarly metadata
+Provides: python-fuzzycat
+BuildRequires: python3-devel
+BuildRequires: python3-setuptools
+BuildRequires: python3-pip
+%description -n python3-fuzzycat
+![https://pypi.org/project/fuzzycat/](https://img.shields.io/pypi/v/fuzzycat?style=flat-square)
+This Python library contains routines for finding near-duplicate bibliographic
+entities (primarily research papers), and estimating whether two metadata
+records describe the same work (or variations of the same work). Some routines
+are designed to work "offline" with batches of billions of sorted metadata
+records, and others are designed to work "online" making queries against hosted
+web services and catalogs.
+`fuzzycat` was originally developed by Martin Czygan at the Internet Archive,
+and is used in the construction of a [citation
+graph](https://gitlab.com/internetarchive/refcat) and to identify duplicate
+records in the [fatcat.wiki](https://fatcat.wiki) catalog and
+[scholar.archive.org](https://scholar.archive.org) search index.
+**DISCLAIMER:** this tool is still under development, as indicated by the "0"
+major version. The interface, semantics, and behavior are likely to be tweaked.
+## Quickstart
+Inside a `virtualenv` (or similar), install with [pip](https://pypi.org/project/pip/):
+```
+pip install fuzzycat
+```
+The `fuzzycat.simple` module contains high-level helpers which query Internet
+Archive hosted services:
+ import elasticsearch
+ from fuzzycat.simple import *
+ es_client = elasticsearch.Elasticsearch("https://search.fatcat.wiki:443")
+ # parses reference using GROBID (at https://grobid.qa.fatcat.wiki),
+ # then queries Elasticsearch (at https://search.fatcat.wiki),
+ # then scores candidates against latest catalog record fetched from
+ # https://api.fatcat.wiki
+ best_match = closest_fuzzy_unstructured_match(
+ """Cunningham HB, Weis JJ, Taveras LR, Huerta S. Mesh migration following abdominal hernia repair: a comprehensive review. Hernia. 2019 Apr;23(2):235-243. doi: 10.1007/s10029-019-01898-9. Epub 2019 Jan 30. PMID: 30701369.""",
+ es_client=es_client)
+ print(best_match)
+ # FuzzyReleaseMatchResult(status=<Status.EXACT: 'exact'>, reason=<Reason.DOI: 'doi'>, release={...})
+ # same as above, but without the GROBID parsing, and returns multiple results
+ matches = close_fuzzy_biblio_matches(
+ dict(
+ title="Mesh migration following abdominal hernia repair: a comprehensive review",
+ first_author="Cunningham",
+ year=2019,
+ journal="Hernia",
+ ),
+ es_client=es_client,
+ )
+A CLI tool is included for processing records in UNIX stdin/stdout pipelines:
+ # print usage
+ python -m fuzzycat
+## Features and Use-Cases
+The [refcat project](https://gitlab.com/internetarchive/refcat) builds on top
+of this library to build a citation graph by processing billions of structured
+and unstructured reference records extracted from scholarly papers (note: jfor
+performance critical parts, some code has been ported to Go, albeit the test
+suite is shared between the Python and Go implementations).
+Automated imports of metadata records into the fatcat catalog use fuzzycat to
+filter new metadata which look like duplicates of existing records from other
+sources.
+In conjunction with standard command-line tools (like `sort`), fatcat bulk
+metadata snapshots can be clustered and reduced into groups to flag duplicate
+records for merging.
+Extracted reference strings from any source (webpages, books, papers, wikis,
+databases, etc) can be resolved against the fatcat catalog of scholarly papers.
+## Support and Acknowledgements
+Work on this software received support from the Andrew W. Mellon Foundation
+through multiple phases of the ["Ensuring the Persistent Access of Open Access
+Journal Literature"](https://mellon.org/grants/grants-database/advanced-search/?amount-low=&amount-high=&year-start=&year-end=&city=&state=&country=&q=%22Ensuring+the+Persistent+Access%22&per_page=25) project (see [original announcement](http://blog.archive.org/2018/03/05/andrew-w-mellon-foundation-awards-grant-to-the-internet-archive-for-long-tail-journal-preservation/)).
+Additional acknowledgements [at fatcat.wiki](https://fatcat.wiki/about).
+
+%package help
+Summary: Development documents and examples for fuzzycat
+Provides: python3-fuzzycat-doc
+%description help
+![https://pypi.org/project/fuzzycat/](https://img.shields.io/pypi/v/fuzzycat?style=flat-square)
+This Python library contains routines for finding near-duplicate bibliographic
+entities (primarily research papers), and estimating whether two metadata
+records describe the same work (or variations of the same work). Some routines
+are designed to work "offline" with batches of billions of sorted metadata
+records, and others are designed to work "online" making queries against hosted
+web services and catalogs.
+`fuzzycat` was originally developed by Martin Czygan at the Internet Archive,
+and is used in the construction of a [citation
+graph](https://gitlab.com/internetarchive/refcat) and to identify duplicate
+records in the [fatcat.wiki](https://fatcat.wiki) catalog and
+[scholar.archive.org](https://scholar.archive.org) search index.
+**DISCLAIMER:** this tool is still under development, as indicated by the "0"
+major version. The interface, semantics, and behavior are likely to be tweaked.
+## Quickstart
+Inside a `virtualenv` (or similar), install with [pip](https://pypi.org/project/pip/):
+```
+pip install fuzzycat
+```
+The `fuzzycat.simple` module contains high-level helpers which query Internet
+Archive hosted services:
+ import elasticsearch
+ from fuzzycat.simple import *
+ es_client = elasticsearch.Elasticsearch("https://search.fatcat.wiki:443")
+ # parses reference using GROBID (at https://grobid.qa.fatcat.wiki),
+ # then queries Elasticsearch (at https://search.fatcat.wiki),
+ # then scores candidates against latest catalog record fetched from
+ # https://api.fatcat.wiki
+ best_match = closest_fuzzy_unstructured_match(
+ """Cunningham HB, Weis JJ, Taveras LR, Huerta S. Mesh migration following abdominal hernia repair: a comprehensive review. Hernia. 2019 Apr;23(2):235-243. doi: 10.1007/s10029-019-01898-9. Epub 2019 Jan 30. PMID: 30701369.""",
+ es_client=es_client)
+ print(best_match)
+ # FuzzyReleaseMatchResult(status=<Status.EXACT: 'exact'>, reason=<Reason.DOI: 'doi'>, release={...})
+ # same as above, but without the GROBID parsing, and returns multiple results
+ matches = close_fuzzy_biblio_matches(
+ dict(
+ title="Mesh migration following abdominal hernia repair: a comprehensive review",
+ first_author="Cunningham",
+ year=2019,
+ journal="Hernia",
+ ),
+ es_client=es_client,
+ )
+A CLI tool is included for processing records in UNIX stdin/stdout pipelines:
+ # print usage
+ python -m fuzzycat
+## Features and Use-Cases
+The [refcat project](https://gitlab.com/internetarchive/refcat) builds on top
+of this library to build a citation graph by processing billions of structured
+and unstructured reference records extracted from scholarly papers (note: jfor
+performance critical parts, some code has been ported to Go, albeit the test
+suite is shared between the Python and Go implementations).
+Automated imports of metadata records into the fatcat catalog use fuzzycat to
+filter new metadata which look like duplicates of existing records from other
+sources.
+In conjunction with standard command-line tools (like `sort`), fatcat bulk
+metadata snapshots can be clustered and reduced into groups to flag duplicate
+records for merging.
+Extracted reference strings from any source (webpages, books, papers, wikis,
+databases, etc) can be resolved against the fatcat catalog of scholarly papers.
+## Support and Acknowledgements
+Work on this software received support from the Andrew W. Mellon Foundation
+through multiple phases of the ["Ensuring the Persistent Access of Open Access
+Journal Literature"](https://mellon.org/grants/grants-database/advanced-search/?amount-low=&amount-high=&year-start=&year-end=&city=&state=&country=&q=%22Ensuring+the+Persistent+Access%22&per_page=25) project (see [original announcement](http://blog.archive.org/2018/03/05/andrew-w-mellon-foundation-awards-grant-to-the-internet-archive-for-long-tail-journal-preservation/)).
+Additional acknowledgements [at fatcat.wiki](https://fatcat.wiki/about).
+
+%prep
+%autosetup -n fuzzycat-0.1.23
+
+%build
+%py3_build
+
+%install
+%py3_install
+install -d -m755 %{buildroot}/%{_pkgdocdir}
+if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi
+if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi
+if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi
+if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi
+pushd %{buildroot}
+if [ -d usr/lib ]; then
+ find usr/lib -type f -printf "\"/%h/%f\"\n" >> filelist.lst
+fi
+if [ -d usr/lib64 ]; then
+ find usr/lib64 -type f -printf "\"/%h/%f\"\n" >> filelist.lst
+fi
+if [ -d usr/bin ]; then
+ find usr/bin -type f -printf "\"/%h/%f\"\n" >> filelist.lst
+fi
+if [ -d usr/sbin ]; then
+ find usr/sbin -type f -printf "\"/%h/%f\"\n" >> filelist.lst
+fi
+touch doclist.lst
+if [ -d usr/share/man ]; then
+ find usr/share/man -type f -printf "\"/%h/%f.gz\"\n" >> doclist.lst
+fi
+popd
+mv %{buildroot}/filelist.lst .
+mv %{buildroot}/doclist.lst .
+
+%files -n python3-fuzzycat -f filelist.lst
+%dir %{python3_sitelib}/*
+
+%files help -f doclist.lst
+%{_docdir}/*
+
+%changelog
+* Tue Jun 20 2023 Python_Bot <Python_Bot@openeuler.org> - 0.1.23-1
+- Package Spec generated
diff --git a/sources b/sources
new file mode 100644
index 0000000..a61e891
--- /dev/null
+++ b/sources
@@ -0,0 +1 @@
+c71c31be6f7a156c320c878de60d5214 fuzzycat-0.1.23.tar.gz