diff options
author | CoprDistGit <infra@openeuler.org> | 2023-06-20 04:25:48 +0000 |
---|---|---|
committer | CoprDistGit <infra@openeuler.org> | 2023-06-20 04:25:48 +0000 |
commit | bcdc35bfb0cdf844ca91f7b32fe35f5913d8889c (patch) | |
tree | 5c2b062f29984db66a337de1ce1f47dc99a24ac5 | |
parent | 9d0668f5d7e2c437c1688f495c5cd6e6b8884d8c (diff) |
automatic import of python-fuzzycatopeneuler20.03
-rw-r--r-- | .gitignore | 1 | ||||
-rw-r--r-- | python-fuzzycat.spec | 287 | ||||
-rw-r--r-- | sources | 1 |
3 files changed, 289 insertions, 0 deletions
@@ -0,0 +1 @@ +/fuzzycat-0.1.23.tar.gz diff --git a/python-fuzzycat.spec b/python-fuzzycat.spec new file mode 100644 index 0000000..7566f9b --- /dev/null +++ b/python-fuzzycat.spec @@ -0,0 +1,287 @@ +%global _empty_manifest_terminate_build 0 +Name: python-fuzzycat +Version: 0.1.23 +Release: 1 +Summary: Fuzzy matching utilities for scholarly metadata +License: MIT License +URL: https://github.com/miku/fuzzycat +Source0: https://mirrors.aliyun.com/pypi/web/packages/ff/91/19a75e56b496384ca5635de7640cc50ef0de75853b1ce3c758ea6e85cdb0/fuzzycat-0.1.23.tar.gz +BuildArch: noarch + +Requires: python3-dynaconf +Requires: python3-elasticsearch +Requires: python3-elasticsearch-dsl +Requires: python3-fatcat-openapi-client +Requires: python3-ftfy +Requires: python3-glom +Requires: python3-grobid-tei-xml +Requires: python3-jellyfish +Requires: python3-pyyaml +Requires: python3-regex +Requires: python3-requests +Requires: python3-thefuzz +Requires: python3-toml +Requires: python3-unidecode +Requires: python3-zstandard +Requires: python3-ipython +Requires: python3-isort +Requires: python3-mypy +Requires: python3-pylint +Requires: python3-pytest +Requires: python3-pytest-cov +Requires: python3-twine +Requires: python3-yapf + +%description + +This Python library contains routines for finding near-duplicate bibliographic +entities (primarily research papers), and estimating whether two metadata +records describe the same work (or variations of the same work). Some routines +are designed to work "offline" with batches of billions of sorted metadata +records, and others are designed to work "online" making queries against hosted +web services and catalogs. +`fuzzycat` was originally developed by Martin Czygan at the Internet Archive, +and is used in the construction of a [citation +graph](https://gitlab.com/internetarchive/refcat) and to identify duplicate +records in the [fatcat.wiki](https://fatcat.wiki) catalog and +[scholar.archive.org](https://scholar.archive.org) search index. +**DISCLAIMER:** this tool is still under development, as indicated by the "0" +major version. The interface, semantics, and behavior are likely to be tweaked. +## Quickstart +Inside a `virtualenv` (or similar), install with [pip](https://pypi.org/project/pip/): +``` +pip install fuzzycat +``` +The `fuzzycat.simple` module contains high-level helpers which query Internet +Archive hosted services: + import elasticsearch + from fuzzycat.simple import * + es_client = elasticsearch.Elasticsearch("https://search.fatcat.wiki:443") + # parses reference using GROBID (at https://grobid.qa.fatcat.wiki), + # then queries Elasticsearch (at https://search.fatcat.wiki), + # then scores candidates against latest catalog record fetched from + # https://api.fatcat.wiki + best_match = closest_fuzzy_unstructured_match( + """Cunningham HB, Weis JJ, Taveras LR, Huerta S. Mesh migration following abdominal hernia repair: a comprehensive review. Hernia. 2019 Apr;23(2):235-243. doi: 10.1007/s10029-019-01898-9. Epub 2019 Jan 30. PMID: 30701369.""", + es_client=es_client) + print(best_match) + # FuzzyReleaseMatchResult(status=<Status.EXACT: 'exact'>, reason=<Reason.DOI: 'doi'>, release={...}) + # same as above, but without the GROBID parsing, and returns multiple results + matches = close_fuzzy_biblio_matches( + dict( + title="Mesh migration following abdominal hernia repair: a comprehensive review", + first_author="Cunningham", + year=2019, + journal="Hernia", + ), + es_client=es_client, + ) +A CLI tool is included for processing records in UNIX stdin/stdout pipelines: + # print usage + python -m fuzzycat +## Features and Use-Cases +The [refcat project](https://gitlab.com/internetarchive/refcat) builds on top +of this library to build a citation graph by processing billions of structured +and unstructured reference records extracted from scholarly papers (note: jfor +performance critical parts, some code has been ported to Go, albeit the test +suite is shared between the Python and Go implementations). +Automated imports of metadata records into the fatcat catalog use fuzzycat to +filter new metadata which look like duplicates of existing records from other +sources. +In conjunction with standard command-line tools (like `sort`), fatcat bulk +metadata snapshots can be clustered and reduced into groups to flag duplicate +records for merging. +Extracted reference strings from any source (webpages, books, papers, wikis, +databases, etc) can be resolved against the fatcat catalog of scholarly papers. +## Support and Acknowledgements +Work on this software received support from the Andrew W. Mellon Foundation +through multiple phases of the ["Ensuring the Persistent Access of Open Access +Journal Literature"](https://mellon.org/grants/grants-database/advanced-search/?amount-low=&amount-high=&year-start=&year-end=&city=&state=&country=&q=%22Ensuring+the+Persistent+Access%22&per_page=25) project (see [original announcement](http://blog.archive.org/2018/03/05/andrew-w-mellon-foundation-awards-grant-to-the-internet-archive-for-long-tail-journal-preservation/)). +Additional acknowledgements [at fatcat.wiki](https://fatcat.wiki/about). + +%package -n python3-fuzzycat +Summary: Fuzzy matching utilities for scholarly metadata +Provides: python-fuzzycat +BuildRequires: python3-devel +BuildRequires: python3-setuptools +BuildRequires: python3-pip +%description -n python3-fuzzycat + +This Python library contains routines for finding near-duplicate bibliographic +entities (primarily research papers), and estimating whether two metadata +records describe the same work (or variations of the same work). Some routines +are designed to work "offline" with batches of billions of sorted metadata +records, and others are designed to work "online" making queries against hosted +web services and catalogs. +`fuzzycat` was originally developed by Martin Czygan at the Internet Archive, +and is used in the construction of a [citation +graph](https://gitlab.com/internetarchive/refcat) and to identify duplicate +records in the [fatcat.wiki](https://fatcat.wiki) catalog and +[scholar.archive.org](https://scholar.archive.org) search index. +**DISCLAIMER:** this tool is still under development, as indicated by the "0" +major version. The interface, semantics, and behavior are likely to be tweaked. +## Quickstart +Inside a `virtualenv` (or similar), install with [pip](https://pypi.org/project/pip/): +``` +pip install fuzzycat +``` +The `fuzzycat.simple` module contains high-level helpers which query Internet +Archive hosted services: + import elasticsearch + from fuzzycat.simple import * + es_client = elasticsearch.Elasticsearch("https://search.fatcat.wiki:443") + # parses reference using GROBID (at https://grobid.qa.fatcat.wiki), + # then queries Elasticsearch (at https://search.fatcat.wiki), + # then scores candidates against latest catalog record fetched from + # https://api.fatcat.wiki + best_match = closest_fuzzy_unstructured_match( + """Cunningham HB, Weis JJ, Taveras LR, Huerta S. Mesh migration following abdominal hernia repair: a comprehensive review. Hernia. 2019 Apr;23(2):235-243. doi: 10.1007/s10029-019-01898-9. Epub 2019 Jan 30. PMID: 30701369.""", + es_client=es_client) + print(best_match) + # FuzzyReleaseMatchResult(status=<Status.EXACT: 'exact'>, reason=<Reason.DOI: 'doi'>, release={...}) + # same as above, but without the GROBID parsing, and returns multiple results + matches = close_fuzzy_biblio_matches( + dict( + title="Mesh migration following abdominal hernia repair: a comprehensive review", + first_author="Cunningham", + year=2019, + journal="Hernia", + ), + es_client=es_client, + ) +A CLI tool is included for processing records in UNIX stdin/stdout pipelines: + # print usage + python -m fuzzycat +## Features and Use-Cases +The [refcat project](https://gitlab.com/internetarchive/refcat) builds on top +of this library to build a citation graph by processing billions of structured +and unstructured reference records extracted from scholarly papers (note: jfor +performance critical parts, some code has been ported to Go, albeit the test +suite is shared between the Python and Go implementations). +Automated imports of metadata records into the fatcat catalog use fuzzycat to +filter new metadata which look like duplicates of existing records from other +sources. +In conjunction with standard command-line tools (like `sort`), fatcat bulk +metadata snapshots can be clustered and reduced into groups to flag duplicate +records for merging. +Extracted reference strings from any source (webpages, books, papers, wikis, +databases, etc) can be resolved against the fatcat catalog of scholarly papers. +## Support and Acknowledgements +Work on this software received support from the Andrew W. Mellon Foundation +through multiple phases of the ["Ensuring the Persistent Access of Open Access +Journal Literature"](https://mellon.org/grants/grants-database/advanced-search/?amount-low=&amount-high=&year-start=&year-end=&city=&state=&country=&q=%22Ensuring+the+Persistent+Access%22&per_page=25) project (see [original announcement](http://blog.archive.org/2018/03/05/andrew-w-mellon-foundation-awards-grant-to-the-internet-archive-for-long-tail-journal-preservation/)). +Additional acknowledgements [at fatcat.wiki](https://fatcat.wiki/about). + +%package help +Summary: Development documents and examples for fuzzycat +Provides: python3-fuzzycat-doc +%description help + +This Python library contains routines for finding near-duplicate bibliographic +entities (primarily research papers), and estimating whether two metadata +records describe the same work (or variations of the same work). Some routines +are designed to work "offline" with batches of billions of sorted metadata +records, and others are designed to work "online" making queries against hosted +web services and catalogs. +`fuzzycat` was originally developed by Martin Czygan at the Internet Archive, +and is used in the construction of a [citation +graph](https://gitlab.com/internetarchive/refcat) and to identify duplicate +records in the [fatcat.wiki](https://fatcat.wiki) catalog and +[scholar.archive.org](https://scholar.archive.org) search index. +**DISCLAIMER:** this tool is still under development, as indicated by the "0" +major version. The interface, semantics, and behavior are likely to be tweaked. +## Quickstart +Inside a `virtualenv` (or similar), install with [pip](https://pypi.org/project/pip/): +``` +pip install fuzzycat +``` +The `fuzzycat.simple` module contains high-level helpers which query Internet +Archive hosted services: + import elasticsearch + from fuzzycat.simple import * + es_client = elasticsearch.Elasticsearch("https://search.fatcat.wiki:443") + # parses reference using GROBID (at https://grobid.qa.fatcat.wiki), + # then queries Elasticsearch (at https://search.fatcat.wiki), + # then scores candidates against latest catalog record fetched from + # https://api.fatcat.wiki + best_match = closest_fuzzy_unstructured_match( + """Cunningham HB, Weis JJ, Taveras LR, Huerta S. Mesh migration following abdominal hernia repair: a comprehensive review. Hernia. 2019 Apr;23(2):235-243. doi: 10.1007/s10029-019-01898-9. Epub 2019 Jan 30. PMID: 30701369.""", + es_client=es_client) + print(best_match) + # FuzzyReleaseMatchResult(status=<Status.EXACT: 'exact'>, reason=<Reason.DOI: 'doi'>, release={...}) + # same as above, but without the GROBID parsing, and returns multiple results + matches = close_fuzzy_biblio_matches( + dict( + title="Mesh migration following abdominal hernia repair: a comprehensive review", + first_author="Cunningham", + year=2019, + journal="Hernia", + ), + es_client=es_client, + ) +A CLI tool is included for processing records in UNIX stdin/stdout pipelines: + # print usage + python -m fuzzycat +## Features and Use-Cases +The [refcat project](https://gitlab.com/internetarchive/refcat) builds on top +of this library to build a citation graph by processing billions of structured +and unstructured reference records extracted from scholarly papers (note: jfor +performance critical parts, some code has been ported to Go, albeit the test +suite is shared between the Python and Go implementations). +Automated imports of metadata records into the fatcat catalog use fuzzycat to +filter new metadata which look like duplicates of existing records from other +sources. +In conjunction with standard command-line tools (like `sort`), fatcat bulk +metadata snapshots can be clustered and reduced into groups to flag duplicate +records for merging. +Extracted reference strings from any source (webpages, books, papers, wikis, +databases, etc) can be resolved against the fatcat catalog of scholarly papers. +## Support and Acknowledgements +Work on this software received support from the Andrew W. Mellon Foundation +through multiple phases of the ["Ensuring the Persistent Access of Open Access +Journal Literature"](https://mellon.org/grants/grants-database/advanced-search/?amount-low=&amount-high=&year-start=&year-end=&city=&state=&country=&q=%22Ensuring+the+Persistent+Access%22&per_page=25) project (see [original announcement](http://blog.archive.org/2018/03/05/andrew-w-mellon-foundation-awards-grant-to-the-internet-archive-for-long-tail-journal-preservation/)). +Additional acknowledgements [at fatcat.wiki](https://fatcat.wiki/about). + +%prep +%autosetup -n fuzzycat-0.1.23 + +%build +%py3_build + +%install +%py3_install +install -d -m755 %{buildroot}/%{_pkgdocdir} +if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi +if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi +if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi +if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi +pushd %{buildroot} +if [ -d usr/lib ]; then + find usr/lib -type f -printf "\"/%h/%f\"\n" >> filelist.lst +fi +if [ -d usr/lib64 ]; then + find usr/lib64 -type f -printf "\"/%h/%f\"\n" >> filelist.lst +fi +if [ -d usr/bin ]; then + find usr/bin -type f -printf "\"/%h/%f\"\n" >> filelist.lst +fi +if [ -d usr/sbin ]; then + find usr/sbin -type f -printf "\"/%h/%f\"\n" >> filelist.lst +fi +touch doclist.lst +if [ -d usr/share/man ]; then + find usr/share/man -type f -printf "\"/%h/%f.gz\"\n" >> doclist.lst +fi +popd +mv %{buildroot}/filelist.lst . +mv %{buildroot}/doclist.lst . + +%files -n python3-fuzzycat -f filelist.lst +%dir %{python3_sitelib}/* + +%files help -f doclist.lst +%{_docdir}/* + +%changelog +* Tue Jun 20 2023 Python_Bot <Python_Bot@openeuler.org> - 0.1.23-1 +- Package Spec generated @@ -0,0 +1 @@ +c71c31be6f7a156c320c878de60d5214 fuzzycat-0.1.23.tar.gz |