diff options
author | CoprDistGit <infra@openeuler.org> | 2023-04-10 16:49:05 +0000 |
---|---|---|
committer | CoprDistGit <infra@openeuler.org> | 2023-04-10 16:49:05 +0000 |
commit | 8f200d869f45b8372f61583d96312f672641713d (patch) | |
tree | 3c79c1ebe3934f1ed0ec7514c3fd4272017a390f | |
parent | f6b5a7804756967d2cd4a33e98cbbfdb0dd09673 (diff) |
automatic import of python-recordlinkage
-rw-r--r-- | .gitignore | 1 | ||||
-rw-r--r-- | python-recordlinkage.spec | 622 | ||||
-rw-r--r-- | sources | 1 |
3 files changed, 624 insertions, 0 deletions
@@ -0,0 +1 @@ +/recordlinkage-0.15.tar.gz diff --git a/python-recordlinkage.spec b/python-recordlinkage.spec new file mode 100644 index 0000000..b6914b5 --- /dev/null +++ b/python-recordlinkage.spec @@ -0,0 +1,622 @@ +%global _empty_manifest_terminate_build 0 +Name: python-recordlinkage +Version: 0.15 +Release: 1 +Summary: A record linkage toolkit for linking and deduplication +License: BSD-3-Clause +URL: https://github.com/J535D165/recordlinkage +Source0: https://mirrors.nju.edu.cn/pypi/web/packages/75/7c/8deed2c61e0b77f856d785f022385871c6e25777119186071b6648f864d0/recordlinkage-0.15.tar.gz +BuildArch: noarch + +Requires: python3-jellyfish +Requires: python3-numpy +Requires: python3-pandas +Requires: python3-scipy +Requires: python3-scikit-learn +Requires: python3-joblib +Requires: python3-networkx +Requires: python3-bottleneck +Requires: python3-numexpr +Requires: python3-pytest +Requires: python3-networkx +Requires: python3-bottleneck +Requires: python3-numexpr + +%description +<div align="center"> + <img src="https://raw.githubusercontent.com/J535D165/recordlinkage/master/docs/images/recordlinkage-banner-transparent.svg"><br> +</div> + +# RecordLinkage: powerful and modular Python record linkage toolkit + +[](https://pypi.python.org/pypi/recordlinkage/) +[](https://github.com/J535D165/recordlinkage/actions) +[](https://codecov.io/gh/J535D165/recordlinkage) +[](https://recordlinkage.readthedocs.io/en/latest/?badge=latest) +[](https://doi.org/10.5281/zenodo.3559042) + +**RecordLinkage** is a powerful and modular record linkage toolkit to +link records in or between data sources. The toolkit provides most of +the tools needed for record linkage and deduplication. The package +contains indexing methods, functions to compare records and classifiers. +The package is developed for research and the linking of small or medium +sized files. + +This project is inspired by the [Freely Extensible Biomedical Record +Linkage (FEBRL)](https://sourceforge.net/projects/febrl/) project, which +is a great project. In contrast with FEBRL, the recordlinkage project +uses [pandas](http://pandas.pydata.org/) and +[numpy](http://www.numpy.org/) for data handling and computations. The +use of *pandas*, a flexible and powerful data analysis and manipulation +library for Python, makes the record linkage process much easier and +faster. The extensive *pandas* library can be used to integrate your +record linkage directly into existing data manipulation projects. + +One of the aims of this project is to make an easily extensible record +linkage framework. It is easy to include your own indexing algorithms, +comparison/similarity measures and classifiers. + +## Basic linking example + +Import the `recordlinkage` module with all important tools for record +linkage and import the data manipulation framework **pandas**. + +``` python +import recordlinkage +import pandas +``` + +Load your data into pandas DataFrames. + +``` python +df_a = pandas.DataFrame(YOUR_FIRST_DATASET) +df_b = pandas.DataFrame(YOUR_SECOND_DATASET) +``` + +Comparing all record can be computationally intensive. Therefore, we +make set of candidate links with one of the built-in indexing techniques +like **blocking**. In this example, only pairs of records that agree on +the surname are returned. + +``` python +indexer = recordlinkage.Index() +indexer.block('surname') +candidate_links = indexer.index(df_a, df_b) +``` + +For each candidate link, compare the records with one of the comparison +or similarity algorithms in the Compare class. + +``` python +c = recordlinkage.Compare() + +c.string('name_a', 'name_b', method='jarowinkler', threshold=0.85) +c.exact('sex', 'gender') +c.date('dob', 'date_of_birth') +c.string('str_name', 'streetname', method='damerau_levenshtein', threshold=0.7) +c.exact('place', 'placename') +c.numeric('income', 'income', method='gauss', offset=3, scale=3, missing_value=0.5) + +# The comparison vectors +feature_vectors = c.compute(candidate_links, df_a, df_b) +``` + +Classify the candidate links into matching or distinct pairs based on +their comparison result with one of the [classification +algorithms](https://recordlinkage.readthedocs.io/en/latest/ref-classifiers.html). +The following code classifies candidate pairs with a Logistic Regression +classifier. This (supervised machine learning) algorithm requires +training data. + +``` python +logrg = recordlinkage.LogisticRegressionClassifier() +logrg.fit(TRAINING_COMPARISON_VECTORS, TRAINING_PAIRS) + +logrg.predict(feature_vectors) +``` + +The following code shows the classification of candidate pairs with the +Expectation-Conditional Maximisation (ECM) algorithm. This variant of +the Expectation-Maximisation algorithm doesn't require training data +(unsupervised machine learning). + +``` python +ecm = recordlinkage.ECMClassifier() +ecm.fit_predict(feature_vectors) +``` + +## Main Features + +The main features of this Python record linkage toolkit are: + +- Clean and standardise data with easy to use tools +- Make pairs of records with smart indexing methods such as + **blocking** and **sorted neighbourhood indexing** +- Compare records with a large number of comparison and similarity + measures for different types of variables such as strings, numbers + and dates. +- Several classifications algorithms, both supervised and unsupervised + algorithms. +- Common record linkage evaluation tools +- Several built-in datasets. + +## Documentation + +The most recent documentation and API reference can be found at +[recordlinkage.readthedocs.org](http://recordlinkage.readthedocs.org/en/latest/). +The documentation provides some basic usage examples like +[deduplication](http://recordlinkage.readthedocs.io/en/latest/notebooks/data_deduplication.html) +and +[linking](http://recordlinkage.readthedocs.io/en/latest/notebooks/link_two_dataframes.html) +census data. More examples are coming soon. If you do have interesting +examples to share, let us know. + +## Installation + +The Python Record linkage Toolkit requires Python 3.6 or higher. Install the +package easily with pip + +``` sh +pip install recordlinkage +``` + +Python 2.7 users can use version \<= 0.13, but it is advised to use +Python \>= 3.5. + +The toolkit depends on popular packages like +[Pandas](https://github.com/pydata/pandas), +[Numpy](http://www.numpy.org), [Scipy](https://www.scipy.org/) and, +[Scikit-learn](http://scikit-learn.org/). A complete list of +dependencies can be found in the [installation +manual](https://recordlinkage.readthedocs.io/en/latest/installation.html) +as well as recommended and optional dependencies. + +## License + +The license for this record linkage tool is BSD-3-Clause. + +## Citation + +Please cite this package when being used in an academic context. Ensure +that the DOI and version match the installed version. Citatation styles +can be found on the publishers website +[10.5281/zenodo.3559042](https://doi.org/10.5281/zenodo.3559042). + +``` text +@software{de_bruin_j_2019_3559043, + author = {De Bruin, J}, + title = {{Python Record Linkage Toolkit: A toolkit for + record linkage and duplicate detection in Python}}, + month = dec, + year = 2019, + publisher = {Zenodo}, + version = {v0.14}, + doi = {10.5281/zenodo.3559043}, + url = {https://doi.org/10.5281/zenodo.3559043} +} +``` + +## Need help? + +Stuck on your record linkage code or problem? Any other questions? Don't +hestitate to send me an email (<jonathandebruinos@gmail.com>). + + + + +%package -n python3-recordlinkage +Summary: A record linkage toolkit for linking and deduplication +Provides: python-recordlinkage +BuildRequires: python3-devel +BuildRequires: python3-setuptools +BuildRequires: python3-pip +%description -n python3-recordlinkage +<div align="center"> + <img src="https://raw.githubusercontent.com/J535D165/recordlinkage/master/docs/images/recordlinkage-banner-transparent.svg"><br> +</div> + +# RecordLinkage: powerful and modular Python record linkage toolkit + +[](https://pypi.python.org/pypi/recordlinkage/) +[](https://github.com/J535D165/recordlinkage/actions) +[](https://codecov.io/gh/J535D165/recordlinkage) +[](https://recordlinkage.readthedocs.io/en/latest/?badge=latest) +[](https://doi.org/10.5281/zenodo.3559042) + +**RecordLinkage** is a powerful and modular record linkage toolkit to +link records in or between data sources. The toolkit provides most of +the tools needed for record linkage and deduplication. The package +contains indexing methods, functions to compare records and classifiers. +The package is developed for research and the linking of small or medium +sized files. + +This project is inspired by the [Freely Extensible Biomedical Record +Linkage (FEBRL)](https://sourceforge.net/projects/febrl/) project, which +is a great project. In contrast with FEBRL, the recordlinkage project +uses [pandas](http://pandas.pydata.org/) and +[numpy](http://www.numpy.org/) for data handling and computations. The +use of *pandas*, a flexible and powerful data analysis and manipulation +library for Python, makes the record linkage process much easier and +faster. The extensive *pandas* library can be used to integrate your +record linkage directly into existing data manipulation projects. + +One of the aims of this project is to make an easily extensible record +linkage framework. It is easy to include your own indexing algorithms, +comparison/similarity measures and classifiers. + +## Basic linking example + +Import the `recordlinkage` module with all important tools for record +linkage and import the data manipulation framework **pandas**. + +``` python +import recordlinkage +import pandas +``` + +Load your data into pandas DataFrames. + +``` python +df_a = pandas.DataFrame(YOUR_FIRST_DATASET) +df_b = pandas.DataFrame(YOUR_SECOND_DATASET) +``` + +Comparing all record can be computationally intensive. Therefore, we +make set of candidate links with one of the built-in indexing techniques +like **blocking**. In this example, only pairs of records that agree on +the surname are returned. + +``` python +indexer = recordlinkage.Index() +indexer.block('surname') +candidate_links = indexer.index(df_a, df_b) +``` + +For each candidate link, compare the records with one of the comparison +or similarity algorithms in the Compare class. + +``` python +c = recordlinkage.Compare() + +c.string('name_a', 'name_b', method='jarowinkler', threshold=0.85) +c.exact('sex', 'gender') +c.date('dob', 'date_of_birth') +c.string('str_name', 'streetname', method='damerau_levenshtein', threshold=0.7) +c.exact('place', 'placename') +c.numeric('income', 'income', method='gauss', offset=3, scale=3, missing_value=0.5) + +# The comparison vectors +feature_vectors = c.compute(candidate_links, df_a, df_b) +``` + +Classify the candidate links into matching or distinct pairs based on +their comparison result with one of the [classification +algorithms](https://recordlinkage.readthedocs.io/en/latest/ref-classifiers.html). +The following code classifies candidate pairs with a Logistic Regression +classifier. This (supervised machine learning) algorithm requires +training data. + +``` python +logrg = recordlinkage.LogisticRegressionClassifier() +logrg.fit(TRAINING_COMPARISON_VECTORS, TRAINING_PAIRS) + +logrg.predict(feature_vectors) +``` + +The following code shows the classification of candidate pairs with the +Expectation-Conditional Maximisation (ECM) algorithm. This variant of +the Expectation-Maximisation algorithm doesn't require training data +(unsupervised machine learning). + +``` python +ecm = recordlinkage.ECMClassifier() +ecm.fit_predict(feature_vectors) +``` + +## Main Features + +The main features of this Python record linkage toolkit are: + +- Clean and standardise data with easy to use tools +- Make pairs of records with smart indexing methods such as + **blocking** and **sorted neighbourhood indexing** +- Compare records with a large number of comparison and similarity + measures for different types of variables such as strings, numbers + and dates. +- Several classifications algorithms, both supervised and unsupervised + algorithms. +- Common record linkage evaluation tools +- Several built-in datasets. + +## Documentation + +The most recent documentation and API reference can be found at +[recordlinkage.readthedocs.org](http://recordlinkage.readthedocs.org/en/latest/). +The documentation provides some basic usage examples like +[deduplication](http://recordlinkage.readthedocs.io/en/latest/notebooks/data_deduplication.html) +and +[linking](http://recordlinkage.readthedocs.io/en/latest/notebooks/link_two_dataframes.html) +census data. More examples are coming soon. If you do have interesting +examples to share, let us know. + +## Installation + +The Python Record linkage Toolkit requires Python 3.6 or higher. Install the +package easily with pip + +``` sh +pip install recordlinkage +``` + +Python 2.7 users can use version \<= 0.13, but it is advised to use +Python \>= 3.5. + +The toolkit depends on popular packages like +[Pandas](https://github.com/pydata/pandas), +[Numpy](http://www.numpy.org), [Scipy](https://www.scipy.org/) and, +[Scikit-learn](http://scikit-learn.org/). A complete list of +dependencies can be found in the [installation +manual](https://recordlinkage.readthedocs.io/en/latest/installation.html) +as well as recommended and optional dependencies. + +## License + +The license for this record linkage tool is BSD-3-Clause. + +## Citation + +Please cite this package when being used in an academic context. Ensure +that the DOI and version match the installed version. Citatation styles +can be found on the publishers website +[10.5281/zenodo.3559042](https://doi.org/10.5281/zenodo.3559042). + +``` text +@software{de_bruin_j_2019_3559043, + author = {De Bruin, J}, + title = {{Python Record Linkage Toolkit: A toolkit for + record linkage and duplicate detection in Python}}, + month = dec, + year = 2019, + publisher = {Zenodo}, + version = {v0.14}, + doi = {10.5281/zenodo.3559043}, + url = {https://doi.org/10.5281/zenodo.3559043} +} +``` + +## Need help? + +Stuck on your record linkage code or problem? Any other questions? Don't +hestitate to send me an email (<jonathandebruinos@gmail.com>). + + + + +%package help +Summary: Development documents and examples for recordlinkage +Provides: python3-recordlinkage-doc +%description help +<div align="center"> + <img src="https://raw.githubusercontent.com/J535D165/recordlinkage/master/docs/images/recordlinkage-banner-transparent.svg"><br> +</div> + +# RecordLinkage: powerful and modular Python record linkage toolkit + +[](https://pypi.python.org/pypi/recordlinkage/) +[](https://github.com/J535D165/recordlinkage/actions) +[](https://codecov.io/gh/J535D165/recordlinkage) +[](https://recordlinkage.readthedocs.io/en/latest/?badge=latest) +[](https://doi.org/10.5281/zenodo.3559042) + +**RecordLinkage** is a powerful and modular record linkage toolkit to +link records in or between data sources. The toolkit provides most of +the tools needed for record linkage and deduplication. The package +contains indexing methods, functions to compare records and classifiers. +The package is developed for research and the linking of small or medium +sized files. + +This project is inspired by the [Freely Extensible Biomedical Record +Linkage (FEBRL)](https://sourceforge.net/projects/febrl/) project, which +is a great project. In contrast with FEBRL, the recordlinkage project +uses [pandas](http://pandas.pydata.org/) and +[numpy](http://www.numpy.org/) for data handling and computations. The +use of *pandas*, a flexible and powerful data analysis and manipulation +library for Python, makes the record linkage process much easier and +faster. The extensive *pandas* library can be used to integrate your +record linkage directly into existing data manipulation projects. + +One of the aims of this project is to make an easily extensible record +linkage framework. It is easy to include your own indexing algorithms, +comparison/similarity measures and classifiers. + +## Basic linking example + +Import the `recordlinkage` module with all important tools for record +linkage and import the data manipulation framework **pandas**. + +``` python +import recordlinkage +import pandas +``` + +Load your data into pandas DataFrames. + +``` python +df_a = pandas.DataFrame(YOUR_FIRST_DATASET) +df_b = pandas.DataFrame(YOUR_SECOND_DATASET) +``` + +Comparing all record can be computationally intensive. Therefore, we +make set of candidate links with one of the built-in indexing techniques +like **blocking**. In this example, only pairs of records that agree on +the surname are returned. + +``` python +indexer = recordlinkage.Index() +indexer.block('surname') +candidate_links = indexer.index(df_a, df_b) +``` + +For each candidate link, compare the records with one of the comparison +or similarity algorithms in the Compare class. + +``` python +c = recordlinkage.Compare() + +c.string('name_a', 'name_b', method='jarowinkler', threshold=0.85) +c.exact('sex', 'gender') +c.date('dob', 'date_of_birth') +c.string('str_name', 'streetname', method='damerau_levenshtein', threshold=0.7) +c.exact('place', 'placename') +c.numeric('income', 'income', method='gauss', offset=3, scale=3, missing_value=0.5) + +# The comparison vectors +feature_vectors = c.compute(candidate_links, df_a, df_b) +``` + +Classify the candidate links into matching or distinct pairs based on +their comparison result with one of the [classification +algorithms](https://recordlinkage.readthedocs.io/en/latest/ref-classifiers.html). +The following code classifies candidate pairs with a Logistic Regression +classifier. This (supervised machine learning) algorithm requires +training data. + +``` python +logrg = recordlinkage.LogisticRegressionClassifier() +logrg.fit(TRAINING_COMPARISON_VECTORS, TRAINING_PAIRS) + +logrg.predict(feature_vectors) +``` + +The following code shows the classification of candidate pairs with the +Expectation-Conditional Maximisation (ECM) algorithm. This variant of +the Expectation-Maximisation algorithm doesn't require training data +(unsupervised machine learning). + +``` python +ecm = recordlinkage.ECMClassifier() +ecm.fit_predict(feature_vectors) +``` + +## Main Features + +The main features of this Python record linkage toolkit are: + +- Clean and standardise data with easy to use tools +- Make pairs of records with smart indexing methods such as + **blocking** and **sorted neighbourhood indexing** +- Compare records with a large number of comparison and similarity + measures for different types of variables such as strings, numbers + and dates. +- Several classifications algorithms, both supervised and unsupervised + algorithms. +- Common record linkage evaluation tools +- Several built-in datasets. + +## Documentation + +The most recent documentation and API reference can be found at +[recordlinkage.readthedocs.org](http://recordlinkage.readthedocs.org/en/latest/). +The documentation provides some basic usage examples like +[deduplication](http://recordlinkage.readthedocs.io/en/latest/notebooks/data_deduplication.html) +and +[linking](http://recordlinkage.readthedocs.io/en/latest/notebooks/link_two_dataframes.html) +census data. More examples are coming soon. If you do have interesting +examples to share, let us know. + +## Installation + +The Python Record linkage Toolkit requires Python 3.6 or higher. Install the +package easily with pip + +``` sh +pip install recordlinkage +``` + +Python 2.7 users can use version \<= 0.13, but it is advised to use +Python \>= 3.5. + +The toolkit depends on popular packages like +[Pandas](https://github.com/pydata/pandas), +[Numpy](http://www.numpy.org), [Scipy](https://www.scipy.org/) and, +[Scikit-learn](http://scikit-learn.org/). A complete list of +dependencies can be found in the [installation +manual](https://recordlinkage.readthedocs.io/en/latest/installation.html) +as well as recommended and optional dependencies. + +## License + +The license for this record linkage tool is BSD-3-Clause. + +## Citation + +Please cite this package when being used in an academic context. Ensure +that the DOI and version match the installed version. Citatation styles +can be found on the publishers website +[10.5281/zenodo.3559042](https://doi.org/10.5281/zenodo.3559042). + +``` text +@software{de_bruin_j_2019_3559043, + author = {De Bruin, J}, + title = {{Python Record Linkage Toolkit: A toolkit for + record linkage and duplicate detection in Python}}, + month = dec, + year = 2019, + publisher = {Zenodo}, + version = {v0.14}, + doi = {10.5281/zenodo.3559043}, + url = {https://doi.org/10.5281/zenodo.3559043} +} +``` + +## Need help? + +Stuck on your record linkage code or problem? Any other questions? Don't +hestitate to send me an email (<jonathandebruinos@gmail.com>). + + + + +%prep +%autosetup -n recordlinkage-0.15 + +%build +%py3_build + +%install +%py3_install +install -d -m755 %{buildroot}/%{_pkgdocdir} +if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi +if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi +if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi +if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi +pushd %{buildroot} +if [ -d usr/lib ]; then + find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/lib64 ]; then + find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/bin ]; then + find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/sbin ]; then + find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst +fi +touch doclist.lst +if [ -d usr/share/man ]; then + find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst +fi +popd +mv %{buildroot}/filelist.lst . +mv %{buildroot}/doclist.lst . + +%files -n python3-recordlinkage -f filelist.lst +%dir %{python3_sitelib}/* + +%files help -f doclist.lst +%{_docdir}/* + +%changelog +* Mon Apr 10 2023 Python_Bot <Python_Bot@openeuler.org> - 0.15-1 +- Package Spec generated @@ -0,0 +1 @@ +6bea35f04181ee758717b5ea8328e604 recordlinkage-0.15.tar.gz |