summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
-rw-r--r--.gitignore1
-rw-r--r--python-recordlinkage.spec622
-rw-r--r--sources1
3 files changed, 624 insertions, 0 deletions
diff --git a/.gitignore b/.gitignore
index e69de29..9684095 100644
--- a/.gitignore
+++ b/.gitignore
@@ -0,0 +1 @@
+/recordlinkage-0.15.tar.gz
diff --git a/python-recordlinkage.spec b/python-recordlinkage.spec
new file mode 100644
index 0000000..b6914b5
--- /dev/null
+++ b/python-recordlinkage.spec
@@ -0,0 +1,622 @@
+%global _empty_manifest_terminate_build 0
+Name: python-recordlinkage
+Version: 0.15
+Release: 1
+Summary: A record linkage toolkit for linking and deduplication
+License: BSD-3-Clause
+URL: https://github.com/J535D165/recordlinkage
+Source0: https://mirrors.nju.edu.cn/pypi/web/packages/75/7c/8deed2c61e0b77f856d785f022385871c6e25777119186071b6648f864d0/recordlinkage-0.15.tar.gz
+BuildArch: noarch
+
+Requires: python3-jellyfish
+Requires: python3-numpy
+Requires: python3-pandas
+Requires: python3-scipy
+Requires: python3-scikit-learn
+Requires: python3-joblib
+Requires: python3-networkx
+Requires: python3-bottleneck
+Requires: python3-numexpr
+Requires: python3-pytest
+Requires: python3-networkx
+Requires: python3-bottleneck
+Requires: python3-numexpr
+
+%description
+<div align="center">
+ <img src="https://raw.githubusercontent.com/J535D165/recordlinkage/master/docs/images/recordlinkage-banner-transparent.svg"><br>
+</div>
+
+# RecordLinkage: powerful and modular Python record linkage toolkit
+
+[![Pypi Version](https://badge.fury.io/py/recordlinkage.svg)](https://pypi.python.org/pypi/recordlinkage/)
+[![Github Actions CI Status](https://github.com/J535D165/recordlinkage/workflows/tests/badge.svg?branch=master)](https://github.com/J535D165/recordlinkage/actions)
+[![Code Coverage](https://codecov.io/gh/J535D165/recordlinkage/branch/master/graph/badge.svg)](https://codecov.io/gh/J535D165/recordlinkage)
+[![Documentation Status](https://readthedocs.org/projects/recordlinkage/badge/?version=latest)](https://recordlinkage.readthedocs.io/en/latest/?badge=latest)
+[![Zenodo DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.3559042.svg)](https://doi.org/10.5281/zenodo.3559042)
+
+**RecordLinkage** is a powerful and modular record linkage toolkit to
+link records in or between data sources. The toolkit provides most of
+the tools needed for record linkage and deduplication. The package
+contains indexing methods, functions to compare records and classifiers.
+The package is developed for research and the linking of small or medium
+sized files.
+
+This project is inspired by the [Freely Extensible Biomedical Record
+Linkage (FEBRL)](https://sourceforge.net/projects/febrl/) project, which
+is a great project. In contrast with FEBRL, the recordlinkage project
+uses [pandas](http://pandas.pydata.org/) and
+[numpy](http://www.numpy.org/) for data handling and computations. The
+use of *pandas*, a flexible and powerful data analysis and manipulation
+library for Python, makes the record linkage process much easier and
+faster. The extensive *pandas* library can be used to integrate your
+record linkage directly into existing data manipulation projects.
+
+One of the aims of this project is to make an easily extensible record
+linkage framework. It is easy to include your own indexing algorithms,
+comparison/similarity measures and classifiers.
+
+## Basic linking example
+
+Import the `recordlinkage` module with all important tools for record
+linkage and import the data manipulation framework **pandas**.
+
+``` python
+import recordlinkage
+import pandas
+```
+
+Load your data into pandas DataFrames.
+
+``` python
+df_a = pandas.DataFrame(YOUR_FIRST_DATASET)
+df_b = pandas.DataFrame(YOUR_SECOND_DATASET)
+```
+
+Comparing all record can be computationally intensive. Therefore, we
+make set of candidate links with one of the built-in indexing techniques
+like **blocking**. In this example, only pairs of records that agree on
+the surname are returned.
+
+``` python
+indexer = recordlinkage.Index()
+indexer.block('surname')
+candidate_links = indexer.index(df_a, df_b)
+```
+
+For each candidate link, compare the records with one of the comparison
+or similarity algorithms in the Compare class.
+
+``` python
+c = recordlinkage.Compare()
+
+c.string('name_a', 'name_b', method='jarowinkler', threshold=0.85)
+c.exact('sex', 'gender')
+c.date('dob', 'date_of_birth')
+c.string('str_name', 'streetname', method='damerau_levenshtein', threshold=0.7)
+c.exact('place', 'placename')
+c.numeric('income', 'income', method='gauss', offset=3, scale=3, missing_value=0.5)
+
+# The comparison vectors
+feature_vectors = c.compute(candidate_links, df_a, df_b)
+```
+
+Classify the candidate links into matching or distinct pairs based on
+their comparison result with one of the [classification
+algorithms](https://recordlinkage.readthedocs.io/en/latest/ref-classifiers.html).
+The following code classifies candidate pairs with a Logistic Regression
+classifier. This (supervised machine learning) algorithm requires
+training data.
+
+``` python
+logrg = recordlinkage.LogisticRegressionClassifier()
+logrg.fit(TRAINING_COMPARISON_VECTORS, TRAINING_PAIRS)
+
+logrg.predict(feature_vectors)
+```
+
+The following code shows the classification of candidate pairs with the
+Expectation-Conditional Maximisation (ECM) algorithm. This variant of
+the Expectation-Maximisation algorithm doesn't require training data
+(unsupervised machine learning).
+
+``` python
+ecm = recordlinkage.ECMClassifier()
+ecm.fit_predict(feature_vectors)
+```
+
+## Main Features
+
+The main features of this Python record linkage toolkit are:
+
+- Clean and standardise data with easy to use tools
+- Make pairs of records with smart indexing methods such as
+ **blocking** and **sorted neighbourhood indexing**
+- Compare records with a large number of comparison and similarity
+ measures for different types of variables such as strings, numbers
+ and dates.
+- Several classifications algorithms, both supervised and unsupervised
+ algorithms.
+- Common record linkage evaluation tools
+- Several built-in datasets.
+
+## Documentation
+
+The most recent documentation and API reference can be found at
+[recordlinkage.readthedocs.org](http://recordlinkage.readthedocs.org/en/latest/).
+The documentation provides some basic usage examples like
+[deduplication](http://recordlinkage.readthedocs.io/en/latest/notebooks/data_deduplication.html)
+and
+[linking](http://recordlinkage.readthedocs.io/en/latest/notebooks/link_two_dataframes.html)
+census data. More examples are coming soon. If you do have interesting
+examples to share, let us know.
+
+## Installation
+
+The Python Record linkage Toolkit requires Python 3.6 or higher. Install the
+package easily with pip
+
+``` sh
+pip install recordlinkage
+```
+
+Python 2.7 users can use version \<= 0.13, but it is advised to use
+Python \>= 3.5.
+
+The toolkit depends on popular packages like
+[Pandas](https://github.com/pydata/pandas),
+[Numpy](http://www.numpy.org), [Scipy](https://www.scipy.org/) and,
+[Scikit-learn](http://scikit-learn.org/). A complete list of
+dependencies can be found in the [installation
+manual](https://recordlinkage.readthedocs.io/en/latest/installation.html)
+as well as recommended and optional dependencies.
+
+## License
+
+The license for this record linkage tool is BSD-3-Clause.
+
+## Citation
+
+Please cite this package when being used in an academic context. Ensure
+that the DOI and version match the installed version. Citatation styles
+can be found on the publishers website
+[10.5281/zenodo.3559042](https://doi.org/10.5281/zenodo.3559042).
+
+``` text
+@software{de_bruin_j_2019_3559043,
+ author = {De Bruin, J},
+ title = {{Python Record Linkage Toolkit: A toolkit for
+ record linkage and duplicate detection in Python}},
+ month = dec,
+ year = 2019,
+ publisher = {Zenodo},
+ version = {v0.14},
+ doi = {10.5281/zenodo.3559043},
+ url = {https://doi.org/10.5281/zenodo.3559043}
+}
+```
+
+## Need help?
+
+Stuck on your record linkage code or problem? Any other questions? Don't
+hestitate to send me an email (<jonathandebruinos@gmail.com>).
+
+
+
+
+%package -n python3-recordlinkage
+Summary: A record linkage toolkit for linking and deduplication
+Provides: python-recordlinkage
+BuildRequires: python3-devel
+BuildRequires: python3-setuptools
+BuildRequires: python3-pip
+%description -n python3-recordlinkage
+<div align="center">
+ <img src="https://raw.githubusercontent.com/J535D165/recordlinkage/master/docs/images/recordlinkage-banner-transparent.svg"><br>
+</div>
+
+# RecordLinkage: powerful and modular Python record linkage toolkit
+
+[![Pypi Version](https://badge.fury.io/py/recordlinkage.svg)](https://pypi.python.org/pypi/recordlinkage/)
+[![Github Actions CI Status](https://github.com/J535D165/recordlinkage/workflows/tests/badge.svg?branch=master)](https://github.com/J535D165/recordlinkage/actions)
+[![Code Coverage](https://codecov.io/gh/J535D165/recordlinkage/branch/master/graph/badge.svg)](https://codecov.io/gh/J535D165/recordlinkage)
+[![Documentation Status](https://readthedocs.org/projects/recordlinkage/badge/?version=latest)](https://recordlinkage.readthedocs.io/en/latest/?badge=latest)
+[![Zenodo DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.3559042.svg)](https://doi.org/10.5281/zenodo.3559042)
+
+**RecordLinkage** is a powerful and modular record linkage toolkit to
+link records in or between data sources. The toolkit provides most of
+the tools needed for record linkage and deduplication. The package
+contains indexing methods, functions to compare records and classifiers.
+The package is developed for research and the linking of small or medium
+sized files.
+
+This project is inspired by the [Freely Extensible Biomedical Record
+Linkage (FEBRL)](https://sourceforge.net/projects/febrl/) project, which
+is a great project. In contrast with FEBRL, the recordlinkage project
+uses [pandas](http://pandas.pydata.org/) and
+[numpy](http://www.numpy.org/) for data handling and computations. The
+use of *pandas*, a flexible and powerful data analysis and manipulation
+library for Python, makes the record linkage process much easier and
+faster. The extensive *pandas* library can be used to integrate your
+record linkage directly into existing data manipulation projects.
+
+One of the aims of this project is to make an easily extensible record
+linkage framework. It is easy to include your own indexing algorithms,
+comparison/similarity measures and classifiers.
+
+## Basic linking example
+
+Import the `recordlinkage` module with all important tools for record
+linkage and import the data manipulation framework **pandas**.
+
+``` python
+import recordlinkage
+import pandas
+```
+
+Load your data into pandas DataFrames.
+
+``` python
+df_a = pandas.DataFrame(YOUR_FIRST_DATASET)
+df_b = pandas.DataFrame(YOUR_SECOND_DATASET)
+```
+
+Comparing all record can be computationally intensive. Therefore, we
+make set of candidate links with one of the built-in indexing techniques
+like **blocking**. In this example, only pairs of records that agree on
+the surname are returned.
+
+``` python
+indexer = recordlinkage.Index()
+indexer.block('surname')
+candidate_links = indexer.index(df_a, df_b)
+```
+
+For each candidate link, compare the records with one of the comparison
+or similarity algorithms in the Compare class.
+
+``` python
+c = recordlinkage.Compare()
+
+c.string('name_a', 'name_b', method='jarowinkler', threshold=0.85)
+c.exact('sex', 'gender')
+c.date('dob', 'date_of_birth')
+c.string('str_name', 'streetname', method='damerau_levenshtein', threshold=0.7)
+c.exact('place', 'placename')
+c.numeric('income', 'income', method='gauss', offset=3, scale=3, missing_value=0.5)
+
+# The comparison vectors
+feature_vectors = c.compute(candidate_links, df_a, df_b)
+```
+
+Classify the candidate links into matching or distinct pairs based on
+their comparison result with one of the [classification
+algorithms](https://recordlinkage.readthedocs.io/en/latest/ref-classifiers.html).
+The following code classifies candidate pairs with a Logistic Regression
+classifier. This (supervised machine learning) algorithm requires
+training data.
+
+``` python
+logrg = recordlinkage.LogisticRegressionClassifier()
+logrg.fit(TRAINING_COMPARISON_VECTORS, TRAINING_PAIRS)
+
+logrg.predict(feature_vectors)
+```
+
+The following code shows the classification of candidate pairs with the
+Expectation-Conditional Maximisation (ECM) algorithm. This variant of
+the Expectation-Maximisation algorithm doesn't require training data
+(unsupervised machine learning).
+
+``` python
+ecm = recordlinkage.ECMClassifier()
+ecm.fit_predict(feature_vectors)
+```
+
+## Main Features
+
+The main features of this Python record linkage toolkit are:
+
+- Clean and standardise data with easy to use tools
+- Make pairs of records with smart indexing methods such as
+ **blocking** and **sorted neighbourhood indexing**
+- Compare records with a large number of comparison and similarity
+ measures for different types of variables such as strings, numbers
+ and dates.
+- Several classifications algorithms, both supervised and unsupervised
+ algorithms.
+- Common record linkage evaluation tools
+- Several built-in datasets.
+
+## Documentation
+
+The most recent documentation and API reference can be found at
+[recordlinkage.readthedocs.org](http://recordlinkage.readthedocs.org/en/latest/).
+The documentation provides some basic usage examples like
+[deduplication](http://recordlinkage.readthedocs.io/en/latest/notebooks/data_deduplication.html)
+and
+[linking](http://recordlinkage.readthedocs.io/en/latest/notebooks/link_two_dataframes.html)
+census data. More examples are coming soon. If you do have interesting
+examples to share, let us know.
+
+## Installation
+
+The Python Record linkage Toolkit requires Python 3.6 or higher. Install the
+package easily with pip
+
+``` sh
+pip install recordlinkage
+```
+
+Python 2.7 users can use version \<= 0.13, but it is advised to use
+Python \>= 3.5.
+
+The toolkit depends on popular packages like
+[Pandas](https://github.com/pydata/pandas),
+[Numpy](http://www.numpy.org), [Scipy](https://www.scipy.org/) and,
+[Scikit-learn](http://scikit-learn.org/). A complete list of
+dependencies can be found in the [installation
+manual](https://recordlinkage.readthedocs.io/en/latest/installation.html)
+as well as recommended and optional dependencies.
+
+## License
+
+The license for this record linkage tool is BSD-3-Clause.
+
+## Citation
+
+Please cite this package when being used in an academic context. Ensure
+that the DOI and version match the installed version. Citatation styles
+can be found on the publishers website
+[10.5281/zenodo.3559042](https://doi.org/10.5281/zenodo.3559042).
+
+``` text
+@software{de_bruin_j_2019_3559043,
+ author = {De Bruin, J},
+ title = {{Python Record Linkage Toolkit: A toolkit for
+ record linkage and duplicate detection in Python}},
+ month = dec,
+ year = 2019,
+ publisher = {Zenodo},
+ version = {v0.14},
+ doi = {10.5281/zenodo.3559043},
+ url = {https://doi.org/10.5281/zenodo.3559043}
+}
+```
+
+## Need help?
+
+Stuck on your record linkage code or problem? Any other questions? Don't
+hestitate to send me an email (<jonathandebruinos@gmail.com>).
+
+
+
+
+%package help
+Summary: Development documents and examples for recordlinkage
+Provides: python3-recordlinkage-doc
+%description help
+<div align="center">
+ <img src="https://raw.githubusercontent.com/J535D165/recordlinkage/master/docs/images/recordlinkage-banner-transparent.svg"><br>
+</div>
+
+# RecordLinkage: powerful and modular Python record linkage toolkit
+
+[![Pypi Version](https://badge.fury.io/py/recordlinkage.svg)](https://pypi.python.org/pypi/recordlinkage/)
+[![Github Actions CI Status](https://github.com/J535D165/recordlinkage/workflows/tests/badge.svg?branch=master)](https://github.com/J535D165/recordlinkage/actions)
+[![Code Coverage](https://codecov.io/gh/J535D165/recordlinkage/branch/master/graph/badge.svg)](https://codecov.io/gh/J535D165/recordlinkage)
+[![Documentation Status](https://readthedocs.org/projects/recordlinkage/badge/?version=latest)](https://recordlinkage.readthedocs.io/en/latest/?badge=latest)
+[![Zenodo DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.3559042.svg)](https://doi.org/10.5281/zenodo.3559042)
+
+**RecordLinkage** is a powerful and modular record linkage toolkit to
+link records in or between data sources. The toolkit provides most of
+the tools needed for record linkage and deduplication. The package
+contains indexing methods, functions to compare records and classifiers.
+The package is developed for research and the linking of small or medium
+sized files.
+
+This project is inspired by the [Freely Extensible Biomedical Record
+Linkage (FEBRL)](https://sourceforge.net/projects/febrl/) project, which
+is a great project. In contrast with FEBRL, the recordlinkage project
+uses [pandas](http://pandas.pydata.org/) and
+[numpy](http://www.numpy.org/) for data handling and computations. The
+use of *pandas*, a flexible and powerful data analysis and manipulation
+library for Python, makes the record linkage process much easier and
+faster. The extensive *pandas* library can be used to integrate your
+record linkage directly into existing data manipulation projects.
+
+One of the aims of this project is to make an easily extensible record
+linkage framework. It is easy to include your own indexing algorithms,
+comparison/similarity measures and classifiers.
+
+## Basic linking example
+
+Import the `recordlinkage` module with all important tools for record
+linkage and import the data manipulation framework **pandas**.
+
+``` python
+import recordlinkage
+import pandas
+```
+
+Load your data into pandas DataFrames.
+
+``` python
+df_a = pandas.DataFrame(YOUR_FIRST_DATASET)
+df_b = pandas.DataFrame(YOUR_SECOND_DATASET)
+```
+
+Comparing all record can be computationally intensive. Therefore, we
+make set of candidate links with one of the built-in indexing techniques
+like **blocking**. In this example, only pairs of records that agree on
+the surname are returned.
+
+``` python
+indexer = recordlinkage.Index()
+indexer.block('surname')
+candidate_links = indexer.index(df_a, df_b)
+```
+
+For each candidate link, compare the records with one of the comparison
+or similarity algorithms in the Compare class.
+
+``` python
+c = recordlinkage.Compare()
+
+c.string('name_a', 'name_b', method='jarowinkler', threshold=0.85)
+c.exact('sex', 'gender')
+c.date('dob', 'date_of_birth')
+c.string('str_name', 'streetname', method='damerau_levenshtein', threshold=0.7)
+c.exact('place', 'placename')
+c.numeric('income', 'income', method='gauss', offset=3, scale=3, missing_value=0.5)
+
+# The comparison vectors
+feature_vectors = c.compute(candidate_links, df_a, df_b)
+```
+
+Classify the candidate links into matching or distinct pairs based on
+their comparison result with one of the [classification
+algorithms](https://recordlinkage.readthedocs.io/en/latest/ref-classifiers.html).
+The following code classifies candidate pairs with a Logistic Regression
+classifier. This (supervised machine learning) algorithm requires
+training data.
+
+``` python
+logrg = recordlinkage.LogisticRegressionClassifier()
+logrg.fit(TRAINING_COMPARISON_VECTORS, TRAINING_PAIRS)
+
+logrg.predict(feature_vectors)
+```
+
+The following code shows the classification of candidate pairs with the
+Expectation-Conditional Maximisation (ECM) algorithm. This variant of
+the Expectation-Maximisation algorithm doesn't require training data
+(unsupervised machine learning).
+
+``` python
+ecm = recordlinkage.ECMClassifier()
+ecm.fit_predict(feature_vectors)
+```
+
+## Main Features
+
+The main features of this Python record linkage toolkit are:
+
+- Clean and standardise data with easy to use tools
+- Make pairs of records with smart indexing methods such as
+ **blocking** and **sorted neighbourhood indexing**
+- Compare records with a large number of comparison and similarity
+ measures for different types of variables such as strings, numbers
+ and dates.
+- Several classifications algorithms, both supervised and unsupervised
+ algorithms.
+- Common record linkage evaluation tools
+- Several built-in datasets.
+
+## Documentation
+
+The most recent documentation and API reference can be found at
+[recordlinkage.readthedocs.org](http://recordlinkage.readthedocs.org/en/latest/).
+The documentation provides some basic usage examples like
+[deduplication](http://recordlinkage.readthedocs.io/en/latest/notebooks/data_deduplication.html)
+and
+[linking](http://recordlinkage.readthedocs.io/en/latest/notebooks/link_two_dataframes.html)
+census data. More examples are coming soon. If you do have interesting
+examples to share, let us know.
+
+## Installation
+
+The Python Record linkage Toolkit requires Python 3.6 or higher. Install the
+package easily with pip
+
+``` sh
+pip install recordlinkage
+```
+
+Python 2.7 users can use version \<= 0.13, but it is advised to use
+Python \>= 3.5.
+
+The toolkit depends on popular packages like
+[Pandas](https://github.com/pydata/pandas),
+[Numpy](http://www.numpy.org), [Scipy](https://www.scipy.org/) and,
+[Scikit-learn](http://scikit-learn.org/). A complete list of
+dependencies can be found in the [installation
+manual](https://recordlinkage.readthedocs.io/en/latest/installation.html)
+as well as recommended and optional dependencies.
+
+## License
+
+The license for this record linkage tool is BSD-3-Clause.
+
+## Citation
+
+Please cite this package when being used in an academic context. Ensure
+that the DOI and version match the installed version. Citatation styles
+can be found on the publishers website
+[10.5281/zenodo.3559042](https://doi.org/10.5281/zenodo.3559042).
+
+``` text
+@software{de_bruin_j_2019_3559043,
+ author = {De Bruin, J},
+ title = {{Python Record Linkage Toolkit: A toolkit for
+ record linkage and duplicate detection in Python}},
+ month = dec,
+ year = 2019,
+ publisher = {Zenodo},
+ version = {v0.14},
+ doi = {10.5281/zenodo.3559043},
+ url = {https://doi.org/10.5281/zenodo.3559043}
+}
+```
+
+## Need help?
+
+Stuck on your record linkage code or problem? Any other questions? Don't
+hestitate to send me an email (<jonathandebruinos@gmail.com>).
+
+
+
+
+%prep
+%autosetup -n recordlinkage-0.15
+
+%build
+%py3_build
+
+%install
+%py3_install
+install -d -m755 %{buildroot}/%{_pkgdocdir}
+if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi
+if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi
+if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi
+if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi
+pushd %{buildroot}
+if [ -d usr/lib ]; then
+ find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+if [ -d usr/lib64 ]; then
+ find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+if [ -d usr/bin ]; then
+ find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+if [ -d usr/sbin ]; then
+ find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+touch doclist.lst
+if [ -d usr/share/man ]; then
+ find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst
+fi
+popd
+mv %{buildroot}/filelist.lst .
+mv %{buildroot}/doclist.lst .
+
+%files -n python3-recordlinkage -f filelist.lst
+%dir %{python3_sitelib}/*
+
+%files help -f doclist.lst
+%{_docdir}/*
+
+%changelog
+* Mon Apr 10 2023 Python_Bot <Python_Bot@openeuler.org> - 0.15-1
+- Package Spec generated
diff --git a/sources b/sources
new file mode 100644
index 0000000..0904370
--- /dev/null
+++ b/sources
@@ -0,0 +1 @@
+6bea35f04181ee758717b5ea8328e604 recordlinkage-0.15.tar.gz