summaryrefslogtreecommitdiff
path: root/python-pandas-dedupe.spec
diff options
context:
space:
mode:
Diffstat (limited to 'python-pandas-dedupe.spec')
-rw-r--r--python-pandas-dedupe.spec594
1 files changed, 594 insertions, 0 deletions
diff --git a/python-pandas-dedupe.spec b/python-pandas-dedupe.spec
new file mode 100644
index 0000000..058e756
--- /dev/null
+++ b/python-pandas-dedupe.spec
@@ -0,0 +1,594 @@
+%global _empty_manifest_terminate_build 0
+Name: python-pandas-dedupe
+Version: 1.5.0
+Release: 1
+Summary: The Dedupe library made easy with Pandas.
+License: MIT
+URL: https://github.com/Lyonk71/pandas-dedupe
+Source0: https://mirrors.nju.edu.cn/pypi/web/packages/45/1f/f24ba1dbb5ff59f07dc8829c9d80c3ff9d1d4367f21d7d482243a92f3f4e/pandas_dedupe-1.5.0.tar.gz
+BuildArch: noarch
+
+Requires: python3-dedupe
+Requires: python3-unidecode
+Requires: python3-pandas
+
+%description
+# pandas-dedupe
+
+The Dedupe library made easy with Pandas.
+
+# Installation
+
+```
+pip install pandas-dedupe
+```
+
+# Video Tutorials
+
+[Basic Deduplication](https://www.youtube.com/watch?v=lCFEzRaqoJA)
+
+# Basic Usage
+
+A training file and a settings file will be created while running Dedupe.
+Keeping these files will eliminate the need to retrain your model in the future.
+
+If you would like to retrain your model from scratch, just delete the settings and training files.
+
+### Deduplication (dedupe_dataframe)
+`dedupe_dataframe` is for deduplication when you have data that can contain multiple records that can all refer to the same entity
+
+```python
+import pandas as pd
+import pandas_dedupe
+
+#load dataframe
+df = pd.read_csv('test_names.csv')
+
+#initiate deduplication
+df_final = pandas_dedupe.dedupe_dataframe(df,['first_name', 'last_name', 'middle_initial'])
+
+#send output to csv
+df_final.to_csv('deduplication_output.csv')
+```
+
+### Gazetteer deduplication (gazetteer_dataframe)
+`gazetteer_dataframe` is for matching a messy dataset against a 'canonical dataset' (i.e. the gazette)
+
+```python
+import pandas as pd
+import pandas_dedupe
+
+#load dataframe
+df_clean = pd.read_csv('gazette.csv')
+df_messy = pd.read_csv('test_names.csv')
+
+#initiate deduplication
+df_final = pandas_dedupe.gazetteer_dataframe(df_clean, df_messy, 'fullname', canonicalize=True)
+
+#send output to csv
+df_final.to_csv('gazetteer_deduplication_output.csv')
+```
+
+
+### Matching / Record Linkage
+
+Use identical field names when linking dataframes.
+Record linkage should only be used on dataframes that have been deduplicated.
+
+```python
+import pandas as pd
+import pandas_dedupe
+
+#load dataframes
+dfa = pd.read_csv('file_a.csv')
+dfb = pd.read_csv('file_b.csv')
+
+#initiate matching
+df_final = pandas_dedupe.link_dataframes(dfa, dfb, ['field_1', 'field_2', 'field_3', 'field_4'])
+
+#send output to csv
+df_final.to_csv('linkage_output.csv')
+```
+
+# Advanced Usage
+
+### Canonicalize Fields
+
+The canonicalize parameter will standardize names in a given cluster. Original fields are also kept.
+
+```python
+pandas_dedupe.dedupe_dataframe(df,['first_name', 'last_name', 'payment_type'], canonicalize=True)
+```
+
+### Update Threshold (dedupe_dataframe and gazetteer_dataframe only)
+
+Group records into clusters only if the cophenetic similarity of the cluster is greater than
+the threshold.
+
+```python
+pandas_dedupe.dedupe_dataframe(df, ['first_name', 'last_name'], threshold=.7)
+```
+
+### Update Existing Model (dedupe_dataframe and gazetteer_dataframe only)
+
+If `True`, it allows a user to update the existing model.
+
+```python
+pandas_dedupe.dedupe_dataframe(df, ['first_name', 'last_name'], update_model=True)
+```
+
+### Recall Weight & Sample Size
+
+The `dedupe_dataframe()` function has two optional parameters specifying `recall_weight` and `sample_size`:
+
+- **recall_weight** - Ranges from 0 to 2. When set to 2, we are saying we care twice as much
+ about recall than we do about precision.
+- **sample_size** - Specifies the sample size used for training as a float from 0 to 1.
+ By default it is 30% (0.3) of our data.
+
+### Specifying Types
+
+If you'd like to specify dates, spatial data, etc, do so here. The structure must be like so:
+`('field', 'type', 'additional_parameter)`. the `additional_parameter` section can be omitted.
+The default type is `String`.
+
+See the full list of types [below](#Types).
+
+```python
+# Price Example
+pandas_dedupe.dedupe_dataframe(df,['first_name', 'last_name', ('salary', 'Price')])
+
+# has missing Example
+pandas_dedupe.link_dataframes(df,['SSN', ('bio_pgraph', 'Text'), ('salary', 'Price', 'has missing')])
+
+# crf Example
+pandas_dedupe.dedupe_dataframe(df,[('first_name', 'String', 'crf'), 'last_name', (m_initial, 'Exact')])
+```
+
+# Types
+
+Dedupe supports a variety of datatypes; a full list with documentation can be found [here.](https://docs.dedupe.io/en/latest/Variable-definition.html#)
+
+pandas-dedupe officially supports the following datatypes:
+
+- **String** - Standard string comparison using string distance metric. This is the default type.
+- **Text** - Comparison for sentences or paragraphs of text. Uses cosine similarity metric.
+- **Price** - For comparing positive, non zero numerical values.
+- **DateTime** - For comparing dates.
+- **LatLong** - (39.990334, 70.012) will not match to (40.01, 69.98) using a string distance
+ metric, even though the points are in a geographically similar location. The LatLong type resolves
+ this by calculating the haversine distance between compared coordinates. LatLong requires
+ the field to be in the format (Lat, Long). The value can be a string, a tuple containing two
+ strings, a tuple containing two floats, or a tuple containing two integers. If the format
+ is not able to be processed, you will get a traceback.
+- **Exact** - Tests whether fields are an exact match.
+- **Exists** - Sometimes, the presence or absence of data can be useful in predicting a match.
+ The Exists type tests for whether both, one, or neither of fields are null.
+
+Additional supported parameters are:
+
+- **has missing** - Can be used if one of your data fields contains null values
+- **crf** - Use conditional random fields for comparisons rather than distance metric. May be more
+ accurate in some cases, but runs much slower. Works with String and ShortString types.
+
+# Contributors
+
+[Tyler Marrs](http://tylermarrs.com/) - Refactored code, added docstrings, added `threshold` parameter
+
+[Tawni Marrs](https://github.com/tawnimarrs) - refactored code, added docstrings
+
+[ieriii](https://github.com/ieriii) - Added `update_model` parameter, updated codebase to use `Dedupe 2.0`, added support for multiprocessing, added `gazetteer_dataframe`.
+
+[Daniel Marczin](https://github.com/dim5) - Extensive updates to documentation to enhance readability.
+
+# Credits
+
+Many thanks to folks at [DataMade](https://datamade.us/) for making the the [Dedupe library](https://github.com/dedupeio/dedupe) publicly available. People interested in a code-free implementation of the dedupe library can find a link here: [Dedupe.io](https://dedupe.io/pricing/).
+
+
+
+
+%package -n python3-pandas-dedupe
+Summary: The Dedupe library made easy with Pandas.
+Provides: python-pandas-dedupe
+BuildRequires: python3-devel
+BuildRequires: python3-setuptools
+BuildRequires: python3-pip
+%description -n python3-pandas-dedupe
+# pandas-dedupe
+
+The Dedupe library made easy with Pandas.
+
+# Installation
+
+```
+pip install pandas-dedupe
+```
+
+# Video Tutorials
+
+[Basic Deduplication](https://www.youtube.com/watch?v=lCFEzRaqoJA)
+
+# Basic Usage
+
+A training file and a settings file will be created while running Dedupe.
+Keeping these files will eliminate the need to retrain your model in the future.
+
+If you would like to retrain your model from scratch, just delete the settings and training files.
+
+### Deduplication (dedupe_dataframe)
+`dedupe_dataframe` is for deduplication when you have data that can contain multiple records that can all refer to the same entity
+
+```python
+import pandas as pd
+import pandas_dedupe
+
+#load dataframe
+df = pd.read_csv('test_names.csv')
+
+#initiate deduplication
+df_final = pandas_dedupe.dedupe_dataframe(df,['first_name', 'last_name', 'middle_initial'])
+
+#send output to csv
+df_final.to_csv('deduplication_output.csv')
+```
+
+### Gazetteer deduplication (gazetteer_dataframe)
+`gazetteer_dataframe` is for matching a messy dataset against a 'canonical dataset' (i.e. the gazette)
+
+```python
+import pandas as pd
+import pandas_dedupe
+
+#load dataframe
+df_clean = pd.read_csv('gazette.csv')
+df_messy = pd.read_csv('test_names.csv')
+
+#initiate deduplication
+df_final = pandas_dedupe.gazetteer_dataframe(df_clean, df_messy, 'fullname', canonicalize=True)
+
+#send output to csv
+df_final.to_csv('gazetteer_deduplication_output.csv')
+```
+
+
+### Matching / Record Linkage
+
+Use identical field names when linking dataframes.
+Record linkage should only be used on dataframes that have been deduplicated.
+
+```python
+import pandas as pd
+import pandas_dedupe
+
+#load dataframes
+dfa = pd.read_csv('file_a.csv')
+dfb = pd.read_csv('file_b.csv')
+
+#initiate matching
+df_final = pandas_dedupe.link_dataframes(dfa, dfb, ['field_1', 'field_2', 'field_3', 'field_4'])
+
+#send output to csv
+df_final.to_csv('linkage_output.csv')
+```
+
+# Advanced Usage
+
+### Canonicalize Fields
+
+The canonicalize parameter will standardize names in a given cluster. Original fields are also kept.
+
+```python
+pandas_dedupe.dedupe_dataframe(df,['first_name', 'last_name', 'payment_type'], canonicalize=True)
+```
+
+### Update Threshold (dedupe_dataframe and gazetteer_dataframe only)
+
+Group records into clusters only if the cophenetic similarity of the cluster is greater than
+the threshold.
+
+```python
+pandas_dedupe.dedupe_dataframe(df, ['first_name', 'last_name'], threshold=.7)
+```
+
+### Update Existing Model (dedupe_dataframe and gazetteer_dataframe only)
+
+If `True`, it allows a user to update the existing model.
+
+```python
+pandas_dedupe.dedupe_dataframe(df, ['first_name', 'last_name'], update_model=True)
+```
+
+### Recall Weight & Sample Size
+
+The `dedupe_dataframe()` function has two optional parameters specifying `recall_weight` and `sample_size`:
+
+- **recall_weight** - Ranges from 0 to 2. When set to 2, we are saying we care twice as much
+ about recall than we do about precision.
+- **sample_size** - Specifies the sample size used for training as a float from 0 to 1.
+ By default it is 30% (0.3) of our data.
+
+### Specifying Types
+
+If you'd like to specify dates, spatial data, etc, do so here. The structure must be like so:
+`('field', 'type', 'additional_parameter)`. the `additional_parameter` section can be omitted.
+The default type is `String`.
+
+See the full list of types [below](#Types).
+
+```python
+# Price Example
+pandas_dedupe.dedupe_dataframe(df,['first_name', 'last_name', ('salary', 'Price')])
+
+# has missing Example
+pandas_dedupe.link_dataframes(df,['SSN', ('bio_pgraph', 'Text'), ('salary', 'Price', 'has missing')])
+
+# crf Example
+pandas_dedupe.dedupe_dataframe(df,[('first_name', 'String', 'crf'), 'last_name', (m_initial, 'Exact')])
+```
+
+# Types
+
+Dedupe supports a variety of datatypes; a full list with documentation can be found [here.](https://docs.dedupe.io/en/latest/Variable-definition.html#)
+
+pandas-dedupe officially supports the following datatypes:
+
+- **String** - Standard string comparison using string distance metric. This is the default type.
+- **Text** - Comparison for sentences or paragraphs of text. Uses cosine similarity metric.
+- **Price** - For comparing positive, non zero numerical values.
+- **DateTime** - For comparing dates.
+- **LatLong** - (39.990334, 70.012) will not match to (40.01, 69.98) using a string distance
+ metric, even though the points are in a geographically similar location. The LatLong type resolves
+ this by calculating the haversine distance between compared coordinates. LatLong requires
+ the field to be in the format (Lat, Long). The value can be a string, a tuple containing two
+ strings, a tuple containing two floats, or a tuple containing two integers. If the format
+ is not able to be processed, you will get a traceback.
+- **Exact** - Tests whether fields are an exact match.
+- **Exists** - Sometimes, the presence or absence of data can be useful in predicting a match.
+ The Exists type tests for whether both, one, or neither of fields are null.
+
+Additional supported parameters are:
+
+- **has missing** - Can be used if one of your data fields contains null values
+- **crf** - Use conditional random fields for comparisons rather than distance metric. May be more
+ accurate in some cases, but runs much slower. Works with String and ShortString types.
+
+# Contributors
+
+[Tyler Marrs](http://tylermarrs.com/) - Refactored code, added docstrings, added `threshold` parameter
+
+[Tawni Marrs](https://github.com/tawnimarrs) - refactored code, added docstrings
+
+[ieriii](https://github.com/ieriii) - Added `update_model` parameter, updated codebase to use `Dedupe 2.0`, added support for multiprocessing, added `gazetteer_dataframe`.
+
+[Daniel Marczin](https://github.com/dim5) - Extensive updates to documentation to enhance readability.
+
+# Credits
+
+Many thanks to folks at [DataMade](https://datamade.us/) for making the the [Dedupe library](https://github.com/dedupeio/dedupe) publicly available. People interested in a code-free implementation of the dedupe library can find a link here: [Dedupe.io](https://dedupe.io/pricing/).
+
+
+
+
+%package help
+Summary: Development documents and examples for pandas-dedupe
+Provides: python3-pandas-dedupe-doc
+%description help
+# pandas-dedupe
+
+The Dedupe library made easy with Pandas.
+
+# Installation
+
+```
+pip install pandas-dedupe
+```
+
+# Video Tutorials
+
+[Basic Deduplication](https://www.youtube.com/watch?v=lCFEzRaqoJA)
+
+# Basic Usage
+
+A training file and a settings file will be created while running Dedupe.
+Keeping these files will eliminate the need to retrain your model in the future.
+
+If you would like to retrain your model from scratch, just delete the settings and training files.
+
+### Deduplication (dedupe_dataframe)
+`dedupe_dataframe` is for deduplication when you have data that can contain multiple records that can all refer to the same entity
+
+```python
+import pandas as pd
+import pandas_dedupe
+
+#load dataframe
+df = pd.read_csv('test_names.csv')
+
+#initiate deduplication
+df_final = pandas_dedupe.dedupe_dataframe(df,['first_name', 'last_name', 'middle_initial'])
+
+#send output to csv
+df_final.to_csv('deduplication_output.csv')
+```
+
+### Gazetteer deduplication (gazetteer_dataframe)
+`gazetteer_dataframe` is for matching a messy dataset against a 'canonical dataset' (i.e. the gazette)
+
+```python
+import pandas as pd
+import pandas_dedupe
+
+#load dataframe
+df_clean = pd.read_csv('gazette.csv')
+df_messy = pd.read_csv('test_names.csv')
+
+#initiate deduplication
+df_final = pandas_dedupe.gazetteer_dataframe(df_clean, df_messy, 'fullname', canonicalize=True)
+
+#send output to csv
+df_final.to_csv('gazetteer_deduplication_output.csv')
+```
+
+
+### Matching / Record Linkage
+
+Use identical field names when linking dataframes.
+Record linkage should only be used on dataframes that have been deduplicated.
+
+```python
+import pandas as pd
+import pandas_dedupe
+
+#load dataframes
+dfa = pd.read_csv('file_a.csv')
+dfb = pd.read_csv('file_b.csv')
+
+#initiate matching
+df_final = pandas_dedupe.link_dataframes(dfa, dfb, ['field_1', 'field_2', 'field_3', 'field_4'])
+
+#send output to csv
+df_final.to_csv('linkage_output.csv')
+```
+
+# Advanced Usage
+
+### Canonicalize Fields
+
+The canonicalize parameter will standardize names in a given cluster. Original fields are also kept.
+
+```python
+pandas_dedupe.dedupe_dataframe(df,['first_name', 'last_name', 'payment_type'], canonicalize=True)
+```
+
+### Update Threshold (dedupe_dataframe and gazetteer_dataframe only)
+
+Group records into clusters only if the cophenetic similarity of the cluster is greater than
+the threshold.
+
+```python
+pandas_dedupe.dedupe_dataframe(df, ['first_name', 'last_name'], threshold=.7)
+```
+
+### Update Existing Model (dedupe_dataframe and gazetteer_dataframe only)
+
+If `True`, it allows a user to update the existing model.
+
+```python
+pandas_dedupe.dedupe_dataframe(df, ['first_name', 'last_name'], update_model=True)
+```
+
+### Recall Weight & Sample Size
+
+The `dedupe_dataframe()` function has two optional parameters specifying `recall_weight` and `sample_size`:
+
+- **recall_weight** - Ranges from 0 to 2. When set to 2, we are saying we care twice as much
+ about recall than we do about precision.
+- **sample_size** - Specifies the sample size used for training as a float from 0 to 1.
+ By default it is 30% (0.3) of our data.
+
+### Specifying Types
+
+If you'd like to specify dates, spatial data, etc, do so here. The structure must be like so:
+`('field', 'type', 'additional_parameter)`. the `additional_parameter` section can be omitted.
+The default type is `String`.
+
+See the full list of types [below](#Types).
+
+```python
+# Price Example
+pandas_dedupe.dedupe_dataframe(df,['first_name', 'last_name', ('salary', 'Price')])
+
+# has missing Example
+pandas_dedupe.link_dataframes(df,['SSN', ('bio_pgraph', 'Text'), ('salary', 'Price', 'has missing')])
+
+# crf Example
+pandas_dedupe.dedupe_dataframe(df,[('first_name', 'String', 'crf'), 'last_name', (m_initial, 'Exact')])
+```
+
+# Types
+
+Dedupe supports a variety of datatypes; a full list with documentation can be found [here.](https://docs.dedupe.io/en/latest/Variable-definition.html#)
+
+pandas-dedupe officially supports the following datatypes:
+
+- **String** - Standard string comparison using string distance metric. This is the default type.
+- **Text** - Comparison for sentences or paragraphs of text. Uses cosine similarity metric.
+- **Price** - For comparing positive, non zero numerical values.
+- **DateTime** - For comparing dates.
+- **LatLong** - (39.990334, 70.012) will not match to (40.01, 69.98) using a string distance
+ metric, even though the points are in a geographically similar location. The LatLong type resolves
+ this by calculating the haversine distance between compared coordinates. LatLong requires
+ the field to be in the format (Lat, Long). The value can be a string, a tuple containing two
+ strings, a tuple containing two floats, or a tuple containing two integers. If the format
+ is not able to be processed, you will get a traceback.
+- **Exact** - Tests whether fields are an exact match.
+- **Exists** - Sometimes, the presence or absence of data can be useful in predicting a match.
+ The Exists type tests for whether both, one, or neither of fields are null.
+
+Additional supported parameters are:
+
+- **has missing** - Can be used if one of your data fields contains null values
+- **crf** - Use conditional random fields for comparisons rather than distance metric. May be more
+ accurate in some cases, but runs much slower. Works with String and ShortString types.
+
+# Contributors
+
+[Tyler Marrs](http://tylermarrs.com/) - Refactored code, added docstrings, added `threshold` parameter
+
+[Tawni Marrs](https://github.com/tawnimarrs) - refactored code, added docstrings
+
+[ieriii](https://github.com/ieriii) - Added `update_model` parameter, updated codebase to use `Dedupe 2.0`, added support for multiprocessing, added `gazetteer_dataframe`.
+
+[Daniel Marczin](https://github.com/dim5) - Extensive updates to documentation to enhance readability.
+
+# Credits
+
+Many thanks to folks at [DataMade](https://datamade.us/) for making the the [Dedupe library](https://github.com/dedupeio/dedupe) publicly available. People interested in a code-free implementation of the dedupe library can find a link here: [Dedupe.io](https://dedupe.io/pricing/).
+
+
+
+
+%prep
+%autosetup -n pandas-dedupe-1.5.0
+
+%build
+%py3_build
+
+%install
+%py3_install
+install -d -m755 %{buildroot}/%{_pkgdocdir}
+if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi
+if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi
+if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi
+if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi
+pushd %{buildroot}
+if [ -d usr/lib ]; then
+ find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+if [ -d usr/lib64 ]; then
+ find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+if [ -d usr/bin ]; then
+ find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+if [ -d usr/sbin ]; then
+ find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+touch doclist.lst
+if [ -d usr/share/man ]; then
+ find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst
+fi
+popd
+mv %{buildroot}/filelist.lst .
+mv %{buildroot}/doclist.lst .
+
+%files -n python3-pandas-dedupe -f filelist.lst
+%dir %{python3_sitelib}/*
+
+%files help -f doclist.lst
+%{_docdir}/*
+
+%changelog
+* Mon Apr 10 2023 Python_Bot <Python_Bot@openeuler.org> - 1.5.0-1
+- Package Spec generated