diff options
| -rw-r--r-- | .gitignore | 1 | ||||
| -rw-r--r-- | python-pandas-dedupe.spec | 594 | ||||
| -rw-r--r-- | sources | 1 |
3 files changed, 596 insertions, 0 deletions
@@ -0,0 +1 @@ +/pandas_dedupe-1.5.0.tar.gz diff --git a/python-pandas-dedupe.spec b/python-pandas-dedupe.spec new file mode 100644 index 0000000..058e756 --- /dev/null +++ b/python-pandas-dedupe.spec @@ -0,0 +1,594 @@ +%global _empty_manifest_terminate_build 0 +Name: python-pandas-dedupe +Version: 1.5.0 +Release: 1 +Summary: The Dedupe library made easy with Pandas. +License: MIT +URL: https://github.com/Lyonk71/pandas-dedupe +Source0: https://mirrors.nju.edu.cn/pypi/web/packages/45/1f/f24ba1dbb5ff59f07dc8829c9d80c3ff9d1d4367f21d7d482243a92f3f4e/pandas_dedupe-1.5.0.tar.gz +BuildArch: noarch + +Requires: python3-dedupe +Requires: python3-unidecode +Requires: python3-pandas + +%description +# pandas-dedupe + +The Dedupe library made easy with Pandas. + +# Installation + +``` +pip install pandas-dedupe +``` + +# Video Tutorials + +[Basic Deduplication](https://www.youtube.com/watch?v=lCFEzRaqoJA) + +# Basic Usage + +A training file and a settings file will be created while running Dedupe. +Keeping these files will eliminate the need to retrain your model in the future. + +If you would like to retrain your model from scratch, just delete the settings and training files. + +### Deduplication (dedupe_dataframe) +`dedupe_dataframe` is for deduplication when you have data that can contain multiple records that can all refer to the same entity + +```python +import pandas as pd +import pandas_dedupe + +#load dataframe +df = pd.read_csv('test_names.csv') + +#initiate deduplication +df_final = pandas_dedupe.dedupe_dataframe(df,['first_name', 'last_name', 'middle_initial']) + +#send output to csv +df_final.to_csv('deduplication_output.csv') +``` + +### Gazetteer deduplication (gazetteer_dataframe) +`gazetteer_dataframe` is for matching a messy dataset against a 'canonical dataset' (i.e. the gazette) + +```python +import pandas as pd +import pandas_dedupe + +#load dataframe +df_clean = pd.read_csv('gazette.csv') +df_messy = pd.read_csv('test_names.csv') + +#initiate deduplication +df_final = pandas_dedupe.gazetteer_dataframe(df_clean, df_messy, 'fullname', canonicalize=True) + +#send output to csv +df_final.to_csv('gazetteer_deduplication_output.csv') +``` + + +### Matching / Record Linkage + +Use identical field names when linking dataframes. +Record linkage should only be used on dataframes that have been deduplicated. + +```python +import pandas as pd +import pandas_dedupe + +#load dataframes +dfa = pd.read_csv('file_a.csv') +dfb = pd.read_csv('file_b.csv') + +#initiate matching +df_final = pandas_dedupe.link_dataframes(dfa, dfb, ['field_1', 'field_2', 'field_3', 'field_4']) + +#send output to csv +df_final.to_csv('linkage_output.csv') +``` + +# Advanced Usage + +### Canonicalize Fields + +The canonicalize parameter will standardize names in a given cluster. Original fields are also kept. + +```python +pandas_dedupe.dedupe_dataframe(df,['first_name', 'last_name', 'payment_type'], canonicalize=True) +``` + +### Update Threshold (dedupe_dataframe and gazetteer_dataframe only) + +Group records into clusters only if the cophenetic similarity of the cluster is greater than +the threshold. + +```python +pandas_dedupe.dedupe_dataframe(df, ['first_name', 'last_name'], threshold=.7) +``` + +### Update Existing Model (dedupe_dataframe and gazetteer_dataframe only) + +If `True`, it allows a user to update the existing model. + +```python +pandas_dedupe.dedupe_dataframe(df, ['first_name', 'last_name'], update_model=True) +``` + +### Recall Weight & Sample Size + +The `dedupe_dataframe()` function has two optional parameters specifying `recall_weight` and `sample_size`: + +- **recall_weight** - Ranges from 0 to 2. When set to 2, we are saying we care twice as much + about recall than we do about precision. +- **sample_size** - Specifies the sample size used for training as a float from 0 to 1. + By default it is 30% (0.3) of our data. + +### Specifying Types + +If you'd like to specify dates, spatial data, etc, do so here. The structure must be like so: +`('field', 'type', 'additional_parameter)`. the `additional_parameter` section can be omitted. +The default type is `String`. + +See the full list of types [below](#Types). + +```python +# Price Example +pandas_dedupe.dedupe_dataframe(df,['first_name', 'last_name', ('salary', 'Price')]) + +# has missing Example +pandas_dedupe.link_dataframes(df,['SSN', ('bio_pgraph', 'Text'), ('salary', 'Price', 'has missing')]) + +# crf Example +pandas_dedupe.dedupe_dataframe(df,[('first_name', 'String', 'crf'), 'last_name', (m_initial, 'Exact')]) +``` + +# Types + +Dedupe supports a variety of datatypes; a full list with documentation can be found [here.](https://docs.dedupe.io/en/latest/Variable-definition.html#) + +pandas-dedupe officially supports the following datatypes: + +- **String** - Standard string comparison using string distance metric. This is the default type. +- **Text** - Comparison for sentences or paragraphs of text. Uses cosine similarity metric. +- **Price** - For comparing positive, non zero numerical values. +- **DateTime** - For comparing dates. +- **LatLong** - (39.990334, 70.012) will not match to (40.01, 69.98) using a string distance + metric, even though the points are in a geographically similar location. The LatLong type resolves + this by calculating the haversine distance between compared coordinates. LatLong requires + the field to be in the format (Lat, Long). The value can be a string, a tuple containing two + strings, a tuple containing two floats, or a tuple containing two integers. If the format + is not able to be processed, you will get a traceback. +- **Exact** - Tests whether fields are an exact match. +- **Exists** - Sometimes, the presence or absence of data can be useful in predicting a match. + The Exists type tests for whether both, one, or neither of fields are null. + +Additional supported parameters are: + +- **has missing** - Can be used if one of your data fields contains null values +- **crf** - Use conditional random fields for comparisons rather than distance metric. May be more + accurate in some cases, but runs much slower. Works with String and ShortString types. + +# Contributors + +[Tyler Marrs](http://tylermarrs.com/) - Refactored code, added docstrings, added `threshold` parameter + +[Tawni Marrs](https://github.com/tawnimarrs) - refactored code, added docstrings + +[ieriii](https://github.com/ieriii) - Added `update_model` parameter, updated codebase to use `Dedupe 2.0`, added support for multiprocessing, added `gazetteer_dataframe`. + +[Daniel Marczin](https://github.com/dim5) - Extensive updates to documentation to enhance readability. + +# Credits + +Many thanks to folks at [DataMade](https://datamade.us/) for making the the [Dedupe library](https://github.com/dedupeio/dedupe) publicly available. People interested in a code-free implementation of the dedupe library can find a link here: [Dedupe.io](https://dedupe.io/pricing/). + + + + +%package -n python3-pandas-dedupe +Summary: The Dedupe library made easy with Pandas. +Provides: python-pandas-dedupe +BuildRequires: python3-devel +BuildRequires: python3-setuptools +BuildRequires: python3-pip +%description -n python3-pandas-dedupe +# pandas-dedupe + +The Dedupe library made easy with Pandas. + +# Installation + +``` +pip install pandas-dedupe +``` + +# Video Tutorials + +[Basic Deduplication](https://www.youtube.com/watch?v=lCFEzRaqoJA) + +# Basic Usage + +A training file and a settings file will be created while running Dedupe. +Keeping these files will eliminate the need to retrain your model in the future. + +If you would like to retrain your model from scratch, just delete the settings and training files. + +### Deduplication (dedupe_dataframe) +`dedupe_dataframe` is for deduplication when you have data that can contain multiple records that can all refer to the same entity + +```python +import pandas as pd +import pandas_dedupe + +#load dataframe +df = pd.read_csv('test_names.csv') + +#initiate deduplication +df_final = pandas_dedupe.dedupe_dataframe(df,['first_name', 'last_name', 'middle_initial']) + +#send output to csv +df_final.to_csv('deduplication_output.csv') +``` + +### Gazetteer deduplication (gazetteer_dataframe) +`gazetteer_dataframe` is for matching a messy dataset against a 'canonical dataset' (i.e. the gazette) + +```python +import pandas as pd +import pandas_dedupe + +#load dataframe +df_clean = pd.read_csv('gazette.csv') +df_messy = pd.read_csv('test_names.csv') + +#initiate deduplication +df_final = pandas_dedupe.gazetteer_dataframe(df_clean, df_messy, 'fullname', canonicalize=True) + +#send output to csv +df_final.to_csv('gazetteer_deduplication_output.csv') +``` + + +### Matching / Record Linkage + +Use identical field names when linking dataframes. +Record linkage should only be used on dataframes that have been deduplicated. + +```python +import pandas as pd +import pandas_dedupe + +#load dataframes +dfa = pd.read_csv('file_a.csv') +dfb = pd.read_csv('file_b.csv') + +#initiate matching +df_final = pandas_dedupe.link_dataframes(dfa, dfb, ['field_1', 'field_2', 'field_3', 'field_4']) + +#send output to csv +df_final.to_csv('linkage_output.csv') +``` + +# Advanced Usage + +### Canonicalize Fields + +The canonicalize parameter will standardize names in a given cluster. Original fields are also kept. + +```python +pandas_dedupe.dedupe_dataframe(df,['first_name', 'last_name', 'payment_type'], canonicalize=True) +``` + +### Update Threshold (dedupe_dataframe and gazetteer_dataframe only) + +Group records into clusters only if the cophenetic similarity of the cluster is greater than +the threshold. + +```python +pandas_dedupe.dedupe_dataframe(df, ['first_name', 'last_name'], threshold=.7) +``` + +### Update Existing Model (dedupe_dataframe and gazetteer_dataframe only) + +If `True`, it allows a user to update the existing model. + +```python +pandas_dedupe.dedupe_dataframe(df, ['first_name', 'last_name'], update_model=True) +``` + +### Recall Weight & Sample Size + +The `dedupe_dataframe()` function has two optional parameters specifying `recall_weight` and `sample_size`: + +- **recall_weight** - Ranges from 0 to 2. When set to 2, we are saying we care twice as much + about recall than we do about precision. +- **sample_size** - Specifies the sample size used for training as a float from 0 to 1. + By default it is 30% (0.3) of our data. + +### Specifying Types + +If you'd like to specify dates, spatial data, etc, do so here. The structure must be like so: +`('field', 'type', 'additional_parameter)`. the `additional_parameter` section can be omitted. +The default type is `String`. + +See the full list of types [below](#Types). + +```python +# Price Example +pandas_dedupe.dedupe_dataframe(df,['first_name', 'last_name', ('salary', 'Price')]) + +# has missing Example +pandas_dedupe.link_dataframes(df,['SSN', ('bio_pgraph', 'Text'), ('salary', 'Price', 'has missing')]) + +# crf Example +pandas_dedupe.dedupe_dataframe(df,[('first_name', 'String', 'crf'), 'last_name', (m_initial, 'Exact')]) +``` + +# Types + +Dedupe supports a variety of datatypes; a full list with documentation can be found [here.](https://docs.dedupe.io/en/latest/Variable-definition.html#) + +pandas-dedupe officially supports the following datatypes: + +- **String** - Standard string comparison using string distance metric. This is the default type. +- **Text** - Comparison for sentences or paragraphs of text. Uses cosine similarity metric. +- **Price** - For comparing positive, non zero numerical values. +- **DateTime** - For comparing dates. +- **LatLong** - (39.990334, 70.012) will not match to (40.01, 69.98) using a string distance + metric, even though the points are in a geographically similar location. The LatLong type resolves + this by calculating the haversine distance between compared coordinates. LatLong requires + the field to be in the format (Lat, Long). The value can be a string, a tuple containing two + strings, a tuple containing two floats, or a tuple containing two integers. If the format + is not able to be processed, you will get a traceback. +- **Exact** - Tests whether fields are an exact match. +- **Exists** - Sometimes, the presence or absence of data can be useful in predicting a match. + The Exists type tests for whether both, one, or neither of fields are null. + +Additional supported parameters are: + +- **has missing** - Can be used if one of your data fields contains null values +- **crf** - Use conditional random fields for comparisons rather than distance metric. May be more + accurate in some cases, but runs much slower. Works with String and ShortString types. + +# Contributors + +[Tyler Marrs](http://tylermarrs.com/) - Refactored code, added docstrings, added `threshold` parameter + +[Tawni Marrs](https://github.com/tawnimarrs) - refactored code, added docstrings + +[ieriii](https://github.com/ieriii) - Added `update_model` parameter, updated codebase to use `Dedupe 2.0`, added support for multiprocessing, added `gazetteer_dataframe`. + +[Daniel Marczin](https://github.com/dim5) - Extensive updates to documentation to enhance readability. + +# Credits + +Many thanks to folks at [DataMade](https://datamade.us/) for making the the [Dedupe library](https://github.com/dedupeio/dedupe) publicly available. People interested in a code-free implementation of the dedupe library can find a link here: [Dedupe.io](https://dedupe.io/pricing/). + + + + +%package help +Summary: Development documents and examples for pandas-dedupe +Provides: python3-pandas-dedupe-doc +%description help +# pandas-dedupe + +The Dedupe library made easy with Pandas. + +# Installation + +``` +pip install pandas-dedupe +``` + +# Video Tutorials + +[Basic Deduplication](https://www.youtube.com/watch?v=lCFEzRaqoJA) + +# Basic Usage + +A training file and a settings file will be created while running Dedupe. +Keeping these files will eliminate the need to retrain your model in the future. + +If you would like to retrain your model from scratch, just delete the settings and training files. + +### Deduplication (dedupe_dataframe) +`dedupe_dataframe` is for deduplication when you have data that can contain multiple records that can all refer to the same entity + +```python +import pandas as pd +import pandas_dedupe + +#load dataframe +df = pd.read_csv('test_names.csv') + +#initiate deduplication +df_final = pandas_dedupe.dedupe_dataframe(df,['first_name', 'last_name', 'middle_initial']) + +#send output to csv +df_final.to_csv('deduplication_output.csv') +``` + +### Gazetteer deduplication (gazetteer_dataframe) +`gazetteer_dataframe` is for matching a messy dataset against a 'canonical dataset' (i.e. the gazette) + +```python +import pandas as pd +import pandas_dedupe + +#load dataframe +df_clean = pd.read_csv('gazette.csv') +df_messy = pd.read_csv('test_names.csv') + +#initiate deduplication +df_final = pandas_dedupe.gazetteer_dataframe(df_clean, df_messy, 'fullname', canonicalize=True) + +#send output to csv +df_final.to_csv('gazetteer_deduplication_output.csv') +``` + + +### Matching / Record Linkage + +Use identical field names when linking dataframes. +Record linkage should only be used on dataframes that have been deduplicated. + +```python +import pandas as pd +import pandas_dedupe + +#load dataframes +dfa = pd.read_csv('file_a.csv') +dfb = pd.read_csv('file_b.csv') + +#initiate matching +df_final = pandas_dedupe.link_dataframes(dfa, dfb, ['field_1', 'field_2', 'field_3', 'field_4']) + +#send output to csv +df_final.to_csv('linkage_output.csv') +``` + +# Advanced Usage + +### Canonicalize Fields + +The canonicalize parameter will standardize names in a given cluster. Original fields are also kept. + +```python +pandas_dedupe.dedupe_dataframe(df,['first_name', 'last_name', 'payment_type'], canonicalize=True) +``` + +### Update Threshold (dedupe_dataframe and gazetteer_dataframe only) + +Group records into clusters only if the cophenetic similarity of the cluster is greater than +the threshold. + +```python +pandas_dedupe.dedupe_dataframe(df, ['first_name', 'last_name'], threshold=.7) +``` + +### Update Existing Model (dedupe_dataframe and gazetteer_dataframe only) + +If `True`, it allows a user to update the existing model. + +```python +pandas_dedupe.dedupe_dataframe(df, ['first_name', 'last_name'], update_model=True) +``` + +### Recall Weight & Sample Size + +The `dedupe_dataframe()` function has two optional parameters specifying `recall_weight` and `sample_size`: + +- **recall_weight** - Ranges from 0 to 2. When set to 2, we are saying we care twice as much + about recall than we do about precision. +- **sample_size** - Specifies the sample size used for training as a float from 0 to 1. + By default it is 30% (0.3) of our data. + +### Specifying Types + +If you'd like to specify dates, spatial data, etc, do so here. The structure must be like so: +`('field', 'type', 'additional_parameter)`. the `additional_parameter` section can be omitted. +The default type is `String`. + +See the full list of types [below](#Types). + +```python +# Price Example +pandas_dedupe.dedupe_dataframe(df,['first_name', 'last_name', ('salary', 'Price')]) + +# has missing Example +pandas_dedupe.link_dataframes(df,['SSN', ('bio_pgraph', 'Text'), ('salary', 'Price', 'has missing')]) + +# crf Example +pandas_dedupe.dedupe_dataframe(df,[('first_name', 'String', 'crf'), 'last_name', (m_initial, 'Exact')]) +``` + +# Types + +Dedupe supports a variety of datatypes; a full list with documentation can be found [here.](https://docs.dedupe.io/en/latest/Variable-definition.html#) + +pandas-dedupe officially supports the following datatypes: + +- **String** - Standard string comparison using string distance metric. This is the default type. +- **Text** - Comparison for sentences or paragraphs of text. Uses cosine similarity metric. +- **Price** - For comparing positive, non zero numerical values. +- **DateTime** - For comparing dates. +- **LatLong** - (39.990334, 70.012) will not match to (40.01, 69.98) using a string distance + metric, even though the points are in a geographically similar location. The LatLong type resolves + this by calculating the haversine distance between compared coordinates. LatLong requires + the field to be in the format (Lat, Long). The value can be a string, a tuple containing two + strings, a tuple containing two floats, or a tuple containing two integers. If the format + is not able to be processed, you will get a traceback. +- **Exact** - Tests whether fields are an exact match. +- **Exists** - Sometimes, the presence or absence of data can be useful in predicting a match. + The Exists type tests for whether both, one, or neither of fields are null. + +Additional supported parameters are: + +- **has missing** - Can be used if one of your data fields contains null values +- **crf** - Use conditional random fields for comparisons rather than distance metric. May be more + accurate in some cases, but runs much slower. Works with String and ShortString types. + +# Contributors + +[Tyler Marrs](http://tylermarrs.com/) - Refactored code, added docstrings, added `threshold` parameter + +[Tawni Marrs](https://github.com/tawnimarrs) - refactored code, added docstrings + +[ieriii](https://github.com/ieriii) - Added `update_model` parameter, updated codebase to use `Dedupe 2.0`, added support for multiprocessing, added `gazetteer_dataframe`. + +[Daniel Marczin](https://github.com/dim5) - Extensive updates to documentation to enhance readability. + +# Credits + +Many thanks to folks at [DataMade](https://datamade.us/) for making the the [Dedupe library](https://github.com/dedupeio/dedupe) publicly available. People interested in a code-free implementation of the dedupe library can find a link here: [Dedupe.io](https://dedupe.io/pricing/). + + + + +%prep +%autosetup -n pandas-dedupe-1.5.0 + +%build +%py3_build + +%install +%py3_install +install -d -m755 %{buildroot}/%{_pkgdocdir} +if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi +if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi +if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi +if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi +pushd %{buildroot} +if [ -d usr/lib ]; then + find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/lib64 ]; then + find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/bin ]; then + find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/sbin ]; then + find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst +fi +touch doclist.lst +if [ -d usr/share/man ]; then + find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst +fi +popd +mv %{buildroot}/filelist.lst . +mv %{buildroot}/doclist.lst . + +%files -n python3-pandas-dedupe -f filelist.lst +%dir %{python3_sitelib}/* + +%files help -f doclist.lst +%{_docdir}/* + +%changelog +* Mon Apr 10 2023 Python_Bot <Python_Bot@openeuler.org> - 1.5.0-1 +- Package Spec generated @@ -0,0 +1 @@ +c4851fa65ec0cffd358726fd64a2e40a pandas_dedupe-1.5.0.tar.gz |
