%global _empty_manifest_terminate_build 0
Name:		python-pandas-dedupe
Version:	1.5.0
Release:	1
Summary:	The Dedupe library made easy with Pandas.
License:	MIT
URL:		https://github.com/Lyonk71/pandas-dedupe
Source0:	https://mirrors.nju.edu.cn/pypi/web/packages/45/1f/f24ba1dbb5ff59f07dc8829c9d80c3ff9d1d4367f21d7d482243a92f3f4e/pandas_dedupe-1.5.0.tar.gz
BuildArch:	noarch

Requires:	python3-dedupe
Requires:	python3-unidecode
Requires:	python3-pandas

%description
# pandas-dedupe

The Dedupe library made easy with Pandas.

# Installation

```
pip install pandas-dedupe
```

# Video Tutorials

[Basic Deduplication](https://www.youtube.com/watch?v=lCFEzRaqoJA)

# Basic Usage

A training file and a settings file will be created while running Dedupe.
Keeping these files will eliminate the need to retrain your model in the future.

If you would like to retrain your model from scratch, just delete the settings and training files.

### Deduplication (dedupe_dataframe)
`dedupe_dataframe` is for deduplication when you have data that can contain multiple records that can all refer to the same entity

```python
import pandas as pd
import pandas_dedupe

#load dataframe
df = pd.read_csv('test_names.csv')

#initiate deduplication
df_final = pandas_dedupe.dedupe_dataframe(df,['first_name', 'last_name', 'middle_initial'])

#send output to csv
df_final.to_csv('deduplication_output.csv')
```

### Gazetteer deduplication (gazetteer_dataframe)
`gazetteer_dataframe` is for matching a messy dataset against a 'canonical dataset' (i.e. the gazette)

```python
import pandas as pd
import pandas_dedupe

#load dataframe
df_clean = pd.read_csv('gazette.csv')
df_messy = pd.read_csv('test_names.csv')

#initiate deduplication
df_final = pandas_dedupe.gazetteer_dataframe(df_clean, df_messy, 'fullname', canonicalize=True)

#send output to csv
df_final.to_csv('gazetteer_deduplication_output.csv')
```


### Matching / Record Linkage

Use identical field names when linking dataframes.
Record linkage should only be used on dataframes that have been deduplicated.

```python
import pandas as pd
import pandas_dedupe

#load dataframes
dfa = pd.read_csv('file_a.csv')
dfb = pd.read_csv('file_b.csv')

#initiate matching
df_final = pandas_dedupe.link_dataframes(dfa, dfb, ['field_1', 'field_2', 'field_3', 'field_4'])

#send output to csv
df_final.to_csv('linkage_output.csv')
```

# Advanced Usage

### Canonicalize Fields

The canonicalize parameter will standardize names in a given cluster. Original fields are also kept.

```python
pandas_dedupe.dedupe_dataframe(df,['first_name', 'last_name', 'payment_type'], canonicalize=True)
```

### Update Threshold (dedupe_dataframe and gazetteer_dataframe only)

Group records into clusters only if the cophenetic similarity of the cluster is greater than
the threshold.

```python
pandas_dedupe.dedupe_dataframe(df, ['first_name', 'last_name'], threshold=.7)
```

### Update Existing Model (dedupe_dataframe and gazetteer_dataframe only)

If `True`, it allows a user to update the existing model.

```python
pandas_dedupe.dedupe_dataframe(df, ['first_name', 'last_name'], update_model=True)
```

### Recall Weight & Sample Size

The `dedupe_dataframe()` function has two optional parameters specifying `recall_weight` and `sample_size`:

- **recall_weight** - Ranges from 0 to 2. When set to 2, we are saying we care twice as much
  about recall than we do about precision.
- **sample_size** - Specifies the sample size used for training as a float from 0 to 1.
  By default it is 30% (0.3) of our data.

### Specifying Types

If you'd like to specify dates, spatial data, etc, do so here. The structure must be like so:
`('field', 'type', 'additional_parameter)`. the `additional_parameter` section can be omitted.
The default type is `String`.

See the full list of types [below](#Types).

```python
# Price Example
pandas_dedupe.dedupe_dataframe(df,['first_name', 'last_name', ('salary', 'Price')])

# has missing Example
pandas_dedupe.link_dataframes(df,['SSN', ('bio_pgraph', 'Text'), ('salary', 'Price', 'has missing')])

# crf Example
pandas_dedupe.dedupe_dataframe(df,[('first_name', 'String', 'crf'), 'last_name', (m_initial, 'Exact')])
```

# Types

Dedupe supports a variety of datatypes; a full list with documentation can be found [here.](https://docs.dedupe.io/en/latest/Variable-definition.html#)

pandas-dedupe officially supports the following datatypes:

- **String** - Standard string comparison using string distance metric. This is the default type.
- **Text** - Comparison for sentences or paragraphs of text. Uses cosine similarity metric.
- **Price** - For comparing positive, non zero numerical values.
- **DateTime** - For comparing dates.
- **LatLong** - (39.990334, 70.012) will not match to (40.01, 69.98) using a string distance
  metric, even though the points are in a geographically similar location. The LatLong type resolves
  this by calculating the haversine distance between compared coordinates. LatLong requires
  the field to be in the format (Lat, Long). The value can be a string, a tuple containing two
  strings, a tuple containing two floats, or a tuple containing two integers. If the format
  is not able to be processed, you will get a traceback.
- **Exact** - Tests whether fields are an exact match.
- **Exists** - Sometimes, the presence or absence of data can be useful in predicting a match.
  The Exists type tests for whether both, one, or neither of fields are null.

Additional supported parameters are:

- **has missing** - Can be used if one of your data fields contains null values
- **crf** - Use conditional random fields for comparisons rather than distance metric. May be more
  accurate in some cases, but runs much slower. Works with String and ShortString types.

# Contributors

[Tyler Marrs](http://tylermarrs.com/) - Refactored code, added docstrings, added `threshold` parameter

[Tawni Marrs](https://github.com/tawnimarrs) - refactored code, added docstrings

[ieriii](https://github.com/ieriii) - Added `update_model` parameter, updated codebase to use `Dedupe 2.0`, added support for multiprocessing, added `gazetteer_dataframe`.

[Daniel Marczin](https://github.com/dim5) - Extensive updates to documentation to enhance readability.

# Credits

Many thanks to folks at [DataMade](https://datamade.us/) for making the the [Dedupe library](https://github.com/dedupeio/dedupe) publicly available. People interested in a code-free implementation of the dedupe library can find a link here: [Dedupe.io](https://dedupe.io/pricing/).


%package -n python3-pandas-dedupe
Summary:	The Dedupe library made easy with Pandas.
Provides:	python-pandas-dedupe
BuildRequires:	python3-devel
BuildRequires:	python3-setuptools
BuildRequires:	python3-pip
%description -n python3-pandas-dedupe
# pandas-dedupe

The Dedupe library made easy with Pandas.

# Installation

```
pip install pandas-dedupe
```

# Video Tutorials

[Basic Deduplication](https://www.youtube.com/watch?v=lCFEzRaqoJA)

# Basic Usage

A training file and a settings file will be created while running Dedupe.
Keeping these files will eliminate the need to retrain your model in the future.

If you would like to retrain your model from scratch, just delete the settings and training files.

### Deduplication (dedupe_dataframe)
`dedupe_dataframe` is for deduplication when you have data that can contain multiple records that can all refer to the same entity

```python
import pandas as pd
import pandas_dedupe

#load dataframe
df = pd.read_csv('test_names.csv')

#initiate deduplication
df_final = pandas_dedupe.dedupe_dataframe(df,['first_name', 'last_name', 'middle_initial'])

#send output to csv
df_final.to_csv('deduplication_output.csv')
```

### Gazetteer deduplication (gazetteer_dataframe)
`gazetteer_dataframe` is for matching a messy dataset against a 'canonical dataset' (i.e. the gazette)

```python
import pandas as pd
import pandas_dedupe

#load dataframe
df_clean = pd.read_csv('gazette.csv')
df_messy = pd.read_csv('test_names.csv')

#initiate deduplication
df_final = pandas_dedupe.gazetteer_dataframe(df_clean, df_messy, 'fullname', canonicalize=True)

#send output to csv
df_final.to_csv('gazetteer_deduplication_output.csv')
```


### Matching / Record Linkage

Use identical field names when linking dataframes.
Record linkage should only be used on dataframes that have been deduplicated.

```python
import pandas as pd
import pandas_dedupe

#load dataframes
dfa = pd.read_csv('file_a.csv')
dfb = pd.read_csv('file_b.csv')

#initiate matching
df_final = pandas_dedupe.link_dataframes(dfa, dfb, ['field_1', 'field_2', 'field_3', 'field_4'])

#send output to csv
df_final.to_csv('linkage_output.csv')
```

# Advanced Usage

### Canonicalize Fields

The canonicalize parameter will standardize names in a given cluster. Original fields are also kept.

```python
pandas_dedupe.dedupe_dataframe(df,['first_name', 'last_name', 'payment_type'], canonicalize=True)
```

### Update Threshold (dedupe_dataframe and gazetteer_dataframe only)

Group records into clusters only if the cophenetic similarity of the cluster is greater than
the threshold.

```python
pandas_dedupe.dedupe_dataframe(df, ['first_name', 'last_name'], threshold=.7)
```

### Update Existing Model (dedupe_dataframe and gazetteer_dataframe only)

If `True`, it allows a user to update the existing model.

```python
pandas_dedupe.dedupe_dataframe(df, ['first_name', 'last_name'], update_model=True)
```

### Recall Weight & Sample Size

The `dedupe_dataframe()` function has two optional parameters specifying `recall_weight` and `sample_size`:

- **recall_weight** - Ranges from 0 to 2. When set to 2, we are saying we care twice as much
  about recall than we do about precision.
- **sample_size** - Specifies the sample size used for training as a float from 0 to 1.
  By default it is 30% (0.3) of our data.

### Specifying Types

If you'd like to specify dates, spatial data, etc, do so here. The structure must be like so:
`('field', 'type', 'additional_parameter)`. the `additional_parameter` section can be omitted.
The default type is `String`.

See the full list of types [below](#Types).

```python
# Price Example
pandas_dedupe.dedupe_dataframe(df,['first_name', 'last_name', ('salary', 'Price')])

# has missing Example
pandas_dedupe.link_dataframes(df,['SSN', ('bio_pgraph', 'Text'), ('salary', 'Price', 'has missing')])

# crf Example
pandas_dedupe.dedupe_dataframe(df,[('first_name', 'String', 'crf'), 'last_name', (m_initial, 'Exact')])
```

# Types

Dedupe supports a variety of datatypes; a full list with documentation can be found [here.](https://docs.dedupe.io/en/latest/Variable-definition.html#)

pandas-dedupe officially supports the following datatypes:

- **String** - Standard string comparison using string distance metric. This is the default type.
- **Text** - Comparison for sentences or paragraphs of text. Uses cosine similarity metric.
- **Price** - For comparing positive, non zero numerical values.
- **DateTime** - For comparing dates.
- **LatLong** - (39.990334, 70.012) will not match to (40.01, 69.98) using a string distance
  metric, even though the points are in a geographically similar location. The LatLong type resolves
  this by calculating the haversine distance between compared coordinates. LatLong requires
  the field to be in the format (Lat, Long). The value can be a string, a tuple containing two
  strings, a tuple containing two floats, or a tuple containing two integers. If the format
  is not able to be processed, you will get a traceback.
- **Exact** - Tests whether fields are an exact match.
- **Exists** - Sometimes, the presence or absence of data can be useful in predicting a match.
  The Exists type tests for whether both, one, or neither of fields are null.

Additional supported parameters are:

- **has missing** - Can be used if one of your data fields contains null values
- **crf** - Use conditional random fields for comparisons rather than distance metric. May be more
  accurate in some cases, but runs much slower. Works with String and ShortString types.

# Contributors

[Tyler Marrs](http://tylermarrs.com/) - Refactored code, added docstrings, added `threshold` parameter

[Tawni Marrs](https://github.com/tawnimarrs) - refactored code, added docstrings

[ieriii](https://github.com/ieriii) - Added `update_model` parameter, updated codebase to use `Dedupe 2.0`, added support for multiprocessing, added `gazetteer_dataframe`.

[Daniel Marczin](https://github.com/dim5) - Extensive updates to documentation to enhance readability.

# Credits

Many thanks to folks at [DataMade](https://datamade.us/) for making the the [Dedupe library](https://github.com/dedupeio/dedupe) publicly available. People interested in a code-free implementation of the dedupe library can find a link here: [Dedupe.io](https://dedupe.io/pricing/).


%package help
Summary:	Development documents and examples for pandas-dedupe
Provides:	python3-pandas-dedupe-doc
%description help
# pandas-dedupe

The Dedupe library made easy with Pandas.

# Installation

```
pip install pandas-dedupe
```

# Video Tutorials

[Basic Deduplication](https://www.youtube.com/watch?v=lCFEzRaqoJA)

# Basic Usage

A training file and a settings file will be created while running Dedupe.
Keeping these files will eliminate the need to retrain your model in the future.

If you would like to retrain your model from scratch, just delete the settings and training files.

### Deduplication (dedupe_dataframe)
`dedupe_dataframe` is for deduplication when you have data that can contain multiple records that can all refer to the same entity

```python
import pandas as pd
import pandas_dedupe

#load dataframe
df = pd.read_csv('test_names.csv')

#initiate deduplication
df_final = pandas_dedupe.dedupe_dataframe(df,['first_name', 'last_name', 'middle_initial'])

#send output to csv
df_final.to_csv('deduplication_output.csv')
```

### Gazetteer deduplication (gazetteer_dataframe)
`gazetteer_dataframe` is for matching a messy dataset against a 'canonical dataset' (i.e. the gazette)

```python
import pandas as pd
import pandas_dedupe

#load dataframe
df_clean = pd.read_csv('gazette.csv')
df_messy = pd.read_csv('test_names.csv')

#initiate deduplication
df_final = pandas_dedupe.gazetteer_dataframe(df_clean, df_messy, 'fullname', canonicalize=True)

#send output to csv
df_final.to_csv('gazetteer_deduplication_output.csv')
```


### Matching / Record Linkage

Use identical field names when linking dataframes.
Record linkage should only be used on dataframes that have been deduplicated.

```python
import pandas as pd
import pandas_dedupe

#load dataframes
dfa = pd.read_csv('file_a.csv')
dfb = pd.read_csv('file_b.csv')

#initiate matching
df_final = pandas_dedupe.link_dataframes(dfa, dfb, ['field_1', 'field_2', 'field_3', 'field_4'])

#send output to csv
df_final.to_csv('linkage_output.csv')
```

# Advanced Usage

### Canonicalize Fields

The canonicalize parameter will standardize names in a given cluster. Original fields are also kept.

```python
pandas_dedupe.dedupe_dataframe(df,['first_name', 'last_name', 'payment_type'], canonicalize=True)
```

### Update Threshold (dedupe_dataframe and gazetteer_dataframe only)

Group records into clusters only if the cophenetic similarity of the cluster is greater than
the threshold.

```python
pandas_dedupe.dedupe_dataframe(df, ['first_name', 'last_name'], threshold=.7)
```

### Update Existing Model (dedupe_dataframe and gazetteer_dataframe only)

If `True`, it allows a user to update the existing model.

```python
pandas_dedupe.dedupe_dataframe(df, ['first_name', 'last_name'], update_model=True)
```

### Recall Weight & Sample Size

The `dedupe_dataframe()` function has two optional parameters specifying `recall_weight` and `sample_size`:

- **recall_weight** - Ranges from 0 to 2. When set to 2, we are saying we care twice as much
  about recall than we do about precision.
- **sample_size** - Specifies the sample size used for training as a float from 0 to 1.
  By default it is 30% (0.3) of our data.

### Specifying Types

If you'd like to specify dates, spatial data, etc, do so here. The structure must be like so:
`('field', 'type', 'additional_parameter)`. the `additional_parameter` section can be omitted.
The default type is `String`.

See the full list of types [below](#Types).

```python
# Price Example
pandas_dedupe.dedupe_dataframe(df,['first_name', 'last_name', ('salary', 'Price')])

# has missing Example
pandas_dedupe.link_dataframes(df,['SSN', ('bio_pgraph', 'Text'), ('salary', 'Price', 'has missing')])

# crf Example
pandas_dedupe.dedupe_dataframe(df,[('first_name', 'String', 'crf'), 'last_name', (m_initial, 'Exact')])
```

# Types

Dedupe supports a variety of datatypes; a full list with documentation can be found [here.](https://docs.dedupe.io/en/latest/Variable-definition.html#)

pandas-dedupe officially supports the following datatypes:

- **String** - Standard string comparison using string distance metric. This is the default type.
- **Text** - Comparison for sentences or paragraphs of text. Uses cosine similarity metric.
- **Price** - For comparing positive, non zero numerical values.
- **DateTime** - For comparing dates.
- **LatLong** - (39.990334, 70.012) will not match to (40.01, 69.98) using a string distance
  metric, even though the points are in a geographically similar location. The LatLong type resolves
  this by calculating the haversine distance between compared coordinates. LatLong requires
  the field to be in the format (Lat, Long). The value can be a string, a tuple containing two
  strings, a tuple containing two floats, or a tuple containing two integers. If the format
  is not able to be processed, you will get a traceback.
- **Exact** - Tests whether fields are an exact match.
- **Exists** - Sometimes, the presence or absence of data can be useful in predicting a match.
  The Exists type tests for whether both, one, or neither of fields are null.

Additional supported parameters are:

- **has missing** - Can be used if one of your data fields contains null values
- **crf** - Use conditional random fields for comparisons rather than distance metric. May be more
  accurate in some cases, but runs much slower. Works with String and ShortString types.

# Contributors

[Tyler Marrs](http://tylermarrs.com/) - Refactored code, added docstrings, added `threshold` parameter

[Tawni Marrs](https://github.com/tawnimarrs) - refactored code, added docstrings

[ieriii](https://github.com/ieriii) - Added `update_model` parameter, updated codebase to use `Dedupe 2.0`, added support for multiprocessing, added `gazetteer_dataframe`.

[Daniel Marczin](https://github.com/dim5) - Extensive updates to documentation to enhance readability.

# Credits

Many thanks to folks at [DataMade](https://datamade.us/) for making the the [Dedupe library](https://github.com/dedupeio/dedupe) publicly available. People interested in a code-free implementation of the dedupe library can find a link here: [Dedupe.io](https://dedupe.io/pricing/).


%prep
%autosetup -n pandas-dedupe-1.5.0

%build
%py3_build

%install
%py3_install
install -d -m755 %{buildroot}/%{_pkgdocdir}
if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi
if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi
if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi
if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi
pushd %{buildroot}
if [ -d usr/lib ]; then
	find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst
fi
if [ -d usr/lib64 ]; then
	find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst
fi
if [ -d usr/bin ]; then
	find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst
fi
if [ -d usr/sbin ]; then
	find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst
fi
touch doclist.lst
if [ -d usr/share/man ]; then
	find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst
fi
popd
mv %{buildroot}/filelist.lst .
mv %{buildroot}/doclist.lst .

%files -n python3-pandas-dedupe -f filelist.lst
%dir %{python3_sitelib}/*

%files help -f doclist.lst
%{_docdir}/*

%changelog
* Sun Apr 23 2023 Python_Bot <Python_Bot@openeuler.org> - 1.5.0-1
- Package Spec generated