%global _empty_manifest_terminate_build 0 Name: python-polyfuzz Version: 0.4.0 Release: 1 Summary: PolyFuzz performs fuzzy string matching, grouping, and evaluation. License: MIT License URL: https://github.com/MaartenGr/PolyFuzz Source0: https://mirrors.nju.edu.cn/pypi/web/packages/fe/90/79ac771627a14ef47d16f2e3d1662332af790d5b942c8af55f1a32aa8ef6/polyfuzz-0.4.0.tar.gz BuildArch: noarch Requires: python3-numpy Requires: python3-scipy Requires: python3-pandas Requires: python3-tqdm Requires: python3-joblib Requires: python3-matplotlib Requires: python3-seaborn Requires: python3-rapidfuzz Requires: python3-scikit-learn Requires: python3-mkdocs Requires: python3-mkdocs-material Requires: python3-mkdocstrings Requires: python3-pytest Requires: python3-pytest-cov Requires: python3-torch Requires: python3-flair Requires: python3-sparse-dot-topn Requires: python3-sentence-transformers Requires: python3-spacy Requires: python3-tensorflow Requires: python3-tensorflow-hub Requires: python3-tensorflow-text Requires: python3-mkdocs Requires: python3-mkdocs-material Requires: python3-mkdocstrings Requires: python3-sparse-dot-topn Requires: python3-torch Requires: python3-flair Requires: python3-gensim Requires: python3-sentence-transformers Requires: python3-pytest Requires: python3-pytest-cov Requires: python3-tensorflow Requires: python3-tensorflow-hub Requires: python3-tensorflow-text %description [![PyPI - Python](https://img.shields.io/badge/python-3.6%20|%203.7%20|%203.8-blue.svg)](https://pypi.org/project/keybert/) [![PyPI - License](https://img.shields.io/badge/license-MIT-green.svg)](https://github.com/MaartenGr/keybert/blob/master/LICENSE) [![PyPI - PyPi](https://img.shields.io/pypi/v/polyfuzz)](https://pypi.org/project/polyfuzz/) [![Build](https://img.shields.io/github/workflow/status/MaartenGr/polyfuzz/Code%20Checks/master)](https://pypi.org/project/polyfuzz/) [![docs](https://img.shields.io/badge/docs-Passing-green.svg)](https://maartengr.github.io/PolyFuzz/) **`PolyFuzz`** performs fuzzy string matching, string grouping, and contains extensive evaluation functions. PolyFuzz is meant to bring fuzzy string matching techniques together within a single framework. Currently, methods include a variety of edit distance measures, a character-based n-gram TF-IDF, word embedding techniques such as FastText and GloVe, and 🤗 transformers embeddings. Corresponding medium post can be found [here](https://towardsdatascience.com/string-matching-with-bert-tf-idf-and-more-274bb3a95136?source=friends_link&sk=0f765b76ceaba49363829c13dfdc9d98). ## Installation You can install **`PolyFuzz`** via pip: ```bash pip install polyfuzz ``` You may want to install more depending on the transformers and language backends that you will be using. The possible installations are: ```python pip install bertopic[sbert] pip install bertopic[flair] pip install bertopic[gensim] pip install bertopic[spacy] pip install bertopic[use] ``` If you want to speed up the cosine similarity comparison and decrease memory usage when using embedding models, you can use `sparse_dot_topn` which is installed via: ```bash pip install polyfuzz[fast] ```
Installation Issues You might run into installation issues with `sparse_dot_topn`. If so, one solution that has worked for many is by installing it via conda first before installing PolyFuzz: ```bash conda install -c conda-forge sparse_dot_topn ``` If that does not work, I would advise you to look through their issues](https://github.com/ing-bank/sparse_dot_topn/issues) page or continue to use PolyFuzz without `sparse_dot_topn`.
## Getting Started For an in-depth overview of the possibilities of `PolyFuzz` you can check the full documentation [here](https://maartengr.github.io/PolyFuzz/) or you can follow along with the notebook [here](https://github.com/MaartenGr/PolyFuzz/blob/master/notebooks/Overview.ipynb). ### Quick Start The main goal of `PolyFuzz` is to allow the user to perform different methods for matching strings. We start by defining two lists, one to map from and one to map to. We are going to be using `TF-IDF` to create n-grams on a character level in order to compare similarity between strings. Then, we calculate the similarity between strings by calculating the cosine similarity between vector representations. We only have to instantiate `PolyFuzz` with `TF-IDF` and match the lists: ```python from polyfuzz import PolyFuzz from_list = ["apple", "apples", "appl", "recal", "house", "similarity"] to_list = ["apple", "apples", "mouse"] model = PolyFuzz("TF-IDF") model.match(from_list, to_list) ``` The resulting matches can be accessed through `model.get_matches()`: ```python >>> model.get_matches() From To Similarity 0 apple apple 1.000000 1 apples apples 1.000000 2 appl apple 0.783751 3 recal None 0.000000 4 house mouse 0.587927 5 similarity None 0.000000 ``` **NOTE 1**: If you want to compare distances within a single list, you can simply pass that list as such: `model.match(from_list)` **NOTE 2**: When instantiating `PolyFuzz` we also could have used "EditDistance" or "Embeddings" to quickly access Levenshtein and FastText (English) respectively. ### Production The `.match` function allows you to quickly extract similar strings. However, after selecting the right models to be used, you may want to use PolyFuzz in production to match incoming strings. To do so, we can make use of the familiar `fit`, `transform`, and `fit_transform` functions. Let's say that we have a list of words that we know to be correct called `train_words`. We want to any incoming word to mapped to one of the words in `train_words`. In other words, we `fit` on `train_words` and we use `transform` on any incoming words: ```python from sklearn.datasets import fetch_20newsgroups from sklearn.feature_extraction.text import CountVectorizer from polyfuzz import PolyFuzz train_words = ["apple", "apples", "appl", "recal", "house", "similarity"] unseen_words = ["apple", "apples", "mouse"] # Fit model = PolyFuzz("TF-IDF") model.fit(train_words) # Transform results = model.transform(unseen_words) ``` In the above example, we are using `fit` on `train_words` to calculate the TF-IDF representation of those words which are saved to be used again in `transform`. This speeds up `transform` quite a bit since all TF-IDF representations are stored when applying `fit`. Then, we apply save and load the model as follows to be used in production: ```python # Save the model model.save("my_model") # Load the model loaded_model = PolyFuzz.load("my_model") ``` ### Group Matches We can group the matches `To` as there might be significant overlap in strings in our to_list. To do this, we calculate the similarity within strings in to_list and use `single linkage` to then group the strings with a high similarity. When we extract the new matches, we can see an additional column `Group` in which all the `To` matches were grouped to: ```python >>> model.group(link_min_similarity=0.75) >>> model.get_matches() From To Similarity Group 0 apple apple 1.000000 apples 1 apples apples 1.000000 apples 2 appl apple 0.783751 apples 3 recal None 0.000000 None 4 house mouse 0.587927 mouse 5 similarity None 0.000000 None ``` As can be seen above, we grouped apple and apples together to `apple` such that when a string is mapped to `apple` it will fall in the cluster of `[apples, apple]` and will be mapped to the first instance in the cluster which is `apples`. ### Precision-Recall Curve Next, we would like to see how well our model is doing on our data. We express our results as **`precision`** and **`recall`** where precision is defined as the minimum similarity score before a match is correct and recall the percentage of matches found at a certain minimum similarity score. Creating the visualizations is as simple as: ``` model.visualize_precision_recall() ``` ## Models Currently, the following models are implemented in PolyFuzz: * TF-IDF * EditDistance (you can use any distance measure, see [documentation](https://maartengr.github.io/PolyFuzz/tutorial/models/#EditDistance)) * FastText and GloVe * 🤗 Transformers With `Flair`, we can use all 🤗 Transformers [models](https://huggingface.co/transformers/pretrained_models.html). We simply have to instantiate any Flair WordEmbedding method and pass it through PolyFuzzy. All models listed above can be found in `polyfuzz.models` and can be used to create and compare different models: ```python from polyfuzz.models import EditDistance, TFIDF, Embeddings from flair.embeddings import TransformerWordEmbeddings embeddings = TransformerWordEmbeddings('bert-base-multilingual-cased') bert = Embeddings(embeddings, min_similarity=0, model_id="BERT") tfidf = TFIDF(min_similarity=0) edit = EditDistance() string_models = [bert, tfidf, edit] model = PolyFuzz(string_models) model.match(from_list, to_list) ``` To access the results, we again can call `get_matches` but since we have multiple models we get back a dictionary of dataframes back. In order to access the results of a specific model, call `get_matches` with the correct id: ```python >>> model.get_matches("BERT") From To Similarity 0 apple apple 1.000000 1 apples apples 1.000000 2 appl apple 0.928045 3 recal apples 0.825268 4 house mouse 0.887524 5 similarity mouse 0.791548 ``` Finally, visualize the results to compare the models: ```python model.visualize_precision_recall(kde=True) ``` ## Custom Grouper We can even use one of the `polyfuzz.models` to be used as the grouper in case you would like to use something else than the standard TF-IDF model: ```python model = PolyFuzz("TF-IDF") model.match(from_list, to_list) edit_grouper = EditDistance(n_jobs=1) model.group(edit_grouper) ``` ## Custom Models Although the options above are a great solution for comparing different models, what if you have developed your own? If you follow the structure of PolyFuzz's `BaseMatcher` you can quickly implement any model you would like. Below, we are implementing the ratio similarity measure from RapidFuzz. ```python import numpy as np import pandas as pd from rapidfuzz import fuzz from polyfuzz.models import BaseMatcher class MyModel(BaseMatcher): def match(self, from_list, to_list, **kwargs): # Calculate distances matches = [[fuzz.ratio(from_string, to_string) / 100 for to_string in to_list] for from_string in from_list] # Get best matches mappings = [to_list[index] for index in np.argmax(matches, axis=1)] scores = np.max(matches, axis=1) # Prepare dataframe matches = pd.DataFrame({'From': from_list,'To': mappings, 'Similarity': scores}) return matches ``` Then, we can simply create an instance of MyModel and pass it through PolyFuzz: ```python custom_model = MyModel() model = PolyFuzz(custom_model) ``` ## Citation To cite PolyFuzz in your work, please use the following bibtex reference: ```bibtex @misc{grootendorst2020polyfuzz, author = {Maarten Grootendorst}, title = {PolyFuzz: Fuzzy string matching, grouping, and evaluation.}, year = 2020, publisher = {Zenodo}, version = {v0.2.2}, doi = {10.5281/zenodo.4461050}, url = {https://doi.org/10.5281/zenodo.4461050} } ``` ## References Below, you can find several resources that were used for or inspired by when developing PolyFuzz: **Edit distance algorithms**: These algorithms focus primarily on edit distance measures and can be used in `polyfuzz.models.EditDistance`: * https://github.com/jamesturk/jellyfish * https://github.com/ztane/python-Levenshtein * https://github.com/seatgeek/fuzzywuzzy * https://github.com/maxbachmann/rapidfuzz * https://github.com/roy-ht/editdistance **Other interesting repos**: * https://github.com/ing-bank/sparse_dot_topn * Used in PolyFuzz for fast cosine similarity between sparse matrices %package -n python3-polyfuzz Summary: PolyFuzz performs fuzzy string matching, grouping, and evaluation. Provides: python-polyfuzz BuildRequires: python3-devel BuildRequires: python3-setuptools BuildRequires: python3-pip %description -n python3-polyfuzz [![PyPI - Python](https://img.shields.io/badge/python-3.6%20|%203.7%20|%203.8-blue.svg)](https://pypi.org/project/keybert/) [![PyPI - License](https://img.shields.io/badge/license-MIT-green.svg)](https://github.com/MaartenGr/keybert/blob/master/LICENSE) [![PyPI - PyPi](https://img.shields.io/pypi/v/polyfuzz)](https://pypi.org/project/polyfuzz/) [![Build](https://img.shields.io/github/workflow/status/MaartenGr/polyfuzz/Code%20Checks/master)](https://pypi.org/project/polyfuzz/) [![docs](https://img.shields.io/badge/docs-Passing-green.svg)](https://maartengr.github.io/PolyFuzz/) **`PolyFuzz`** performs fuzzy string matching, string grouping, and contains extensive evaluation functions. PolyFuzz is meant to bring fuzzy string matching techniques together within a single framework. Currently, methods include a variety of edit distance measures, a character-based n-gram TF-IDF, word embedding techniques such as FastText and GloVe, and 🤗 transformers embeddings. Corresponding medium post can be found [here](https://towardsdatascience.com/string-matching-with-bert-tf-idf-and-more-274bb3a95136?source=friends_link&sk=0f765b76ceaba49363829c13dfdc9d98). ## Installation You can install **`PolyFuzz`** via pip: ```bash pip install polyfuzz ``` You may want to install more depending on the transformers and language backends that you will be using. The possible installations are: ```python pip install bertopic[sbert] pip install bertopic[flair] pip install bertopic[gensim] pip install bertopic[spacy] pip install bertopic[use] ``` If you want to speed up the cosine similarity comparison and decrease memory usage when using embedding models, you can use `sparse_dot_topn` which is installed via: ```bash pip install polyfuzz[fast] ```
Installation Issues You might run into installation issues with `sparse_dot_topn`. If so, one solution that has worked for many is by installing it via conda first before installing PolyFuzz: ```bash conda install -c conda-forge sparse_dot_topn ``` If that does not work, I would advise you to look through their issues](https://github.com/ing-bank/sparse_dot_topn/issues) page or continue to use PolyFuzz without `sparse_dot_topn`.
## Getting Started For an in-depth overview of the possibilities of `PolyFuzz` you can check the full documentation [here](https://maartengr.github.io/PolyFuzz/) or you can follow along with the notebook [here](https://github.com/MaartenGr/PolyFuzz/blob/master/notebooks/Overview.ipynb). ### Quick Start The main goal of `PolyFuzz` is to allow the user to perform different methods for matching strings. We start by defining two lists, one to map from and one to map to. We are going to be using `TF-IDF` to create n-grams on a character level in order to compare similarity between strings. Then, we calculate the similarity between strings by calculating the cosine similarity between vector representations. We only have to instantiate `PolyFuzz` with `TF-IDF` and match the lists: ```python from polyfuzz import PolyFuzz from_list = ["apple", "apples", "appl", "recal", "house", "similarity"] to_list = ["apple", "apples", "mouse"] model = PolyFuzz("TF-IDF") model.match(from_list, to_list) ``` The resulting matches can be accessed through `model.get_matches()`: ```python >>> model.get_matches() From To Similarity 0 apple apple 1.000000 1 apples apples 1.000000 2 appl apple 0.783751 3 recal None 0.000000 4 house mouse 0.587927 5 similarity None 0.000000 ``` **NOTE 1**: If you want to compare distances within a single list, you can simply pass that list as such: `model.match(from_list)` **NOTE 2**: When instantiating `PolyFuzz` we also could have used "EditDistance" or "Embeddings" to quickly access Levenshtein and FastText (English) respectively. ### Production The `.match` function allows you to quickly extract similar strings. However, after selecting the right models to be used, you may want to use PolyFuzz in production to match incoming strings. To do so, we can make use of the familiar `fit`, `transform`, and `fit_transform` functions. Let's say that we have a list of words that we know to be correct called `train_words`. We want to any incoming word to mapped to one of the words in `train_words`. In other words, we `fit` on `train_words` and we use `transform` on any incoming words: ```python from sklearn.datasets import fetch_20newsgroups from sklearn.feature_extraction.text import CountVectorizer from polyfuzz import PolyFuzz train_words = ["apple", "apples", "appl", "recal", "house", "similarity"] unseen_words = ["apple", "apples", "mouse"] # Fit model = PolyFuzz("TF-IDF") model.fit(train_words) # Transform results = model.transform(unseen_words) ``` In the above example, we are using `fit` on `train_words` to calculate the TF-IDF representation of those words which are saved to be used again in `transform`. This speeds up `transform` quite a bit since all TF-IDF representations are stored when applying `fit`. Then, we apply save and load the model as follows to be used in production: ```python # Save the model model.save("my_model") # Load the model loaded_model = PolyFuzz.load("my_model") ``` ### Group Matches We can group the matches `To` as there might be significant overlap in strings in our to_list. To do this, we calculate the similarity within strings in to_list and use `single linkage` to then group the strings with a high similarity. When we extract the new matches, we can see an additional column `Group` in which all the `To` matches were grouped to: ```python >>> model.group(link_min_similarity=0.75) >>> model.get_matches() From To Similarity Group 0 apple apple 1.000000 apples 1 apples apples 1.000000 apples 2 appl apple 0.783751 apples 3 recal None 0.000000 None 4 house mouse 0.587927 mouse 5 similarity None 0.000000 None ``` As can be seen above, we grouped apple and apples together to `apple` such that when a string is mapped to `apple` it will fall in the cluster of `[apples, apple]` and will be mapped to the first instance in the cluster which is `apples`. ### Precision-Recall Curve Next, we would like to see how well our model is doing on our data. We express our results as **`precision`** and **`recall`** where precision is defined as the minimum similarity score before a match is correct and recall the percentage of matches found at a certain minimum similarity score. Creating the visualizations is as simple as: ``` model.visualize_precision_recall() ``` ## Models Currently, the following models are implemented in PolyFuzz: * TF-IDF * EditDistance (you can use any distance measure, see [documentation](https://maartengr.github.io/PolyFuzz/tutorial/models/#EditDistance)) * FastText and GloVe * 🤗 Transformers With `Flair`, we can use all 🤗 Transformers [models](https://huggingface.co/transformers/pretrained_models.html). We simply have to instantiate any Flair WordEmbedding method and pass it through PolyFuzzy. All models listed above can be found in `polyfuzz.models` and can be used to create and compare different models: ```python from polyfuzz.models import EditDistance, TFIDF, Embeddings from flair.embeddings import TransformerWordEmbeddings embeddings = TransformerWordEmbeddings('bert-base-multilingual-cased') bert = Embeddings(embeddings, min_similarity=0, model_id="BERT") tfidf = TFIDF(min_similarity=0) edit = EditDistance() string_models = [bert, tfidf, edit] model = PolyFuzz(string_models) model.match(from_list, to_list) ``` To access the results, we again can call `get_matches` but since we have multiple models we get back a dictionary of dataframes back. In order to access the results of a specific model, call `get_matches` with the correct id: ```python >>> model.get_matches("BERT") From To Similarity 0 apple apple 1.000000 1 apples apples 1.000000 2 appl apple 0.928045 3 recal apples 0.825268 4 house mouse 0.887524 5 similarity mouse 0.791548 ``` Finally, visualize the results to compare the models: ```python model.visualize_precision_recall(kde=True) ``` ## Custom Grouper We can even use one of the `polyfuzz.models` to be used as the grouper in case you would like to use something else than the standard TF-IDF model: ```python model = PolyFuzz("TF-IDF") model.match(from_list, to_list) edit_grouper = EditDistance(n_jobs=1) model.group(edit_grouper) ``` ## Custom Models Although the options above are a great solution for comparing different models, what if you have developed your own? If you follow the structure of PolyFuzz's `BaseMatcher` you can quickly implement any model you would like. Below, we are implementing the ratio similarity measure from RapidFuzz. ```python import numpy as np import pandas as pd from rapidfuzz import fuzz from polyfuzz.models import BaseMatcher class MyModel(BaseMatcher): def match(self, from_list, to_list, **kwargs): # Calculate distances matches = [[fuzz.ratio(from_string, to_string) / 100 for to_string in to_list] for from_string in from_list] # Get best matches mappings = [to_list[index] for index in np.argmax(matches, axis=1)] scores = np.max(matches, axis=1) # Prepare dataframe matches = pd.DataFrame({'From': from_list,'To': mappings, 'Similarity': scores}) return matches ``` Then, we can simply create an instance of MyModel and pass it through PolyFuzz: ```python custom_model = MyModel() model = PolyFuzz(custom_model) ``` ## Citation To cite PolyFuzz in your work, please use the following bibtex reference: ```bibtex @misc{grootendorst2020polyfuzz, author = {Maarten Grootendorst}, title = {PolyFuzz: Fuzzy string matching, grouping, and evaluation.}, year = 2020, publisher = {Zenodo}, version = {v0.2.2}, doi = {10.5281/zenodo.4461050}, url = {https://doi.org/10.5281/zenodo.4461050} } ``` ## References Below, you can find several resources that were used for or inspired by when developing PolyFuzz: **Edit distance algorithms**: These algorithms focus primarily on edit distance measures and can be used in `polyfuzz.models.EditDistance`: * https://github.com/jamesturk/jellyfish * https://github.com/ztane/python-Levenshtein * https://github.com/seatgeek/fuzzywuzzy * https://github.com/maxbachmann/rapidfuzz * https://github.com/roy-ht/editdistance **Other interesting repos**: * https://github.com/ing-bank/sparse_dot_topn * Used in PolyFuzz for fast cosine similarity between sparse matrices %package help Summary: Development documents and examples for polyfuzz Provides: python3-polyfuzz-doc %description help [![PyPI - Python](https://img.shields.io/badge/python-3.6%20|%203.7%20|%203.8-blue.svg)](https://pypi.org/project/keybert/) [![PyPI - License](https://img.shields.io/badge/license-MIT-green.svg)](https://github.com/MaartenGr/keybert/blob/master/LICENSE) [![PyPI - PyPi](https://img.shields.io/pypi/v/polyfuzz)](https://pypi.org/project/polyfuzz/) [![Build](https://img.shields.io/github/workflow/status/MaartenGr/polyfuzz/Code%20Checks/master)](https://pypi.org/project/polyfuzz/) [![docs](https://img.shields.io/badge/docs-Passing-green.svg)](https://maartengr.github.io/PolyFuzz/) **`PolyFuzz`** performs fuzzy string matching, string grouping, and contains extensive evaluation functions. PolyFuzz is meant to bring fuzzy string matching techniques together within a single framework. Currently, methods include a variety of edit distance measures, a character-based n-gram TF-IDF, word embedding techniques such as FastText and GloVe, and 🤗 transformers embeddings. Corresponding medium post can be found [here](https://towardsdatascience.com/string-matching-with-bert-tf-idf-and-more-274bb3a95136?source=friends_link&sk=0f765b76ceaba49363829c13dfdc9d98). ## Installation You can install **`PolyFuzz`** via pip: ```bash pip install polyfuzz ``` You may want to install more depending on the transformers and language backends that you will be using. The possible installations are: ```python pip install bertopic[sbert] pip install bertopic[flair] pip install bertopic[gensim] pip install bertopic[spacy] pip install bertopic[use] ``` If you want to speed up the cosine similarity comparison and decrease memory usage when using embedding models, you can use `sparse_dot_topn` which is installed via: ```bash pip install polyfuzz[fast] ```
Installation Issues You might run into installation issues with `sparse_dot_topn`. If so, one solution that has worked for many is by installing it via conda first before installing PolyFuzz: ```bash conda install -c conda-forge sparse_dot_topn ``` If that does not work, I would advise you to look through their issues](https://github.com/ing-bank/sparse_dot_topn/issues) page or continue to use PolyFuzz without `sparse_dot_topn`.
## Getting Started For an in-depth overview of the possibilities of `PolyFuzz` you can check the full documentation [here](https://maartengr.github.io/PolyFuzz/) or you can follow along with the notebook [here](https://github.com/MaartenGr/PolyFuzz/blob/master/notebooks/Overview.ipynb). ### Quick Start The main goal of `PolyFuzz` is to allow the user to perform different methods for matching strings. We start by defining two lists, one to map from and one to map to. We are going to be using `TF-IDF` to create n-grams on a character level in order to compare similarity between strings. Then, we calculate the similarity between strings by calculating the cosine similarity between vector representations. We only have to instantiate `PolyFuzz` with `TF-IDF` and match the lists: ```python from polyfuzz import PolyFuzz from_list = ["apple", "apples", "appl", "recal", "house", "similarity"] to_list = ["apple", "apples", "mouse"] model = PolyFuzz("TF-IDF") model.match(from_list, to_list) ``` The resulting matches can be accessed through `model.get_matches()`: ```python >>> model.get_matches() From To Similarity 0 apple apple 1.000000 1 apples apples 1.000000 2 appl apple 0.783751 3 recal None 0.000000 4 house mouse 0.587927 5 similarity None 0.000000 ``` **NOTE 1**: If you want to compare distances within a single list, you can simply pass that list as such: `model.match(from_list)` **NOTE 2**: When instantiating `PolyFuzz` we also could have used "EditDistance" or "Embeddings" to quickly access Levenshtein and FastText (English) respectively. ### Production The `.match` function allows you to quickly extract similar strings. However, after selecting the right models to be used, you may want to use PolyFuzz in production to match incoming strings. To do so, we can make use of the familiar `fit`, `transform`, and `fit_transform` functions. Let's say that we have a list of words that we know to be correct called `train_words`. We want to any incoming word to mapped to one of the words in `train_words`. In other words, we `fit` on `train_words` and we use `transform` on any incoming words: ```python from sklearn.datasets import fetch_20newsgroups from sklearn.feature_extraction.text import CountVectorizer from polyfuzz import PolyFuzz train_words = ["apple", "apples", "appl", "recal", "house", "similarity"] unseen_words = ["apple", "apples", "mouse"] # Fit model = PolyFuzz("TF-IDF") model.fit(train_words) # Transform results = model.transform(unseen_words) ``` In the above example, we are using `fit` on `train_words` to calculate the TF-IDF representation of those words which are saved to be used again in `transform`. This speeds up `transform` quite a bit since all TF-IDF representations are stored when applying `fit`. Then, we apply save and load the model as follows to be used in production: ```python # Save the model model.save("my_model") # Load the model loaded_model = PolyFuzz.load("my_model") ``` ### Group Matches We can group the matches `To` as there might be significant overlap in strings in our to_list. To do this, we calculate the similarity within strings in to_list and use `single linkage` to then group the strings with a high similarity. When we extract the new matches, we can see an additional column `Group` in which all the `To` matches were grouped to: ```python >>> model.group(link_min_similarity=0.75) >>> model.get_matches() From To Similarity Group 0 apple apple 1.000000 apples 1 apples apples 1.000000 apples 2 appl apple 0.783751 apples 3 recal None 0.000000 None 4 house mouse 0.587927 mouse 5 similarity None 0.000000 None ``` As can be seen above, we grouped apple and apples together to `apple` such that when a string is mapped to `apple` it will fall in the cluster of `[apples, apple]` and will be mapped to the first instance in the cluster which is `apples`. ### Precision-Recall Curve Next, we would like to see how well our model is doing on our data. We express our results as **`precision`** and **`recall`** where precision is defined as the minimum similarity score before a match is correct and recall the percentage of matches found at a certain minimum similarity score. Creating the visualizations is as simple as: ``` model.visualize_precision_recall() ``` ## Models Currently, the following models are implemented in PolyFuzz: * TF-IDF * EditDistance (you can use any distance measure, see [documentation](https://maartengr.github.io/PolyFuzz/tutorial/models/#EditDistance)) * FastText and GloVe * 🤗 Transformers With `Flair`, we can use all 🤗 Transformers [models](https://huggingface.co/transformers/pretrained_models.html). We simply have to instantiate any Flair WordEmbedding method and pass it through PolyFuzzy. All models listed above can be found in `polyfuzz.models` and can be used to create and compare different models: ```python from polyfuzz.models import EditDistance, TFIDF, Embeddings from flair.embeddings import TransformerWordEmbeddings embeddings = TransformerWordEmbeddings('bert-base-multilingual-cased') bert = Embeddings(embeddings, min_similarity=0, model_id="BERT") tfidf = TFIDF(min_similarity=0) edit = EditDistance() string_models = [bert, tfidf, edit] model = PolyFuzz(string_models) model.match(from_list, to_list) ``` To access the results, we again can call `get_matches` but since we have multiple models we get back a dictionary of dataframes back. In order to access the results of a specific model, call `get_matches` with the correct id: ```python >>> model.get_matches("BERT") From To Similarity 0 apple apple 1.000000 1 apples apples 1.000000 2 appl apple 0.928045 3 recal apples 0.825268 4 house mouse 0.887524 5 similarity mouse 0.791548 ``` Finally, visualize the results to compare the models: ```python model.visualize_precision_recall(kde=True) ``` ## Custom Grouper We can even use one of the `polyfuzz.models` to be used as the grouper in case you would like to use something else than the standard TF-IDF model: ```python model = PolyFuzz("TF-IDF") model.match(from_list, to_list) edit_grouper = EditDistance(n_jobs=1) model.group(edit_grouper) ``` ## Custom Models Although the options above are a great solution for comparing different models, what if you have developed your own? If you follow the structure of PolyFuzz's `BaseMatcher` you can quickly implement any model you would like. Below, we are implementing the ratio similarity measure from RapidFuzz. ```python import numpy as np import pandas as pd from rapidfuzz import fuzz from polyfuzz.models import BaseMatcher class MyModel(BaseMatcher): def match(self, from_list, to_list, **kwargs): # Calculate distances matches = [[fuzz.ratio(from_string, to_string) / 100 for to_string in to_list] for from_string in from_list] # Get best matches mappings = [to_list[index] for index in np.argmax(matches, axis=1)] scores = np.max(matches, axis=1) # Prepare dataframe matches = pd.DataFrame({'From': from_list,'To': mappings, 'Similarity': scores}) return matches ``` Then, we can simply create an instance of MyModel and pass it through PolyFuzz: ```python custom_model = MyModel() model = PolyFuzz(custom_model) ``` ## Citation To cite PolyFuzz in your work, please use the following bibtex reference: ```bibtex @misc{grootendorst2020polyfuzz, author = {Maarten Grootendorst}, title = {PolyFuzz: Fuzzy string matching, grouping, and evaluation.}, year = 2020, publisher = {Zenodo}, version = {v0.2.2}, doi = {10.5281/zenodo.4461050}, url = {https://doi.org/10.5281/zenodo.4461050} } ``` ## References Below, you can find several resources that were used for or inspired by when developing PolyFuzz: **Edit distance algorithms**: These algorithms focus primarily on edit distance measures and can be used in `polyfuzz.models.EditDistance`: * https://github.com/jamesturk/jellyfish * https://github.com/ztane/python-Levenshtein * https://github.com/seatgeek/fuzzywuzzy * https://github.com/maxbachmann/rapidfuzz * https://github.com/roy-ht/editdistance **Other interesting repos**: * https://github.com/ing-bank/sparse_dot_topn * Used in PolyFuzz for fast cosine similarity between sparse matrices %prep %autosetup -n polyfuzz-0.4.0 %build %py3_build %install %py3_install install -d -m755 %{buildroot}/%{_pkgdocdir} if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi pushd %{buildroot} if [ -d usr/lib ]; then find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst fi if [ -d usr/lib64 ]; then find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst fi if [ -d usr/bin ]; then find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst fi if [ -d usr/sbin ]; then find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst fi touch doclist.lst if [ -d usr/share/man ]; then find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst fi popd mv %{buildroot}/filelist.lst . mv %{buildroot}/doclist.lst . %files -n python3-polyfuzz -f filelist.lst %dir %{python3_sitelib}/* %files help -f doclist.lst %{_docdir}/* %changelog * Tue Apr 25 2023 Python_Bot - 0.4.0-1 - Package Spec generated