summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorCoprDistGit <infra@openeuler.org>2023-05-05 15:20:53 +0000
committerCoprDistGit <infra@openeuler.org>2023-05-05 15:20:53 +0000
commitc0b2d90c7f9a9012bff3a00a038b2589bb9502a0 (patch)
tree44dac69691ac2b2d39bd15c8f04b0eb4a2bdf22c
parent599f26f1c070e02abc9bcb5bcf6700ebb470c2f6 (diff)
automatic import of python-prefixspanopeneuler20.03
-rw-r--r--.gitignore1
-rw-r--r--python-prefixspan.spec1005
-rw-r--r--sources1
3 files changed, 1007 insertions, 0 deletions
diff --git a/.gitignore b/.gitignore
index e69de29..550142d 100644
--- a/.gitignore
+++ b/.gitignore
@@ -0,0 +1 @@
+/prefixspan-0.5.2.tar.gz
diff --git a/python-prefixspan.spec b/python-prefixspan.spec
new file mode 100644
index 0000000..456259d
--- /dev/null
+++ b/python-prefixspan.spec
@@ -0,0 +1,1005 @@
+%global _empty_manifest_terminate_build 0
+Name: python-prefixspan
+Version: 0.5.2
+Release: 1
+Summary: PrefixSpan, BIDE, and FEAT in Python 3
+License: MIT
+URL: https://github.com/chuanconggao/PrefixSpan-py
+Source0: https://mirrors.nju.edu.cn/pypi/web/packages/c5/b0/e66e9f6e07a0b37aa0f5703c46f54bafbdf65dfba63994247676b19076c4/prefixspan-0.5.2.tar.gz
+BuildArch: noarch
+
+
+%description
+[![PyPI version](https://img.shields.io/pypi/v/prefixspan.svg)](https://pypi.python.org/pypi/prefixspan/)
+[![PyPI pyversions](https://img.shields.io/pypi/pyversions/prefixspan.svg)](https://pypi.python.org/pypi/prefixspan/)
+[![PyPI license](https://img.shields.io/pypi/l/prefixspan.svg)](https://pypi.python.org/pypi/prefixspan/)
+
+**Featured on ImportPython [Issue 173](http://importpython.com/newsletter/no/173/). Thank you so much for support!**
+
+The shortest yet efficient implementation of the famous frequent sequential pattern mining algorithm [PrefixSpan](https://ieeexplore.ieee.org/abstract/document/914830/), the famous frequent **closed** sequential pattern mining algorithm [BIDE](https://ieeexplore.ieee.org/abstract/document/1319986) (in `closed.py`), and the frequent **generator** sequential pattern mining algorithm [FEAT](https://dl.acm.org/citation.cfm?doid=1367497.1367651) (in `generator.py`), as a unified and holistic algorithm framework.
+
+- BIDE is usually much faster than PrefixSpan on large datasets, as only a small subset of closed patterns sharing the equivalent information of all the patterns are returned.
+
+- FEAT is usually faster than PrefixSpan but slower than BIDE on large datasets.
+
+For simpler code, some general purpose functions have been moved to be part of a new library [extratools](https://github.com/chuanconggao/extratools).
+
+## Reference
+
+### Research Papers
+
+``` text
+PrefixSpan: Mining Sequential Patterns by Prefix-Projected Growth.
+Jian Pei, Jiawei Han, Behzad Mortazavi-Asl, Helen Pinto, Qiming Chen, Umeshwar Dayal, Meichun Hsu.
+Proceedings of the 17th International Conference on Data Engineering, 2001.
+```
+
+``` text
+BIDE: Efficient Mining of Frequent Closed Sequences.
+Jianyong Wang, Jiawei Han.
+Proceedings of the 20th International Conference on Data Engineering, 2004.
+```
+
+``` text
+Efficient mining of frequent sequence generators.
+Chuancong Gao, Jianyong Wang, Yukai He, Lizhu Zhou.
+Proceedings of the 17th International Conference on World Wide Web, 2008.
+```
+
+### Alternative Implementations
+
+I created this project with the [original](https://github.com/chuanconggao/PrefixSpan-py/commit/441b04eca2174b3c92f6b6b2f50a30f1ffe4968c) minimal 15 lines implementation of PrefixSpan for educational purpose. However, as this project grows into a full feature library, its code size also inevitably grows. I have revised and reuploaded the original implementation as a GitHub Gist [here](https://gist.github.com/chuanconggao/4df9c1b06fa7f3ed854d5d96e2ae499f) for reference.
+
+You can also try my Scala [version](https://github.com/chuanconggao/PrefixSpan-scala) of PrefixSpan.
+
+## Features
+
+Outputs traditional single-item sequential patterns, where gaps are allowed between items.
+
+- Mining top-k patterns is supported, with respective optimizations on efficiency.
+
+- You can limit the length of mined patterns. Note that setting maximum pattern length properly can significantly speedup the algorithm.
+
+- Custom key function, custom filter function, and custom callback function can be applied.
+
+## Installation
+
+This package is available on PyPI. Just use `pip3 install -U prefixspan` to install it.
+
+## CLI Usage
+
+You can simply use the algorithms on terminal.
+
+``` text
+Usage:
+ prefixspan-cli (frequent | top-k) <threshold> [options] [<file>]
+
+ prefixspan-cli --help
+
+
+Options:
+ --text Treat each item as text instead of integer.
+
+ --closed Return only closed patterns.
+ --generator Return only generator patterns.
+
+ --key=<key> Custom key function. [default: ]
+ Must be a Python function in form of "lambda patt, matches: ...", returning an integer value.
+ --bound=<bound> The upper-bound function of the respective key function. When unspecified, the same key function is used. [default: ]
+ Must be no less than the key function, i.e. bound(patt, matches) ≥ key(patt, matches).
+ Must be anti-monotone, i.e. for patt1 ⊑ patt2, bound(patt1, matches1) ≥ bound(patt2, matches2).
+
+ --filter=<filter> Custom filter function. [default: ]
+ Must be a Python function in form of "lambda patt, matches: ...", returning a boolean value.
+
+ --minlen=<minlen> Minimum length of patterns. [default: 1]
+ --maxlen=<maxlen> Maximum length of patterns. [default: 1000]
+```
+
+* Sequences are read from standard input. Each sequence is integers separated by space, like this example:
+
+``` text
+cat test.dat
+
+0 1 2 3 4
+1 1 1 3 4
+2 1 2 2 0
+1 1 1 2 2
+```
+
+- When dealing with text data, please use the `--text` option. Each sequence is words separated by space, assuming stop words have been removed, like this example:
+
+``` text
+cat test.txt
+
+a b c d e
+b b b d e
+c b c c a
+b b b c c
+```
+
+* The patterns and their respective frequencies are printed to standard output.
+
+``` text
+prefixspan-cli frequent 2 test.dat
+
+0 : 2
+1 : 4
+1 2 : 3
+1 2 2 : 2
+1 3 : 2
+1 3 4 : 2
+1 4 : 2
+1 1 : 2
+1 1 1 : 2
+2 : 3
+2 2 : 2
+3 : 2
+3 4 : 2
+4 : 2
+```
+
+``` text
+prefixspan-cli frequent 2 --text test.txt
+
+a : 2
+b : 4
+b c : 3
+b c c : 2
+b d : 2
+b d e : 2
+b e : 2
+b b : 2
+b b b : 2
+c : 3
+c c : 2
+d : 2
+d e : 2
+e : 2
+```
+
+## API Usage
+
+Alternatively, you can use the algorithms via API.
+
+``` python
+from prefixspan import PrefixSpan
+
+db = [
+ [0, 1, 2, 3, 4],
+ [1, 1, 1, 3, 4],
+ [2, 1, 2, 2, 0],
+ [1, 1, 1, 2, 2],
+]
+
+ps = PrefixSpan(db)
+```
+
+For details of each parameter, please refer to the `PrefixSpan` class in `prefixspan/api.py`.
+
+``` python
+print(ps.frequent(2))
+# [(2, [0]),
+# (4, [1]),
+# (3, [1, 2]),
+# (2, [1, 2, 2]),
+# (2, [1, 3]),
+# (2, [1, 3, 4]),
+# (2, [1, 4]),
+# (2, [1, 1]),
+# (2, [1, 1, 1]),
+# (3, [2]),
+# (2, [2, 2]),
+# (2, [3]),
+# (2, [3, 4]),
+# (2, [4])]
+
+print(ps.topk(5))
+# [(4, [1]),
+# (3, [2]),
+# (3, [1, 2]),
+# (2, [1, 3]),
+# (2, [1, 3, 4])]
+
+
+print(ps.frequent(2, closed=True))
+
+print(ps.topk(5, closed=True))
+
+
+print(ps.frequent(2, generator=True))
+
+print(ps.topk(5, generator=True))
+```
+
+## Closed Patterns and Generator Patterns
+
+The closed patterns are much more compact due to the smaller number.
+
+- A pattern is closed if there is no super-pattern with the same frequency.
+
+``` text
+prefixspan-cli frequent 2 --closed test.dat
+
+0 : 2
+1 : 4
+1 2 : 3
+1 2 2 : 2
+1 3 4 : 2
+1 1 1 : 2
+```
+
+The generator patterns are even more compact due to both the smaller number and the shorter lengths.
+
+- A pattern is generator if there is no sub-pattern with the same frequency.
+
+- Due to the high compactness, generator patterns are useful as features for classification, etc.
+
+``` text
+prefixspan-cli frequent 2 --generator test.dat
+
+0 : 2
+1 1 : 2
+2 : 3
+2 2 : 2
+3 : 2
+4 : 2
+```
+
+There are patterns that are both closed and generator.
+
+``` text
+prefixspan-cli frequent 2 --closed --generator test.dat
+
+0 : 2
+```
+
+## Custom Key Function
+
+For both frequent and top-k algorithms, a custom key function `key=lambda patt, matches: ...` can be applied, where `patt` is the current pattern and `matches` is the current list of matching sequence `(id, position)` tuples.
+
+- In default, `len(matches)` is used denoting the frequency of current pattern.
+
+- Alternatively, any key function can be used. As an example, `sum(len(db[i]) for i in matches)` can be used to find the satisfying patterns according to the number of matched items.
+
+- For efficiency, an anti-monotone upper-bound function should also be specified for pruning.
+
+ - If unspecified, the key function is also the upper-bound function, and must be anti-monotone.
+
+``` python
+print(ps.topk(5, key=lambda patt, matches: sum(len(db[i]) for i, _ in matches)))
+# [(20, [1]),
+# (15, [2]),
+# (15, [1, 2]),
+# (10, [1, 3]),
+# (10, [1, 3, 4])]
+```
+
+## Custom Filter Function
+
+For both frequent and top-k algorithms, a custom filter function `filter=lambda patt, matches: ...` can be applied, where `patt` is the current pattern and `matches` is the current list of matching sequence `(id, position)` tuples.
+
+- In default, `filter` is not applied and all the patterns are returned.
+
+- Alternatively, any function can be used. As an example, `matches[0][0] > 0` can be used to exclude the patterns covering the first sequence.
+
+``` python
+print(ps.topk(5, filter=lambda patt, matches: matches[0][0] > 0))
+# [(2, [1, 1]),
+# (2, [1, 1, 1]),
+# (2, [1, 2, 2]),
+# (2, [2, 2]),
+# (1, [1, 2, 2, 0])]
+```
+
+## Custom Callback Function
+
+For both the frequent and the top-k algorithm, you can use a custom callback function `callback=lambda patt, matches: ...` instead of returning the normal results of patterns and their respective frequencies.
+
+- When callback function is specified, `None` is returned.
+
+- For large datasets, when mining frequent patterns, you can use callback function to process each pattern immediately, and avoid having a huge list of patterns consuming huge amount of memory.
+
+- The following example finds the longest frequent pattern covering each sequence.
+
+``` python
+coverage = [[] for i in range(len(db))]
+
+def cover(patt, matches):
+ for i, _ in matches:
+ coverage[i] = max(coverage[i], patt, key=len)
+
+
+ps.frequent(2, callback=cover)
+
+print(coverage)
+# [[1, 3, 4],
+# [1, 3, 4],
+# [1, 2, 2],
+# [1, 2, 2]]
+```
+
+## Tip
+
+I strongly encourage using [PyPy](http://pypy.org/) instead of CPython to run the script for best performance. In my own experience, it is nearly 10 times faster in average. To start, you can install this package in a [virtual environment](https://virtualenv.pypa.io/en/stable/) created for PyPy.
+
+%package -n python3-prefixspan
+Summary: PrefixSpan, BIDE, and FEAT in Python 3
+Provides: python-prefixspan
+BuildRequires: python3-devel
+BuildRequires: python3-setuptools
+BuildRequires: python3-pip
+%description -n python3-prefixspan
+[![PyPI version](https://img.shields.io/pypi/v/prefixspan.svg)](https://pypi.python.org/pypi/prefixspan/)
+[![PyPI pyversions](https://img.shields.io/pypi/pyversions/prefixspan.svg)](https://pypi.python.org/pypi/prefixspan/)
+[![PyPI license](https://img.shields.io/pypi/l/prefixspan.svg)](https://pypi.python.org/pypi/prefixspan/)
+
+**Featured on ImportPython [Issue 173](http://importpython.com/newsletter/no/173/). Thank you so much for support!**
+
+The shortest yet efficient implementation of the famous frequent sequential pattern mining algorithm [PrefixSpan](https://ieeexplore.ieee.org/abstract/document/914830/), the famous frequent **closed** sequential pattern mining algorithm [BIDE](https://ieeexplore.ieee.org/abstract/document/1319986) (in `closed.py`), and the frequent **generator** sequential pattern mining algorithm [FEAT](https://dl.acm.org/citation.cfm?doid=1367497.1367651) (in `generator.py`), as a unified and holistic algorithm framework.
+
+- BIDE is usually much faster than PrefixSpan on large datasets, as only a small subset of closed patterns sharing the equivalent information of all the patterns are returned.
+
+- FEAT is usually faster than PrefixSpan but slower than BIDE on large datasets.
+
+For simpler code, some general purpose functions have been moved to be part of a new library [extratools](https://github.com/chuanconggao/extratools).
+
+## Reference
+
+### Research Papers
+
+``` text
+PrefixSpan: Mining Sequential Patterns by Prefix-Projected Growth.
+Jian Pei, Jiawei Han, Behzad Mortazavi-Asl, Helen Pinto, Qiming Chen, Umeshwar Dayal, Meichun Hsu.
+Proceedings of the 17th International Conference on Data Engineering, 2001.
+```
+
+``` text
+BIDE: Efficient Mining of Frequent Closed Sequences.
+Jianyong Wang, Jiawei Han.
+Proceedings of the 20th International Conference on Data Engineering, 2004.
+```
+
+``` text
+Efficient mining of frequent sequence generators.
+Chuancong Gao, Jianyong Wang, Yukai He, Lizhu Zhou.
+Proceedings of the 17th International Conference on World Wide Web, 2008.
+```
+
+### Alternative Implementations
+
+I created this project with the [original](https://github.com/chuanconggao/PrefixSpan-py/commit/441b04eca2174b3c92f6b6b2f50a30f1ffe4968c) minimal 15 lines implementation of PrefixSpan for educational purpose. However, as this project grows into a full feature library, its code size also inevitably grows. I have revised and reuploaded the original implementation as a GitHub Gist [here](https://gist.github.com/chuanconggao/4df9c1b06fa7f3ed854d5d96e2ae499f) for reference.
+
+You can also try my Scala [version](https://github.com/chuanconggao/PrefixSpan-scala) of PrefixSpan.
+
+## Features
+
+Outputs traditional single-item sequential patterns, where gaps are allowed between items.
+
+- Mining top-k patterns is supported, with respective optimizations on efficiency.
+
+- You can limit the length of mined patterns. Note that setting maximum pattern length properly can significantly speedup the algorithm.
+
+- Custom key function, custom filter function, and custom callback function can be applied.
+
+## Installation
+
+This package is available on PyPI. Just use `pip3 install -U prefixspan` to install it.
+
+## CLI Usage
+
+You can simply use the algorithms on terminal.
+
+``` text
+Usage:
+ prefixspan-cli (frequent | top-k) <threshold> [options] [<file>]
+
+ prefixspan-cli --help
+
+
+Options:
+ --text Treat each item as text instead of integer.
+
+ --closed Return only closed patterns.
+ --generator Return only generator patterns.
+
+ --key=<key> Custom key function. [default: ]
+ Must be a Python function in form of "lambda patt, matches: ...", returning an integer value.
+ --bound=<bound> The upper-bound function of the respective key function. When unspecified, the same key function is used. [default: ]
+ Must be no less than the key function, i.e. bound(patt, matches) ≥ key(patt, matches).
+ Must be anti-monotone, i.e. for patt1 ⊑ patt2, bound(patt1, matches1) ≥ bound(patt2, matches2).
+
+ --filter=<filter> Custom filter function. [default: ]
+ Must be a Python function in form of "lambda patt, matches: ...", returning a boolean value.
+
+ --minlen=<minlen> Minimum length of patterns. [default: 1]
+ --maxlen=<maxlen> Maximum length of patterns. [default: 1000]
+```
+
+* Sequences are read from standard input. Each sequence is integers separated by space, like this example:
+
+``` text
+cat test.dat
+
+0 1 2 3 4
+1 1 1 3 4
+2 1 2 2 0
+1 1 1 2 2
+```
+
+- When dealing with text data, please use the `--text` option. Each sequence is words separated by space, assuming stop words have been removed, like this example:
+
+``` text
+cat test.txt
+
+a b c d e
+b b b d e
+c b c c a
+b b b c c
+```
+
+* The patterns and their respective frequencies are printed to standard output.
+
+``` text
+prefixspan-cli frequent 2 test.dat
+
+0 : 2
+1 : 4
+1 2 : 3
+1 2 2 : 2
+1 3 : 2
+1 3 4 : 2
+1 4 : 2
+1 1 : 2
+1 1 1 : 2
+2 : 3
+2 2 : 2
+3 : 2
+3 4 : 2
+4 : 2
+```
+
+``` text
+prefixspan-cli frequent 2 --text test.txt
+
+a : 2
+b : 4
+b c : 3
+b c c : 2
+b d : 2
+b d e : 2
+b e : 2
+b b : 2
+b b b : 2
+c : 3
+c c : 2
+d : 2
+d e : 2
+e : 2
+```
+
+## API Usage
+
+Alternatively, you can use the algorithms via API.
+
+``` python
+from prefixspan import PrefixSpan
+
+db = [
+ [0, 1, 2, 3, 4],
+ [1, 1, 1, 3, 4],
+ [2, 1, 2, 2, 0],
+ [1, 1, 1, 2, 2],
+]
+
+ps = PrefixSpan(db)
+```
+
+For details of each parameter, please refer to the `PrefixSpan` class in `prefixspan/api.py`.
+
+``` python
+print(ps.frequent(2))
+# [(2, [0]),
+# (4, [1]),
+# (3, [1, 2]),
+# (2, [1, 2, 2]),
+# (2, [1, 3]),
+# (2, [1, 3, 4]),
+# (2, [1, 4]),
+# (2, [1, 1]),
+# (2, [1, 1, 1]),
+# (3, [2]),
+# (2, [2, 2]),
+# (2, [3]),
+# (2, [3, 4]),
+# (2, [4])]
+
+print(ps.topk(5))
+# [(4, [1]),
+# (3, [2]),
+# (3, [1, 2]),
+# (2, [1, 3]),
+# (2, [1, 3, 4])]
+
+
+print(ps.frequent(2, closed=True))
+
+print(ps.topk(5, closed=True))
+
+
+print(ps.frequent(2, generator=True))
+
+print(ps.topk(5, generator=True))
+```
+
+## Closed Patterns and Generator Patterns
+
+The closed patterns are much more compact due to the smaller number.
+
+- A pattern is closed if there is no super-pattern with the same frequency.
+
+``` text
+prefixspan-cli frequent 2 --closed test.dat
+
+0 : 2
+1 : 4
+1 2 : 3
+1 2 2 : 2
+1 3 4 : 2
+1 1 1 : 2
+```
+
+The generator patterns are even more compact due to both the smaller number and the shorter lengths.
+
+- A pattern is generator if there is no sub-pattern with the same frequency.
+
+- Due to the high compactness, generator patterns are useful as features for classification, etc.
+
+``` text
+prefixspan-cli frequent 2 --generator test.dat
+
+0 : 2
+1 1 : 2
+2 : 3
+2 2 : 2
+3 : 2
+4 : 2
+```
+
+There are patterns that are both closed and generator.
+
+``` text
+prefixspan-cli frequent 2 --closed --generator test.dat
+
+0 : 2
+```
+
+## Custom Key Function
+
+For both frequent and top-k algorithms, a custom key function `key=lambda patt, matches: ...` can be applied, where `patt` is the current pattern and `matches` is the current list of matching sequence `(id, position)` tuples.
+
+- In default, `len(matches)` is used denoting the frequency of current pattern.
+
+- Alternatively, any key function can be used. As an example, `sum(len(db[i]) for i in matches)` can be used to find the satisfying patterns according to the number of matched items.
+
+- For efficiency, an anti-monotone upper-bound function should also be specified for pruning.
+
+ - If unspecified, the key function is also the upper-bound function, and must be anti-monotone.
+
+``` python
+print(ps.topk(5, key=lambda patt, matches: sum(len(db[i]) for i, _ in matches)))
+# [(20, [1]),
+# (15, [2]),
+# (15, [1, 2]),
+# (10, [1, 3]),
+# (10, [1, 3, 4])]
+```
+
+## Custom Filter Function
+
+For both frequent and top-k algorithms, a custom filter function `filter=lambda patt, matches: ...` can be applied, where `patt` is the current pattern and `matches` is the current list of matching sequence `(id, position)` tuples.
+
+- In default, `filter` is not applied and all the patterns are returned.
+
+- Alternatively, any function can be used. As an example, `matches[0][0] > 0` can be used to exclude the patterns covering the first sequence.
+
+``` python
+print(ps.topk(5, filter=lambda patt, matches: matches[0][0] > 0))
+# [(2, [1, 1]),
+# (2, [1, 1, 1]),
+# (2, [1, 2, 2]),
+# (2, [2, 2]),
+# (1, [1, 2, 2, 0])]
+```
+
+## Custom Callback Function
+
+For both the frequent and the top-k algorithm, you can use a custom callback function `callback=lambda patt, matches: ...` instead of returning the normal results of patterns and their respective frequencies.
+
+- When callback function is specified, `None` is returned.
+
+- For large datasets, when mining frequent patterns, you can use callback function to process each pattern immediately, and avoid having a huge list of patterns consuming huge amount of memory.
+
+- The following example finds the longest frequent pattern covering each sequence.
+
+``` python
+coverage = [[] for i in range(len(db))]
+
+def cover(patt, matches):
+ for i, _ in matches:
+ coverage[i] = max(coverage[i], patt, key=len)
+
+
+ps.frequent(2, callback=cover)
+
+print(coverage)
+# [[1, 3, 4],
+# [1, 3, 4],
+# [1, 2, 2],
+# [1, 2, 2]]
+```
+
+## Tip
+
+I strongly encourage using [PyPy](http://pypy.org/) instead of CPython to run the script for best performance. In my own experience, it is nearly 10 times faster in average. To start, you can install this package in a [virtual environment](https://virtualenv.pypa.io/en/stable/) created for PyPy.
+
+%package help
+Summary: Development documents and examples for prefixspan
+Provides: python3-prefixspan-doc
+%description help
+[![PyPI version](https://img.shields.io/pypi/v/prefixspan.svg)](https://pypi.python.org/pypi/prefixspan/)
+[![PyPI pyversions](https://img.shields.io/pypi/pyversions/prefixspan.svg)](https://pypi.python.org/pypi/prefixspan/)
+[![PyPI license](https://img.shields.io/pypi/l/prefixspan.svg)](https://pypi.python.org/pypi/prefixspan/)
+
+**Featured on ImportPython [Issue 173](http://importpython.com/newsletter/no/173/). Thank you so much for support!**
+
+The shortest yet efficient implementation of the famous frequent sequential pattern mining algorithm [PrefixSpan](https://ieeexplore.ieee.org/abstract/document/914830/), the famous frequent **closed** sequential pattern mining algorithm [BIDE](https://ieeexplore.ieee.org/abstract/document/1319986) (in `closed.py`), and the frequent **generator** sequential pattern mining algorithm [FEAT](https://dl.acm.org/citation.cfm?doid=1367497.1367651) (in `generator.py`), as a unified and holistic algorithm framework.
+
+- BIDE is usually much faster than PrefixSpan on large datasets, as only a small subset of closed patterns sharing the equivalent information of all the patterns are returned.
+
+- FEAT is usually faster than PrefixSpan but slower than BIDE on large datasets.
+
+For simpler code, some general purpose functions have been moved to be part of a new library [extratools](https://github.com/chuanconggao/extratools).
+
+## Reference
+
+### Research Papers
+
+``` text
+PrefixSpan: Mining Sequential Patterns by Prefix-Projected Growth.
+Jian Pei, Jiawei Han, Behzad Mortazavi-Asl, Helen Pinto, Qiming Chen, Umeshwar Dayal, Meichun Hsu.
+Proceedings of the 17th International Conference on Data Engineering, 2001.
+```
+
+``` text
+BIDE: Efficient Mining of Frequent Closed Sequences.
+Jianyong Wang, Jiawei Han.
+Proceedings of the 20th International Conference on Data Engineering, 2004.
+```
+
+``` text
+Efficient mining of frequent sequence generators.
+Chuancong Gao, Jianyong Wang, Yukai He, Lizhu Zhou.
+Proceedings of the 17th International Conference on World Wide Web, 2008.
+```
+
+### Alternative Implementations
+
+I created this project with the [original](https://github.com/chuanconggao/PrefixSpan-py/commit/441b04eca2174b3c92f6b6b2f50a30f1ffe4968c) minimal 15 lines implementation of PrefixSpan for educational purpose. However, as this project grows into a full feature library, its code size also inevitably grows. I have revised and reuploaded the original implementation as a GitHub Gist [here](https://gist.github.com/chuanconggao/4df9c1b06fa7f3ed854d5d96e2ae499f) for reference.
+
+You can also try my Scala [version](https://github.com/chuanconggao/PrefixSpan-scala) of PrefixSpan.
+
+## Features
+
+Outputs traditional single-item sequential patterns, where gaps are allowed between items.
+
+- Mining top-k patterns is supported, with respective optimizations on efficiency.
+
+- You can limit the length of mined patterns. Note that setting maximum pattern length properly can significantly speedup the algorithm.
+
+- Custom key function, custom filter function, and custom callback function can be applied.
+
+## Installation
+
+This package is available on PyPI. Just use `pip3 install -U prefixspan` to install it.
+
+## CLI Usage
+
+You can simply use the algorithms on terminal.
+
+``` text
+Usage:
+ prefixspan-cli (frequent | top-k) <threshold> [options] [<file>]
+
+ prefixspan-cli --help
+
+
+Options:
+ --text Treat each item as text instead of integer.
+
+ --closed Return only closed patterns.
+ --generator Return only generator patterns.
+
+ --key=<key> Custom key function. [default: ]
+ Must be a Python function in form of "lambda patt, matches: ...", returning an integer value.
+ --bound=<bound> The upper-bound function of the respective key function. When unspecified, the same key function is used. [default: ]
+ Must be no less than the key function, i.e. bound(patt, matches) ≥ key(patt, matches).
+ Must be anti-monotone, i.e. for patt1 ⊑ patt2, bound(patt1, matches1) ≥ bound(patt2, matches2).
+
+ --filter=<filter> Custom filter function. [default: ]
+ Must be a Python function in form of "lambda patt, matches: ...", returning a boolean value.
+
+ --minlen=<minlen> Minimum length of patterns. [default: 1]
+ --maxlen=<maxlen> Maximum length of patterns. [default: 1000]
+```
+
+* Sequences are read from standard input. Each sequence is integers separated by space, like this example:
+
+``` text
+cat test.dat
+
+0 1 2 3 4
+1 1 1 3 4
+2 1 2 2 0
+1 1 1 2 2
+```
+
+- When dealing with text data, please use the `--text` option. Each sequence is words separated by space, assuming stop words have been removed, like this example:
+
+``` text
+cat test.txt
+
+a b c d e
+b b b d e
+c b c c a
+b b b c c
+```
+
+* The patterns and their respective frequencies are printed to standard output.
+
+``` text
+prefixspan-cli frequent 2 test.dat
+
+0 : 2
+1 : 4
+1 2 : 3
+1 2 2 : 2
+1 3 : 2
+1 3 4 : 2
+1 4 : 2
+1 1 : 2
+1 1 1 : 2
+2 : 3
+2 2 : 2
+3 : 2
+3 4 : 2
+4 : 2
+```
+
+``` text
+prefixspan-cli frequent 2 --text test.txt
+
+a : 2
+b : 4
+b c : 3
+b c c : 2
+b d : 2
+b d e : 2
+b e : 2
+b b : 2
+b b b : 2
+c : 3
+c c : 2
+d : 2
+d e : 2
+e : 2
+```
+
+## API Usage
+
+Alternatively, you can use the algorithms via API.
+
+``` python
+from prefixspan import PrefixSpan
+
+db = [
+ [0, 1, 2, 3, 4],
+ [1, 1, 1, 3, 4],
+ [2, 1, 2, 2, 0],
+ [1, 1, 1, 2, 2],
+]
+
+ps = PrefixSpan(db)
+```
+
+For details of each parameter, please refer to the `PrefixSpan` class in `prefixspan/api.py`.
+
+``` python
+print(ps.frequent(2))
+# [(2, [0]),
+# (4, [1]),
+# (3, [1, 2]),
+# (2, [1, 2, 2]),
+# (2, [1, 3]),
+# (2, [1, 3, 4]),
+# (2, [1, 4]),
+# (2, [1, 1]),
+# (2, [1, 1, 1]),
+# (3, [2]),
+# (2, [2, 2]),
+# (2, [3]),
+# (2, [3, 4]),
+# (2, [4])]
+
+print(ps.topk(5))
+# [(4, [1]),
+# (3, [2]),
+# (3, [1, 2]),
+# (2, [1, 3]),
+# (2, [1, 3, 4])]
+
+
+print(ps.frequent(2, closed=True))
+
+print(ps.topk(5, closed=True))
+
+
+print(ps.frequent(2, generator=True))
+
+print(ps.topk(5, generator=True))
+```
+
+## Closed Patterns and Generator Patterns
+
+The closed patterns are much more compact due to the smaller number.
+
+- A pattern is closed if there is no super-pattern with the same frequency.
+
+``` text
+prefixspan-cli frequent 2 --closed test.dat
+
+0 : 2
+1 : 4
+1 2 : 3
+1 2 2 : 2
+1 3 4 : 2
+1 1 1 : 2
+```
+
+The generator patterns are even more compact due to both the smaller number and the shorter lengths.
+
+- A pattern is generator if there is no sub-pattern with the same frequency.
+
+- Due to the high compactness, generator patterns are useful as features for classification, etc.
+
+``` text
+prefixspan-cli frequent 2 --generator test.dat
+
+0 : 2
+1 1 : 2
+2 : 3
+2 2 : 2
+3 : 2
+4 : 2
+```
+
+There are patterns that are both closed and generator.
+
+``` text
+prefixspan-cli frequent 2 --closed --generator test.dat
+
+0 : 2
+```
+
+## Custom Key Function
+
+For both frequent and top-k algorithms, a custom key function `key=lambda patt, matches: ...` can be applied, where `patt` is the current pattern and `matches` is the current list of matching sequence `(id, position)` tuples.
+
+- In default, `len(matches)` is used denoting the frequency of current pattern.
+
+- Alternatively, any key function can be used. As an example, `sum(len(db[i]) for i in matches)` can be used to find the satisfying patterns according to the number of matched items.
+
+- For efficiency, an anti-monotone upper-bound function should also be specified for pruning.
+
+ - If unspecified, the key function is also the upper-bound function, and must be anti-monotone.
+
+``` python
+print(ps.topk(5, key=lambda patt, matches: sum(len(db[i]) for i, _ in matches)))
+# [(20, [1]),
+# (15, [2]),
+# (15, [1, 2]),
+# (10, [1, 3]),
+# (10, [1, 3, 4])]
+```
+
+## Custom Filter Function
+
+For both frequent and top-k algorithms, a custom filter function `filter=lambda patt, matches: ...` can be applied, where `patt` is the current pattern and `matches` is the current list of matching sequence `(id, position)` tuples.
+
+- In default, `filter` is not applied and all the patterns are returned.
+
+- Alternatively, any function can be used. As an example, `matches[0][0] > 0` can be used to exclude the patterns covering the first sequence.
+
+``` python
+print(ps.topk(5, filter=lambda patt, matches: matches[0][0] > 0))
+# [(2, [1, 1]),
+# (2, [1, 1, 1]),
+# (2, [1, 2, 2]),
+# (2, [2, 2]),
+# (1, [1, 2, 2, 0])]
+```
+
+## Custom Callback Function
+
+For both the frequent and the top-k algorithm, you can use a custom callback function `callback=lambda patt, matches: ...` instead of returning the normal results of patterns and their respective frequencies.
+
+- When callback function is specified, `None` is returned.
+
+- For large datasets, when mining frequent patterns, you can use callback function to process each pattern immediately, and avoid having a huge list of patterns consuming huge amount of memory.
+
+- The following example finds the longest frequent pattern covering each sequence.
+
+``` python
+coverage = [[] for i in range(len(db))]
+
+def cover(patt, matches):
+ for i, _ in matches:
+ coverage[i] = max(coverage[i], patt, key=len)
+
+
+ps.frequent(2, callback=cover)
+
+print(coverage)
+# [[1, 3, 4],
+# [1, 3, 4],
+# [1, 2, 2],
+# [1, 2, 2]]
+```
+
+## Tip
+
+I strongly encourage using [PyPy](http://pypy.org/) instead of CPython to run the script for best performance. In my own experience, it is nearly 10 times faster in average. To start, you can install this package in a [virtual environment](https://virtualenv.pypa.io/en/stable/) created for PyPy.
+
+%prep
+%autosetup -n prefixspan-0.5.2
+
+%build
+%py3_build
+
+%install
+%py3_install
+install -d -m755 %{buildroot}/%{_pkgdocdir}
+if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi
+if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi
+if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi
+if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi
+pushd %{buildroot}
+if [ -d usr/lib ]; then
+ find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+if [ -d usr/lib64 ]; then
+ find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+if [ -d usr/bin ]; then
+ find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+if [ -d usr/sbin ]; then
+ find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+touch doclist.lst
+if [ -d usr/share/man ]; then
+ find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst
+fi
+popd
+mv %{buildroot}/filelist.lst .
+mv %{buildroot}/doclist.lst .
+
+%files -n python3-prefixspan -f filelist.lst
+%dir %{python3_sitelib}/*
+
+%files help -f doclist.lst
+%{_docdir}/*
+
+%changelog
+* Fri May 05 2023 Python_Bot <Python_Bot@openeuler.org> - 0.5.2-1
+- Package Spec generated
diff --git a/sources b/sources
new file mode 100644
index 0000000..3b481b2
--- /dev/null
+++ b/sources
@@ -0,0 +1 @@
+671beb44a55202bace8e54ead620aa9e prefixspan-0.5.2.tar.gz