diff options
| author | CoprDistGit <infra@openeuler.org> | 2023-04-12 06:41:03 +0000 |
|---|---|---|
| committer | CoprDistGit <infra@openeuler.org> | 2023-04-12 06:41:03 +0000 |
| commit | 35c5188e02fc2c99cfb68ad44658e5438cf728f6 (patch) | |
| tree | 33c79771550d8e0273c6587e7fea94220500f452 | |
| parent | 8cc1cc1e0377b6ec8aa9576ed50b23f7788d5303 (diff) | |
automatic import of python-pyrodigalopeneuler20.03
| -rw-r--r-- | .gitignore | 1 | ||||
| -rw-r--r-- | python-pyrodigal.spec | 791 | ||||
| -rw-r--r-- | sources | 1 |
3 files changed, 793 insertions, 0 deletions
@@ -0,0 +1 @@ +/pyrodigal-2.1.0.tar.gz diff --git a/python-pyrodigal.spec b/python-pyrodigal.spec new file mode 100644 index 0000000..a8348f4 --- /dev/null +++ b/python-pyrodigal.spec @@ -0,0 +1,791 @@ +%global _empty_manifest_terminate_build 0 +Name: python-pyrodigal +Version: 2.1.0 +Release: 1 +Summary: Cython bindings and Python interface to Prodigal, an ORF finder for genomes and metagenomes. +License: GPLv3 +URL: https://github.com/althonos/pyrodigal +Source0: https://mirrors.nju.edu.cn/pypi/web/packages/aa/d2/d770bce91da80bd5e890edc0e9d4d8641450f5c11e192af99ade838016f0/pyrodigal-2.1.0.tar.gz + + +%description +# π₯ Pyrodigal [](https://github.com/althonos/pyrodigal/stargazers) + +*Cython bindings and Python interface to [Prodigal](https://github.com/hyattpd/Prodigal/), an ORF +finder for genomes and metagenomes. **Now with SIMD!*** + +[](https://github.com/althonos/pyrodigal/actions) +[](https://codecov.io/gh/althonos/pyrodigal/) +[](https://choosealicense.com/licenses/gpl-3.0/) +[](https://pypi.org/project/pyrodigal) +[](https://anaconda.org/bioconda/pyrodigal) +[](https://aur.archlinux.org/packages/python-pyrodigal) +[](https://pypi.org/project/pyrodigal/#files) +[](https://pypi.org/project/pyrodigal/#files) +[](https://pypi.org/project/pyrodigal/#files) +[](https://github.com/althonos/pyrodigal/) +[](https://github.com/althonos/pyrodigal/issues) +[](https://pyrodigal.readthedocs.io) +[](https://github.com/althonos/pyrodigal/blob/main/CHANGELOG.md) +[](https://pepy.tech/project/pyrodigal) +[](https://doi.org/10.21105/joss.04296) + + +## πΊοΈ Overview + +Pyrodigal is a Python module that provides bindings to Prodigal using +[Cython](https://cython.org/). It directly interacts with the Prodigal +internals, which has the following advantages: + +- **single dependency**: Pyrodigal is distributed as a Python package, so you + can add it as a dependency to your project, and stop worrying about the + Prodigal binary being present on the end-user machine. +- **no intermediate files**: Everything happens in memory, in a Python object + you fully control, so you don't have to invoke the Prodigal CLI using a + sub-process and temporary files. Sequences can be passed directly as + strings or bytes, which avoids the overhead of formatting your input to + FASTA for Prodigal. +- **lower memory usage**: Pyrodigal is slightly more conservative when it comes + to using memory, which can help process very large sequences. It also lets + you save some more memory when running several *meta*-mode analyses +- **better performance**: Pyrodigal uses *SIMD* instructions to compute which + dynamic programming nodes can be ignored when scoring connections. This can + save from a third to half the runtime depending on the sequence. The [Benchmarks](https://pyrodigal.readthedocs.io/en/stable/benchmarks.html) page of the documentation contains comprehensive comparisons. See the [JOSS paper](https://doi.org/10.21105/joss.04296) + for details about how this is achieved. +- **same results**: Pyrodigal is tested to make sure it produces + exactly the same results as Prodigal `v2.6.3+31b300a`. *This was verified + extensively by [Julian Hahnfeld](https://github.com/jhahnfeld) and can be + checked with his [comparison repository](https://github.com/jhahnfeld/prodigal-pyrodigal-comparison).* + +### π Features + +The library now features everything from the original Prodigal CLI: + +- **run mode selection**: Choose between *single* mode, using a training + sequence to count nucleotide hexamers, or *metagenomic* mode, using + pre-trained data from different organisms (`prodigal -p`). +- **region masking**: Prevent genes from being predicted across regions + containing unknown nucleotides (`prodigal -m`). +- **closed ends**: Genes will be identified as running over edges if they + are larger than a certain size, but this can be disabled (`prodigal -c`). +- **training configuration**: During the training process, a custom + translation table can be given (`prodigal -g`), and the Shine-Dalgarno motif + search can be forcefully bypassed (`prodigal -n`) +- **output files**: Output files can be written in a format mostly + compatible with the Prodigal binary, including the protein translations + in FASTA format (`prodigal -a`), the gene sequences in FASTA format + (`prodigal -d`), or the potential gene scores in tabular format + (`prodigal -s`). +- **training data persistence**: Getting training data from a sequence and + using it for other sequences is supported; in addition, a training data + file can be saved and loaded transparently (`prodigal -t`). + +In addition, the **new** features are available: + +- **custom gene size threshold**: While Prodigal uses a minimum gene size + of 90 nucleotides (60 if on edge), Pyrodigal allows to customize this + threshold, allowing for smaller ORFs to be identified if needed. + +### π Memory + +Pyrodigal makes several changes compared to the original Prodigal binary +regarding memory management: + +* Sequences are stored as raw bytes instead of compressed bitmaps. This means + that the sequence itself takes 3/8th more space, but since the memory used + for storing the sequence is often negligible compared to the memory used to + store dynamic programming nodes, this is an acceptable trade-off for better + performance when extracting said nodes. +* Node arrays are dynamically allocated and grow exponentially instead of + being pre-allocated with a large size. On small sequences, this leads to + Pyrodigal using about 30% less memory. +* Genes are stored in a more compact data structure than in Prodigal (which + reserves a buffer to store string data), saving around 1KiB per gene. + + +### π§Ά Thread-safety + +[`pyrodigal.OrfFinder`](https://pyrodigal.readthedocs.io/en/stable/api/orf_finder.html#pyrodigal.OrfFinder) +instances are thread-safe. In addition, the +[`find_genes`](https://pyrodigal.readthedocs.io/en/stable/api/orf_finder.html#pyrodigal.OrfFinder.find_genes) +method is re-entrant. This means you can train an +[`OrfFinder`](https://pyrodigal.readthedocs.io/en/stable/api/orf_finder.html#pyrodigal.OrfFinder) +instance once, and then use a pool to process sequences in parallel: +```python +import multiprocessing.pool +import pyrodigal + +orf_finder = pyrodigal.OrfFinder() +orf_finder.train(training_sequence) + +with multiprocessing.pool.ThreadPool() as pool: + predictions = pool.map(orf_finder.find_genes, sequences) +``` + +## π§ Installing + +Pyrodigal can be installed directly from [PyPI](https://pypi.org/project/pyrodigal/), +which hosts some pre-built wheels for the x86-64 architecture (Linux/OSX/Windows) +and the Aarch64 architecture (Linux only), as well as the code required to compile +from source with Cython: +```console +$ pip install pyrodigal +``` + +Otherwise, Pyrodigal is also available as a [Bioconda](https://bioconda.github.io/) +package: +```console +$ conda install -c bioconda pyrodigal +``` + +## π‘ Example + +Let's load a sequence from a +[GenBank](http://www.insdc.org/files/feature_table.html) file, use an `OrfFinder` +to find all the genes it contains, and print the proteins in two-line FASTA +format. + +### π¬ [Biopython](https://github.com/biopython/biopython) + +To use the [`OrfFinder`](https://pyrodigal.readthedocs.io/en/stable/api/orf_finder.html#pyrodigal.OrfFinder) +in single mode (corresponding to `prodigal -p single`, the default operation mode of Prodigal), +you must explicitly call the +[`train`](https://pyrodigal.readthedocs.io/en/stable/api/orf_finder.html#pyrodigal.OrfFinder.train) method +with the sequence you want to use for training before trying to find genes, +or you will get a [`RuntimeError`](https://docs.python.org/3/library/exceptions.html#RuntimeError): +```python +import Bio.SeqIO +import pyrodigal + +record = Bio.SeqIO.read("sequence.gbk", "genbank") + +orf_finder = pyrodigal.OrfFinder() +orf_finder.train(bytes(record.seq)) +genes = orf_finder.find_genes(bytes(record.seq)) +``` + +However, in `meta` mode (corresponding to `prodigal -p meta`), you can find genes directly: +```python +import Bio.SeqIO +import pyrodigal + +record = Bio.SeqIO.read("sequence.gbk", "genbank") + +orf_finder = pyrodigal.OrfFinder(meta=True) +for i, pred in enumerate(orf_finder.find_genes(bytes(record.seq))): + print(f">{record.id}_{i+1}") + print(pred.translate()) +``` + +*On older versions of Biopython (before 1.79) you will need to use +`record.seq.encode()` instead of `bytes(record.seq)`*. + + +### π§ͺ [Scikit-bio](https://github.com/biocore/scikit-bio) + +```python +import skbio.io +import pyrodigal + +seq = next(skbio.io.read("sequence.gbk", "genbank")) + +orf_finder = pyrodigal.OrfFinder(meta=True) +for i, pred in enumerate(orf_finder.find_genes(seq.values.view('B'))): + print(f">{record.id}_{i+1}") + print(pred.translate()) +``` + +*We need to use the [`view`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.view.html) +method to get the sequence viewable by Cython as an array of `unsigned char`.* + + +## π Citation + +Pyrodigal is scientific software, with a +[published paper](https://doi.org/10.21105/joss.04296) +in the [Journal of Open-Source Software](https://joss.theoj.org/). Please +cite both [Pyrodigal](https://doi.org/10.21105/joss.04296) +and [Prodigal](https://doi.org/10.1186/1471-2105-11-119) if you are using it in +an academic work, for instance as: + +> Pyrodigal (Larralde, 2022), a Python library binding to Prodigal (Hyatt *et al.*, 2010). + +Detailed references are available on the [Publications page](https://pyrodigal.readthedocs.io/en/stable/publications.html) of the +[online documentation](https://pyrodigal.readthedocs.io/). + +## π Feedback + +### β οΈ Issue Tracker + +Found a bug ? Have an enhancement request ? Head over to the [GitHub issue +tracker](https://github.com/althonos/pyrodigal/issues) if you need to report +or ask something. If you are filing in on a bug, please include as much +information as you can about the issue, and try to recreate the same bug +in a simple, easily reproducible situation. + +### ποΈ Contributing + +Contributions are more than welcome! See +[`CONTRIBUTING.md`](https://github.com/althonos/pyrodigal/blob/main/CONTRIBUTING.md) +for more details. + +## π Changelog + +This project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.html) +and provides a [changelog](https://github.com/althonos/pyrodigal/blob/main/CHANGELOG.md) +in the [Keep a Changelog](http://keepachangelog.com/en/1.0.0/) format. + + +## βοΈ License + +This library is provided under the [GNU General Public License v3.0](https://choosealicense.com/licenses/gpl-3.0/). +The Prodigal code was written by [Doug Hyatt](https://github.com/hyattpd) and is distributed under the +terms of the GPLv3 as well. See `vendor/Prodigal/LICENSE` for more information. The `cpu_features` library was written by [Guillaume Chatelet](https://github.com/gchatelet) and is +licensed under the terms of the [Apache License 2.0](https://choosealicense.com/licenses/apache-2.0/). See `vendor/cpu_features/LICENSE` for more information. + +*This project is in no way not affiliated, sponsored, or otherwise endorsed +by the [original Prodigal authors](https://github.com/hyattpd). It was developed +by [Martin Larralde](https://github.com/althonos/) during his PhD project +at the [European Molecular Biology Laboratory](https://www.embl.de/) in +the [Zeller team](https://github.com/zellerlab).* + + +%package -n python3-pyrodigal +Summary: Cython bindings and Python interface to Prodigal, an ORF finder for genomes and metagenomes. +Provides: python-pyrodigal +BuildRequires: python3-devel +BuildRequires: python3-setuptools +BuildRequires: python3-pip +BuildRequires: python3-cffi +BuildRequires: gcc +BuildRequires: gdb +%description -n python3-pyrodigal +# π₯ Pyrodigal [](https://github.com/althonos/pyrodigal/stargazers) + +*Cython bindings and Python interface to [Prodigal](https://github.com/hyattpd/Prodigal/), an ORF +finder for genomes and metagenomes. **Now with SIMD!*** + +[](https://github.com/althonos/pyrodigal/actions) +[](https://codecov.io/gh/althonos/pyrodigal/) +[](https://choosealicense.com/licenses/gpl-3.0/) +[](https://pypi.org/project/pyrodigal) +[](https://anaconda.org/bioconda/pyrodigal) +[](https://aur.archlinux.org/packages/python-pyrodigal) +[](https://pypi.org/project/pyrodigal/#files) +[](https://pypi.org/project/pyrodigal/#files) +[](https://pypi.org/project/pyrodigal/#files) +[](https://github.com/althonos/pyrodigal/) +[](https://github.com/althonos/pyrodigal/issues) +[](https://pyrodigal.readthedocs.io) +[](https://github.com/althonos/pyrodigal/blob/main/CHANGELOG.md) +[](https://pepy.tech/project/pyrodigal) +[](https://doi.org/10.21105/joss.04296) + + +## πΊοΈ Overview + +Pyrodigal is a Python module that provides bindings to Prodigal using +[Cython](https://cython.org/). It directly interacts with the Prodigal +internals, which has the following advantages: + +- **single dependency**: Pyrodigal is distributed as a Python package, so you + can add it as a dependency to your project, and stop worrying about the + Prodigal binary being present on the end-user machine. +- **no intermediate files**: Everything happens in memory, in a Python object + you fully control, so you don't have to invoke the Prodigal CLI using a + sub-process and temporary files. Sequences can be passed directly as + strings or bytes, which avoids the overhead of formatting your input to + FASTA for Prodigal. +- **lower memory usage**: Pyrodigal is slightly more conservative when it comes + to using memory, which can help process very large sequences. It also lets + you save some more memory when running several *meta*-mode analyses +- **better performance**: Pyrodigal uses *SIMD* instructions to compute which + dynamic programming nodes can be ignored when scoring connections. This can + save from a third to half the runtime depending on the sequence. The [Benchmarks](https://pyrodigal.readthedocs.io/en/stable/benchmarks.html) page of the documentation contains comprehensive comparisons. See the [JOSS paper](https://doi.org/10.21105/joss.04296) + for details about how this is achieved. +- **same results**: Pyrodigal is tested to make sure it produces + exactly the same results as Prodigal `v2.6.3+31b300a`. *This was verified + extensively by [Julian Hahnfeld](https://github.com/jhahnfeld) and can be + checked with his [comparison repository](https://github.com/jhahnfeld/prodigal-pyrodigal-comparison).* + +### π Features + +The library now features everything from the original Prodigal CLI: + +- **run mode selection**: Choose between *single* mode, using a training + sequence to count nucleotide hexamers, or *metagenomic* mode, using + pre-trained data from different organisms (`prodigal -p`). +- **region masking**: Prevent genes from being predicted across regions + containing unknown nucleotides (`prodigal -m`). +- **closed ends**: Genes will be identified as running over edges if they + are larger than a certain size, but this can be disabled (`prodigal -c`). +- **training configuration**: During the training process, a custom + translation table can be given (`prodigal -g`), and the Shine-Dalgarno motif + search can be forcefully bypassed (`prodigal -n`) +- **output files**: Output files can be written in a format mostly + compatible with the Prodigal binary, including the protein translations + in FASTA format (`prodigal -a`), the gene sequences in FASTA format + (`prodigal -d`), or the potential gene scores in tabular format + (`prodigal -s`). +- **training data persistence**: Getting training data from a sequence and + using it for other sequences is supported; in addition, a training data + file can be saved and loaded transparently (`prodigal -t`). + +In addition, the **new** features are available: + +- **custom gene size threshold**: While Prodigal uses a minimum gene size + of 90 nucleotides (60 if on edge), Pyrodigal allows to customize this + threshold, allowing for smaller ORFs to be identified if needed. + +### π Memory + +Pyrodigal makes several changes compared to the original Prodigal binary +regarding memory management: + +* Sequences are stored as raw bytes instead of compressed bitmaps. This means + that the sequence itself takes 3/8th more space, but since the memory used + for storing the sequence is often negligible compared to the memory used to + store dynamic programming nodes, this is an acceptable trade-off for better + performance when extracting said nodes. +* Node arrays are dynamically allocated and grow exponentially instead of + being pre-allocated with a large size. On small sequences, this leads to + Pyrodigal using about 30% less memory. +* Genes are stored in a more compact data structure than in Prodigal (which + reserves a buffer to store string data), saving around 1KiB per gene. + + +### π§Ά Thread-safety + +[`pyrodigal.OrfFinder`](https://pyrodigal.readthedocs.io/en/stable/api/orf_finder.html#pyrodigal.OrfFinder) +instances are thread-safe. In addition, the +[`find_genes`](https://pyrodigal.readthedocs.io/en/stable/api/orf_finder.html#pyrodigal.OrfFinder.find_genes) +method is re-entrant. This means you can train an +[`OrfFinder`](https://pyrodigal.readthedocs.io/en/stable/api/orf_finder.html#pyrodigal.OrfFinder) +instance once, and then use a pool to process sequences in parallel: +```python +import multiprocessing.pool +import pyrodigal + +orf_finder = pyrodigal.OrfFinder() +orf_finder.train(training_sequence) + +with multiprocessing.pool.ThreadPool() as pool: + predictions = pool.map(orf_finder.find_genes, sequences) +``` + +## π§ Installing + +Pyrodigal can be installed directly from [PyPI](https://pypi.org/project/pyrodigal/), +which hosts some pre-built wheels for the x86-64 architecture (Linux/OSX/Windows) +and the Aarch64 architecture (Linux only), as well as the code required to compile +from source with Cython: +```console +$ pip install pyrodigal +``` + +Otherwise, Pyrodigal is also available as a [Bioconda](https://bioconda.github.io/) +package: +```console +$ conda install -c bioconda pyrodigal +``` + +## π‘ Example + +Let's load a sequence from a +[GenBank](http://www.insdc.org/files/feature_table.html) file, use an `OrfFinder` +to find all the genes it contains, and print the proteins in two-line FASTA +format. + +### π¬ [Biopython](https://github.com/biopython/biopython) + +To use the [`OrfFinder`](https://pyrodigal.readthedocs.io/en/stable/api/orf_finder.html#pyrodigal.OrfFinder) +in single mode (corresponding to `prodigal -p single`, the default operation mode of Prodigal), +you must explicitly call the +[`train`](https://pyrodigal.readthedocs.io/en/stable/api/orf_finder.html#pyrodigal.OrfFinder.train) method +with the sequence you want to use for training before trying to find genes, +or you will get a [`RuntimeError`](https://docs.python.org/3/library/exceptions.html#RuntimeError): +```python +import Bio.SeqIO +import pyrodigal + +record = Bio.SeqIO.read("sequence.gbk", "genbank") + +orf_finder = pyrodigal.OrfFinder() +orf_finder.train(bytes(record.seq)) +genes = orf_finder.find_genes(bytes(record.seq)) +``` + +However, in `meta` mode (corresponding to `prodigal -p meta`), you can find genes directly: +```python +import Bio.SeqIO +import pyrodigal + +record = Bio.SeqIO.read("sequence.gbk", "genbank") + +orf_finder = pyrodigal.OrfFinder(meta=True) +for i, pred in enumerate(orf_finder.find_genes(bytes(record.seq))): + print(f">{record.id}_{i+1}") + print(pred.translate()) +``` + +*On older versions of Biopython (before 1.79) you will need to use +`record.seq.encode()` instead of `bytes(record.seq)`*. + + +### π§ͺ [Scikit-bio](https://github.com/biocore/scikit-bio) + +```python +import skbio.io +import pyrodigal + +seq = next(skbio.io.read("sequence.gbk", "genbank")) + +orf_finder = pyrodigal.OrfFinder(meta=True) +for i, pred in enumerate(orf_finder.find_genes(seq.values.view('B'))): + print(f">{record.id}_{i+1}") + print(pred.translate()) +``` + +*We need to use the [`view`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.view.html) +method to get the sequence viewable by Cython as an array of `unsigned char`.* + + +## π Citation + +Pyrodigal is scientific software, with a +[published paper](https://doi.org/10.21105/joss.04296) +in the [Journal of Open-Source Software](https://joss.theoj.org/). Please +cite both [Pyrodigal](https://doi.org/10.21105/joss.04296) +and [Prodigal](https://doi.org/10.1186/1471-2105-11-119) if you are using it in +an academic work, for instance as: + +> Pyrodigal (Larralde, 2022), a Python library binding to Prodigal (Hyatt *et al.*, 2010). + +Detailed references are available on the [Publications page](https://pyrodigal.readthedocs.io/en/stable/publications.html) of the +[online documentation](https://pyrodigal.readthedocs.io/). + +## π Feedback + +### β οΈ Issue Tracker + +Found a bug ? Have an enhancement request ? Head over to the [GitHub issue +tracker](https://github.com/althonos/pyrodigal/issues) if you need to report +or ask something. If you are filing in on a bug, please include as much +information as you can about the issue, and try to recreate the same bug +in a simple, easily reproducible situation. + +### ποΈ Contributing + +Contributions are more than welcome! See +[`CONTRIBUTING.md`](https://github.com/althonos/pyrodigal/blob/main/CONTRIBUTING.md) +for more details. + +## π Changelog + +This project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.html) +and provides a [changelog](https://github.com/althonos/pyrodigal/blob/main/CHANGELOG.md) +in the [Keep a Changelog](http://keepachangelog.com/en/1.0.0/) format. + + +## βοΈ License + +This library is provided under the [GNU General Public License v3.0](https://choosealicense.com/licenses/gpl-3.0/). +The Prodigal code was written by [Doug Hyatt](https://github.com/hyattpd) and is distributed under the +terms of the GPLv3 as well. See `vendor/Prodigal/LICENSE` for more information. The `cpu_features` library was written by [Guillaume Chatelet](https://github.com/gchatelet) and is +licensed under the terms of the [Apache License 2.0](https://choosealicense.com/licenses/apache-2.0/). See `vendor/cpu_features/LICENSE` for more information. + +*This project is in no way not affiliated, sponsored, or otherwise endorsed +by the [original Prodigal authors](https://github.com/hyattpd). It was developed +by [Martin Larralde](https://github.com/althonos/) during his PhD project +at the [European Molecular Biology Laboratory](https://www.embl.de/) in +the [Zeller team](https://github.com/zellerlab).* + + +%package help +Summary: Development documents and examples for pyrodigal +Provides: python3-pyrodigal-doc +%description help +# π₯ Pyrodigal [](https://github.com/althonos/pyrodigal/stargazers) + +*Cython bindings and Python interface to [Prodigal](https://github.com/hyattpd/Prodigal/), an ORF +finder for genomes and metagenomes. **Now with SIMD!*** + +[](https://github.com/althonos/pyrodigal/actions) +[](https://codecov.io/gh/althonos/pyrodigal/) +[](https://choosealicense.com/licenses/gpl-3.0/) +[](https://pypi.org/project/pyrodigal) +[](https://anaconda.org/bioconda/pyrodigal) +[](https://aur.archlinux.org/packages/python-pyrodigal) +[](https://pypi.org/project/pyrodigal/#files) +[](https://pypi.org/project/pyrodigal/#files) +[](https://pypi.org/project/pyrodigal/#files) +[](https://github.com/althonos/pyrodigal/) +[](https://github.com/althonos/pyrodigal/issues) +[](https://pyrodigal.readthedocs.io) +[](https://github.com/althonos/pyrodigal/blob/main/CHANGELOG.md) +[](https://pepy.tech/project/pyrodigal) +[](https://doi.org/10.21105/joss.04296) + + +## πΊοΈ Overview + +Pyrodigal is a Python module that provides bindings to Prodigal using +[Cython](https://cython.org/). It directly interacts with the Prodigal +internals, which has the following advantages: + +- **single dependency**: Pyrodigal is distributed as a Python package, so you + can add it as a dependency to your project, and stop worrying about the + Prodigal binary being present on the end-user machine. +- **no intermediate files**: Everything happens in memory, in a Python object + you fully control, so you don't have to invoke the Prodigal CLI using a + sub-process and temporary files. Sequences can be passed directly as + strings or bytes, which avoids the overhead of formatting your input to + FASTA for Prodigal. +- **lower memory usage**: Pyrodigal is slightly more conservative when it comes + to using memory, which can help process very large sequences. It also lets + you save some more memory when running several *meta*-mode analyses +- **better performance**: Pyrodigal uses *SIMD* instructions to compute which + dynamic programming nodes can be ignored when scoring connections. This can + save from a third to half the runtime depending on the sequence. The [Benchmarks](https://pyrodigal.readthedocs.io/en/stable/benchmarks.html) page of the documentation contains comprehensive comparisons. See the [JOSS paper](https://doi.org/10.21105/joss.04296) + for details about how this is achieved. +- **same results**: Pyrodigal is tested to make sure it produces + exactly the same results as Prodigal `v2.6.3+31b300a`. *This was verified + extensively by [Julian Hahnfeld](https://github.com/jhahnfeld) and can be + checked with his [comparison repository](https://github.com/jhahnfeld/prodigal-pyrodigal-comparison).* + +### π Features + +The library now features everything from the original Prodigal CLI: + +- **run mode selection**: Choose between *single* mode, using a training + sequence to count nucleotide hexamers, or *metagenomic* mode, using + pre-trained data from different organisms (`prodigal -p`). +- **region masking**: Prevent genes from being predicted across regions + containing unknown nucleotides (`prodigal -m`). +- **closed ends**: Genes will be identified as running over edges if they + are larger than a certain size, but this can be disabled (`prodigal -c`). +- **training configuration**: During the training process, a custom + translation table can be given (`prodigal -g`), and the Shine-Dalgarno motif + search can be forcefully bypassed (`prodigal -n`) +- **output files**: Output files can be written in a format mostly + compatible with the Prodigal binary, including the protein translations + in FASTA format (`prodigal -a`), the gene sequences in FASTA format + (`prodigal -d`), or the potential gene scores in tabular format + (`prodigal -s`). +- **training data persistence**: Getting training data from a sequence and + using it for other sequences is supported; in addition, a training data + file can be saved and loaded transparently (`prodigal -t`). + +In addition, the **new** features are available: + +- **custom gene size threshold**: While Prodigal uses a minimum gene size + of 90 nucleotides (60 if on edge), Pyrodigal allows to customize this + threshold, allowing for smaller ORFs to be identified if needed. + +### π Memory + +Pyrodigal makes several changes compared to the original Prodigal binary +regarding memory management: + +* Sequences are stored as raw bytes instead of compressed bitmaps. This means + that the sequence itself takes 3/8th more space, but since the memory used + for storing the sequence is often negligible compared to the memory used to + store dynamic programming nodes, this is an acceptable trade-off for better + performance when extracting said nodes. +* Node arrays are dynamically allocated and grow exponentially instead of + being pre-allocated with a large size. On small sequences, this leads to + Pyrodigal using about 30% less memory. +* Genes are stored in a more compact data structure than in Prodigal (which + reserves a buffer to store string data), saving around 1KiB per gene. + + +### π§Ά Thread-safety + +[`pyrodigal.OrfFinder`](https://pyrodigal.readthedocs.io/en/stable/api/orf_finder.html#pyrodigal.OrfFinder) +instances are thread-safe. In addition, the +[`find_genes`](https://pyrodigal.readthedocs.io/en/stable/api/orf_finder.html#pyrodigal.OrfFinder.find_genes) +method is re-entrant. This means you can train an +[`OrfFinder`](https://pyrodigal.readthedocs.io/en/stable/api/orf_finder.html#pyrodigal.OrfFinder) +instance once, and then use a pool to process sequences in parallel: +```python +import multiprocessing.pool +import pyrodigal + +orf_finder = pyrodigal.OrfFinder() +orf_finder.train(training_sequence) + +with multiprocessing.pool.ThreadPool() as pool: + predictions = pool.map(orf_finder.find_genes, sequences) +``` + +## π§ Installing + +Pyrodigal can be installed directly from [PyPI](https://pypi.org/project/pyrodigal/), +which hosts some pre-built wheels for the x86-64 architecture (Linux/OSX/Windows) +and the Aarch64 architecture (Linux only), as well as the code required to compile +from source with Cython: +```console +$ pip install pyrodigal +``` + +Otherwise, Pyrodigal is also available as a [Bioconda](https://bioconda.github.io/) +package: +```console +$ conda install -c bioconda pyrodigal +``` + +## π‘ Example + +Let's load a sequence from a +[GenBank](http://www.insdc.org/files/feature_table.html) file, use an `OrfFinder` +to find all the genes it contains, and print the proteins in two-line FASTA +format. + +### π¬ [Biopython](https://github.com/biopython/biopython) + +To use the [`OrfFinder`](https://pyrodigal.readthedocs.io/en/stable/api/orf_finder.html#pyrodigal.OrfFinder) +in single mode (corresponding to `prodigal -p single`, the default operation mode of Prodigal), +you must explicitly call the +[`train`](https://pyrodigal.readthedocs.io/en/stable/api/orf_finder.html#pyrodigal.OrfFinder.train) method +with the sequence you want to use for training before trying to find genes, +or you will get a [`RuntimeError`](https://docs.python.org/3/library/exceptions.html#RuntimeError): +```python +import Bio.SeqIO +import pyrodigal + +record = Bio.SeqIO.read("sequence.gbk", "genbank") + +orf_finder = pyrodigal.OrfFinder() +orf_finder.train(bytes(record.seq)) +genes = orf_finder.find_genes(bytes(record.seq)) +``` + +However, in `meta` mode (corresponding to `prodigal -p meta`), you can find genes directly: +```python +import Bio.SeqIO +import pyrodigal + +record = Bio.SeqIO.read("sequence.gbk", "genbank") + +orf_finder = pyrodigal.OrfFinder(meta=True) +for i, pred in enumerate(orf_finder.find_genes(bytes(record.seq))): + print(f">{record.id}_{i+1}") + print(pred.translate()) +``` + +*On older versions of Biopython (before 1.79) you will need to use +`record.seq.encode()` instead of `bytes(record.seq)`*. + + +### π§ͺ [Scikit-bio](https://github.com/biocore/scikit-bio) + +```python +import skbio.io +import pyrodigal + +seq = next(skbio.io.read("sequence.gbk", "genbank")) + +orf_finder = pyrodigal.OrfFinder(meta=True) +for i, pred in enumerate(orf_finder.find_genes(seq.values.view('B'))): + print(f">{record.id}_{i+1}") + print(pred.translate()) +``` + +*We need to use the [`view`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.view.html) +method to get the sequence viewable by Cython as an array of `unsigned char`.* + + +## π Citation + +Pyrodigal is scientific software, with a +[published paper](https://doi.org/10.21105/joss.04296) +in the [Journal of Open-Source Software](https://joss.theoj.org/). Please +cite both [Pyrodigal](https://doi.org/10.21105/joss.04296) +and [Prodigal](https://doi.org/10.1186/1471-2105-11-119) if you are using it in +an academic work, for instance as: + +> Pyrodigal (Larralde, 2022), a Python library binding to Prodigal (Hyatt *et al.*, 2010). + +Detailed references are available on the [Publications page](https://pyrodigal.readthedocs.io/en/stable/publications.html) of the +[online documentation](https://pyrodigal.readthedocs.io/). + +## π Feedback + +### β οΈ Issue Tracker + +Found a bug ? Have an enhancement request ? Head over to the [GitHub issue +tracker](https://github.com/althonos/pyrodigal/issues) if you need to report +or ask something. If you are filing in on a bug, please include as much +information as you can about the issue, and try to recreate the same bug +in a simple, easily reproducible situation. + +### ποΈ Contributing + +Contributions are more than welcome! See +[`CONTRIBUTING.md`](https://github.com/althonos/pyrodigal/blob/main/CONTRIBUTING.md) +for more details. + +## π Changelog + +This project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.html) +and provides a [changelog](https://github.com/althonos/pyrodigal/blob/main/CHANGELOG.md) +in the [Keep a Changelog](http://keepachangelog.com/en/1.0.0/) format. + + +## βοΈ License + +This library is provided under the [GNU General Public License v3.0](https://choosealicense.com/licenses/gpl-3.0/). +The Prodigal code was written by [Doug Hyatt](https://github.com/hyattpd) and is distributed under the +terms of the GPLv3 as well. See `vendor/Prodigal/LICENSE` for more information. The `cpu_features` library was written by [Guillaume Chatelet](https://github.com/gchatelet) and is +licensed under the terms of the [Apache License 2.0](https://choosealicense.com/licenses/apache-2.0/). See `vendor/cpu_features/LICENSE` for more information. + +*This project is in no way not affiliated, sponsored, or otherwise endorsed +by the [original Prodigal authors](https://github.com/hyattpd). It was developed +by [Martin Larralde](https://github.com/althonos/) during his PhD project +at the [European Molecular Biology Laboratory](https://www.embl.de/) in +the [Zeller team](https://github.com/zellerlab).* + + +%prep +%autosetup -n pyrodigal-2.1.0 + +%build +%py3_build + +%install +%py3_install +install -d -m755 %{buildroot}/%{_pkgdocdir} +if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi +if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi +if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi +if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi +pushd %{buildroot} +if [ -d usr/lib ]; then + find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/lib64 ]; then + find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/bin ]; then + find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/sbin ]; then + find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst +fi +touch doclist.lst +if [ -d usr/share/man ]; then + find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst +fi +popd +mv %{buildroot}/filelist.lst . +mv %{buildroot}/doclist.lst . + +%files -n python3-pyrodigal -f filelist.lst +%dir %{python3_sitearch}/* + +%files help -f doclist.lst +%{_docdir}/* + +%changelog +* Wed Apr 12 2023 Python_Bot <Python_Bot@openeuler.org> - 2.1.0-1 +- Package Spec generated @@ -0,0 +1 @@ +5af868013a977d0ac37761bf1ef77a24 pyrodigal-2.1.0.tar.gz |
