diff options
| -rw-r--r-- | .gitignore | 1 | ||||
| -rw-r--r-- | python-cctyper.spec | 1143 | ||||
| -rw-r--r-- | sources | 1 |
3 files changed, 1145 insertions, 0 deletions
@@ -0,0 +1 @@ +/cctyper-1.8.0.tar.gz diff --git a/python-cctyper.spec b/python-cctyper.spec new file mode 100644 index 0000000..cfc0ed8 --- /dev/null +++ b/python-cctyper.spec @@ -0,0 +1,1143 @@ +%global _empty_manifest_terminate_build 0 +Name: python-cctyper +Version: 1.8.0 +Release: 1 +Summary: CRISPRCasTyper: Automatic detection and subtyping of CRISPR-Cas operons +License: MIT License +URL: https://github.com/Russel88/CRISPRCasTyper +Source0: https://mirrors.nju.edu.cn/pypi/web/packages/f5/be/0fcbfaab346c40da7987239ad432c3df0ee124a0494f48182ca44c3737dd/cctyper-1.8.0.tar.gz +BuildArch: noarch + + +%description +[](http://www.repostatus.org/#active) +[](https://anaconda.org/russel88/cctyper) + +# CRISPRCasTyper + +Detect CRISPR-Cas genes and arrays, and predict the subtype based on both Cas genes and CRISPR repeat sequence. + +[CRISPRCasTyper and RepeatType are also available through a webserver](https://crisprcastyper.crispr.dk) + +This software finds Cas genes with a large suite of HMMs, then groups these HMMs into operons, and predicts the subtype of the operons based on a scoring scheme. +Furthermore, it finds CRISPR arrays with [minced](https://github.com/ctSkennerton/minced) and by BLASTing a large suite of known repeats, and using a kmer-based machine learning approach (extreme gradient boosting trees) it predicts the subtype of the CRISPR arrays based on the consensus repeat. +It then connects the Cas operons and CRISPR arrays, producing as output: +* CRISPR-Cas loci, with consensus subtype prediction based on both Cas genes (mostly) and CRISPR consensus repeats +* Orphan Cas operons, and their predicted subtype +* Orphan CRISPR arrays, and their predicted associated subtype + +#### It includes the following 46 subtypes/variants [(find typing scheme here)](https://typer.crispr.dk/#/typing): +* I-A, I-B, I-C, I-D, I-E, I-F, I-F (transposon), I-G, II-A, II-B, II-C, III-A, III-B, III-C, III-D, III-E, III-F, IV-A1, IV-A2, IV-A3, IV-B, IV-C, IV-D, IV-E, V-A, V-B1, V-B2, V-C, V-D, V-E, V-F1, V-F2, V-F3, V-F (the rest), V-G, V-H, V-I, V-J, V-K, V-L, VI-A, VI-B1, VI-B2, VI-C, VI-D, VI-X, VI-Y. + +* All subtypes from the most recent Nature Reviews Microbiology (Makarova et al. 2020): [Evolutionary classification of CRISPR–Cas systems: a burst of class 2 and derived variants](https://doi.org/10.1038/s41579-019-0299-x) +* Updated type IV subtypes and variants based on: [Type IV CRISPR–Cas systems are highly diverse and involved in competition between plasmids](https://doi.org/10.1093/nar/gkz1197) +* Type V-K: [RNA-guided DNA insertion with CRISPR-associated transposases](https://doi.org/10.1126/science.aax9181) +* Transposon associated type I-F: [Transposon-encoded CRISPR–Cas systems direct RNA-guided DNA integration](https://doi.org/10.1038/s41586-019-1323-z) +* New V-A variants: [Novel Type V-A CRISPR Effectors Are Active Nucleases with Expanded Targeting Capabilities](https://doi.org/10.1089/crispr.2020.0043) +* New Cas13s: [Programmable RNA editing with compact CRISPR–Cas13 systems from uncultivated microbes](https://doi.org/10.1038/s41592-021-01124-4) +* V-L (cas12l): [A new family of CRISPR-type V nucleases with C-rich PAM recognition](https://doi.org/10.15252/embr.202255481) + +#### It can automatically draw gene maps of CRISPR-Cas systems and orphan Cas operons and CRISPR arrays +##### in vector graphics format for direct use in scientific manuscripts +<img src='img/plot.svg' align="left" height="200" /> + +#### Citation +[Jakob Russel, Rafael Pinilla-Redondo, David Mayo-Muñoz, Shiraz A. Shah, Søren J. Sørensen - CRISPRCasTyper: Automated Identification, Annotation and Classification of CRISPR-Cas loci. The CRISPR Journal Dec 2020](https://doi.org/10.1089/crispr.2020.0059) + +Find a free to read version on [BioRxiv](https://doi.org/10.1101/2020.05.15.097824) + +# Table of contents +1. [Quick start](#quick) +2. [Installation](#install) +3. [CRISPRCasTyper - How to](#cctyperhow) + * [Plotting](#plot) +4. [RepeatType - How to](#repeattype) + * [Updated models](#repeatnew) +5. [RepeatType - Train](#repeattrain) +6. [Troubleshoot](#trouble) + +## Quick start <a name="quick"></a> + +```sh +conda create -n cctyper -c conda-forge -c bioconda -c russel88 cctyper +conda activate cctyper +cctyper my.fasta my_output +``` + +## Installation <a name="install"></a> +CRISPRCasTyper can be installed either through conda or pip. + +It is advised to use conda, since this installs CRISPRCasTyper and all dependencies, and downloads the database in one go. + +### Conda +Use [miniconda](https://docs.conda.io/en/latest/miniconda.html) or [anaconda](https://www.anaconda.com/) to install. + +Create the environment with CRISPRCasTyper and all dependencies and database +```sh +conda create -n cctyper -c conda-forge -c bioconda -c russel88 cctyper +``` + +### pip +If you have the dependencies (Python >= 3.8, HMMER >= 3.2, Prodigal >= 2.6, minced, grep, sed) in your PATH you can install with pip + +Install cctyper python module +```sh +python -m pip install cctyper +``` + +Upgrade cctyper python module to the latest version +```sh +python -m pip install cctyper --upgrade +``` + + +#### When installing with pip, you need to download the database manually: +```sh +# Download and unpack +svn checkout https://github.com/Russel88/CRISPRCasTyper/trunk/data +tar -xvzf data/Profiles.tar.gz +mv Profiles/ data/ +rm data/Profiles.tar.gz + +# Tell CRISPRCasTyper where the data is: +# either by setting an environment variable (has to be done for each terminal session, or added to .bashrc): +export CCTYPER_DB="/path/to/data/" +# or by using the --db argument each time you run CRISPRCasTyper: +cctyper input.fa output --db /path/to/data/ +``` + +## CRISPRCasTyper - How to <a name="cctyperhow"></a> +CRISPRCasTyper takes as input a nucleotide fasta, and produces outputs with CRISPR-Cas predictions + +#### Activate environment +```sh +conda activate cctyper +``` + +#### Run with a nucleotide fasta as input +```sh +cctyper genome.fa my_output +``` + +#### If you have a complete circular genome (each entry in the fasta will be treated as having circular topology) +```sh +cctyper genome.fa my_output --circular +``` + +#### For metagenome assemblies and short contigs/plasmids/phages, change the prodigal mode +The default prodigal mode expects the input to be a single draft or complete genome +```sh +cctyper assembly.fa my_output --prodigal meta +``` + +#### Check the different options +```sh +cctyper -h +``` + +#### Output <a name="cctyperout"></a> +* **CRISPR_Cas.tab:** CRISPR_Cas loci, with consensus subtype prediction + * Contig: Sequence accession + * Operon: Operon ID (Sequence accession @ NUMBER) + * Operon_Pos: [Start, End] of operon + * Prediction: Consenus prediction based on both Cas operon and CRISPR arrays + * CRISPRs: CRISPRs adjacent to Cas operon + * Distances: Distances to CRISPRs from Cas operon + * Prediction_Cas: Subtype prediction based on Cas operon + * Prediction_CRISPRs: Subtype prediction of CRISPRs based on CRISPR repeat sequences +* **cas_operons.tab:** All certain Cas operons + * Contig: Sequence accession + * Operon: Operon ID (Sequence accession @ NUMBER) + * Start: Start of operon + * End: End of operon + * Prediction: Subtype prediction + * Complete_Interference: Percent completion of the interference module(s). Can be a list if best_type is a list (Hybrid and Ambiguous) + * Complete_Adaptation: Percent completion of the adaptation module(s). Can be a list if best_type is a list (Hybrid and Ambiguous) + * Best_type: Subtype with the highest score. If the score is high then Prediction = Best_type + * Best_score: Score of the highest scoring subtype + * Genes: List of Cas genes + * Positions: List of Gene IDs for the genes + * E-values: List of E-values for the genes + * CoverageSeq: List of sequence coverages for the genes + * CoverageHMM: List of HMM coverages for the genes + * Strand_Interference: Strand of interference module. 1 is positive strand, -1 is negative strand, 0 is mixed, NA if no interference gene found + * Strand_Adaptation: Strand of adaptation module. 1 is positive strand, -1 is negative strand, 0 is mixed, NA if no adaptation gene found +* **crisprs_all.tab:** All CRISPR arrays, also false positives + * Contig: Sequence accession + * CRISPR: CRISPR ID (minced: Sequence accession _ NUMBER; repeatBLAST: Sequence accession - NUMBER _ NUMBER) + * Start: Start of CRISPR + * End: End of CRISPR + * Consensus_repeat: Consensus repeat sequence + * N_repeats: Number of repeats + * Repeat_len: Length of repeat sequences + * Spacer_len_avg: Average spacer length + * Repeat_identity: Average identity of repeat sequences + * Spacer_identity: Average identity of spacer sequences + * Spacer_len_sem: Standard error of the mean of spacer lenghts + * Trusted: TRUE/FALSE, is the array trusted. Based on repeat/spacer identity, spacer sem, prediction probability and adjacency to a cas operon + * Prediction: Prediction of the associated subtype based on the repeat sequence + * Subtype: Subtype with highest prediction probability. Prediction = Subtype if Subtype_probability is high + * Subtype_probability: Probability of subtype prediction +* **crisprs_near_cas.tab:** CRISPRs part of CRISPR-Cas loci + * Same columns as crisprs_all.tab +* **crisprs_orphan.tab:** Orphan CRISPRs (those not in CRISPR_Cas.tab) + * Same columns as crisprs_all.tab +* **crisprs_putative.tab:** Low quality CRISPRs. Most likely false positives + * Same columns as crisprs_all.tab +* **cas_operons_orphan.tab:** Orphan Cas operons (those not in CRISPR_Cas.tab) + * Same columns as cas_operons.tab +* **CRISPR_Cas_putative.tab:** Putative CRISPR_Cas loci, often lonely Cas genes next to a CRISPR array + * Same columns as CRISPR_Cas.tab +* **cas_operons_putative.tab:** Putative Cas operons, mostly false positives, but also some ambiguous and partial systems + * Same columns as cas_operons.tab +* **spacers/*.fa:** Fasta files with all spacer sequences +* **hmmer.tab:** All HMM vs. ORF matches, unfiltered results + * Hmm: HMM name + * ORF: ORF name (Sequence accession _ Gene ID) + * tlen: ORF length + * qlen: HMM length + * Eval: E-value of alignment + * score: Alignment score + * start: ORF start + * end: ORF end + * Acc: Sequence accession + * Pos: Gene ID + * Cov_seq: Sequence coverage + * Cov_hmm: HMM coverage + * strand: Coding strand is like input (1) or reverse complement (-1) +* **genes.tab** All genes and their positions + * Contig: Sequence accession + * Start: Start of ORF + * End: End of ORF + * Strand: Coding strand is like input (1) or reverse complement (-1) + * Pos: Gene ID +* **arguments.tab:** File with arguments given to CRISPRCasTyper +* **hmmer.log** Error messages from HMMER (only produced if any errors were encountered) + +##### If run with `--keep_tmp` the following is also produced +* **prodigal.log** Log from prodigal +* **proteins.faa** Protein sequences +* **hmmer/*.tab** Alignment output from HMMER for each Cas HMM +* **minced.out:** CRISPR array output from minced +* **blast.tab:** BLAST output from repeat alignment against flanking regions of cas operons +* **Flank....:** Fasta of flanking regions near cas operons and BLAST database of this + +#### Notes on output +Files are only created if there is any data. For example, the CRISPR_Cas.tab file is only created if there are any CRISPR-Cas loci. + +### Plotting <a name="plot"></a> +CRISPRCasTyper will automatically plot a map of the CRISPR-Cas loci, orphan Cas operons, and orphan CRISPR arrays. + +These maps can be expanded (`--expand N`) by adding unknown genes and genes with alignment scores below the thresholds. This can help in identify potentially un-annotated genes in operons. You can generate new plots without having to re-run the entire pipeline by adding `--redo_typing` to the command. This will re-use the mappings and re-type the operons and re-make the plot, based on new thresholds and plot parameters. + +The plot below is run with `--expand 5000` + +* Arrays are in alternating black/white displaying the actual number of repeats/spacers, and with their predicted subtype association based on the consensus repeat sequence. +* The interference module is in yellow. +* The adaptation module is in blue. +* Cas6 is in red. +* Accessory genes are in purple +* Genes with alignment scores below the thresholds are lighter and with parentheses around names. +* Unknown genes are in gray (the number matches the genes.tab file) + +<img src='img/plot2.svg' align="left" height="350" /> + +## RepeatTyper - How to <a name="repeattype"></a> +With an input of CRISPR repeats (one per line, in a simple textfile) RepeatTyper will predict the subtype, based on the kmer composition of the repeat + +#### Activate environment +```sh +conda activate cctyper +``` + +#### Run with a simple textfile, containing only CRISPR repeats (in capital letters), one repeat per line. +```sh +repeatType repeats.txt +``` + +#### Output <a name="repeattypeout"></a> +The script prints: +* Repeat sequence +* Predicted subtype +* Probability of prediction + +#### Notes on output +* Predictions with probabilities below 0.75 are uncertain, and should be taken with a grain of salt. +* Prior to version 1.4.0 the curated repeatTyper model was included in CCTyper +* From version 1.4.0 and onwards updated repeatTyper models are included in CCTyper (see more information in the section below) +* The followinig subtypes are included in the updated model as per December 2022: + * I-A, I-B, I-C, I-D, I-E, I-F, I-F (Transposon), I-G + * II-A, II-B, II-C + * III-A, III-B, III-C, III-D, III-E, III-F + * IV-A1, IV-A2, IV-A3, IV-D, IV-E + * V-A, V-B1, V-E, V-F1, V-F2, V-F3, V-F (the rest), V-G, V-I, V-J, V-K + * VI-A, VI-B1, VI-B2, VI-C, VI-D +* This is the accuracy per subtype (on an unseen test dataset): +* I-A 0.76 +* I-B 0.81 +* I-C 0.97 +* I-D 0.86 +* I-E 0.95 +* I-F 0.96 +* I-F_T 0.99 +* I-G 0.89 +* II-A 0.92 +* II-B 0.97 +* II-C 0.90 +* III-A 0.82 +* III-B 0.68 +* III-C 0.60 +* III-D 0.59 +* III-E 1.00 +* III-F 0.25 +* IV-A1 0.85 +* IV-A2 0.68 +* IV-A3 0.96 +* IV-D 0.85 +* IV-E 0.92 +* V-A 1.00 +* V-B1 0.90 +* V-E 0.30 +* V-F 0.87 +* V-F1 0.87 +* V-F2 0.90 +* V-F3 0.90 +* V-G 0.67 +* V-I 0.80 +* V-J 0.63 +* V-K 0.99 +* VI-A 0.96 +* VI-B1 0.96 +* VI-B2 1.00 +* VI-C 0.67 +* VI-D 0.97 + +### Updated RepeatTyper models <a name="repeatnew"></a> +The [CCTyper webserver](https://typer.crispr.dk) is crowdsourcing subtyped repeats and includes an updated RepeatTyper model based on a much larger set of repeats and contains additional subtypes compared to the curated RepeatTyper model. +This updated model is automatically retrained each month and the models can be downloaded [here](http://mibi.galaxy.bio.ku.dk/russel/repeattyper/). + +From version 1.4.0 and onwards of CCTyper the newest repeatTyper model is included upon release of the version. + +Each model contains a training report (xgb_report), where you can find the training log, and in the bottom the accuracy, both overall and per subtype. + +#### Use new model in CRISPRCasTyper +Save the original database files: +```sh +mv ${CCTYPER_DB}/type_dict.tab ${CCTYPER_DB}/type_dict_orig.tab +mv ${CCTYPER_DB}/xgb_repeats.model ${CCTYPER_DB}/xgb_repeats_orig.model +``` + +Move the new model into the database folder +```sh +mv repeat_model/* ${CCTYPER_DB}/ +``` + +##### CRISPRCasTyper and RepeatTyper will now use the new model for repeat prediction! + +## RepeatTyper - Train <a name="repeattrain"></a> +You can train the repeat classifier with your own set of subtyped repeats. With a tab-delimeted input where 1. column contains the subtypes and 2. column contains the CRISPR repeat sequences, RepeatTrain will train a CRISPR repeat classifier that is directly usable for both RepeatTyper and CRISPRCasTyper. + +#### Train +```sh +repeatTrain typed_repeats.tab my_classifier +``` + +#### Use new model in RepeatTyper +```sh +repeatType repeats.txt --db my_classifier +``` + +#### Use new model in CRISPRCasTyper +Save the original database files: +```sh +mv ${CCTYPER_DB}/type_dict.tab ${CCTYPER_DB}/type_dict_orig.tab +mv ${CCTYPER_DB}/xgb_repeats.model ${CCTYPER_DB}/xgb_repeats_orig.model +``` + +Move the new model into the database folder +```sh +mv my_classifier/* ${CCTYPER_DB}/ +``` + +##### CRISPRCasTyper and RepeatTyper will now use the new model for repeat prediction! + +## Troubleshoot <a name="trouble"></a> + +### Running out of memory +Large metagenomic assemblies with many small contigs can exhaust the RAM on your laptop. Fortunately, as metagenomic contigs are analysed separately (when run with `--prodigal meta`) a simple solution is to split the input into smaller chunks (e.g. with [pyfasta](https://pypi.org/project/pyfasta/#command-line-interface)) + + + + +%package -n python3-cctyper +Summary: CRISPRCasTyper: Automatic detection and subtyping of CRISPR-Cas operons +Provides: python-cctyper +BuildRequires: python3-devel +BuildRequires: python3-setuptools +BuildRequires: python3-pip +%description -n python3-cctyper +[](http://www.repostatus.org/#active) +[](https://anaconda.org/russel88/cctyper) + +# CRISPRCasTyper + +Detect CRISPR-Cas genes and arrays, and predict the subtype based on both Cas genes and CRISPR repeat sequence. + +[CRISPRCasTyper and RepeatType are also available through a webserver](https://crisprcastyper.crispr.dk) + +This software finds Cas genes with a large suite of HMMs, then groups these HMMs into operons, and predicts the subtype of the operons based on a scoring scheme. +Furthermore, it finds CRISPR arrays with [minced](https://github.com/ctSkennerton/minced) and by BLASTing a large suite of known repeats, and using a kmer-based machine learning approach (extreme gradient boosting trees) it predicts the subtype of the CRISPR arrays based on the consensus repeat. +It then connects the Cas operons and CRISPR arrays, producing as output: +* CRISPR-Cas loci, with consensus subtype prediction based on both Cas genes (mostly) and CRISPR consensus repeats +* Orphan Cas operons, and their predicted subtype +* Orphan CRISPR arrays, and their predicted associated subtype + +#### It includes the following 46 subtypes/variants [(find typing scheme here)](https://typer.crispr.dk/#/typing): +* I-A, I-B, I-C, I-D, I-E, I-F, I-F (transposon), I-G, II-A, II-B, II-C, III-A, III-B, III-C, III-D, III-E, III-F, IV-A1, IV-A2, IV-A3, IV-B, IV-C, IV-D, IV-E, V-A, V-B1, V-B2, V-C, V-D, V-E, V-F1, V-F2, V-F3, V-F (the rest), V-G, V-H, V-I, V-J, V-K, V-L, VI-A, VI-B1, VI-B2, VI-C, VI-D, VI-X, VI-Y. + +* All subtypes from the most recent Nature Reviews Microbiology (Makarova et al. 2020): [Evolutionary classification of CRISPR–Cas systems: a burst of class 2 and derived variants](https://doi.org/10.1038/s41579-019-0299-x) +* Updated type IV subtypes and variants based on: [Type IV CRISPR–Cas systems are highly diverse and involved in competition between plasmids](https://doi.org/10.1093/nar/gkz1197) +* Type V-K: [RNA-guided DNA insertion with CRISPR-associated transposases](https://doi.org/10.1126/science.aax9181) +* Transposon associated type I-F: [Transposon-encoded CRISPR–Cas systems direct RNA-guided DNA integration](https://doi.org/10.1038/s41586-019-1323-z) +* New V-A variants: [Novel Type V-A CRISPR Effectors Are Active Nucleases with Expanded Targeting Capabilities](https://doi.org/10.1089/crispr.2020.0043) +* New Cas13s: [Programmable RNA editing with compact CRISPR–Cas13 systems from uncultivated microbes](https://doi.org/10.1038/s41592-021-01124-4) +* V-L (cas12l): [A new family of CRISPR-type V nucleases with C-rich PAM recognition](https://doi.org/10.15252/embr.202255481) + +#### It can automatically draw gene maps of CRISPR-Cas systems and orphan Cas operons and CRISPR arrays +##### in vector graphics format for direct use in scientific manuscripts +<img src='img/plot.svg' align="left" height="200" /> + +#### Citation +[Jakob Russel, Rafael Pinilla-Redondo, David Mayo-Muñoz, Shiraz A. Shah, Søren J. Sørensen - CRISPRCasTyper: Automated Identification, Annotation and Classification of CRISPR-Cas loci. The CRISPR Journal Dec 2020](https://doi.org/10.1089/crispr.2020.0059) + +Find a free to read version on [BioRxiv](https://doi.org/10.1101/2020.05.15.097824) + +# Table of contents +1. [Quick start](#quick) +2. [Installation](#install) +3. [CRISPRCasTyper - How to](#cctyperhow) + * [Plotting](#plot) +4. [RepeatType - How to](#repeattype) + * [Updated models](#repeatnew) +5. [RepeatType - Train](#repeattrain) +6. [Troubleshoot](#trouble) + +## Quick start <a name="quick"></a> + +```sh +conda create -n cctyper -c conda-forge -c bioconda -c russel88 cctyper +conda activate cctyper +cctyper my.fasta my_output +``` + +## Installation <a name="install"></a> +CRISPRCasTyper can be installed either through conda or pip. + +It is advised to use conda, since this installs CRISPRCasTyper and all dependencies, and downloads the database in one go. + +### Conda +Use [miniconda](https://docs.conda.io/en/latest/miniconda.html) or [anaconda](https://www.anaconda.com/) to install. + +Create the environment with CRISPRCasTyper and all dependencies and database +```sh +conda create -n cctyper -c conda-forge -c bioconda -c russel88 cctyper +``` + +### pip +If you have the dependencies (Python >= 3.8, HMMER >= 3.2, Prodigal >= 2.6, minced, grep, sed) in your PATH you can install with pip + +Install cctyper python module +```sh +python -m pip install cctyper +``` + +Upgrade cctyper python module to the latest version +```sh +python -m pip install cctyper --upgrade +``` + + +#### When installing with pip, you need to download the database manually: +```sh +# Download and unpack +svn checkout https://github.com/Russel88/CRISPRCasTyper/trunk/data +tar -xvzf data/Profiles.tar.gz +mv Profiles/ data/ +rm data/Profiles.tar.gz + +# Tell CRISPRCasTyper where the data is: +# either by setting an environment variable (has to be done for each terminal session, or added to .bashrc): +export CCTYPER_DB="/path/to/data/" +# or by using the --db argument each time you run CRISPRCasTyper: +cctyper input.fa output --db /path/to/data/ +``` + +## CRISPRCasTyper - How to <a name="cctyperhow"></a> +CRISPRCasTyper takes as input a nucleotide fasta, and produces outputs with CRISPR-Cas predictions + +#### Activate environment +```sh +conda activate cctyper +``` + +#### Run with a nucleotide fasta as input +```sh +cctyper genome.fa my_output +``` + +#### If you have a complete circular genome (each entry in the fasta will be treated as having circular topology) +```sh +cctyper genome.fa my_output --circular +``` + +#### For metagenome assemblies and short contigs/plasmids/phages, change the prodigal mode +The default prodigal mode expects the input to be a single draft or complete genome +```sh +cctyper assembly.fa my_output --prodigal meta +``` + +#### Check the different options +```sh +cctyper -h +``` + +#### Output <a name="cctyperout"></a> +* **CRISPR_Cas.tab:** CRISPR_Cas loci, with consensus subtype prediction + * Contig: Sequence accession + * Operon: Operon ID (Sequence accession @ NUMBER) + * Operon_Pos: [Start, End] of operon + * Prediction: Consenus prediction based on both Cas operon and CRISPR arrays + * CRISPRs: CRISPRs adjacent to Cas operon + * Distances: Distances to CRISPRs from Cas operon + * Prediction_Cas: Subtype prediction based on Cas operon + * Prediction_CRISPRs: Subtype prediction of CRISPRs based on CRISPR repeat sequences +* **cas_operons.tab:** All certain Cas operons + * Contig: Sequence accession + * Operon: Operon ID (Sequence accession @ NUMBER) + * Start: Start of operon + * End: End of operon + * Prediction: Subtype prediction + * Complete_Interference: Percent completion of the interference module(s). Can be a list if best_type is a list (Hybrid and Ambiguous) + * Complete_Adaptation: Percent completion of the adaptation module(s). Can be a list if best_type is a list (Hybrid and Ambiguous) + * Best_type: Subtype with the highest score. If the score is high then Prediction = Best_type + * Best_score: Score of the highest scoring subtype + * Genes: List of Cas genes + * Positions: List of Gene IDs for the genes + * E-values: List of E-values for the genes + * CoverageSeq: List of sequence coverages for the genes + * CoverageHMM: List of HMM coverages for the genes + * Strand_Interference: Strand of interference module. 1 is positive strand, -1 is negative strand, 0 is mixed, NA if no interference gene found + * Strand_Adaptation: Strand of adaptation module. 1 is positive strand, -1 is negative strand, 0 is mixed, NA if no adaptation gene found +* **crisprs_all.tab:** All CRISPR arrays, also false positives + * Contig: Sequence accession + * CRISPR: CRISPR ID (minced: Sequence accession _ NUMBER; repeatBLAST: Sequence accession - NUMBER _ NUMBER) + * Start: Start of CRISPR + * End: End of CRISPR + * Consensus_repeat: Consensus repeat sequence + * N_repeats: Number of repeats + * Repeat_len: Length of repeat sequences + * Spacer_len_avg: Average spacer length + * Repeat_identity: Average identity of repeat sequences + * Spacer_identity: Average identity of spacer sequences + * Spacer_len_sem: Standard error of the mean of spacer lenghts + * Trusted: TRUE/FALSE, is the array trusted. Based on repeat/spacer identity, spacer sem, prediction probability and adjacency to a cas operon + * Prediction: Prediction of the associated subtype based on the repeat sequence + * Subtype: Subtype with highest prediction probability. Prediction = Subtype if Subtype_probability is high + * Subtype_probability: Probability of subtype prediction +* **crisprs_near_cas.tab:** CRISPRs part of CRISPR-Cas loci + * Same columns as crisprs_all.tab +* **crisprs_orphan.tab:** Orphan CRISPRs (those not in CRISPR_Cas.tab) + * Same columns as crisprs_all.tab +* **crisprs_putative.tab:** Low quality CRISPRs. Most likely false positives + * Same columns as crisprs_all.tab +* **cas_operons_orphan.tab:** Orphan Cas operons (those not in CRISPR_Cas.tab) + * Same columns as cas_operons.tab +* **CRISPR_Cas_putative.tab:** Putative CRISPR_Cas loci, often lonely Cas genes next to a CRISPR array + * Same columns as CRISPR_Cas.tab +* **cas_operons_putative.tab:** Putative Cas operons, mostly false positives, but also some ambiguous and partial systems + * Same columns as cas_operons.tab +* **spacers/*.fa:** Fasta files with all spacer sequences +* **hmmer.tab:** All HMM vs. ORF matches, unfiltered results + * Hmm: HMM name + * ORF: ORF name (Sequence accession _ Gene ID) + * tlen: ORF length + * qlen: HMM length + * Eval: E-value of alignment + * score: Alignment score + * start: ORF start + * end: ORF end + * Acc: Sequence accession + * Pos: Gene ID + * Cov_seq: Sequence coverage + * Cov_hmm: HMM coverage + * strand: Coding strand is like input (1) or reverse complement (-1) +* **genes.tab** All genes and their positions + * Contig: Sequence accession + * Start: Start of ORF + * End: End of ORF + * Strand: Coding strand is like input (1) or reverse complement (-1) + * Pos: Gene ID +* **arguments.tab:** File with arguments given to CRISPRCasTyper +* **hmmer.log** Error messages from HMMER (only produced if any errors were encountered) + +##### If run with `--keep_tmp` the following is also produced +* **prodigal.log** Log from prodigal +* **proteins.faa** Protein sequences +* **hmmer/*.tab** Alignment output from HMMER for each Cas HMM +* **minced.out:** CRISPR array output from minced +* **blast.tab:** BLAST output from repeat alignment against flanking regions of cas operons +* **Flank....:** Fasta of flanking regions near cas operons and BLAST database of this + +#### Notes on output +Files are only created if there is any data. For example, the CRISPR_Cas.tab file is only created if there are any CRISPR-Cas loci. + +### Plotting <a name="plot"></a> +CRISPRCasTyper will automatically plot a map of the CRISPR-Cas loci, orphan Cas operons, and orphan CRISPR arrays. + +These maps can be expanded (`--expand N`) by adding unknown genes and genes with alignment scores below the thresholds. This can help in identify potentially un-annotated genes in operons. You can generate new plots without having to re-run the entire pipeline by adding `--redo_typing` to the command. This will re-use the mappings and re-type the operons and re-make the plot, based on new thresholds and plot parameters. + +The plot below is run with `--expand 5000` + +* Arrays are in alternating black/white displaying the actual number of repeats/spacers, and with their predicted subtype association based on the consensus repeat sequence. +* The interference module is in yellow. +* The adaptation module is in blue. +* Cas6 is in red. +* Accessory genes are in purple +* Genes with alignment scores below the thresholds are lighter and with parentheses around names. +* Unknown genes are in gray (the number matches the genes.tab file) + +<img src='img/plot2.svg' align="left" height="350" /> + +## RepeatTyper - How to <a name="repeattype"></a> +With an input of CRISPR repeats (one per line, in a simple textfile) RepeatTyper will predict the subtype, based on the kmer composition of the repeat + +#### Activate environment +```sh +conda activate cctyper +``` + +#### Run with a simple textfile, containing only CRISPR repeats (in capital letters), one repeat per line. +```sh +repeatType repeats.txt +``` + +#### Output <a name="repeattypeout"></a> +The script prints: +* Repeat sequence +* Predicted subtype +* Probability of prediction + +#### Notes on output +* Predictions with probabilities below 0.75 are uncertain, and should be taken with a grain of salt. +* Prior to version 1.4.0 the curated repeatTyper model was included in CCTyper +* From version 1.4.0 and onwards updated repeatTyper models are included in CCTyper (see more information in the section below) +* The followinig subtypes are included in the updated model as per December 2022: + * I-A, I-B, I-C, I-D, I-E, I-F, I-F (Transposon), I-G + * II-A, II-B, II-C + * III-A, III-B, III-C, III-D, III-E, III-F + * IV-A1, IV-A2, IV-A3, IV-D, IV-E + * V-A, V-B1, V-E, V-F1, V-F2, V-F3, V-F (the rest), V-G, V-I, V-J, V-K + * VI-A, VI-B1, VI-B2, VI-C, VI-D +* This is the accuracy per subtype (on an unseen test dataset): +* I-A 0.76 +* I-B 0.81 +* I-C 0.97 +* I-D 0.86 +* I-E 0.95 +* I-F 0.96 +* I-F_T 0.99 +* I-G 0.89 +* II-A 0.92 +* II-B 0.97 +* II-C 0.90 +* III-A 0.82 +* III-B 0.68 +* III-C 0.60 +* III-D 0.59 +* III-E 1.00 +* III-F 0.25 +* IV-A1 0.85 +* IV-A2 0.68 +* IV-A3 0.96 +* IV-D 0.85 +* IV-E 0.92 +* V-A 1.00 +* V-B1 0.90 +* V-E 0.30 +* V-F 0.87 +* V-F1 0.87 +* V-F2 0.90 +* V-F3 0.90 +* V-G 0.67 +* V-I 0.80 +* V-J 0.63 +* V-K 0.99 +* VI-A 0.96 +* VI-B1 0.96 +* VI-B2 1.00 +* VI-C 0.67 +* VI-D 0.97 + +### Updated RepeatTyper models <a name="repeatnew"></a> +The [CCTyper webserver](https://typer.crispr.dk) is crowdsourcing subtyped repeats and includes an updated RepeatTyper model based on a much larger set of repeats and contains additional subtypes compared to the curated RepeatTyper model. +This updated model is automatically retrained each month and the models can be downloaded [here](http://mibi.galaxy.bio.ku.dk/russel/repeattyper/). + +From version 1.4.0 and onwards of CCTyper the newest repeatTyper model is included upon release of the version. + +Each model contains a training report (xgb_report), where you can find the training log, and in the bottom the accuracy, both overall and per subtype. + +#### Use new model in CRISPRCasTyper +Save the original database files: +```sh +mv ${CCTYPER_DB}/type_dict.tab ${CCTYPER_DB}/type_dict_orig.tab +mv ${CCTYPER_DB}/xgb_repeats.model ${CCTYPER_DB}/xgb_repeats_orig.model +``` + +Move the new model into the database folder +```sh +mv repeat_model/* ${CCTYPER_DB}/ +``` + +##### CRISPRCasTyper and RepeatTyper will now use the new model for repeat prediction! + +## RepeatTyper - Train <a name="repeattrain"></a> +You can train the repeat classifier with your own set of subtyped repeats. With a tab-delimeted input where 1. column contains the subtypes and 2. column contains the CRISPR repeat sequences, RepeatTrain will train a CRISPR repeat classifier that is directly usable for both RepeatTyper and CRISPRCasTyper. + +#### Train +```sh +repeatTrain typed_repeats.tab my_classifier +``` + +#### Use new model in RepeatTyper +```sh +repeatType repeats.txt --db my_classifier +``` + +#### Use new model in CRISPRCasTyper +Save the original database files: +```sh +mv ${CCTYPER_DB}/type_dict.tab ${CCTYPER_DB}/type_dict_orig.tab +mv ${CCTYPER_DB}/xgb_repeats.model ${CCTYPER_DB}/xgb_repeats_orig.model +``` + +Move the new model into the database folder +```sh +mv my_classifier/* ${CCTYPER_DB}/ +``` + +##### CRISPRCasTyper and RepeatTyper will now use the new model for repeat prediction! + +## Troubleshoot <a name="trouble"></a> + +### Running out of memory +Large metagenomic assemblies with many small contigs can exhaust the RAM on your laptop. Fortunately, as metagenomic contigs are analysed separately (when run with `--prodigal meta`) a simple solution is to split the input into smaller chunks (e.g. with [pyfasta](https://pypi.org/project/pyfasta/#command-line-interface)) + + + + +%package help +Summary: Development documents and examples for cctyper +Provides: python3-cctyper-doc +%description help +[](http://www.repostatus.org/#active) +[](https://anaconda.org/russel88/cctyper) + +# CRISPRCasTyper + +Detect CRISPR-Cas genes and arrays, and predict the subtype based on both Cas genes and CRISPR repeat sequence. + +[CRISPRCasTyper and RepeatType are also available through a webserver](https://crisprcastyper.crispr.dk) + +This software finds Cas genes with a large suite of HMMs, then groups these HMMs into operons, and predicts the subtype of the operons based on a scoring scheme. +Furthermore, it finds CRISPR arrays with [minced](https://github.com/ctSkennerton/minced) and by BLASTing a large suite of known repeats, and using a kmer-based machine learning approach (extreme gradient boosting trees) it predicts the subtype of the CRISPR arrays based on the consensus repeat. +It then connects the Cas operons and CRISPR arrays, producing as output: +* CRISPR-Cas loci, with consensus subtype prediction based on both Cas genes (mostly) and CRISPR consensus repeats +* Orphan Cas operons, and their predicted subtype +* Orphan CRISPR arrays, and their predicted associated subtype + +#### It includes the following 46 subtypes/variants [(find typing scheme here)](https://typer.crispr.dk/#/typing): +* I-A, I-B, I-C, I-D, I-E, I-F, I-F (transposon), I-G, II-A, II-B, II-C, III-A, III-B, III-C, III-D, III-E, III-F, IV-A1, IV-A2, IV-A3, IV-B, IV-C, IV-D, IV-E, V-A, V-B1, V-B2, V-C, V-D, V-E, V-F1, V-F2, V-F3, V-F (the rest), V-G, V-H, V-I, V-J, V-K, V-L, VI-A, VI-B1, VI-B2, VI-C, VI-D, VI-X, VI-Y. + +* All subtypes from the most recent Nature Reviews Microbiology (Makarova et al. 2020): [Evolutionary classification of CRISPR–Cas systems: a burst of class 2 and derived variants](https://doi.org/10.1038/s41579-019-0299-x) +* Updated type IV subtypes and variants based on: [Type IV CRISPR–Cas systems are highly diverse and involved in competition between plasmids](https://doi.org/10.1093/nar/gkz1197) +* Type V-K: [RNA-guided DNA insertion with CRISPR-associated transposases](https://doi.org/10.1126/science.aax9181) +* Transposon associated type I-F: [Transposon-encoded CRISPR–Cas systems direct RNA-guided DNA integration](https://doi.org/10.1038/s41586-019-1323-z) +* New V-A variants: [Novel Type V-A CRISPR Effectors Are Active Nucleases with Expanded Targeting Capabilities](https://doi.org/10.1089/crispr.2020.0043) +* New Cas13s: [Programmable RNA editing with compact CRISPR–Cas13 systems from uncultivated microbes](https://doi.org/10.1038/s41592-021-01124-4) +* V-L (cas12l): [A new family of CRISPR-type V nucleases with C-rich PAM recognition](https://doi.org/10.15252/embr.202255481) + +#### It can automatically draw gene maps of CRISPR-Cas systems and orphan Cas operons and CRISPR arrays +##### in vector graphics format for direct use in scientific manuscripts +<img src='img/plot.svg' align="left" height="200" /> + +#### Citation +[Jakob Russel, Rafael Pinilla-Redondo, David Mayo-Muñoz, Shiraz A. Shah, Søren J. Sørensen - CRISPRCasTyper: Automated Identification, Annotation and Classification of CRISPR-Cas loci. The CRISPR Journal Dec 2020](https://doi.org/10.1089/crispr.2020.0059) + +Find a free to read version on [BioRxiv](https://doi.org/10.1101/2020.05.15.097824) + +# Table of contents +1. [Quick start](#quick) +2. [Installation](#install) +3. [CRISPRCasTyper - How to](#cctyperhow) + * [Plotting](#plot) +4. [RepeatType - How to](#repeattype) + * [Updated models](#repeatnew) +5. [RepeatType - Train](#repeattrain) +6. [Troubleshoot](#trouble) + +## Quick start <a name="quick"></a> + +```sh +conda create -n cctyper -c conda-forge -c bioconda -c russel88 cctyper +conda activate cctyper +cctyper my.fasta my_output +``` + +## Installation <a name="install"></a> +CRISPRCasTyper can be installed either through conda or pip. + +It is advised to use conda, since this installs CRISPRCasTyper and all dependencies, and downloads the database in one go. + +### Conda +Use [miniconda](https://docs.conda.io/en/latest/miniconda.html) or [anaconda](https://www.anaconda.com/) to install. + +Create the environment with CRISPRCasTyper and all dependencies and database +```sh +conda create -n cctyper -c conda-forge -c bioconda -c russel88 cctyper +``` + +### pip +If you have the dependencies (Python >= 3.8, HMMER >= 3.2, Prodigal >= 2.6, minced, grep, sed) in your PATH you can install with pip + +Install cctyper python module +```sh +python -m pip install cctyper +``` + +Upgrade cctyper python module to the latest version +```sh +python -m pip install cctyper --upgrade +``` + + +#### When installing with pip, you need to download the database manually: +```sh +# Download and unpack +svn checkout https://github.com/Russel88/CRISPRCasTyper/trunk/data +tar -xvzf data/Profiles.tar.gz +mv Profiles/ data/ +rm data/Profiles.tar.gz + +# Tell CRISPRCasTyper where the data is: +# either by setting an environment variable (has to be done for each terminal session, or added to .bashrc): +export CCTYPER_DB="/path/to/data/" +# or by using the --db argument each time you run CRISPRCasTyper: +cctyper input.fa output --db /path/to/data/ +``` + +## CRISPRCasTyper - How to <a name="cctyperhow"></a> +CRISPRCasTyper takes as input a nucleotide fasta, and produces outputs with CRISPR-Cas predictions + +#### Activate environment +```sh +conda activate cctyper +``` + +#### Run with a nucleotide fasta as input +```sh +cctyper genome.fa my_output +``` + +#### If you have a complete circular genome (each entry in the fasta will be treated as having circular topology) +```sh +cctyper genome.fa my_output --circular +``` + +#### For metagenome assemblies and short contigs/plasmids/phages, change the prodigal mode +The default prodigal mode expects the input to be a single draft or complete genome +```sh +cctyper assembly.fa my_output --prodigal meta +``` + +#### Check the different options +```sh +cctyper -h +``` + +#### Output <a name="cctyperout"></a> +* **CRISPR_Cas.tab:** CRISPR_Cas loci, with consensus subtype prediction + * Contig: Sequence accession + * Operon: Operon ID (Sequence accession @ NUMBER) + * Operon_Pos: [Start, End] of operon + * Prediction: Consenus prediction based on both Cas operon and CRISPR arrays + * CRISPRs: CRISPRs adjacent to Cas operon + * Distances: Distances to CRISPRs from Cas operon + * Prediction_Cas: Subtype prediction based on Cas operon + * Prediction_CRISPRs: Subtype prediction of CRISPRs based on CRISPR repeat sequences +* **cas_operons.tab:** All certain Cas operons + * Contig: Sequence accession + * Operon: Operon ID (Sequence accession @ NUMBER) + * Start: Start of operon + * End: End of operon + * Prediction: Subtype prediction + * Complete_Interference: Percent completion of the interference module(s). Can be a list if best_type is a list (Hybrid and Ambiguous) + * Complete_Adaptation: Percent completion of the adaptation module(s). Can be a list if best_type is a list (Hybrid and Ambiguous) + * Best_type: Subtype with the highest score. If the score is high then Prediction = Best_type + * Best_score: Score of the highest scoring subtype + * Genes: List of Cas genes + * Positions: List of Gene IDs for the genes + * E-values: List of E-values for the genes + * CoverageSeq: List of sequence coverages for the genes + * CoverageHMM: List of HMM coverages for the genes + * Strand_Interference: Strand of interference module. 1 is positive strand, -1 is negative strand, 0 is mixed, NA if no interference gene found + * Strand_Adaptation: Strand of adaptation module. 1 is positive strand, -1 is negative strand, 0 is mixed, NA if no adaptation gene found +* **crisprs_all.tab:** All CRISPR arrays, also false positives + * Contig: Sequence accession + * CRISPR: CRISPR ID (minced: Sequence accession _ NUMBER; repeatBLAST: Sequence accession - NUMBER _ NUMBER) + * Start: Start of CRISPR + * End: End of CRISPR + * Consensus_repeat: Consensus repeat sequence + * N_repeats: Number of repeats + * Repeat_len: Length of repeat sequences + * Spacer_len_avg: Average spacer length + * Repeat_identity: Average identity of repeat sequences + * Spacer_identity: Average identity of spacer sequences + * Spacer_len_sem: Standard error of the mean of spacer lenghts + * Trusted: TRUE/FALSE, is the array trusted. Based on repeat/spacer identity, spacer sem, prediction probability and adjacency to a cas operon + * Prediction: Prediction of the associated subtype based on the repeat sequence + * Subtype: Subtype with highest prediction probability. Prediction = Subtype if Subtype_probability is high + * Subtype_probability: Probability of subtype prediction +* **crisprs_near_cas.tab:** CRISPRs part of CRISPR-Cas loci + * Same columns as crisprs_all.tab +* **crisprs_orphan.tab:** Orphan CRISPRs (those not in CRISPR_Cas.tab) + * Same columns as crisprs_all.tab +* **crisprs_putative.tab:** Low quality CRISPRs. Most likely false positives + * Same columns as crisprs_all.tab +* **cas_operons_orphan.tab:** Orphan Cas operons (those not in CRISPR_Cas.tab) + * Same columns as cas_operons.tab +* **CRISPR_Cas_putative.tab:** Putative CRISPR_Cas loci, often lonely Cas genes next to a CRISPR array + * Same columns as CRISPR_Cas.tab +* **cas_operons_putative.tab:** Putative Cas operons, mostly false positives, but also some ambiguous and partial systems + * Same columns as cas_operons.tab +* **spacers/*.fa:** Fasta files with all spacer sequences +* **hmmer.tab:** All HMM vs. ORF matches, unfiltered results + * Hmm: HMM name + * ORF: ORF name (Sequence accession _ Gene ID) + * tlen: ORF length + * qlen: HMM length + * Eval: E-value of alignment + * score: Alignment score + * start: ORF start + * end: ORF end + * Acc: Sequence accession + * Pos: Gene ID + * Cov_seq: Sequence coverage + * Cov_hmm: HMM coverage + * strand: Coding strand is like input (1) or reverse complement (-1) +* **genes.tab** All genes and their positions + * Contig: Sequence accession + * Start: Start of ORF + * End: End of ORF + * Strand: Coding strand is like input (1) or reverse complement (-1) + * Pos: Gene ID +* **arguments.tab:** File with arguments given to CRISPRCasTyper +* **hmmer.log** Error messages from HMMER (only produced if any errors were encountered) + +##### If run with `--keep_tmp` the following is also produced +* **prodigal.log** Log from prodigal +* **proteins.faa** Protein sequences +* **hmmer/*.tab** Alignment output from HMMER for each Cas HMM +* **minced.out:** CRISPR array output from minced +* **blast.tab:** BLAST output from repeat alignment against flanking regions of cas operons +* **Flank....:** Fasta of flanking regions near cas operons and BLAST database of this + +#### Notes on output +Files are only created if there is any data. For example, the CRISPR_Cas.tab file is only created if there are any CRISPR-Cas loci. + +### Plotting <a name="plot"></a> +CRISPRCasTyper will automatically plot a map of the CRISPR-Cas loci, orphan Cas operons, and orphan CRISPR arrays. + +These maps can be expanded (`--expand N`) by adding unknown genes and genes with alignment scores below the thresholds. This can help in identify potentially un-annotated genes in operons. You can generate new plots without having to re-run the entire pipeline by adding `--redo_typing` to the command. This will re-use the mappings and re-type the operons and re-make the plot, based on new thresholds and plot parameters. + +The plot below is run with `--expand 5000` + +* Arrays are in alternating black/white displaying the actual number of repeats/spacers, and with their predicted subtype association based on the consensus repeat sequence. +* The interference module is in yellow. +* The adaptation module is in blue. +* Cas6 is in red. +* Accessory genes are in purple +* Genes with alignment scores below the thresholds are lighter and with parentheses around names. +* Unknown genes are in gray (the number matches the genes.tab file) + +<img src='img/plot2.svg' align="left" height="350" /> + +## RepeatTyper - How to <a name="repeattype"></a> +With an input of CRISPR repeats (one per line, in a simple textfile) RepeatTyper will predict the subtype, based on the kmer composition of the repeat + +#### Activate environment +```sh +conda activate cctyper +``` + +#### Run with a simple textfile, containing only CRISPR repeats (in capital letters), one repeat per line. +```sh +repeatType repeats.txt +``` + +#### Output <a name="repeattypeout"></a> +The script prints: +* Repeat sequence +* Predicted subtype +* Probability of prediction + +#### Notes on output +* Predictions with probabilities below 0.75 are uncertain, and should be taken with a grain of salt. +* Prior to version 1.4.0 the curated repeatTyper model was included in CCTyper +* From version 1.4.0 and onwards updated repeatTyper models are included in CCTyper (see more information in the section below) +* The followinig subtypes are included in the updated model as per December 2022: + * I-A, I-B, I-C, I-D, I-E, I-F, I-F (Transposon), I-G + * II-A, II-B, II-C + * III-A, III-B, III-C, III-D, III-E, III-F + * IV-A1, IV-A2, IV-A3, IV-D, IV-E + * V-A, V-B1, V-E, V-F1, V-F2, V-F3, V-F (the rest), V-G, V-I, V-J, V-K + * VI-A, VI-B1, VI-B2, VI-C, VI-D +* This is the accuracy per subtype (on an unseen test dataset): +* I-A 0.76 +* I-B 0.81 +* I-C 0.97 +* I-D 0.86 +* I-E 0.95 +* I-F 0.96 +* I-F_T 0.99 +* I-G 0.89 +* II-A 0.92 +* II-B 0.97 +* II-C 0.90 +* III-A 0.82 +* III-B 0.68 +* III-C 0.60 +* III-D 0.59 +* III-E 1.00 +* III-F 0.25 +* IV-A1 0.85 +* IV-A2 0.68 +* IV-A3 0.96 +* IV-D 0.85 +* IV-E 0.92 +* V-A 1.00 +* V-B1 0.90 +* V-E 0.30 +* V-F 0.87 +* V-F1 0.87 +* V-F2 0.90 +* V-F3 0.90 +* V-G 0.67 +* V-I 0.80 +* V-J 0.63 +* V-K 0.99 +* VI-A 0.96 +* VI-B1 0.96 +* VI-B2 1.00 +* VI-C 0.67 +* VI-D 0.97 + +### Updated RepeatTyper models <a name="repeatnew"></a> +The [CCTyper webserver](https://typer.crispr.dk) is crowdsourcing subtyped repeats and includes an updated RepeatTyper model based on a much larger set of repeats and contains additional subtypes compared to the curated RepeatTyper model. +This updated model is automatically retrained each month and the models can be downloaded [here](http://mibi.galaxy.bio.ku.dk/russel/repeattyper/). + +From version 1.4.0 and onwards of CCTyper the newest repeatTyper model is included upon release of the version. + +Each model contains a training report (xgb_report), where you can find the training log, and in the bottom the accuracy, both overall and per subtype. + +#### Use new model in CRISPRCasTyper +Save the original database files: +```sh +mv ${CCTYPER_DB}/type_dict.tab ${CCTYPER_DB}/type_dict_orig.tab +mv ${CCTYPER_DB}/xgb_repeats.model ${CCTYPER_DB}/xgb_repeats_orig.model +``` + +Move the new model into the database folder +```sh +mv repeat_model/* ${CCTYPER_DB}/ +``` + +##### CRISPRCasTyper and RepeatTyper will now use the new model for repeat prediction! + +## RepeatTyper - Train <a name="repeattrain"></a> +You can train the repeat classifier with your own set of subtyped repeats. With a tab-delimeted input where 1. column contains the subtypes and 2. column contains the CRISPR repeat sequences, RepeatTrain will train a CRISPR repeat classifier that is directly usable for both RepeatTyper and CRISPRCasTyper. + +#### Train +```sh +repeatTrain typed_repeats.tab my_classifier +``` + +#### Use new model in RepeatTyper +```sh +repeatType repeats.txt --db my_classifier +``` + +#### Use new model in CRISPRCasTyper +Save the original database files: +```sh +mv ${CCTYPER_DB}/type_dict.tab ${CCTYPER_DB}/type_dict_orig.tab +mv ${CCTYPER_DB}/xgb_repeats.model ${CCTYPER_DB}/xgb_repeats_orig.model +``` + +Move the new model into the database folder +```sh +mv my_classifier/* ${CCTYPER_DB}/ +``` + +##### CRISPRCasTyper and RepeatTyper will now use the new model for repeat prediction! + +## Troubleshoot <a name="trouble"></a> + +### Running out of memory +Large metagenomic assemblies with many small contigs can exhaust the RAM on your laptop. Fortunately, as metagenomic contigs are analysed separately (when run with `--prodigal meta`) a simple solution is to split the input into smaller chunks (e.g. with [pyfasta](https://pypi.org/project/pyfasta/#command-line-interface)) + + + + +%prep +%autosetup -n cctyper-1.8.0 + +%build +%py3_build + +%install +%py3_install +install -d -m755 %{buildroot}/%{_pkgdocdir} +if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi +if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi +if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi +if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi +pushd %{buildroot} +if [ -d usr/lib ]; then + find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/lib64 ]; then + find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/bin ]; then + find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/sbin ]; then + find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst +fi +touch doclist.lst +if [ -d usr/share/man ]; then + find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst +fi +popd +mv %{buildroot}/filelist.lst . +mv %{buildroot}/doclist.lst . + +%files -n python3-cctyper -f filelist.lst +%dir %{python3_sitelib}/* + +%files help -f doclist.lst +%{_docdir}/* + +%changelog +* Thu May 18 2023 Python_Bot <Python_Bot@openeuler.org> - 1.8.0-1 +- Package Spec generated @@ -0,0 +1 @@ +70062be4fe7ed5b05ea68b2cbcfbc288 cctyper-1.8.0.tar.gz |
