summaryrefslogtreecommitdiff
path: root/python-motulizer.spec
blob: 04cc0e489e4d4ff0bb7a215e6af0199392172f6a (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
%global _empty_manifest_terminate_build 0
Name:		python-mOTUlizer
Version:	0.3.2
Release:	1
Summary:	making OTUs from genomes, and stats on them. and even core-genomes
License:	GNU General Public License v3 (GPLv3)
URL:		https://github.com/moritzbuck/mOTUlizer/
Source0:	https://mirrors.aliyun.com/pypi/web/packages/ab/73/858918afeebb96748775d9c05ac94f69f1c27ad221ec654f63a8c688ded0/mOTUlizer-0.3.2.tar.gz
BuildArch:	noarch

Requires:	python3-igraph
Requires:	python3-biopython

%description
# mOTUlizer



**DISCLAIMER, there is an other tool out there called mOTUs that creates OTU-tables directly from reads, if you are looking for that tool, this is the wrong page, you want to go ['here'](https://motu-tool.org/), but while you on my page, why don't you check out mOTUlizer, it's cool, I swear**

Utility to analyse a group of closely related MAGs/Genomes/bins/SUBs of more or less dubious origin. Right now it is composed of a number of programs:

* `mOTUlize.py` takes a set of genomes (I will use the term genome as a short hand for set of nucleotide sequences that presumably come from the same organism/population, can be incomplete, redundant or contaminated) and cluster them in to metagenomic Operational Taxonomic Units (mOTUs). Using similarity scores (by default ANI as computed by fastANI, but user can provide other similarities) a network is built based on (user defined) better quality genomes (for historical reasons called MAGs) by thresholding the similarities at a specific value (95% by default). The connected components of this graph are the mOTUs. Additionally lower quality genomes (SUBs, ) are recruited to the mOTU of whichever MAG they are most similar too if the similarity is above the threshold.

* `mOTUpan.py` computes the likelihood of gene-encoded traits to be expected in all of a set of genomes, e.g. of a trait to be in the core genome of a set of genomes (of possibly varying quality). Basically you provide to `mOTUpan` the set of proteomes of your genomes of interest (for example from the same mOTU or Genus) as well as a completeness prior of these genomes (for example [`checkm`](https://ecogenomics.github.io/CheckM/) output or a fixed value) and it computes gene clusters using [`mmseqs2`](https://github.com/soedinglab/MMseqs2), you can also provide your own genome encoded traits either as a `JSON`-file, or `TAB`-separated file (see example files). For each of these gene-clusters it will then compute the likelihood of it being in the core vs the likelihood of it not being, the ratio of these likelihoods will determine if a trait is considered core or not. This new partitioning can be used to update our completeness prior, and recomputed iteratively until convergence.

* `mOTUconvert.py` converts the output of diverse programs into input files for `mOTUpan.py`, currently includes methods for [`mmseqs2`](https://github.com/soedinglab/MMseqs2), [`roary`](https://sanger-pathogens.github.io/Roary/), [`PPanGGOLiN`](https://github.com/labgem/PPanGGOLiN), [`eggNOGmapper`](https://github.com/eggnogdb/eggnog-mapper), [`anvio`][https://merenlab.org/software/anvio/] pangenome databases.

* **experimental** `anvi-run-motupan.py` a anvi'o compatible version of `mOTUpan.py` a bit less options right now, but runs directly on anvi'o pangenome database

a number of example files are to be found in the `example_files`-folder, the `fasta`- and `gff`-files are the ones used for all the other files, these are generated by the always fantastic [`prokka`](example_files/fnas/). Also there is some reading material in the `mOTUlizer/doc` (a poster, a presentation and a very early paper draft, but at least it has the maths in it), the paper will eventually be available there!

## INSTALL

With conda:

```
conda install -c bioconda  motulizer
```

With pip:

```
pip install mOTUlizer
```

manually:

```
git clone https://github.com/moritzbuck/mOTUlizer.git
cd mOTUlizer
python setup.py install
```


## USAGE

### mOTUlize

To make OTUs and get some stats, needs [`fastANI`](https://github.com/ParBLiSS/FastANI) in the `PATH` if you do not provide a file for `--similarities`. To bypass fastANIs memory greedy nature, it runs it in blocks if needed.

simply run with:
```
mOTUlize.py --fnas example_files/fnas/*.fna -o output.tsv
```

Loads of little options if you do : `mOTUlize.py -h`

#### Key options:

* `--checkm`: provide a file containing completenesses and contaminations, see `mOTUpan`-section for detail, this time though, contamination is used... also, if not provided all genomes are assumed 'MAGs' e.g. high quality.

* `--similarities`: provide a file that contains pairwise similarities of genomes. The parser will ignore any lines containting the work query (yeah I know, dodgy, but that is it for now, it's based on the output of fastANI), and need at least three columns separated by `TAB`s. The first two columns are the two genome names, the third a similarity value between 0 and 100. As of now, the similarities are assumed to be asymetrical, only pairs where both pass the similarity threshold a kept for the network.
* `--keep-simi-file` : saves the similarity file generated by `fastANI`, can be used directly with `--similarities`. Good to use if you have a lot of genomes as the `fastANI` part is the slowest part, you can then use the same file with different cutoffs to tailor your clustering.
* `--MAG-completeness`/`--MAG-contamination`: controls which genomes are considered high quality, used for creating the mOTU-network
* `--SUB-completeness`/`--SUB-completeness`: controls which genomes are satelite unclassified bins (SUBs), lower quality bins that could be recruted to the MAGs (e.g. satelite around mOTUs, ok, not sure the acronym is really good, fine)
* `--similarity-cutoff`: similarity cutoff used to generate the mOTU network. By default 95, as in 95% ANI cutoff which has been recorded as a weirdly universal cutoff that separates species. This number is reported at a number of places [I will cite them soon, I promise, I will fill this with citations], and freaks me out. I thought it was an artefact, but if the completeness threshold is high enough, it is weirdly universal, found very very few exceptions... anyhow, you can change it.

### mOTUpan

An intro video [here](https://www.youtube.com/watch?v=VIeV1Gg5NS4):

[![mOTUpan for beginners](https://img.youtube.com/vi/VIeV1Gg5NS4/0.jpg)](https://www.youtube.com/watch?v=VIeV1Gg5NS4)


```
mOTUpan.py -h
```

Simplest command to run (needs mmseqs2 installed), but many options:

```
mOTUpan.py --faas *.faa -o output.tsv
```

#### Key options:

Check all flags in with `--help`, but here are some keys ones a bit more explained

* `--boots BOOTS` : runs `BOOTS` bootstraps, where artificial genomes are generated using the gene-partitioning obtained with `mOTUpan` (e.g. the core genes are in all artificial genomes, the others are a gene-pool with their frequency conserved), these genomes are then rarefied according to the posterioir completeness estimates. The bootstrap will provide an estimate of the false positive rate (e.g. fraction of core genes that might not be), the recall (fraction of core genes that have been classified as such in the bootstrap), and 'lowest false', the lowest frequency of any false positive found (e.g. should be high, meaning that your possible false positive are actually highly prevalent in your genome-set). Higher number of bootstraps give you a standard deviation for these numbers.

* `--cog_file` : can be used as an alternative to `--faas`, you can use it to provide you own gene-clusters (or other genetically encoded traits). The file should either be a `JSON`-file encoding a dictionary where the keys are the genome names and the values are lists of traits/genes (example in `example_files/example_genome2cog.json`). Or a `TAB`-separated file, where the first column is the genome name and followed by `TAB`-separated trait/gene-names (example in `example_files/example_genome2cog.tsv`).

* `--genome2cog_only` : only runs the gene-clustering (`mmseqs east_cluster`), returns a `JSON`-datastructure compatible with `--cog_file`

* `--checkm` : provide a file with Completenesses and Contaminations (Redundancy), it accepts the output of `checkm` or any other `TAB`-separated file with at least three columns: `Bin Id`, `Completeness`, and `Contamination`. `Bin Id` should be the genome names (e.g. file name minus extension), `Completeness` and `Contamination` values between 0 and 100. Note that `Contamination` is not actually used yet... A normal `check` output file is available at `example_files/example_checkm.txt` and a more generic completeness file that also works at `example_files/example_generic_completeness.tsv`.

* `--max_iter` : maximum number of iterations for the recursive aspect of motupan. You might want to put that to `1` if you have only few traits that would not be sufficient to estimate completeness.

### anvi-run-motupan

You need an anvi'o pangenome-database, and if you have it the genome-storage (for completenesses), great otherwise simply:

```
# if you want just a tsv :

anvi-run-motupan.py -p MYPANGENOME-PAN.db -g MYGENOMES.db -o MY_OUTPUT.tsv

# if you want to update the db, so it show up in anvi-display-pan

anvi-run-motupan.py -p MYPANGENOME-PAN.db -g MYGENOMES.db --store-in-db

```

### mOTUconvert

A small program generating appropriate input files for `mOTUpan.py` from the output of some of my favorite, or the public's favorite programs. It assumes the IDs in your protein `fasta`-file to be  `${genome_name}_[0-9]*` so genome-name separated from a number by an underscore. The gene name could have an underscore in it... But it might be risky, I did not code this very cleanly...

Runs as :

```
# check possible input file-types within

mOTUconvert.py --list

# running it
mOTUconvert.py  --in_type INFILE_TYPE INFILE > OUTPUT
# or
mOTUconvert.py  --in_type INFILE_TYPE -o OUTPUT INFILE

# you can then run mOTUpan as

mOTUpan.py --cog_file OUTPUT
```

In the `example_files`-folder a number of example input and output file are available.

## Citing and additional doc

Preprint for mOTUpan available on [bioRxiv](https://www.biorxiv.org/content/10.1101/2021.06.25.449606v1.full):

>mOTUpan: a robust Bayesian approach to leverage metagenome assembled genomes for core-genome estimation
>Moritz Buck, Maliheh Mehrshad, and Stefan Bertilsson
>bioRxiv 2021.06.25.449606; doi: https://doi.org/10.1101/2021.06.25.449606

A draft of a `release note` for mOTUlize is in the doc-folder, as well as the source of the previously mentioned mOTUpan paper and some slides




%package -n python3-mOTUlizer
Summary:	making OTUs from genomes, and stats on them. and even core-genomes
Provides:	python-mOTUlizer
BuildRequires:	python3-devel
BuildRequires:	python3-setuptools
BuildRequires:	python3-pip
%description -n python3-mOTUlizer
# mOTUlizer



**DISCLAIMER, there is an other tool out there called mOTUs that creates OTU-tables directly from reads, if you are looking for that tool, this is the wrong page, you want to go ['here'](https://motu-tool.org/), but while you on my page, why don't you check out mOTUlizer, it's cool, I swear**

Utility to analyse a group of closely related MAGs/Genomes/bins/SUBs of more or less dubious origin. Right now it is composed of a number of programs:

* `mOTUlize.py` takes a set of genomes (I will use the term genome as a short hand for set of nucleotide sequences that presumably come from the same organism/population, can be incomplete, redundant or contaminated) and cluster them in to metagenomic Operational Taxonomic Units (mOTUs). Using similarity scores (by default ANI as computed by fastANI, but user can provide other similarities) a network is built based on (user defined) better quality genomes (for historical reasons called MAGs) by thresholding the similarities at a specific value (95% by default). The connected components of this graph are the mOTUs. Additionally lower quality genomes (SUBs, ) are recruited to the mOTU of whichever MAG they are most similar too if the similarity is above the threshold.

* `mOTUpan.py` computes the likelihood of gene-encoded traits to be expected in all of a set of genomes, e.g. of a trait to be in the core genome of a set of genomes (of possibly varying quality). Basically you provide to `mOTUpan` the set of proteomes of your genomes of interest (for example from the same mOTU or Genus) as well as a completeness prior of these genomes (for example [`checkm`](https://ecogenomics.github.io/CheckM/) output or a fixed value) and it computes gene clusters using [`mmseqs2`](https://github.com/soedinglab/MMseqs2), you can also provide your own genome encoded traits either as a `JSON`-file, or `TAB`-separated file (see example files). For each of these gene-clusters it will then compute the likelihood of it being in the core vs the likelihood of it not being, the ratio of these likelihoods will determine if a trait is considered core or not. This new partitioning can be used to update our completeness prior, and recomputed iteratively until convergence.

* `mOTUconvert.py` converts the output of diverse programs into input files for `mOTUpan.py`, currently includes methods for [`mmseqs2`](https://github.com/soedinglab/MMseqs2), [`roary`](https://sanger-pathogens.github.io/Roary/), [`PPanGGOLiN`](https://github.com/labgem/PPanGGOLiN), [`eggNOGmapper`](https://github.com/eggnogdb/eggnog-mapper), [`anvio`][https://merenlab.org/software/anvio/] pangenome databases.

* **experimental** `anvi-run-motupan.py` a anvi'o compatible version of `mOTUpan.py` a bit less options right now, but runs directly on anvi'o pangenome database

a number of example files are to be found in the `example_files`-folder, the `fasta`- and `gff`-files are the ones used for all the other files, these are generated by the always fantastic [`prokka`](example_files/fnas/). Also there is some reading material in the `mOTUlizer/doc` (a poster, a presentation and a very early paper draft, but at least it has the maths in it), the paper will eventually be available there!

## INSTALL

With conda:

```
conda install -c bioconda  motulizer
```

With pip:

```
pip install mOTUlizer
```

manually:

```
git clone https://github.com/moritzbuck/mOTUlizer.git
cd mOTUlizer
python setup.py install
```


## USAGE

### mOTUlize

To make OTUs and get some stats, needs [`fastANI`](https://github.com/ParBLiSS/FastANI) in the `PATH` if you do not provide a file for `--similarities`. To bypass fastANIs memory greedy nature, it runs it in blocks if needed.

simply run with:
```
mOTUlize.py --fnas example_files/fnas/*.fna -o output.tsv
```

Loads of little options if you do : `mOTUlize.py -h`

#### Key options:

* `--checkm`: provide a file containing completenesses and contaminations, see `mOTUpan`-section for detail, this time though, contamination is used... also, if not provided all genomes are assumed 'MAGs' e.g. high quality.

* `--similarities`: provide a file that contains pairwise similarities of genomes. The parser will ignore any lines containting the work query (yeah I know, dodgy, but that is it for now, it's based on the output of fastANI), and need at least three columns separated by `TAB`s. The first two columns are the two genome names, the third a similarity value between 0 and 100. As of now, the similarities are assumed to be asymetrical, only pairs where both pass the similarity threshold a kept for the network.
* `--keep-simi-file` : saves the similarity file generated by `fastANI`, can be used directly with `--similarities`. Good to use if you have a lot of genomes as the `fastANI` part is the slowest part, you can then use the same file with different cutoffs to tailor your clustering.
* `--MAG-completeness`/`--MAG-contamination`: controls which genomes are considered high quality, used for creating the mOTU-network
* `--SUB-completeness`/`--SUB-completeness`: controls which genomes are satelite unclassified bins (SUBs), lower quality bins that could be recruted to the MAGs (e.g. satelite around mOTUs, ok, not sure the acronym is really good, fine)
* `--similarity-cutoff`: similarity cutoff used to generate the mOTU network. By default 95, as in 95% ANI cutoff which has been recorded as a weirdly universal cutoff that separates species. This number is reported at a number of places [I will cite them soon, I promise, I will fill this with citations], and freaks me out. I thought it was an artefact, but if the completeness threshold is high enough, it is weirdly universal, found very very few exceptions... anyhow, you can change it.

### mOTUpan

An intro video [here](https://www.youtube.com/watch?v=VIeV1Gg5NS4):

[![mOTUpan for beginners](https://img.youtube.com/vi/VIeV1Gg5NS4/0.jpg)](https://www.youtube.com/watch?v=VIeV1Gg5NS4)


```
mOTUpan.py -h
```

Simplest command to run (needs mmseqs2 installed), but many options:

```
mOTUpan.py --faas *.faa -o output.tsv
```

#### Key options:

Check all flags in with `--help`, but here are some keys ones a bit more explained

* `--boots BOOTS` : runs `BOOTS` bootstraps, where artificial genomes are generated using the gene-partitioning obtained with `mOTUpan` (e.g. the core genes are in all artificial genomes, the others are a gene-pool with their frequency conserved), these genomes are then rarefied according to the posterioir completeness estimates. The bootstrap will provide an estimate of the false positive rate (e.g. fraction of core genes that might not be), the recall (fraction of core genes that have been classified as such in the bootstrap), and 'lowest false', the lowest frequency of any false positive found (e.g. should be high, meaning that your possible false positive are actually highly prevalent in your genome-set). Higher number of bootstraps give you a standard deviation for these numbers.

* `--cog_file` : can be used as an alternative to `--faas`, you can use it to provide you own gene-clusters (or other genetically encoded traits). The file should either be a `JSON`-file encoding a dictionary where the keys are the genome names and the values are lists of traits/genes (example in `example_files/example_genome2cog.json`). Or a `TAB`-separated file, where the first column is the genome name and followed by `TAB`-separated trait/gene-names (example in `example_files/example_genome2cog.tsv`).

* `--genome2cog_only` : only runs the gene-clustering (`mmseqs east_cluster`), returns a `JSON`-datastructure compatible with `--cog_file`

* `--checkm` : provide a file with Completenesses and Contaminations (Redundancy), it accepts the output of `checkm` or any other `TAB`-separated file with at least three columns: `Bin Id`, `Completeness`, and `Contamination`. `Bin Id` should be the genome names (e.g. file name minus extension), `Completeness` and `Contamination` values between 0 and 100. Note that `Contamination` is not actually used yet... A normal `check` output file is available at `example_files/example_checkm.txt` and a more generic completeness file that also works at `example_files/example_generic_completeness.tsv`.

* `--max_iter` : maximum number of iterations for the recursive aspect of motupan. You might want to put that to `1` if you have only few traits that would not be sufficient to estimate completeness.

### anvi-run-motupan

You need an anvi'o pangenome-database, and if you have it the genome-storage (for completenesses), great otherwise simply:

```
# if you want just a tsv :

anvi-run-motupan.py -p MYPANGENOME-PAN.db -g MYGENOMES.db -o MY_OUTPUT.tsv

# if you want to update the db, so it show up in anvi-display-pan

anvi-run-motupan.py -p MYPANGENOME-PAN.db -g MYGENOMES.db --store-in-db

```

### mOTUconvert

A small program generating appropriate input files for `mOTUpan.py` from the output of some of my favorite, or the public's favorite programs. It assumes the IDs in your protein `fasta`-file to be  `${genome_name}_[0-9]*` so genome-name separated from a number by an underscore. The gene name could have an underscore in it... But it might be risky, I did not code this very cleanly...

Runs as :

```
# check possible input file-types within

mOTUconvert.py --list

# running it
mOTUconvert.py  --in_type INFILE_TYPE INFILE > OUTPUT
# or
mOTUconvert.py  --in_type INFILE_TYPE -o OUTPUT INFILE

# you can then run mOTUpan as

mOTUpan.py --cog_file OUTPUT
```

In the `example_files`-folder a number of example input and output file are available.

## Citing and additional doc

Preprint for mOTUpan available on [bioRxiv](https://www.biorxiv.org/content/10.1101/2021.06.25.449606v1.full):

>mOTUpan: a robust Bayesian approach to leverage metagenome assembled genomes for core-genome estimation
>Moritz Buck, Maliheh Mehrshad, and Stefan Bertilsson
>bioRxiv 2021.06.25.449606; doi: https://doi.org/10.1101/2021.06.25.449606

A draft of a `release note` for mOTUlize is in the doc-folder, as well as the source of the previously mentioned mOTUpan paper and some slides




%package help
Summary:	Development documents and examples for mOTUlizer
Provides:	python3-mOTUlizer-doc
%description help
# mOTUlizer



**DISCLAIMER, there is an other tool out there called mOTUs that creates OTU-tables directly from reads, if you are looking for that tool, this is the wrong page, you want to go ['here'](https://motu-tool.org/), but while you on my page, why don't you check out mOTUlizer, it's cool, I swear**

Utility to analyse a group of closely related MAGs/Genomes/bins/SUBs of more or less dubious origin. Right now it is composed of a number of programs:

* `mOTUlize.py` takes a set of genomes (I will use the term genome as a short hand for set of nucleotide sequences that presumably come from the same organism/population, can be incomplete, redundant or contaminated) and cluster them in to metagenomic Operational Taxonomic Units (mOTUs). Using similarity scores (by default ANI as computed by fastANI, but user can provide other similarities) a network is built based on (user defined) better quality genomes (for historical reasons called MAGs) by thresholding the similarities at a specific value (95% by default). The connected components of this graph are the mOTUs. Additionally lower quality genomes (SUBs, ) are recruited to the mOTU of whichever MAG they are most similar too if the similarity is above the threshold.

* `mOTUpan.py` computes the likelihood of gene-encoded traits to be expected in all of a set of genomes, e.g. of a trait to be in the core genome of a set of genomes (of possibly varying quality). Basically you provide to `mOTUpan` the set of proteomes of your genomes of interest (for example from the same mOTU or Genus) as well as a completeness prior of these genomes (for example [`checkm`](https://ecogenomics.github.io/CheckM/) output or a fixed value) and it computes gene clusters using [`mmseqs2`](https://github.com/soedinglab/MMseqs2), you can also provide your own genome encoded traits either as a `JSON`-file, or `TAB`-separated file (see example files). For each of these gene-clusters it will then compute the likelihood of it being in the core vs the likelihood of it not being, the ratio of these likelihoods will determine if a trait is considered core or not. This new partitioning can be used to update our completeness prior, and recomputed iteratively until convergence.

* `mOTUconvert.py` converts the output of diverse programs into input files for `mOTUpan.py`, currently includes methods for [`mmseqs2`](https://github.com/soedinglab/MMseqs2), [`roary`](https://sanger-pathogens.github.io/Roary/), [`PPanGGOLiN`](https://github.com/labgem/PPanGGOLiN), [`eggNOGmapper`](https://github.com/eggnogdb/eggnog-mapper), [`anvio`][https://merenlab.org/software/anvio/] pangenome databases.

* **experimental** `anvi-run-motupan.py` a anvi'o compatible version of `mOTUpan.py` a bit less options right now, but runs directly on anvi'o pangenome database

a number of example files are to be found in the `example_files`-folder, the `fasta`- and `gff`-files are the ones used for all the other files, these are generated by the always fantastic [`prokka`](example_files/fnas/). Also there is some reading material in the `mOTUlizer/doc` (a poster, a presentation and a very early paper draft, but at least it has the maths in it), the paper will eventually be available there!

## INSTALL

With conda:

```
conda install -c bioconda  motulizer
```

With pip:

```
pip install mOTUlizer
```

manually:

```
git clone https://github.com/moritzbuck/mOTUlizer.git
cd mOTUlizer
python setup.py install
```


## USAGE

### mOTUlize

To make OTUs and get some stats, needs [`fastANI`](https://github.com/ParBLiSS/FastANI) in the `PATH` if you do not provide a file for `--similarities`. To bypass fastANIs memory greedy nature, it runs it in blocks if needed.

simply run with:
```
mOTUlize.py --fnas example_files/fnas/*.fna -o output.tsv
```

Loads of little options if you do : `mOTUlize.py -h`

#### Key options:

* `--checkm`: provide a file containing completenesses and contaminations, see `mOTUpan`-section for detail, this time though, contamination is used... also, if not provided all genomes are assumed 'MAGs' e.g. high quality.

* `--similarities`: provide a file that contains pairwise similarities of genomes. The parser will ignore any lines containting the work query (yeah I know, dodgy, but that is it for now, it's based on the output of fastANI), and need at least three columns separated by `TAB`s. The first two columns are the two genome names, the third a similarity value between 0 and 100. As of now, the similarities are assumed to be asymetrical, only pairs where both pass the similarity threshold a kept for the network.
* `--keep-simi-file` : saves the similarity file generated by `fastANI`, can be used directly with `--similarities`. Good to use if you have a lot of genomes as the `fastANI` part is the slowest part, you can then use the same file with different cutoffs to tailor your clustering.
* `--MAG-completeness`/`--MAG-contamination`: controls which genomes are considered high quality, used for creating the mOTU-network
* `--SUB-completeness`/`--SUB-completeness`: controls which genomes are satelite unclassified bins (SUBs), lower quality bins that could be recruted to the MAGs (e.g. satelite around mOTUs, ok, not sure the acronym is really good, fine)
* `--similarity-cutoff`: similarity cutoff used to generate the mOTU network. By default 95, as in 95% ANI cutoff which has been recorded as a weirdly universal cutoff that separates species. This number is reported at a number of places [I will cite them soon, I promise, I will fill this with citations], and freaks me out. I thought it was an artefact, but if the completeness threshold is high enough, it is weirdly universal, found very very few exceptions... anyhow, you can change it.

### mOTUpan

An intro video [here](https://www.youtube.com/watch?v=VIeV1Gg5NS4):

[![mOTUpan for beginners](https://img.youtube.com/vi/VIeV1Gg5NS4/0.jpg)](https://www.youtube.com/watch?v=VIeV1Gg5NS4)


```
mOTUpan.py -h
```

Simplest command to run (needs mmseqs2 installed), but many options:

```
mOTUpan.py --faas *.faa -o output.tsv
```

#### Key options:

Check all flags in with `--help`, but here are some keys ones a bit more explained

* `--boots BOOTS` : runs `BOOTS` bootstraps, where artificial genomes are generated using the gene-partitioning obtained with `mOTUpan` (e.g. the core genes are in all artificial genomes, the others are a gene-pool with their frequency conserved), these genomes are then rarefied according to the posterioir completeness estimates. The bootstrap will provide an estimate of the false positive rate (e.g. fraction of core genes that might not be), the recall (fraction of core genes that have been classified as such in the bootstrap), and 'lowest false', the lowest frequency of any false positive found (e.g. should be high, meaning that your possible false positive are actually highly prevalent in your genome-set). Higher number of bootstraps give you a standard deviation for these numbers.

* `--cog_file` : can be used as an alternative to `--faas`, you can use it to provide you own gene-clusters (or other genetically encoded traits). The file should either be a `JSON`-file encoding a dictionary where the keys are the genome names and the values are lists of traits/genes (example in `example_files/example_genome2cog.json`). Or a `TAB`-separated file, where the first column is the genome name and followed by `TAB`-separated trait/gene-names (example in `example_files/example_genome2cog.tsv`).

* `--genome2cog_only` : only runs the gene-clustering (`mmseqs east_cluster`), returns a `JSON`-datastructure compatible with `--cog_file`

* `--checkm` : provide a file with Completenesses and Contaminations (Redundancy), it accepts the output of `checkm` or any other `TAB`-separated file with at least three columns: `Bin Id`, `Completeness`, and `Contamination`. `Bin Id` should be the genome names (e.g. file name minus extension), `Completeness` and `Contamination` values between 0 and 100. Note that `Contamination` is not actually used yet... A normal `check` output file is available at `example_files/example_checkm.txt` and a more generic completeness file that also works at `example_files/example_generic_completeness.tsv`.

* `--max_iter` : maximum number of iterations for the recursive aspect of motupan. You might want to put that to `1` if you have only few traits that would not be sufficient to estimate completeness.

### anvi-run-motupan

You need an anvi'o pangenome-database, and if you have it the genome-storage (for completenesses), great otherwise simply:

```
# if you want just a tsv :

anvi-run-motupan.py -p MYPANGENOME-PAN.db -g MYGENOMES.db -o MY_OUTPUT.tsv

# if you want to update the db, so it show up in anvi-display-pan

anvi-run-motupan.py -p MYPANGENOME-PAN.db -g MYGENOMES.db --store-in-db

```

### mOTUconvert

A small program generating appropriate input files for `mOTUpan.py` from the output of some of my favorite, or the public's favorite programs. It assumes the IDs in your protein `fasta`-file to be  `${genome_name}_[0-9]*` so genome-name separated from a number by an underscore. The gene name could have an underscore in it... But it might be risky, I did not code this very cleanly...

Runs as :

```
# check possible input file-types within

mOTUconvert.py --list

# running it
mOTUconvert.py  --in_type INFILE_TYPE INFILE > OUTPUT
# or
mOTUconvert.py  --in_type INFILE_TYPE -o OUTPUT INFILE

# you can then run mOTUpan as

mOTUpan.py --cog_file OUTPUT
```

In the `example_files`-folder a number of example input and output file are available.

## Citing and additional doc

Preprint for mOTUpan available on [bioRxiv](https://www.biorxiv.org/content/10.1101/2021.06.25.449606v1.full):

>mOTUpan: a robust Bayesian approach to leverage metagenome assembled genomes for core-genome estimation
>Moritz Buck, Maliheh Mehrshad, and Stefan Bertilsson
>bioRxiv 2021.06.25.449606; doi: https://doi.org/10.1101/2021.06.25.449606

A draft of a `release note` for mOTUlize is in the doc-folder, as well as the source of the previously mentioned mOTUpan paper and some slides




%prep
%autosetup -n mOTUlizer-0.3.2

%build
%py3_build

%install
%py3_install
install -d -m755 %{buildroot}/%{_pkgdocdir}
if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi
if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi
if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi
if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi
pushd %{buildroot}
if [ -d usr/lib ]; then
	find usr/lib -type f -printf "\"/%h/%f\"\n" >> filelist.lst
fi
if [ -d usr/lib64 ]; then
	find usr/lib64 -type f -printf "\"/%h/%f\"\n" >> filelist.lst
fi
if [ -d usr/bin ]; then
	find usr/bin -type f -printf "\"/%h/%f\"\n" >> filelist.lst
fi
if [ -d usr/sbin ]; then
	find usr/sbin -type f -printf "\"/%h/%f\"\n" >> filelist.lst
fi
touch doclist.lst
if [ -d usr/share/man ]; then
	find usr/share/man -type f -printf "\"/%h/%f.gz\"\n" >> doclist.lst
fi
popd
mv %{buildroot}/filelist.lst .
mv %{buildroot}/doclist.lst .

%files -n python3-mOTUlizer -f filelist.lst
%dir %{python3_sitelib}/*

%files help -f doclist.lst
%{_docdir}/*

%changelog
* Tue Jun 20 2023 Python_Bot <Python_Bot@openeuler.org> - 0.3.2-1
- Package Spec generated