diff options
| author | CoprDistGit <infra@openeuler.org> | 2023-05-29 10:45:13 +0000 |
|---|---|---|
| committer | CoprDistGit <infra@openeuler.org> | 2023-05-29 10:45:13 +0000 |
| commit | 9d62639c40f209e196b0b4050497a6c15316d92b (patch) | |
| tree | eb1ac2285d37495b9cb9aa1cb705df3e0559cb1a /python-scetm.spec | |
| parent | a75981c665f578dfa0403f1834f725ea2d6d1abc (diff) | |
automatic import of python-scetm
Diffstat (limited to 'python-scetm.spec')
| -rw-r--r-- | python-scetm.spec | 406 |
1 files changed, 406 insertions, 0 deletions
diff --git a/python-scetm.spec b/python-scetm.spec new file mode 100644 index 0000000..d8304d5 --- /dev/null +++ b/python-scetm.spec @@ -0,0 +1,406 @@ +%global _empty_manifest_terminate_build 0 +Name: python-scETM +Version: 0.5.0 +Release: 1 +Summary: Single cell embedded topic model for integrated scRNA-seq data analysis. +License: BSD 3-Clause License +URL: https://github.com/hui2000ji/scETM/ +Source0: https://mirrors.nju.edu.cn/pypi/web/packages/bb/76/7eb2b1d0a50783dd5cf5b189c335608b4ed626cf3ebe6c910c7c88aca424/scETM-0.5.0.tar.gz +BuildArch: noarch + +Requires: python3-torch +Requires: python3-numpy +Requires: python3-matplotlib +Requires: python3-scikit-learn +Requires: python3-h5py +Requires: python3-pandas +Requires: python3-tqdm +Requires: python3-anndata +Requires: python3-scanpy +Requires: python3-scipy +Requires: python3-louvain +Requires: python3-leidenalg +Requires: python3-psutil + +%description +# scETM: single-cell Embedded Topic Model +A generative topic model that facilitates integrative analysis of large-scale single-cell RNA sequencing data. + +The full description of scETM and its application on published single cell RNA-seq datasets are available [here](https://www.biorxiv.org/content/10.1101/2021.01.13.426593v1). + +This repository includes detailed instructions for installation and requirements, demos, and scripts used for the benchmarking of 7 other state-of-art methods. + + +## Contents ## + +- [scETM: single-cell Embedded Topic Model](#scetm-single-cell-embedded-topic-model) + - [Contents](#contents) + - [1 Model Overview](#1-model-overview) + - [2 Installation](#2-installation) + - [3 Usage](#3-usage) + - [Data format](#data-format) + - [A taste of scETM](#a-taste-of-scetm) + - [p-scETM](#p-scetm) + - [Transfer learning](#transfer-learning) + - [Tensorboard Integration](#tensorboard-integration) + - [4 Benchmarking](#4-benchmarking) + +## 1 Model Overview + + +**(a)** Probabilistic graphical model of scETM. We model the scRNA-profile read count matrix y<sub>d,g</sub> in cell d and gene g across S subjects or studies by a multinomial distribution with the rate parameterized by cell topic mixture θ, topic embedding α, gene embedding ρ, and batch effects λ. **(b)** Matrix factorization view of scETM. **(c)** Encoder architecture for inferring the cell topic mixture θ. + +## 2 Installation +Python version: 3.7+ +scETM is included in PyPI, so you can install it by + +```bash +pip install scETM +``` + +To enable GPU computing (which significantly boosts the performance), please install [PyTorch](https://pytorch.org/) with GPU support **before** installing scETM. + +## 3 Usage +**A step-by-step scETM tutorial can be found in [here](/notebooks/scETM%20introductory%20tutorial.ipynb).** + +### Data format +scETM requires a cells-by-genes matrix `adata` as input, in the format of an AnnData object. Detailed description about AnnData can be found [here](https://anndata.readthedocs.io/en/latest/). + +By default, scETM looks for batch information in the 'batch_indices' column of the `adata.obs` DataFrame, and cell type identity in the 'cell_types' column. If your data stores the batch and cell type information in different columns, pass them to the `batch_col` and `cell_type_col` arguments, respectively, when calling scETM functions. + +### A taste of scETM + +```python +from scETM import scETM, UnsupervisedTrainer, evaluate +import anndata + +# Prepare the source dataset, Mouse Pancreas +mp = anndata.read_h5ad("MousePancreas.h5ad") +# Initialize model +model = scETM(mp.n_vars, mp.obs.batch_indices.nunique(), enable_batch_bias=True) +# The trainer object will set up the random seed, optimizer, training and evaluation loop, checkpointing and logging. +trainer = UnsupervisedTrainer(model, mp, train_instance_name="MP", ckpt_dir="../results") +# Train the model on adata for 12000 epochs, and evaluate every 1000 epochs. Use 4 threads to sample minibatches. +trainer.train(n_epochs=12000, eval_every=1000, n_samplers=4) +# Obtain scETM cell, gene and topic embeddings. Unnormalized cell embeddings will be stored at mp.obsm['delta'], normalized cell embeddings at mp.obsm['theta'], gene embeddings at mp.varm['rho'], topic embeddings at mp.uns['alpha']. +model.get_all_embeddings_and_nll(mp) +# Evaluate the model and save the embedding plot +evaluate(mp, embedding_key="delta", plot_fname="scETM_MP", plot_dir="figures/scETM_MP") +``` + +### p-scETM +p-scETM is a variant of scETM where part or all of the the gene embedding matrix ρ is fixed to a pathways-by-genes matrix, which can be downloaded from the [pathDIP4 pathway database](http://ophid.utoronto.ca/pathDIP/Download.jsp). We only keep pathways that contain more than 5 genes. + +If it is desired to fix the gene embedding matrix ρ during training, let trainable_gene_emb_dim be zero. In this case, the gene set used to train the model would be the intersection of the genes in the scRNA-seq data and the genes in the gene-by-pathway matrix. Otherwise, if trainable_gene_emb_dim is set to a positive value, all the genes in the scRNA-seq data would be kept. + +### Transfer learning + +```python +from scETM import scETM, UnsupervisedTrainer, prepare_for_transfer +import anndata + +# Prepare the source dataset, Mouse Pancreas +mp = anndata.read_h5ad("MousePancreas.h5ad") +# Initialize model +model = scETM(mp.n_vars, mp.obs.batch_indices.nunique(), enable_batch_bias=True) +# The trainer object will set up the random seed, optimizer, training and evaluation loop, checkpointing and logging. +trainer = UnsupervisedTrainer(model, mp, train_instance_name="MP", ckpt_dir="../results") +# Train the model on adata for 12000 epochs, and evaluate every 1000 epochs. Use 4 threads to sample minibatches. +trainer.train(n_epochs=12000, eval_every=1000, n_samplers=4) + +# Load the target dataset, Human Pancreas +hp = anndata.read_h5ad('HumanPancreas.h5ad') +# Align the source dataset's gene names (which are mouse genes) to the target dataset (which are human genes) +mp_genes = mp.var_names.str.upper() +mp_genes.drop_duplicates(inplace=True) +# Generate a new model and a modified dataset from the previously trained model and the mp_genes +model, hp = prepare_for_transfer(model, hp, mp_genes, + keep_tgt_unique_genes=True, # Keep target-unique genes in the model and the target dataset + fix_shared_genes=True # Fix parameters related to shared genes in the model +) +# Instantiate another trainer to fine-tune the model +trainer = UnsupervisedTrainer(model, hp, train_instance_name="HP_all_fix", ckpt_dir="../results", init_lr=5e-4) +trainer.train(n_epochs=800, eval_every=200) +``` + +### Tensorboard Integration +If a Tensorboard SummaryWriter is passed to the `writer` argument of the `UnsupervisedTrainer.train` method, the package will store. + +## 4 Benchmarking +The commands used for running [Harmony](https://github.com/immunogenomics/harmony), [Scanorama](https://github.com/brianhie/scanorama), [Seurat](https://satijalab.org/seurat/), [scVAE-GM](https://github.com/scvae/scvae), [scVI](https://github.com/YosefLab/scvi-tools), [LIGER](https://github.com/welch-lab/liger), [scVI-LD](https://www.biorxiv.org/content/10.1101/737601v1.full.pdf) are available in the [scripts](/scripts) folder. + + + + +%package -n python3-scETM +Summary: Single cell embedded topic model for integrated scRNA-seq data analysis. +Provides: python-scETM +BuildRequires: python3-devel +BuildRequires: python3-setuptools +BuildRequires: python3-pip +%description -n python3-scETM +# scETM: single-cell Embedded Topic Model +A generative topic model that facilitates integrative analysis of large-scale single-cell RNA sequencing data. + +The full description of scETM and its application on published single cell RNA-seq datasets are available [here](https://www.biorxiv.org/content/10.1101/2021.01.13.426593v1). + +This repository includes detailed instructions for installation and requirements, demos, and scripts used for the benchmarking of 7 other state-of-art methods. + + +## Contents ## + +- [scETM: single-cell Embedded Topic Model](#scetm-single-cell-embedded-topic-model) + - [Contents](#contents) + - [1 Model Overview](#1-model-overview) + - [2 Installation](#2-installation) + - [3 Usage](#3-usage) + - [Data format](#data-format) + - [A taste of scETM](#a-taste-of-scetm) + - [p-scETM](#p-scetm) + - [Transfer learning](#transfer-learning) + - [Tensorboard Integration](#tensorboard-integration) + - [4 Benchmarking](#4-benchmarking) + +## 1 Model Overview + + +**(a)** Probabilistic graphical model of scETM. We model the scRNA-profile read count matrix y<sub>d,g</sub> in cell d and gene g across S subjects or studies by a multinomial distribution with the rate parameterized by cell topic mixture θ, topic embedding α, gene embedding ρ, and batch effects λ. **(b)** Matrix factorization view of scETM. **(c)** Encoder architecture for inferring the cell topic mixture θ. + +## 2 Installation +Python version: 3.7+ +scETM is included in PyPI, so you can install it by + +```bash +pip install scETM +``` + +To enable GPU computing (which significantly boosts the performance), please install [PyTorch](https://pytorch.org/) with GPU support **before** installing scETM. + +## 3 Usage +**A step-by-step scETM tutorial can be found in [here](/notebooks/scETM%20introductory%20tutorial.ipynb).** + +### Data format +scETM requires a cells-by-genes matrix `adata` as input, in the format of an AnnData object. Detailed description about AnnData can be found [here](https://anndata.readthedocs.io/en/latest/). + +By default, scETM looks for batch information in the 'batch_indices' column of the `adata.obs` DataFrame, and cell type identity in the 'cell_types' column. If your data stores the batch and cell type information in different columns, pass them to the `batch_col` and `cell_type_col` arguments, respectively, when calling scETM functions. + +### A taste of scETM + +```python +from scETM import scETM, UnsupervisedTrainer, evaluate +import anndata + +# Prepare the source dataset, Mouse Pancreas +mp = anndata.read_h5ad("MousePancreas.h5ad") +# Initialize model +model = scETM(mp.n_vars, mp.obs.batch_indices.nunique(), enable_batch_bias=True) +# The trainer object will set up the random seed, optimizer, training and evaluation loop, checkpointing and logging. +trainer = UnsupervisedTrainer(model, mp, train_instance_name="MP", ckpt_dir="../results") +# Train the model on adata for 12000 epochs, and evaluate every 1000 epochs. Use 4 threads to sample minibatches. +trainer.train(n_epochs=12000, eval_every=1000, n_samplers=4) +# Obtain scETM cell, gene and topic embeddings. Unnormalized cell embeddings will be stored at mp.obsm['delta'], normalized cell embeddings at mp.obsm['theta'], gene embeddings at mp.varm['rho'], topic embeddings at mp.uns['alpha']. +model.get_all_embeddings_and_nll(mp) +# Evaluate the model and save the embedding plot +evaluate(mp, embedding_key="delta", plot_fname="scETM_MP", plot_dir="figures/scETM_MP") +``` + +### p-scETM +p-scETM is a variant of scETM where part or all of the the gene embedding matrix ρ is fixed to a pathways-by-genes matrix, which can be downloaded from the [pathDIP4 pathway database](http://ophid.utoronto.ca/pathDIP/Download.jsp). We only keep pathways that contain more than 5 genes. + +If it is desired to fix the gene embedding matrix ρ during training, let trainable_gene_emb_dim be zero. In this case, the gene set used to train the model would be the intersection of the genes in the scRNA-seq data and the genes in the gene-by-pathway matrix. Otherwise, if trainable_gene_emb_dim is set to a positive value, all the genes in the scRNA-seq data would be kept. + +### Transfer learning + +```python +from scETM import scETM, UnsupervisedTrainer, prepare_for_transfer +import anndata + +# Prepare the source dataset, Mouse Pancreas +mp = anndata.read_h5ad("MousePancreas.h5ad") +# Initialize model +model = scETM(mp.n_vars, mp.obs.batch_indices.nunique(), enable_batch_bias=True) +# The trainer object will set up the random seed, optimizer, training and evaluation loop, checkpointing and logging. +trainer = UnsupervisedTrainer(model, mp, train_instance_name="MP", ckpt_dir="../results") +# Train the model on adata for 12000 epochs, and evaluate every 1000 epochs. Use 4 threads to sample minibatches. +trainer.train(n_epochs=12000, eval_every=1000, n_samplers=4) + +# Load the target dataset, Human Pancreas +hp = anndata.read_h5ad('HumanPancreas.h5ad') +# Align the source dataset's gene names (which are mouse genes) to the target dataset (which are human genes) +mp_genes = mp.var_names.str.upper() +mp_genes.drop_duplicates(inplace=True) +# Generate a new model and a modified dataset from the previously trained model and the mp_genes +model, hp = prepare_for_transfer(model, hp, mp_genes, + keep_tgt_unique_genes=True, # Keep target-unique genes in the model and the target dataset + fix_shared_genes=True # Fix parameters related to shared genes in the model +) +# Instantiate another trainer to fine-tune the model +trainer = UnsupervisedTrainer(model, hp, train_instance_name="HP_all_fix", ckpt_dir="../results", init_lr=5e-4) +trainer.train(n_epochs=800, eval_every=200) +``` + +### Tensorboard Integration +If a Tensorboard SummaryWriter is passed to the `writer` argument of the `UnsupervisedTrainer.train` method, the package will store. + +## 4 Benchmarking +The commands used for running [Harmony](https://github.com/immunogenomics/harmony), [Scanorama](https://github.com/brianhie/scanorama), [Seurat](https://satijalab.org/seurat/), [scVAE-GM](https://github.com/scvae/scvae), [scVI](https://github.com/YosefLab/scvi-tools), [LIGER](https://github.com/welch-lab/liger), [scVI-LD](https://www.biorxiv.org/content/10.1101/737601v1.full.pdf) are available in the [scripts](/scripts) folder. + + + + +%package help +Summary: Development documents and examples for scETM +Provides: python3-scETM-doc +%description help +# scETM: single-cell Embedded Topic Model +A generative topic model that facilitates integrative analysis of large-scale single-cell RNA sequencing data. + +The full description of scETM and its application on published single cell RNA-seq datasets are available [here](https://www.biorxiv.org/content/10.1101/2021.01.13.426593v1). + +This repository includes detailed instructions for installation and requirements, demos, and scripts used for the benchmarking of 7 other state-of-art methods. + + +## Contents ## + +- [scETM: single-cell Embedded Topic Model](#scetm-single-cell-embedded-topic-model) + - [Contents](#contents) + - [1 Model Overview](#1-model-overview) + - [2 Installation](#2-installation) + - [3 Usage](#3-usage) + - [Data format](#data-format) + - [A taste of scETM](#a-taste-of-scetm) + - [p-scETM](#p-scetm) + - [Transfer learning](#transfer-learning) + - [Tensorboard Integration](#tensorboard-integration) + - [4 Benchmarking](#4-benchmarking) + +## 1 Model Overview + + +**(a)** Probabilistic graphical model of scETM. We model the scRNA-profile read count matrix y<sub>d,g</sub> in cell d and gene g across S subjects or studies by a multinomial distribution with the rate parameterized by cell topic mixture θ, topic embedding α, gene embedding ρ, and batch effects λ. **(b)** Matrix factorization view of scETM. **(c)** Encoder architecture for inferring the cell topic mixture θ. + +## 2 Installation +Python version: 3.7+ +scETM is included in PyPI, so you can install it by + +```bash +pip install scETM +``` + +To enable GPU computing (which significantly boosts the performance), please install [PyTorch](https://pytorch.org/) with GPU support **before** installing scETM. + +## 3 Usage +**A step-by-step scETM tutorial can be found in [here](/notebooks/scETM%20introductory%20tutorial.ipynb).** + +### Data format +scETM requires a cells-by-genes matrix `adata` as input, in the format of an AnnData object. Detailed description about AnnData can be found [here](https://anndata.readthedocs.io/en/latest/). + +By default, scETM looks for batch information in the 'batch_indices' column of the `adata.obs` DataFrame, and cell type identity in the 'cell_types' column. If your data stores the batch and cell type information in different columns, pass them to the `batch_col` and `cell_type_col` arguments, respectively, when calling scETM functions. + +### A taste of scETM + +```python +from scETM import scETM, UnsupervisedTrainer, evaluate +import anndata + +# Prepare the source dataset, Mouse Pancreas +mp = anndata.read_h5ad("MousePancreas.h5ad") +# Initialize model +model = scETM(mp.n_vars, mp.obs.batch_indices.nunique(), enable_batch_bias=True) +# The trainer object will set up the random seed, optimizer, training and evaluation loop, checkpointing and logging. +trainer = UnsupervisedTrainer(model, mp, train_instance_name="MP", ckpt_dir="../results") +# Train the model on adata for 12000 epochs, and evaluate every 1000 epochs. Use 4 threads to sample minibatches. +trainer.train(n_epochs=12000, eval_every=1000, n_samplers=4) +# Obtain scETM cell, gene and topic embeddings. Unnormalized cell embeddings will be stored at mp.obsm['delta'], normalized cell embeddings at mp.obsm['theta'], gene embeddings at mp.varm['rho'], topic embeddings at mp.uns['alpha']. +model.get_all_embeddings_and_nll(mp) +# Evaluate the model and save the embedding plot +evaluate(mp, embedding_key="delta", plot_fname="scETM_MP", plot_dir="figures/scETM_MP") +``` + +### p-scETM +p-scETM is a variant of scETM where part or all of the the gene embedding matrix ρ is fixed to a pathways-by-genes matrix, which can be downloaded from the [pathDIP4 pathway database](http://ophid.utoronto.ca/pathDIP/Download.jsp). We only keep pathways that contain more than 5 genes. + +If it is desired to fix the gene embedding matrix ρ during training, let trainable_gene_emb_dim be zero. In this case, the gene set used to train the model would be the intersection of the genes in the scRNA-seq data and the genes in the gene-by-pathway matrix. Otherwise, if trainable_gene_emb_dim is set to a positive value, all the genes in the scRNA-seq data would be kept. + +### Transfer learning + +```python +from scETM import scETM, UnsupervisedTrainer, prepare_for_transfer +import anndata + +# Prepare the source dataset, Mouse Pancreas +mp = anndata.read_h5ad("MousePancreas.h5ad") +# Initialize model +model = scETM(mp.n_vars, mp.obs.batch_indices.nunique(), enable_batch_bias=True) +# The trainer object will set up the random seed, optimizer, training and evaluation loop, checkpointing and logging. +trainer = UnsupervisedTrainer(model, mp, train_instance_name="MP", ckpt_dir="../results") +# Train the model on adata for 12000 epochs, and evaluate every 1000 epochs. Use 4 threads to sample minibatches. +trainer.train(n_epochs=12000, eval_every=1000, n_samplers=4) + +# Load the target dataset, Human Pancreas +hp = anndata.read_h5ad('HumanPancreas.h5ad') +# Align the source dataset's gene names (which are mouse genes) to the target dataset (which are human genes) +mp_genes = mp.var_names.str.upper() +mp_genes.drop_duplicates(inplace=True) +# Generate a new model and a modified dataset from the previously trained model and the mp_genes +model, hp = prepare_for_transfer(model, hp, mp_genes, + keep_tgt_unique_genes=True, # Keep target-unique genes in the model and the target dataset + fix_shared_genes=True # Fix parameters related to shared genes in the model +) +# Instantiate another trainer to fine-tune the model +trainer = UnsupervisedTrainer(model, hp, train_instance_name="HP_all_fix", ckpt_dir="../results", init_lr=5e-4) +trainer.train(n_epochs=800, eval_every=200) +``` + +### Tensorboard Integration +If a Tensorboard SummaryWriter is passed to the `writer` argument of the `UnsupervisedTrainer.train` method, the package will store. + +## 4 Benchmarking +The commands used for running [Harmony](https://github.com/immunogenomics/harmony), [Scanorama](https://github.com/brianhie/scanorama), [Seurat](https://satijalab.org/seurat/), [scVAE-GM](https://github.com/scvae/scvae), [scVI](https://github.com/YosefLab/scvi-tools), [LIGER](https://github.com/welch-lab/liger), [scVI-LD](https://www.biorxiv.org/content/10.1101/737601v1.full.pdf) are available in the [scripts](/scripts) folder. + + + + +%prep +%autosetup -n scETM-0.5.0 + +%build +%py3_build + +%install +%py3_install +install -d -m755 %{buildroot}/%{_pkgdocdir} +if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi +if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi +if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi +if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi +pushd %{buildroot} +if [ -d usr/lib ]; then + find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/lib64 ]; then + find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/bin ]; then + find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/sbin ]; then + find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst +fi +touch doclist.lst +if [ -d usr/share/man ]; then + find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst +fi +popd +mv %{buildroot}/filelist.lst . +mv %{buildroot}/doclist.lst . + +%files -n python3-scETM -f filelist.lst +%dir %{python3_sitelib}/* + +%files help -f doclist.lst +%{_docdir}/* + +%changelog +* Mon May 29 2023 Python_Bot <Python_Bot@openeuler.org> - 0.5.0-1 +- Package Spec generated |
