python-cap-genomics.spec


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239

%global _empty_manifest_terminate_build 0
Name:		python-cap-genomics
Version:	0.1.40
Release:	1
Summary:	Cohort Analysis Platform
License:	MIT
URL:		https://github.com/ArashLab/CAP
Source0:	https://mirrors.nju.edu.cn/pypi/web/packages/e9/ac/fe6e5ecbaff8c6f51886a7bb1fd2b20b1c557fba41d1cf800daed27b2058/cap-genomics-0.1.40.tar.gz
BuildArch:	noarch

Requires:	python3-hail
Requires:	python3-munch
Requires:	python3-jsonschema
Requires:	python3-pyarrow
Requires:	python3-fastparquet

%description
# Cohort Analysis Platform (CAP)
## What is CAP?
In short, CAP is going to automate some of the common genomic processing pipelines and facilitate analysis of the outcomes by creating astonishing reports. As a simple example, think of a genomic case-control study also known as Genome-Wide Association Study (GWAS). Here is the simplest workflow you can imagine:

1. Get the data into the format supported by the software you are using.
2. Compute quality metrics for samples and variants (SNPs), and then prune (clean up) data.
3. Perform Principal Component Analysis (PCA), create plots, and make sure the case and control group are not separated in the plot.
4. Perform Logistic-Regression or other statistical tests to identify the association power of variants.
5. Create a manhattan plot and dive into the region with a strong association to find a convincing argument.

Unfortunately, the simple pipeline above may not reveal the answer in one go. You may need to iterate through this many times, changing the parameters, and adding more and more steps like relatedness check and etc. If you are unlucky and couldn't find the answer, you need to go for a different pipeline like rare-variant, polygenic, epistatic or complex-disease analysis each of which requires many iterations to be tuned for your data.

You may run tens or hundreds of experiments with the datasets which makes it very difficult to keep track of everything and manage all the data you have produced.

Also, you have to deal with a chicken or the egg paradox too. If you don't perform an analysis with perfection (i.e. create the most efficient report and visualisation of the outcome), you may miss the important information your pipeline is capable of discovering. On the other hand, You may not have enough time for this perfection as many of the analysis simply fail and you need to move on quickly. But then how could you be sure if the analysis is useless if you don't do it with perfection.

That is why **we feel the necessity of automation for widely-used genomic workflows**. CAP is our response to this necessity. CAP is going to perform a wide range of analysis on your cohort and provide you with a comprehensive reports. By studying these reports you get an in-depth understanding of your data and plan your research more effectively by focusing on the analysis that seems promising. **CAP could not give you the ultimate answer but help you to move in the right direction towards the answer.**

## CAP, Hail and Spark (Also non-Spark)
[Hail](https://hail.is/) is a python library for the analysis of genomic data (and more) with an extensive list of functionalities. Hail is developed on top of [Apache Sapark](https://spark.apache.org/) that allow to process data on a cluster of computers (or a single computer if you wish). That means 100GB of data is no longer a big deal. **CAP uses Hail as the main analysis platform.** However, CAP is not limited to hail and could integrate other tools no matter if they are implemented in Spark (i.e. [Glow](https://glow.readthedocs.io/en/latest/blogs/glowgr-blog/glowgr-blog.html)) or they are ordinary software (i.e. [plink](https://www.cog-genomics.org/plink/), [bcftools](http://samtools.github.io/bcftools/bcftools.html) and [VEP](https://asia.ensembl.org/info/docs/tools/vep/script/index.html)).

## How CAP works?
All the processing steps and their parameters are described in a workload file (yaml or json). Reading genotype data from a VCF file, breaking multi-allelic sites and computing Hardy-Weinberg statistics are three example steps in a workload. Each processing step is linked to one of the functions implemented in CAP. You can change the parameters, reorder, add or delete steps. But if you want to add a novel step (i.e. impute sex with a new algorithm), you need to implement its logic in the CAP source code first.

**CAP reads the workload file and execute every single step.** Note that CAP is designed to automate some of the standard and routine pipelines and would not be as flexible as working with Hail or any other tools directly. Yet we try to provide as much flexibility as possible through parameters.

## Installation
You can simply install cap using pip.
```bash
pip install cap-genomics
```
We strongly recommend to do so in a virtual environemnt (i.e. using conda).
```bash
conda create --name capenv python=3.9
conda activate capenv
pip install cap-genomics
```
The above installation **does not** include example files used in out [tutorial page on GitHub](https://github.com/ArashLab/CAP/blob/main/TUTORIAL.md). To run examples you need to clone this reposetory.
```bash
git clone https://github.com/ArashLab/CAP.git
```

Note that hail is a requierment of cap. So when you install cap using pip, hail is also gets installed on your system.
However hail has some dependencies which must be installed prior pip installation. See [here](https://hail.is/docs/0.2/getting_started.html#installing-hail) for more details.
## Documentation
See our [documentation page on GitHub](https://github.com/ArashLab/CAP/blob/main/DOCUMENTATION.md)

## Tutorial
See our [tutorial page on GitHub](https://github.com/ArashLab/CAP/blob/main/TUTORIAL.md) for step by step examples.


%package -n python3-cap-genomics
Summary:	Cohort Analysis Platform
Provides:	python-cap-genomics
BuildRequires:	python3-devel
BuildRequires:	python3-setuptools
BuildRequires:	python3-pip
%description -n python3-cap-genomics
# Cohort Analysis Platform (CAP)
## What is CAP?
In short, CAP is going to automate some of the common genomic processing pipelines and facilitate analysis of the outcomes by creating astonishing reports. As a simple example, think of a genomic case-control study also known as Genome-Wide Association Study (GWAS). Here is the simplest workflow you can imagine:

1. Get the data into the format supported by the software you are using.
2. Compute quality metrics for samples and variants (SNPs), and then prune (clean up) data.
3. Perform Principal Component Analysis (PCA), create plots, and make sure the case and control group are not separated in the plot.
4. Perform Logistic-Regression or other statistical tests to identify the association power of variants.
5. Create a manhattan plot and dive into the region with a strong association to find a convincing argument.

Unfortunately, the simple pipeline above may not reveal the answer in one go. You may need to iterate through this many times, changing the parameters, and adding more and more steps like relatedness check and etc. If you are unlucky and couldn't find the answer, you need to go for a different pipeline like rare-variant, polygenic, epistatic or complex-disease analysis each of which requires many iterations to be tuned for your data.

You may run tens or hundreds of experiments with the datasets which makes it very difficult to keep track of everything and manage all the data you have produced.

Also, you have to deal with a chicken or the egg paradox too. If you don't perform an analysis with perfection (i.e. create the most efficient report and visualisation of the outcome), you may miss the important information your pipeline is capable of discovering. On the other hand, You may not have enough time for this perfection as many of the analysis simply fail and you need to move on quickly. But then how could you be sure if the analysis is useless if you don't do it with perfection.

That is why **we feel the necessity of automation for widely-used genomic workflows**. CAP is our response to this necessity. CAP is going to perform a wide range of analysis on your cohort and provide you with a comprehensive reports. By studying these reports you get an in-depth understanding of your data and plan your research more effectively by focusing on the analysis that seems promising. **CAP could not give you the ultimate answer but help you to move in the right direction towards the answer.**

## CAP, Hail and Spark (Also non-Spark)
[Hail](https://hail.is/) is a python library for the analysis of genomic data (and more) with an extensive list of functionalities. Hail is developed on top of [Apache Sapark](https://spark.apache.org/) that allow to process data on a cluster of computers (or a single computer if you wish). That means 100GB of data is no longer a big deal. **CAP uses Hail as the main analysis platform.** However, CAP is not limited to hail and could integrate other tools no matter if they are implemented in Spark (i.e. [Glow](https://glow.readthedocs.io/en/latest/blogs/glowgr-blog/glowgr-blog.html)) or they are ordinary software (i.e. [plink](https://www.cog-genomics.org/plink/), [bcftools](http://samtools.github.io/bcftools/bcftools.html) and [VEP](https://asia.ensembl.org/info/docs/tools/vep/script/index.html)).

## How CAP works?
All the processing steps and their parameters are described in a workload file (yaml or json). Reading genotype data from a VCF file, breaking multi-allelic sites and computing Hardy-Weinberg statistics are three example steps in a workload. Each processing step is linked to one of the functions implemented in CAP. You can change the parameters, reorder, add or delete steps. But if you want to add a novel step (i.e. impute sex with a new algorithm), you need to implement its logic in the CAP source code first.

**CAP reads the workload file and execute every single step.** Note that CAP is designed to automate some of the standard and routine pipelines and would not be as flexible as working with Hail or any other tools directly. Yet we try to provide as much flexibility as possible through parameters.

## Installation
You can simply install cap using pip.
```bash
pip install cap-genomics
```
We strongly recommend to do so in a virtual environemnt (i.e. using conda).
```bash
conda create --name capenv python=3.9
conda activate capenv
pip install cap-genomics
```
The above installation **does not** include example files used in out [tutorial page on GitHub](https://github.com/ArashLab/CAP/blob/main/TUTORIAL.md). To run examples you need to clone this reposetory.
```bash
git clone https://github.com/ArashLab/CAP.git
```

Note that hail is a requierment of cap. So when you install cap using pip, hail is also gets installed on your system.
However hail has some dependencies which must be installed prior pip installation. See [here](https://hail.is/docs/0.2/getting_started.html#installing-hail) for more details.
## Documentation
See our [documentation page on GitHub](https://github.com/ArashLab/CAP/blob/main/DOCUMENTATION.md)

## Tutorial
See our [tutorial page on GitHub](https://github.com/ArashLab/CAP/blob/main/TUTORIAL.md) for step by step examples.


%package help
Summary:	Development documents and examples for cap-genomics
Provides:	python3-cap-genomics-doc
%description help
# Cohort Analysis Platform (CAP)
## What is CAP?
In short, CAP is going to automate some of the common genomic processing pipelines and facilitate analysis of the outcomes by creating astonishing reports. As a simple example, think of a genomic case-control study also known as Genome-Wide Association Study (GWAS). Here is the simplest workflow you can imagine:

1. Get the data into the format supported by the software you are using.
2. Compute quality metrics for samples and variants (SNPs), and then prune (clean up) data.
3. Perform Principal Component Analysis (PCA), create plots, and make sure the case and control group are not separated in the plot.
4. Perform Logistic-Regression or other statistical tests to identify the association power of variants.
5. Create a manhattan plot and dive into the region with a strong association to find a convincing argument.

Unfortunately, the simple pipeline above may not reveal the answer in one go. You may need to iterate through this many times, changing the parameters, and adding more and more steps like relatedness check and etc. If you are unlucky and couldn't find the answer, you need to go for a different pipeline like rare-variant, polygenic, epistatic or complex-disease analysis each of which requires many iterations to be tuned for your data.

You may run tens or hundreds of experiments with the datasets which makes it very difficult to keep track of everything and manage all the data you have produced.

Also, you have to deal with a chicken or the egg paradox too. If you don't perform an analysis with perfection (i.e. create the most efficient report and visualisation of the outcome), you may miss the important information your pipeline is capable of discovering. On the other hand, You may not have enough time for this perfection as many of the analysis simply fail and you need to move on quickly. But then how could you be sure if the analysis is useless if you don't do it with perfection.

That is why **we feel the necessity of automation for widely-used genomic workflows**. CAP is our response to this necessity. CAP is going to perform a wide range of analysis on your cohort and provide you with a comprehensive reports. By studying these reports you get an in-depth understanding of your data and plan your research more effectively by focusing on the analysis that seems promising. **CAP could not give you the ultimate answer but help you to move in the right direction towards the answer.**

## CAP, Hail and Spark (Also non-Spark)
[Hail](https://hail.is/) is a python library for the analysis of genomic data (and more) with an extensive list of functionalities. Hail is developed on top of [Apache Sapark](https://spark.apache.org/) that allow to process data on a cluster of computers (or a single computer if you wish). That means 100GB of data is no longer a big deal. **CAP uses Hail as the main analysis platform.** However, CAP is not limited to hail and could integrate other tools no matter if they are implemented in Spark (i.e. [Glow](https://glow.readthedocs.io/en/latest/blogs/glowgr-blog/glowgr-blog.html)) or they are ordinary software (i.e. [plink](https://www.cog-genomics.org/plink/), [bcftools](http://samtools.github.io/bcftools/bcftools.html) and [VEP](https://asia.ensembl.org/info/docs/tools/vep/script/index.html)).

## How CAP works?
All the processing steps and their parameters are described in a workload file (yaml or json). Reading genotype data from a VCF file, breaking multi-allelic sites and computing Hardy-Weinberg statistics are three example steps in a workload. Each processing step is linked to one of the functions implemented in CAP. You can change the parameters, reorder, add or delete steps. But if you want to add a novel step (i.e. impute sex with a new algorithm), you need to implement its logic in the CAP source code first.

**CAP reads the workload file and execute every single step.** Note that CAP is designed to automate some of the standard and routine pipelines and would not be as flexible as working with Hail or any other tools directly. Yet we try to provide as much flexibility as possible through parameters.

## Installation
You can simply install cap using pip.
```bash
pip install cap-genomics
```
We strongly recommend to do so in a virtual environemnt (i.e. using conda).
```bash
conda create --name capenv python=3.9
conda activate capenv
pip install cap-genomics
```
The above installation **does not** include example files used in out [tutorial page on GitHub](https://github.com/ArashLab/CAP/blob/main/TUTORIAL.md). To run examples you need to clone this reposetory.
```bash
git clone https://github.com/ArashLab/CAP.git
```

Note that hail is a requierment of cap. So when you install cap using pip, hail is also gets installed on your system.
However hail has some dependencies which must be installed prior pip installation. See [here](https://hail.is/docs/0.2/getting_started.html#installing-hail) for more details.
## Documentation
See our [documentation page on GitHub](https://github.com/ArashLab/CAP/blob/main/DOCUMENTATION.md)

## Tutorial
See our [tutorial page on GitHub](https://github.com/ArashLab/CAP/blob/main/TUTORIAL.md) for step by step examples.


%prep
%autosetup -n cap-genomics-0.1.40

%build
%py3_build

%install
%py3_install
install -d -m755 %{buildroot}/%{_pkgdocdir}
if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi
if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi
if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi
if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi
pushd %{buildroot}
if [ -d usr/lib ]; then
	find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst
fi
if [ -d usr/lib64 ]; then
	find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst
fi
if [ -d usr/bin ]; then
	find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst
fi
if [ -d usr/sbin ]; then
	find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst
fi
touch doclist.lst
if [ -d usr/share/man ]; then
	find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst
fi
popd
mv %{buildroot}/filelist.lst .
mv %{buildroot}/doclist.lst .

%files -n python3-cap-genomics -f filelist.lst
%dir %{python3_sitelib}/*

%files help -f doclist.lst
%{_docdir}/*

%changelog
* Thu May 18 2023 Python_Bot <Python_Bot@openeuler.org> - 0.1.40-1
- Package Spec generated