python-splicejunxchx.spec


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405

%global _empty_manifest_terminate_build 0
Name:		python-splicejunxchx
Version:	2.8
Release:	1
Summary:	characterize the splice junctions outputted by SJ.out.tab file
License:	MIT
URL:		https://github.com/ayushkumar-umms/splice-junction-characterization
Source0:	https://mirrors.aliyun.com/pypi/web/packages/37/7a/288af759a73c8968d4e213220722ee32c1155ed14813b9f5b16195b2579f/splicejunxchx-2.8.tar.gz
BuildArch:	noarch


%description

# splicejunxchx

Splicejunxchx is a Python pipeline that takes splice junctions outputed by STAR (SJ.out.tab) and a GTF file to characterize the 5' and 3' splice
sites of a splice junction.

The pipeline includes the following capabilities:
- Determine if 5'/3' end is in a gene, transcript, exon, intron, 5'UTR, CDS, 3'UTR, start codon, or stop codon
- Determine if 5'/3' splice site (ss) is in a constitutive exon or intron
- Determine if 5'/3' end is annotated based on information in the GTF file
- Find closest ss upstream and downstream from the 5' and 3' ss  of the analyzed junction

Additional capabilities with required dependencies:
- The 51 bases centered around each splice site (needs bedtools)
- The 2 bases of the 5'/3' ss (bedtools)
- maxEnt score (need to download maxEnt perl files)
- A phyloP score over an average N nulceotides around each splice site(bigWigtoBedGraph)


## Installation

First, you must have python3.6>=, pandas0.23.4>=, and gtf2csv (see below)

```bash
pip install git+https://github.com/zyxue/gtf2csv.git#egg=gtf2csv
```

Look at the following websites to get: bedtools, bigWigToBedGraph, and maxEnt sccores:
- [bedtools](https://bedtools.readthedocs.io/en/latest/content/installation.html)
- [bigWigToBedGraph](http://hgdownload.cse.ucsc.edu/admin/exe/)
- For [maxEnt](http://hollywood.mit.edu/burgelab/maxent/download/), make sure you download fordownload.tar.gz and put score5.pl, score3.pl, me2x5, and the directory splicemodels in the root where you plan to run splicejunxchx

Now, install splicejunxchx using:

```bash
pip install splicejunxchx
```

## Usage

One suggestion is to ensure there is a 'data' directory in whichever root directory you plan to utilize this code. The data directory will store some temp files that include: Two CSV file of all the splice junctions and constitutive exons based on the GTF File 

The following is the full usage possibilities that can be added with splicejunxchx

```bash
splicejunxchx -h [-seqs SEQUENCE_FILE] [-supp SUPPORT_FILES SUPPORT_FILES] [-phyloP PHYLOPSCORES PHYLOPSCORES] [-maxEnt] inputs inputs output_file
```

There are several ways to utilize this pipeline. First, the basic way is to input the gtf.gz file and the SJ.out.tab file and name the output file. This will output the splice junctions with basic information regarding where each splice site lies according to the GTF and where the other closest splice sites are located. To run this command:

```bash
splicejunxchx raw_data/Homo_sapiens.GRCh38.95.gtf.gz raw_data/ERR152SJ.out.tab output/final_splice_junc.csv
```

If you are interested in adding sequence information, you must have bedtools installed (with getfasta function) and then add the .fa file after -seq:

```bash
splicejunxchx raw_data/Homo_sapiens.GRCh38.95.gtf.gz raw_data/ERR152SJ.out.tab output/final_splice_junc.csv -seq data/Homo_sapiens.GRCh38.95.fa 
```

In some cases, to increase speed and time, support files for reported splice junctions and constitutive exons can be provided to splicejunxchx if available.
- Reported splice junctions file must be csv with the following columns: [seqname,start,end,strand]
- Reported splice junctions file must be csv with the following columns: [seqname,start,end,strand,exon_id,gene_id]
- The pipeline generates these aforementioned files on the first run if you want to utilize the same GTF file but have differing splice junctions on the second run

```bash
splicejunxchx raw_data/Homo_sapiens.GRCh38.95.gtf.gz raw_data/ERR152SJ.out.tab output/final_splice_junc.csv -supp data/all_splice_junctions.csv data/cons_exons.csv
```

If you want to include maxEnt score include score5.pl, score3.pl, me2x5, and the directory splicemodels in the root where you plan to run splicejunxchx. Also add the -maxEnt flag

Lastly, to incorporate phyloP score, the input for this tag requires the phyloPscore file as a bigWig (.bw) and the second input as the number of nucleotides of individual phyloP scores requested around each splice site. This number cannot be more than 200 and must be an even number. 

```bash
splicejunxchx raw/Homo_sapiens.GRCh38.95.gtf.gz raw_data/ERR152SJ.out.tab output/final_splice_junc.csv -phyloP data/hg38.phyloP.bw 20
```


## Notes

### General notes about needed files and output structure

Make sure that the GTF File provided does not list out intron locations. This pipeline assumes that the only features present in the GTF File are: gene, transcript, exon, five_prime_utr, CDS, three_prime_utr, start codon, stop codon, and Selenocysteine

For splice junctions that have a unidentified strand (strand = 0), the pipeline create two copies of that splice junction and changes the strand=1 for one junction and the other to strand=2
- EX: If JNC92 has a strand of 0
    - The pipeline creates two junctions called JNC92.1 (strand =1) and JNC92.2 (strand=2)
- To find the splice junctions that are strand=0, search for the junctions that have a 'unidentified_strand' columns set to the value of 1 

### Columns in Output File 

The following columns are provided with more detail:

- Motif:[0: non-canonical; 1: GT/AG, 2: CT/AC, 3: GC/AG, 4: CT/GC, 5:AT/AC, 6: GT/AT]
- STAR_annotation: Both 5' and 3' splice site are annotated as one splice junction according to STAR
- Unidentified_strand: splice junction originally was undefined (Strand = 0), but this junction has been developed with assumption of being on positive or negative strand (see 'strand' column for assumption)
- 5'_in_constitutiveexon: Name of gene followed by the coordinates, else NA
- 5'_in_constitutiveintron: Name of gene followed by the coordinates, else NA
- 5'_in_CDS: If the 5' end is in a coding sequence region
- 5'phyloPscore: Average score over N nucleotides of each splice site
- 5'phylopList: List of phyloP values starting from lowest coordinate to highest coordinate
- 5'bases_maxEnt and 3'bases_maxEnt: the sequence needed to run a maxEnt score
- Similar logic is present in 3' regions 


## Acknowledgments
- Athma Pai and Eraj Khokhar for guidance and support
- Zyxue for gtf2csv: https://github.com/zyxue/gtf2csv
- Yeo G and Burge C.B., Maximum Entropy Modeling of Short Sequence Motifs with Applications to RNA Splicing Signals, RECOMB 2003 (Journal Comp. Bio in press)


%package -n python3-splicejunxchx
Summary:	characterize the splice junctions outputted by SJ.out.tab file
Provides:	python-splicejunxchx
BuildRequires:	python3-devel
BuildRequires:	python3-setuptools
BuildRequires:	python3-pip
%description -n python3-splicejunxchx

# splicejunxchx

Splicejunxchx is a Python pipeline that takes splice junctions outputed by STAR (SJ.out.tab) and a GTF file to characterize the 5' and 3' splice
sites of a splice junction.

The pipeline includes the following capabilities:
- Determine if 5'/3' end is in a gene, transcript, exon, intron, 5'UTR, CDS, 3'UTR, start codon, or stop codon
- Determine if 5'/3' splice site (ss) is in a constitutive exon or intron
- Determine if 5'/3' end is annotated based on information in the GTF file
- Find closest ss upstream and downstream from the 5' and 3' ss  of the analyzed junction

Additional capabilities with required dependencies:
- The 51 bases centered around each splice site (needs bedtools)
- The 2 bases of the 5'/3' ss (bedtools)
- maxEnt score (need to download maxEnt perl files)
- A phyloP score over an average N nulceotides around each splice site(bigWigtoBedGraph)


## Installation

First, you must have python3.6>=, pandas0.23.4>=, and gtf2csv (see below)

```bash
pip install git+https://github.com/zyxue/gtf2csv.git#egg=gtf2csv
```

Look at the following websites to get: bedtools, bigWigToBedGraph, and maxEnt sccores:
- [bedtools](https://bedtools.readthedocs.io/en/latest/content/installation.html)
- [bigWigToBedGraph](http://hgdownload.cse.ucsc.edu/admin/exe/)
- For [maxEnt](http://hollywood.mit.edu/burgelab/maxent/download/), make sure you download fordownload.tar.gz and put score5.pl, score3.pl, me2x5, and the directory splicemodels in the root where you plan to run splicejunxchx

Now, install splicejunxchx using:

```bash
pip install splicejunxchx
```

## Usage

One suggestion is to ensure there is a 'data' directory in whichever root directory you plan to utilize this code. The data directory will store some temp files that include: Two CSV file of all the splice junctions and constitutive exons based on the GTF File 

The following is the full usage possibilities that can be added with splicejunxchx

```bash
splicejunxchx -h [-seqs SEQUENCE_FILE] [-supp SUPPORT_FILES SUPPORT_FILES] [-phyloP PHYLOPSCORES PHYLOPSCORES] [-maxEnt] inputs inputs output_file
```

There are several ways to utilize this pipeline. First, the basic way is to input the gtf.gz file and the SJ.out.tab file and name the output file. This will output the splice junctions with basic information regarding where each splice site lies according to the GTF and where the other closest splice sites are located. To run this command:

```bash
splicejunxchx raw_data/Homo_sapiens.GRCh38.95.gtf.gz raw_data/ERR152SJ.out.tab output/final_splice_junc.csv
```

If you are interested in adding sequence information, you must have bedtools installed (with getfasta function) and then add the .fa file after -seq:

```bash
splicejunxchx raw_data/Homo_sapiens.GRCh38.95.gtf.gz raw_data/ERR152SJ.out.tab output/final_splice_junc.csv -seq data/Homo_sapiens.GRCh38.95.fa 
```

In some cases, to increase speed and time, support files for reported splice junctions and constitutive exons can be provided to splicejunxchx if available.
- Reported splice junctions file must be csv with the following columns: [seqname,start,end,strand]
- Reported splice junctions file must be csv with the following columns: [seqname,start,end,strand,exon_id,gene_id]
- The pipeline generates these aforementioned files on the first run if you want to utilize the same GTF file but have differing splice junctions on the second run

```bash
splicejunxchx raw_data/Homo_sapiens.GRCh38.95.gtf.gz raw_data/ERR152SJ.out.tab output/final_splice_junc.csv -supp data/all_splice_junctions.csv data/cons_exons.csv
```

If you want to include maxEnt score include score5.pl, score3.pl, me2x5, and the directory splicemodels in the root where you plan to run splicejunxchx. Also add the -maxEnt flag

Lastly, to incorporate phyloP score, the input for this tag requires the phyloPscore file as a bigWig (.bw) and the second input as the number of nucleotides of individual phyloP scores requested around each splice site. This number cannot be more than 200 and must be an even number. 

```bash
splicejunxchx raw/Homo_sapiens.GRCh38.95.gtf.gz raw_data/ERR152SJ.out.tab output/final_splice_junc.csv -phyloP data/hg38.phyloP.bw 20
```


## Notes

### General notes about needed files and output structure

Make sure that the GTF File provided does not list out intron locations. This pipeline assumes that the only features present in the GTF File are: gene, transcript, exon, five_prime_utr, CDS, three_prime_utr, start codon, stop codon, and Selenocysteine

For splice junctions that have a unidentified strand (strand = 0), the pipeline create two copies of that splice junction and changes the strand=1 for one junction and the other to strand=2
- EX: If JNC92 has a strand of 0
    - The pipeline creates two junctions called JNC92.1 (strand =1) and JNC92.2 (strand=2)
- To find the splice junctions that are strand=0, search for the junctions that have a 'unidentified_strand' columns set to the value of 1 

### Columns in Output File 

The following columns are provided with more detail:

- Motif:[0: non-canonical; 1: GT/AG, 2: CT/AC, 3: GC/AG, 4: CT/GC, 5:AT/AC, 6: GT/AT]
- STAR_annotation: Both 5' and 3' splice site are annotated as one splice junction according to STAR
- Unidentified_strand: splice junction originally was undefined (Strand = 0), but this junction has been developed with assumption of being on positive or negative strand (see 'strand' column for assumption)
- 5'_in_constitutiveexon: Name of gene followed by the coordinates, else NA
- 5'_in_constitutiveintron: Name of gene followed by the coordinates, else NA
- 5'_in_CDS: If the 5' end is in a coding sequence region
- 5'phyloPscore: Average score over N nucleotides of each splice site
- 5'phylopList: List of phyloP values starting from lowest coordinate to highest coordinate
- 5'bases_maxEnt and 3'bases_maxEnt: the sequence needed to run a maxEnt score
- Similar logic is present in 3' regions 


## Acknowledgments
- Athma Pai and Eraj Khokhar for guidance and support
- Zyxue for gtf2csv: https://github.com/zyxue/gtf2csv
- Yeo G and Burge C.B., Maximum Entropy Modeling of Short Sequence Motifs with Applications to RNA Splicing Signals, RECOMB 2003 (Journal Comp. Bio in press)


%package help
Summary:	Development documents and examples for splicejunxchx
Provides:	python3-splicejunxchx-doc
%description help

# splicejunxchx

Splicejunxchx is a Python pipeline that takes splice junctions outputed by STAR (SJ.out.tab) and a GTF file to characterize the 5' and 3' splice
sites of a splice junction.

The pipeline includes the following capabilities:
- Determine if 5'/3' end is in a gene, transcript, exon, intron, 5'UTR, CDS, 3'UTR, start codon, or stop codon
- Determine if 5'/3' splice site (ss) is in a constitutive exon or intron
- Determine if 5'/3' end is annotated based on information in the GTF file
- Find closest ss upstream and downstream from the 5' and 3' ss  of the analyzed junction

Additional capabilities with required dependencies:
- The 51 bases centered around each splice site (needs bedtools)
- The 2 bases of the 5'/3' ss (bedtools)
- maxEnt score (need to download maxEnt perl files)
- A phyloP score over an average N nulceotides around each splice site(bigWigtoBedGraph)


## Installation

First, you must have python3.6>=, pandas0.23.4>=, and gtf2csv (see below)

```bash
pip install git+https://github.com/zyxue/gtf2csv.git#egg=gtf2csv
```

Look at the following websites to get: bedtools, bigWigToBedGraph, and maxEnt sccores:
- [bedtools](https://bedtools.readthedocs.io/en/latest/content/installation.html)
- [bigWigToBedGraph](http://hgdownload.cse.ucsc.edu/admin/exe/)
- For [maxEnt](http://hollywood.mit.edu/burgelab/maxent/download/), make sure you download fordownload.tar.gz and put score5.pl, score3.pl, me2x5, and the directory splicemodels in the root where you plan to run splicejunxchx

Now, install splicejunxchx using:

```bash
pip install splicejunxchx
```

## Usage

One suggestion is to ensure there is a 'data' directory in whichever root directory you plan to utilize this code. The data directory will store some temp files that include: Two CSV file of all the splice junctions and constitutive exons based on the GTF File 

The following is the full usage possibilities that can be added with splicejunxchx

```bash
splicejunxchx -h [-seqs SEQUENCE_FILE] [-supp SUPPORT_FILES SUPPORT_FILES] [-phyloP PHYLOPSCORES PHYLOPSCORES] [-maxEnt] inputs inputs output_file
```

There are several ways to utilize this pipeline. First, the basic way is to input the gtf.gz file and the SJ.out.tab file and name the output file. This will output the splice junctions with basic information regarding where each splice site lies according to the GTF and where the other closest splice sites are located. To run this command:

```bash
splicejunxchx raw_data/Homo_sapiens.GRCh38.95.gtf.gz raw_data/ERR152SJ.out.tab output/final_splice_junc.csv
```

If you are interested in adding sequence information, you must have bedtools installed (with getfasta function) and then add the .fa file after -seq:

```bash
splicejunxchx raw_data/Homo_sapiens.GRCh38.95.gtf.gz raw_data/ERR152SJ.out.tab output/final_splice_junc.csv -seq data/Homo_sapiens.GRCh38.95.fa 
```

In some cases, to increase speed and time, support files for reported splice junctions and constitutive exons can be provided to splicejunxchx if available.
- Reported splice junctions file must be csv with the following columns: [seqname,start,end,strand]
- Reported splice junctions file must be csv with the following columns: [seqname,start,end,strand,exon_id,gene_id]
- The pipeline generates these aforementioned files on the first run if you want to utilize the same GTF file but have differing splice junctions on the second run

```bash
splicejunxchx raw_data/Homo_sapiens.GRCh38.95.gtf.gz raw_data/ERR152SJ.out.tab output/final_splice_junc.csv -supp data/all_splice_junctions.csv data/cons_exons.csv
```

If you want to include maxEnt score include score5.pl, score3.pl, me2x5, and the directory splicemodels in the root where you plan to run splicejunxchx. Also add the -maxEnt flag

Lastly, to incorporate phyloP score, the input for this tag requires the phyloPscore file as a bigWig (.bw) and the second input as the number of nucleotides of individual phyloP scores requested around each splice site. This number cannot be more than 200 and must be an even number. 

```bash
splicejunxchx raw/Homo_sapiens.GRCh38.95.gtf.gz raw_data/ERR152SJ.out.tab output/final_splice_junc.csv -phyloP data/hg38.phyloP.bw 20
```


## Notes

### General notes about needed files and output structure

Make sure that the GTF File provided does not list out intron locations. This pipeline assumes that the only features present in the GTF File are: gene, transcript, exon, five_prime_utr, CDS, three_prime_utr, start codon, stop codon, and Selenocysteine

For splice junctions that have a unidentified strand (strand = 0), the pipeline create two copies of that splice junction and changes the strand=1 for one junction and the other to strand=2
- EX: If JNC92 has a strand of 0
    - The pipeline creates two junctions called JNC92.1 (strand =1) and JNC92.2 (strand=2)
- To find the splice junctions that are strand=0, search for the junctions that have a 'unidentified_strand' columns set to the value of 1 

### Columns in Output File 

The following columns are provided with more detail:

- Motif:[0: non-canonical; 1: GT/AG, 2: CT/AC, 3: GC/AG, 4: CT/GC, 5:AT/AC, 6: GT/AT]
- STAR_annotation: Both 5' and 3' splice site are annotated as one splice junction according to STAR
- Unidentified_strand: splice junction originally was undefined (Strand = 0), but this junction has been developed with assumption of being on positive or negative strand (see 'strand' column for assumption)
- 5'_in_constitutiveexon: Name of gene followed by the coordinates, else NA
- 5'_in_constitutiveintron: Name of gene followed by the coordinates, else NA
- 5'_in_CDS: If the 5' end is in a coding sequence region
- 5'phyloPscore: Average score over N nucleotides of each splice site
- 5'phylopList: List of phyloP values starting from lowest coordinate to highest coordinate
- 5'bases_maxEnt and 3'bases_maxEnt: the sequence needed to run a maxEnt score
- Similar logic is present in 3' regions 


## Acknowledgments
- Athma Pai and Eraj Khokhar for guidance and support
- Zyxue for gtf2csv: https://github.com/zyxue/gtf2csv
- Yeo G and Burge C.B., Maximum Entropy Modeling of Short Sequence Motifs with Applications to RNA Splicing Signals, RECOMB 2003 (Journal Comp. Bio in press)


%prep
%autosetup -n splicejunxchx-2.8

%build
%py3_build

%install
%py3_install
install -d -m755 %{buildroot}/%{_pkgdocdir}
if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi
if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi
if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi
if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi
pushd %{buildroot}
if [ -d usr/lib ]; then
	find usr/lib -type f -printf "\"/%h/%f\"\n" >> filelist.lst
fi
if [ -d usr/lib64 ]; then
	find usr/lib64 -type f -printf "\"/%h/%f\"\n" >> filelist.lst
fi
if [ -d usr/bin ]; then
	find usr/bin -type f -printf "\"/%h/%f\"\n" >> filelist.lst
fi
if [ -d usr/sbin ]; then
	find usr/sbin -type f -printf "\"/%h/%f\"\n" >> filelist.lst
fi
touch doclist.lst
if [ -d usr/share/man ]; then
	find usr/share/man -type f -printf "\"/%h/%f.gz\"\n" >> doclist.lst
fi
popd
mv %{buildroot}/filelist.lst .
mv %{buildroot}/doclist.lst .

%files -n python3-splicejunxchx -f filelist.lst
%dir %{python3_sitelib}/*

%files help -f doclist.lst
%{_docdir}/*

%changelog
* Tue Jun 20 2023 Python_Bot <Python_Bot@openeuler.org> - 2.8-1
- Package Spec generated