summaryrefslogtreecommitdiff
path: root/python-ppscore.spec
blob: 446093ad5c731cab88f0cd5b220ad4e7d6e3fb78 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
%global _empty_manifest_terminate_build 0
Name:		python-ppscore
Version:	1.3.0
Release:	1
Summary:	Python implementation of the Predictive Power Score (PPS)
License:	mit
URL:		https://github.com/8080labs/ppscore/
Source0:	https://mirrors.nju.edu.cn/pypi/web/packages/98/ae/7bb55069891bd36eed1f66be6b3caf153a0ec13eb4179c87fce1c105d49c/ppscore-1.3.0.tar.gz
BuildArch:	noarch


%description
# ppscore - a Python implementation of the Predictive Power Score (PPS)

### From the makers of [bamboolib - a GUI for pandas DataFrames](https://bamboolib.com)


__If you don't know yet what the Predictive Power Score is, please read the following blog post:__

__[RIP correlation. Introducing the Predictive Power Score](https://towardsdatascience.com/rip-correlation-introducing-the-predictive-power-score-3d90808b9598?sk=7ac6697576053896fb27d3356dd6db32)__

The PPS is an asymmetric, data-type-agnostic score that can detect linear or non-linear relationships between two columns. The score ranges from 0 (no predictive power) to 1 (perfect predictive power). It can be used as an alternative to the correlation (matrix).


- [Installation](#installation)
- [Getting started](#getting-started)
- [API](#api)
- [Calculation of the PPS](#calculation-of-the-pps)
- [About](#about)


## Installation

> You need Python 3.6 or above.

From the terminal (or Anaconda prompt in Windows), enter:

```bash
pip install -U ppscore
```


## Getting started

> The examples refer to the newest version (1.2.0) of ppscore. [See changes](https://github.com/8080labs/ppscore/blob/master/CHANGELOG.md)

First, let's create some data:

```python
import pandas as pd
import numpy as np
import ppscore as pps

df = pd.DataFrame()
df["x"] = np.random.uniform(-2, 2, 1_000_000)
df["error"] = np.random.uniform(-0.5, 0.5, 1_000_000)
df["y"] = df["x"] * df["x"] + df["error"]
```

Based on the dataframe we can calculate the PPS of x predicting y:

```python
pps.score(df, "x", "y")
```

We can calculate the PPS of all the predictors in the dataframe against a target y:

```python
pps.predictors(df, "y")
```

Here is how we can calculate the PPS matrix between all columns:

```python
pps.matrix(df)
```


### Visualization of the results
For the visualization of the results you can use seaborn or your favorite viz library.

__Plotting the PPS predictors:__

```python
import seaborn as sns
predictors_df = pps.predictors(df, y="y")
sns.barplot(data=predictors_df, x="x", y="ppscore")
```

__Plotting the PPS matrix:__

(This needs some minor preprocessing because seaborn.heatmap unfortunately does not accept tidy data)

```python
import seaborn as sns
matrix_df = pps.matrix(df)[['x', 'y', 'ppscore']].pivot(columns='x', index='y', values='ppscore')
sns.heatmap(matrix_df, vmin=0, vmax=1, cmap="Blues", linewidths=0.5, annot=True)
```


## API

### ppscore.score(df, x, y, sample=5_000, cross_validation=4, random_seed=123, invalid_score=0, catch_errors=True)

Calculate the Predictive Power Score (PPS) for "x predicts y"

- The score always ranges from 0 to 1 and is data-type agnostic.

- A score of 0 means that the column x cannot predict the column y better than a naive baseline model.

- A score of 1 means that the column x can perfectly predict the column y given the model.

- A score between 0 and 1 states the ratio of how much potential predictive power the model achieved compared to the baseline model.


#### Parameters

- __df__ : pandas.DataFrame
    - Dataframe that contains the columns x and y
- __x__ : str
    - Name of the column x which acts as the feature
- __y__ : str
    - Name of the column y which acts as the target
- __sample__ : int or `None`
    - Number of rows for sampling. The sampling decreases the calculation time of the PPS.
    If `None` there will be no sampling.
- __cross_validation__ : int
    - Number of iterations during cross-validation. This has the following implications:
    For example, if the number is 4, then it is possible to detect patterns when there are at least 4 times the same observation. If the limit is increased, the required minimum observations also increase. This is important, because this is the limit when sklearn will throw an error and the PPS cannot be calculated
- __random_seed__ : int or `None`
    - Random seed for the parts of the calculation that require random numbers, e.g. shuffling or sampling.
    If the value is set, the results will be reproducible. If the value is `None` a new random number is drawn at the start of each calculation.
- __invalid_score__ : any
    - The score that is returned when a calculation is not valid, e.g. because the data type was not supported.
- __catch_errors__ : bool
    - If `True` all errors will be catched and reported as `unknown_error` which ensures convenience. If `False` errors will be raised. This is helpful for inspecting and debugging errors.


#### Returns

- __Dict__:
    - A dict that contains multiple fields about the resulting PPS.
    The dict enables introspection into the calculations that have been performed under the hood


### ppscore.predictors(df, y, output="df", sorted=True, **kwargs)

Calculate the Predictive Power Score (PPS) for all columns in the dataframe against a target (y) column

#### Parameters
- __df__ : pandas.DataFrame
    - The dataframe that contains the data
- __y__ : str
    - Name of the column y which acts as the target
- __output__ : str - potential values: "df", "list"
    - Control the type of the output. Either return a df or a list with all the PPS score dicts
- __sorted__ : bool
    - Whether or not to sort the output dataframe/list by the ppscore
- __kwargs__ :
    - Other key-word arguments that shall be forwarded to the pps.score method, e.g. __sample__, __cross_validation__, __random_seed__, __invalid_score__, __catch_errors__

#### Returns

- __pandas.DataFrame__ or list of PPS dicts:
    - Either returns a df or a list of all the PPS dicts. This can be influenced by the output argument


### ppscore.matrix(df, output="df", sorted=False, **kwargs)

Calculate the Predictive Power Score (PPS) matrix for all columns in the dataframe

#### Parameters

- __df__ : pandas.DataFrame
    - The dataframe that contains the data
- __output__ : str - potential values: "df", "list"
    - Control the type of the output. Either return a df or a list with all the PPS score dicts
- __sorted__ : bool
    - Whether or not to sort the output dataframe/list by the ppscore
- __kwargs__ :
    - Other key-word arguments that shall be forwarded to the pps.score method, e.g. __sample__, __cross_validation__, __random_seed__, __invalid_score__, __catch_errors__

#### Returns

- __pandas.DataFrame__ or list of PPS dicts:
    - Either returns a df or a list of all the PPS dicts. This can be influenced by the output argument


## Calculation of the PPS

> If you are uncertain about some details, feel free to jump into the code to have a look at the exact implementation

There are multiple ways how you can calculate the PPS. The ppscore package provides a sample implementation that is based on the following calculations:

- The score is calculated using only 1 feature trying to predict the target column. This means there are no interaction effects between the scores of various features. Note that this is in contrast to feature importance
- The score is calculated on the test sets of a 4-fold cross-validation (number is adjustable via `cross_validation`). For classification, stratifiedKFold is used. For regression, normal KFold. Please note that __this sampling might not be valid for time series data sets__
- All rows which have a missing value in the feature or the target column are dropped
- In case that the dataset has more than 5,000 rows the score is only calculated on a random subset of 5,000 rows. You can adjust the number of rows or skip this sampling via `sample`. However, in most scenarios the results will be very similar
- There is no grid search for optimal model parameters
- The result might change between calculations because the calculation contains random elements, e.g. the sampling of the rows or the shuffling of the rows before cross-validation. If you want to make sure that your results are reproducible you can set the random seed (`random_seed`).
- If the score cannot be calculated, the package will not raise an error but return an object where `is_valid_score` is `False`. The reported score will be `invalid_score`. We chose this behavior because we want to give you a quick overview where significant predictive power exists without you having to handle errors or edge cases. However, when you want to explicitly handle the errors, you can still do so.

### Learning algorithm

As a learning algorithm, we currently use a Decision Tree because the Decision Tree has the following properties:
- can detect any non-linear bivariate relationship
- good predictive power in a wide variety of use cases
- low requirements for feature preprocessing
- robust model which can handle outliers and does not easily overfit
- can be used for classification and regression
- can be calculated quicker than many other algorithms

We differentiate the exact implementation based on the data type of the target column:
- If the target column is numeric, we use the sklearn.DecisionTreeRegressor
- If the target column is categoric, we use the sklearn.DecisionTreeClassifier

> Please note that we prefer a general good performance on a wide variety of use cases over better performance in some narrow use cases. If you have a proposal for a better/different learning algorithm, please open an issue

However, please note why we actively decided against the following algorithms:

- Correlation or Linear Regression: cannot detect non-linear bivariate relationships without extensive preprocessing
- GAMs: might have problems with very unsmooth functions
- SVM: potentially bad performance if the wrong kernel is selected
- Random Forest/Gradient Boosted Tree: slower than a single Decision Tree
- Neural Networks and Deep Learning: slower calculation than a Decision Tree and also needs more feature preprocessing

### Data preprocessing

Even though the Decision Tree is a very flexible learning algorithm, we need to perform the following preprocessing steps if a column represents categoric values - that means it has the pandas dtype `object`, `category`, `string` or `boolean`.‌
- If the target column is categoric, we use the `sklearn.LabelEncoder​`
- If the feature column is categoric, we use the `sklearn.OneHotEncoder​`


### Choosing the prediction case

> This logic was updated in version 1.0.0.

The choice of the case (`classification` or `regression`) has an influence on the final PPS and thus it is important that the correct case is chosen. The case is chosen based on the data types of the columns. That means, e.g. if you want to change the case from `regression` to `classification` that you have to change the data type from `float` to `string`.

Here are the two main cases:
- A __classification__ is chosen if the target has the dtype `object`, `category`, `string` or `boolean`
- A __regression__ is chosen if the target has the dtype `float` or `int`


### Cases and their score metrics​

Each case uses a different evaluation score for calculating the final predictive power score (PPS).

#### Regression

In case of an regression, the ppscore uses the mean absolute error (MAE) as the underlying evaluation metric (MAE_model). The best possible score of the MAE is 0 and higher is worse. As a baseline score, we calculate the MAE of a naive model (MAE_naive) that always predicts the median of the target column. The PPS is the result of the following normalization (and never smaller than 0):
> PPS = 1 - (MAE_model / MAE_naive)

#### Classification

If the task is a classification, we compute the weighted F1 score (wF1) as the underlying evaluation metric (F1_model). The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0. The relative contribution of precision and recall to the F1 score are equal. The weighted F1 takes into account the precision and recall of all classes weighted by their support as described [here](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html). As a baseline score (F1_naive), we calculate the weighted F1 score for a model that always predicts the most common class of the target column (F1_most_common) and a model that predicts random values (F1_random). F1_naive is set to the maximum of F1_most_common and F1_random. The PPS is the result of the following normalization (and never smaller than 0):
> PPS = (F1_model - F1_naive) / (1 - F1_naive)

### Special cases

There are various cases in which the PPS can be defined without fitting a model to save computation time or in which the PPS cannot be calculated at all. Those cases are described below.

#### Valid scores
In the following cases, the PPS is defined but we can save ourselves the computation time:
- __feature_is_id__ means that the feature column is categoric (see above for __classification__) and that all categories appear only once. Such a feature can never predict a target during cross-validation and thus the PPS is 0.
- __target_is_id__ means that the target column is categoric (see above for __classification__) and that all categories appear only once. Thus, the PPS is 0 because an ID column cannot be predicted by any other column as part of a cross-validation. There still might be a 1 to 1 relationship but this is not detectable by the current implementation of the PPS.
- __target_is_constant__ means that the target column only has a single value and thus the PPS is 0 because any column and baseline can perfectly predict a column that only has a single value. Therefore, the feature does not add any predictive power and we want to communicate that.
- __predict_itself__ means that the feature and target columns are the same and thus the PPS is 1 because a column can always perfectly predict its own value. Also, this leads to the typical diagonal of 1 that we are used to from the correlation matrix.

#### Invalid scores and other errors
In the following cases, the PPS is not defined and the score is set to `invalid_score`:
- __target_is_datetime__ means that the target column has a datetime data type which is not supported. A possible solution might be to convert the target column to a string column.
- __target_data_type_not_supported__ means that the target column has a data type which is not supported. A possible solution might be to convert the target column to another data type.
- __empty_dataframe_after_dropping_na__ occurs when there are no valid rows left after rows with missing values have been dropped. A possible solution might be to replace the missing values with valid values.
- Last but not least, __unknown_error__ occurs for all other errors that might raise an exception. This case is only reported when `catch_errors` is `True`. If you want to inspect or debug the underlying error, please set `catch_errors` to `False`.

## Citing ppscore
[![DOI](https://zenodo.org/badge/256518683.svg)](https://zenodo.org/badge/latestdoi/256518683)

## About
ppscore is developed by [8080 Labs](https://8080labs.com) - we create tools for Python Data Scientists. If you like `ppscore` you might want to check out our other project [bamboolib - a GUI for pandas DataFrames](https://bamboolib.com)


%package -n python3-ppscore
Summary:	Python implementation of the Predictive Power Score (PPS)
Provides:	python-ppscore
BuildRequires:	python3-devel
BuildRequires:	python3-setuptools
BuildRequires:	python3-pip
%description -n python3-ppscore
# ppscore - a Python implementation of the Predictive Power Score (PPS)

### From the makers of [bamboolib - a GUI for pandas DataFrames](https://bamboolib.com)


__If you don't know yet what the Predictive Power Score is, please read the following blog post:__

__[RIP correlation. Introducing the Predictive Power Score](https://towardsdatascience.com/rip-correlation-introducing-the-predictive-power-score-3d90808b9598?sk=7ac6697576053896fb27d3356dd6db32)__

The PPS is an asymmetric, data-type-agnostic score that can detect linear or non-linear relationships between two columns. The score ranges from 0 (no predictive power) to 1 (perfect predictive power). It can be used as an alternative to the correlation (matrix).


- [Installation](#installation)
- [Getting started](#getting-started)
- [API](#api)
- [Calculation of the PPS](#calculation-of-the-pps)
- [About](#about)


## Installation

> You need Python 3.6 or above.

From the terminal (or Anaconda prompt in Windows), enter:

```bash
pip install -U ppscore
```


## Getting started

> The examples refer to the newest version (1.2.0) of ppscore. [See changes](https://github.com/8080labs/ppscore/blob/master/CHANGELOG.md)

First, let's create some data:

```python
import pandas as pd
import numpy as np
import ppscore as pps

df = pd.DataFrame()
df["x"] = np.random.uniform(-2, 2, 1_000_000)
df["error"] = np.random.uniform(-0.5, 0.5, 1_000_000)
df["y"] = df["x"] * df["x"] + df["error"]
```

Based on the dataframe we can calculate the PPS of x predicting y:

```python
pps.score(df, "x", "y")
```

We can calculate the PPS of all the predictors in the dataframe against a target y:

```python
pps.predictors(df, "y")
```

Here is how we can calculate the PPS matrix between all columns:

```python
pps.matrix(df)
```


### Visualization of the results
For the visualization of the results you can use seaborn or your favorite viz library.

__Plotting the PPS predictors:__

```python
import seaborn as sns
predictors_df = pps.predictors(df, y="y")
sns.barplot(data=predictors_df, x="x", y="ppscore")
```

__Plotting the PPS matrix:__

(This needs some minor preprocessing because seaborn.heatmap unfortunately does not accept tidy data)

```python
import seaborn as sns
matrix_df = pps.matrix(df)[['x', 'y', 'ppscore']].pivot(columns='x', index='y', values='ppscore')
sns.heatmap(matrix_df, vmin=0, vmax=1, cmap="Blues", linewidths=0.5, annot=True)
```


## API

### ppscore.score(df, x, y, sample=5_000, cross_validation=4, random_seed=123, invalid_score=0, catch_errors=True)

Calculate the Predictive Power Score (PPS) for "x predicts y"

- The score always ranges from 0 to 1 and is data-type agnostic.

- A score of 0 means that the column x cannot predict the column y better than a naive baseline model.

- A score of 1 means that the column x can perfectly predict the column y given the model.

- A score between 0 and 1 states the ratio of how much potential predictive power the model achieved compared to the baseline model.


#### Parameters

- __df__ : pandas.DataFrame
    - Dataframe that contains the columns x and y
- __x__ : str
    - Name of the column x which acts as the feature
- __y__ : str
    - Name of the column y which acts as the target
- __sample__ : int or `None`
    - Number of rows for sampling. The sampling decreases the calculation time of the PPS.
    If `None` there will be no sampling.
- __cross_validation__ : int
    - Number of iterations during cross-validation. This has the following implications:
    For example, if the number is 4, then it is possible to detect patterns when there are at least 4 times the same observation. If the limit is increased, the required minimum observations also increase. This is important, because this is the limit when sklearn will throw an error and the PPS cannot be calculated
- __random_seed__ : int or `None`
    - Random seed for the parts of the calculation that require random numbers, e.g. shuffling or sampling.
    If the value is set, the results will be reproducible. If the value is `None` a new random number is drawn at the start of each calculation.
- __invalid_score__ : any
    - The score that is returned when a calculation is not valid, e.g. because the data type was not supported.
- __catch_errors__ : bool
    - If `True` all errors will be catched and reported as `unknown_error` which ensures convenience. If `False` errors will be raised. This is helpful for inspecting and debugging errors.


#### Returns

- __Dict__:
    - A dict that contains multiple fields about the resulting PPS.
    The dict enables introspection into the calculations that have been performed under the hood


### ppscore.predictors(df, y, output="df", sorted=True, **kwargs)

Calculate the Predictive Power Score (PPS) for all columns in the dataframe against a target (y) column

#### Parameters
- __df__ : pandas.DataFrame
    - The dataframe that contains the data
- __y__ : str
    - Name of the column y which acts as the target
- __output__ : str - potential values: "df", "list"
    - Control the type of the output. Either return a df or a list with all the PPS score dicts
- __sorted__ : bool
    - Whether or not to sort the output dataframe/list by the ppscore
- __kwargs__ :
    - Other key-word arguments that shall be forwarded to the pps.score method, e.g. __sample__, __cross_validation__, __random_seed__, __invalid_score__, __catch_errors__

#### Returns

- __pandas.DataFrame__ or list of PPS dicts:
    - Either returns a df or a list of all the PPS dicts. This can be influenced by the output argument


### ppscore.matrix(df, output="df", sorted=False, **kwargs)

Calculate the Predictive Power Score (PPS) matrix for all columns in the dataframe

#### Parameters

- __df__ : pandas.DataFrame
    - The dataframe that contains the data
- __output__ : str - potential values: "df", "list"
    - Control the type of the output. Either return a df or a list with all the PPS score dicts
- __sorted__ : bool
    - Whether or not to sort the output dataframe/list by the ppscore
- __kwargs__ :
    - Other key-word arguments that shall be forwarded to the pps.score method, e.g. __sample__, __cross_validation__, __random_seed__, __invalid_score__, __catch_errors__

#### Returns

- __pandas.DataFrame__ or list of PPS dicts:
    - Either returns a df or a list of all the PPS dicts. This can be influenced by the output argument


## Calculation of the PPS

> If you are uncertain about some details, feel free to jump into the code to have a look at the exact implementation

There are multiple ways how you can calculate the PPS. The ppscore package provides a sample implementation that is based on the following calculations:

- The score is calculated using only 1 feature trying to predict the target column. This means there are no interaction effects between the scores of various features. Note that this is in contrast to feature importance
- The score is calculated on the test sets of a 4-fold cross-validation (number is adjustable via `cross_validation`). For classification, stratifiedKFold is used. For regression, normal KFold. Please note that __this sampling might not be valid for time series data sets__
- All rows which have a missing value in the feature or the target column are dropped
- In case that the dataset has more than 5,000 rows the score is only calculated on a random subset of 5,000 rows. You can adjust the number of rows or skip this sampling via `sample`. However, in most scenarios the results will be very similar
- There is no grid search for optimal model parameters
- The result might change between calculations because the calculation contains random elements, e.g. the sampling of the rows or the shuffling of the rows before cross-validation. If you want to make sure that your results are reproducible you can set the random seed (`random_seed`).
- If the score cannot be calculated, the package will not raise an error but return an object where `is_valid_score` is `False`. The reported score will be `invalid_score`. We chose this behavior because we want to give you a quick overview where significant predictive power exists without you having to handle errors or edge cases. However, when you want to explicitly handle the errors, you can still do so.

### Learning algorithm

As a learning algorithm, we currently use a Decision Tree because the Decision Tree has the following properties:
- can detect any non-linear bivariate relationship
- good predictive power in a wide variety of use cases
- low requirements for feature preprocessing
- robust model which can handle outliers and does not easily overfit
- can be used for classification and regression
- can be calculated quicker than many other algorithms

We differentiate the exact implementation based on the data type of the target column:
- If the target column is numeric, we use the sklearn.DecisionTreeRegressor
- If the target column is categoric, we use the sklearn.DecisionTreeClassifier

> Please note that we prefer a general good performance on a wide variety of use cases over better performance in some narrow use cases. If you have a proposal for a better/different learning algorithm, please open an issue

However, please note why we actively decided against the following algorithms:

- Correlation or Linear Regression: cannot detect non-linear bivariate relationships without extensive preprocessing
- GAMs: might have problems with very unsmooth functions
- SVM: potentially bad performance if the wrong kernel is selected
- Random Forest/Gradient Boosted Tree: slower than a single Decision Tree
- Neural Networks and Deep Learning: slower calculation than a Decision Tree and also needs more feature preprocessing

### Data preprocessing

Even though the Decision Tree is a very flexible learning algorithm, we need to perform the following preprocessing steps if a column represents categoric values - that means it has the pandas dtype `object`, `category`, `string` or `boolean`.‌
- If the target column is categoric, we use the `sklearn.LabelEncoder​`
- If the feature column is categoric, we use the `sklearn.OneHotEncoder​`


### Choosing the prediction case

> This logic was updated in version 1.0.0.

The choice of the case (`classification` or `regression`) has an influence on the final PPS and thus it is important that the correct case is chosen. The case is chosen based on the data types of the columns. That means, e.g. if you want to change the case from `regression` to `classification` that you have to change the data type from `float` to `string`.

Here are the two main cases:
- A __classification__ is chosen if the target has the dtype `object`, `category`, `string` or `boolean`
- A __regression__ is chosen if the target has the dtype `float` or `int`


### Cases and their score metrics​

Each case uses a different evaluation score for calculating the final predictive power score (PPS).

#### Regression

In case of an regression, the ppscore uses the mean absolute error (MAE) as the underlying evaluation metric (MAE_model). The best possible score of the MAE is 0 and higher is worse. As a baseline score, we calculate the MAE of a naive model (MAE_naive) that always predicts the median of the target column. The PPS is the result of the following normalization (and never smaller than 0):
> PPS = 1 - (MAE_model / MAE_naive)

#### Classification

If the task is a classification, we compute the weighted F1 score (wF1) as the underlying evaluation metric (F1_model). The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0. The relative contribution of precision and recall to the F1 score are equal. The weighted F1 takes into account the precision and recall of all classes weighted by their support as described [here](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html). As a baseline score (F1_naive), we calculate the weighted F1 score for a model that always predicts the most common class of the target column (F1_most_common) and a model that predicts random values (F1_random). F1_naive is set to the maximum of F1_most_common and F1_random. The PPS is the result of the following normalization (and never smaller than 0):
> PPS = (F1_model - F1_naive) / (1 - F1_naive)

### Special cases

There are various cases in which the PPS can be defined without fitting a model to save computation time or in which the PPS cannot be calculated at all. Those cases are described below.

#### Valid scores
In the following cases, the PPS is defined but we can save ourselves the computation time:
- __feature_is_id__ means that the feature column is categoric (see above for __classification__) and that all categories appear only once. Such a feature can never predict a target during cross-validation and thus the PPS is 0.
- __target_is_id__ means that the target column is categoric (see above for __classification__) and that all categories appear only once. Thus, the PPS is 0 because an ID column cannot be predicted by any other column as part of a cross-validation. There still might be a 1 to 1 relationship but this is not detectable by the current implementation of the PPS.
- __target_is_constant__ means that the target column only has a single value and thus the PPS is 0 because any column and baseline can perfectly predict a column that only has a single value. Therefore, the feature does not add any predictive power and we want to communicate that.
- __predict_itself__ means that the feature and target columns are the same and thus the PPS is 1 because a column can always perfectly predict its own value. Also, this leads to the typical diagonal of 1 that we are used to from the correlation matrix.

#### Invalid scores and other errors
In the following cases, the PPS is not defined and the score is set to `invalid_score`:
- __target_is_datetime__ means that the target column has a datetime data type which is not supported. A possible solution might be to convert the target column to a string column.
- __target_data_type_not_supported__ means that the target column has a data type which is not supported. A possible solution might be to convert the target column to another data type.
- __empty_dataframe_after_dropping_na__ occurs when there are no valid rows left after rows with missing values have been dropped. A possible solution might be to replace the missing values with valid values.
- Last but not least, __unknown_error__ occurs for all other errors that might raise an exception. This case is only reported when `catch_errors` is `True`. If you want to inspect or debug the underlying error, please set `catch_errors` to `False`.

## Citing ppscore
[![DOI](https://zenodo.org/badge/256518683.svg)](https://zenodo.org/badge/latestdoi/256518683)

## About
ppscore is developed by [8080 Labs](https://8080labs.com) - we create tools for Python Data Scientists. If you like `ppscore` you might want to check out our other project [bamboolib - a GUI for pandas DataFrames](https://bamboolib.com)


%package help
Summary:	Development documents and examples for ppscore
Provides:	python3-ppscore-doc
%description help
# ppscore - a Python implementation of the Predictive Power Score (PPS)

### From the makers of [bamboolib - a GUI for pandas DataFrames](https://bamboolib.com)


__If you don't know yet what the Predictive Power Score is, please read the following blog post:__

__[RIP correlation. Introducing the Predictive Power Score](https://towardsdatascience.com/rip-correlation-introducing-the-predictive-power-score-3d90808b9598?sk=7ac6697576053896fb27d3356dd6db32)__

The PPS is an asymmetric, data-type-agnostic score that can detect linear or non-linear relationships between two columns. The score ranges from 0 (no predictive power) to 1 (perfect predictive power). It can be used as an alternative to the correlation (matrix).


- [Installation](#installation)
- [Getting started](#getting-started)
- [API](#api)
- [Calculation of the PPS](#calculation-of-the-pps)
- [About](#about)


## Installation

> You need Python 3.6 or above.

From the terminal (or Anaconda prompt in Windows), enter:

```bash
pip install -U ppscore
```


## Getting started

> The examples refer to the newest version (1.2.0) of ppscore. [See changes](https://github.com/8080labs/ppscore/blob/master/CHANGELOG.md)

First, let's create some data:

```python
import pandas as pd
import numpy as np
import ppscore as pps

df = pd.DataFrame()
df["x"] = np.random.uniform(-2, 2, 1_000_000)
df["error"] = np.random.uniform(-0.5, 0.5, 1_000_000)
df["y"] = df["x"] * df["x"] + df["error"]
```

Based on the dataframe we can calculate the PPS of x predicting y:

```python
pps.score(df, "x", "y")
```

We can calculate the PPS of all the predictors in the dataframe against a target y:

```python
pps.predictors(df, "y")
```

Here is how we can calculate the PPS matrix between all columns:

```python
pps.matrix(df)
```


### Visualization of the results
For the visualization of the results you can use seaborn or your favorite viz library.

__Plotting the PPS predictors:__

```python
import seaborn as sns
predictors_df = pps.predictors(df, y="y")
sns.barplot(data=predictors_df, x="x", y="ppscore")
```

__Plotting the PPS matrix:__

(This needs some minor preprocessing because seaborn.heatmap unfortunately does not accept tidy data)

```python
import seaborn as sns
matrix_df = pps.matrix(df)[['x', 'y', 'ppscore']].pivot(columns='x', index='y', values='ppscore')
sns.heatmap(matrix_df, vmin=0, vmax=1, cmap="Blues", linewidths=0.5, annot=True)
```


## API

### ppscore.score(df, x, y, sample=5_000, cross_validation=4, random_seed=123, invalid_score=0, catch_errors=True)

Calculate the Predictive Power Score (PPS) for "x predicts y"

- The score always ranges from 0 to 1 and is data-type agnostic.

- A score of 0 means that the column x cannot predict the column y better than a naive baseline model.

- A score of 1 means that the column x can perfectly predict the column y given the model.

- A score between 0 and 1 states the ratio of how much potential predictive power the model achieved compared to the baseline model.


#### Parameters

- __df__ : pandas.DataFrame
    - Dataframe that contains the columns x and y
- __x__ : str
    - Name of the column x which acts as the feature
- __y__ : str
    - Name of the column y which acts as the target
- __sample__ : int or `None`
    - Number of rows for sampling. The sampling decreases the calculation time of the PPS.
    If `None` there will be no sampling.
- __cross_validation__ : int
    - Number of iterations during cross-validation. This has the following implications:
    For example, if the number is 4, then it is possible to detect patterns when there are at least 4 times the same observation. If the limit is increased, the required minimum observations also increase. This is important, because this is the limit when sklearn will throw an error and the PPS cannot be calculated
- __random_seed__ : int or `None`
    - Random seed for the parts of the calculation that require random numbers, e.g. shuffling or sampling.
    If the value is set, the results will be reproducible. If the value is `None` a new random number is drawn at the start of each calculation.
- __invalid_score__ : any
    - The score that is returned when a calculation is not valid, e.g. because the data type was not supported.
- __catch_errors__ : bool
    - If `True` all errors will be catched and reported as `unknown_error` which ensures convenience. If `False` errors will be raised. This is helpful for inspecting and debugging errors.


#### Returns

- __Dict__:
    - A dict that contains multiple fields about the resulting PPS.
    The dict enables introspection into the calculations that have been performed under the hood


### ppscore.predictors(df, y, output="df", sorted=True, **kwargs)

Calculate the Predictive Power Score (PPS) for all columns in the dataframe against a target (y) column

#### Parameters
- __df__ : pandas.DataFrame
    - The dataframe that contains the data
- __y__ : str
    - Name of the column y which acts as the target
- __output__ : str - potential values: "df", "list"
    - Control the type of the output. Either return a df or a list with all the PPS score dicts
- __sorted__ : bool
    - Whether or not to sort the output dataframe/list by the ppscore
- __kwargs__ :
    - Other key-word arguments that shall be forwarded to the pps.score method, e.g. __sample__, __cross_validation__, __random_seed__, __invalid_score__, __catch_errors__

#### Returns

- __pandas.DataFrame__ or list of PPS dicts:
    - Either returns a df or a list of all the PPS dicts. This can be influenced by the output argument


### ppscore.matrix(df, output="df", sorted=False, **kwargs)

Calculate the Predictive Power Score (PPS) matrix for all columns in the dataframe

#### Parameters

- __df__ : pandas.DataFrame
    - The dataframe that contains the data
- __output__ : str - potential values: "df", "list"
    - Control the type of the output. Either return a df or a list with all the PPS score dicts
- __sorted__ : bool
    - Whether or not to sort the output dataframe/list by the ppscore
- __kwargs__ :
    - Other key-word arguments that shall be forwarded to the pps.score method, e.g. __sample__, __cross_validation__, __random_seed__, __invalid_score__, __catch_errors__

#### Returns

- __pandas.DataFrame__ or list of PPS dicts:
    - Either returns a df or a list of all the PPS dicts. This can be influenced by the output argument


## Calculation of the PPS

> If you are uncertain about some details, feel free to jump into the code to have a look at the exact implementation

There are multiple ways how you can calculate the PPS. The ppscore package provides a sample implementation that is based on the following calculations:

- The score is calculated using only 1 feature trying to predict the target column. This means there are no interaction effects between the scores of various features. Note that this is in contrast to feature importance
- The score is calculated on the test sets of a 4-fold cross-validation (number is adjustable via `cross_validation`). For classification, stratifiedKFold is used. For regression, normal KFold. Please note that __this sampling might not be valid for time series data sets__
- All rows which have a missing value in the feature or the target column are dropped
- In case that the dataset has more than 5,000 rows the score is only calculated on a random subset of 5,000 rows. You can adjust the number of rows or skip this sampling via `sample`. However, in most scenarios the results will be very similar
- There is no grid search for optimal model parameters
- The result might change between calculations because the calculation contains random elements, e.g. the sampling of the rows or the shuffling of the rows before cross-validation. If you want to make sure that your results are reproducible you can set the random seed (`random_seed`).
- If the score cannot be calculated, the package will not raise an error but return an object where `is_valid_score` is `False`. The reported score will be `invalid_score`. We chose this behavior because we want to give you a quick overview where significant predictive power exists without you having to handle errors or edge cases. However, when you want to explicitly handle the errors, you can still do so.

### Learning algorithm

As a learning algorithm, we currently use a Decision Tree because the Decision Tree has the following properties:
- can detect any non-linear bivariate relationship
- good predictive power in a wide variety of use cases
- low requirements for feature preprocessing
- robust model which can handle outliers and does not easily overfit
- can be used for classification and regression
- can be calculated quicker than many other algorithms

We differentiate the exact implementation based on the data type of the target column:
- If the target column is numeric, we use the sklearn.DecisionTreeRegressor
- If the target column is categoric, we use the sklearn.DecisionTreeClassifier

> Please note that we prefer a general good performance on a wide variety of use cases over better performance in some narrow use cases. If you have a proposal for a better/different learning algorithm, please open an issue

However, please note why we actively decided against the following algorithms:

- Correlation or Linear Regression: cannot detect non-linear bivariate relationships without extensive preprocessing
- GAMs: might have problems with very unsmooth functions
- SVM: potentially bad performance if the wrong kernel is selected
- Random Forest/Gradient Boosted Tree: slower than a single Decision Tree
- Neural Networks and Deep Learning: slower calculation than a Decision Tree and also needs more feature preprocessing

### Data preprocessing

Even though the Decision Tree is a very flexible learning algorithm, we need to perform the following preprocessing steps if a column represents categoric values - that means it has the pandas dtype `object`, `category`, `string` or `boolean`.‌
- If the target column is categoric, we use the `sklearn.LabelEncoder​`
- If the feature column is categoric, we use the `sklearn.OneHotEncoder​`


### Choosing the prediction case

> This logic was updated in version 1.0.0.

The choice of the case (`classification` or `regression`) has an influence on the final PPS and thus it is important that the correct case is chosen. The case is chosen based on the data types of the columns. That means, e.g. if you want to change the case from `regression` to `classification` that you have to change the data type from `float` to `string`.

Here are the two main cases:
- A __classification__ is chosen if the target has the dtype `object`, `category`, `string` or `boolean`
- A __regression__ is chosen if the target has the dtype `float` or `int`


### Cases and their score metrics​

Each case uses a different evaluation score for calculating the final predictive power score (PPS).

#### Regression

In case of an regression, the ppscore uses the mean absolute error (MAE) as the underlying evaluation metric (MAE_model). The best possible score of the MAE is 0 and higher is worse. As a baseline score, we calculate the MAE of a naive model (MAE_naive) that always predicts the median of the target column. The PPS is the result of the following normalization (and never smaller than 0):
> PPS = 1 - (MAE_model / MAE_naive)

#### Classification

If the task is a classification, we compute the weighted F1 score (wF1) as the underlying evaluation metric (F1_model). The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0. The relative contribution of precision and recall to the F1 score are equal. The weighted F1 takes into account the precision and recall of all classes weighted by their support as described [here](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html). As a baseline score (F1_naive), we calculate the weighted F1 score for a model that always predicts the most common class of the target column (F1_most_common) and a model that predicts random values (F1_random). F1_naive is set to the maximum of F1_most_common and F1_random. The PPS is the result of the following normalization (and never smaller than 0):
> PPS = (F1_model - F1_naive) / (1 - F1_naive)

### Special cases

There are various cases in which the PPS can be defined without fitting a model to save computation time or in which the PPS cannot be calculated at all. Those cases are described below.

#### Valid scores
In the following cases, the PPS is defined but we can save ourselves the computation time:
- __feature_is_id__ means that the feature column is categoric (see above for __classification__) and that all categories appear only once. Such a feature can never predict a target during cross-validation and thus the PPS is 0.
- __target_is_id__ means that the target column is categoric (see above for __classification__) and that all categories appear only once. Thus, the PPS is 0 because an ID column cannot be predicted by any other column as part of a cross-validation. There still might be a 1 to 1 relationship but this is not detectable by the current implementation of the PPS.
- __target_is_constant__ means that the target column only has a single value and thus the PPS is 0 because any column and baseline can perfectly predict a column that only has a single value. Therefore, the feature does not add any predictive power and we want to communicate that.
- __predict_itself__ means that the feature and target columns are the same and thus the PPS is 1 because a column can always perfectly predict its own value. Also, this leads to the typical diagonal of 1 that we are used to from the correlation matrix.

#### Invalid scores and other errors
In the following cases, the PPS is not defined and the score is set to `invalid_score`:
- __target_is_datetime__ means that the target column has a datetime data type which is not supported. A possible solution might be to convert the target column to a string column.
- __target_data_type_not_supported__ means that the target column has a data type which is not supported. A possible solution might be to convert the target column to another data type.
- __empty_dataframe_after_dropping_na__ occurs when there are no valid rows left after rows with missing values have been dropped. A possible solution might be to replace the missing values with valid values.
- Last but not least, __unknown_error__ occurs for all other errors that might raise an exception. This case is only reported when `catch_errors` is `True`. If you want to inspect or debug the underlying error, please set `catch_errors` to `False`.

## Citing ppscore
[![DOI](https://zenodo.org/badge/256518683.svg)](https://zenodo.org/badge/latestdoi/256518683)

## About
ppscore is developed by [8080 Labs](https://8080labs.com) - we create tools for Python Data Scientists. If you like `ppscore` you might want to check out our other project [bamboolib - a GUI for pandas DataFrames](https://bamboolib.com)


%prep
%autosetup -n ppscore-1.3.0

%build
%py3_build

%install
%py3_install
install -d -m755 %{buildroot}/%{_pkgdocdir}
if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi
if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi
if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi
if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi
pushd %{buildroot}
if [ -d usr/lib ]; then
	find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst
fi
if [ -d usr/lib64 ]; then
	find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst
fi
if [ -d usr/bin ]; then
	find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst
fi
if [ -d usr/sbin ]; then
	find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst
fi
touch doclist.lst
if [ -d usr/share/man ]; then
	find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst
fi
popd
mv %{buildroot}/filelist.lst .
mv %{buildroot}/doclist.lst .

%files -n python3-ppscore -f filelist.lst
%dir %{python3_sitelib}/*

%files help -f doclist.lst
%{_docdir}/*

%changelog
* Sun Apr 23 2023 Python_Bot <Python_Bot@openeuler.org> - 1.3.0-1
- Package Spec generated