diff options
| author | CoprDistGit <infra@openeuler.org> | 2023-05-17 04:50:29 +0000 |
|---|---|---|
| committer | CoprDistGit <infra@openeuler.org> | 2023-05-17 04:50:29 +0000 |
| commit | 0b30e69bad30bb3d62ea0f74c6a38aff495a674c (patch) | |
| tree | ab9c71e70af511e3a02fc94f8f8ea6326ad56952 | |
| parent | 2f891dc233c878495067a0135c6eef2e50508e50 (diff) | |
automatic import of python-light-famd
| -rw-r--r-- | .gitignore | 1 | ||||
| -rw-r--r-- | python-light-famd.spec | 1075 | ||||
| -rw-r--r-- | sources | 1 |
3 files changed, 1077 insertions, 0 deletions
@@ -0,0 +1 @@ +/light_famd-0.0.3.tar.gz diff --git a/python-light-famd.spec b/python-light-famd.spec new file mode 100644 index 0000000..b9de5ce --- /dev/null +++ b/python-light-famd.spec @@ -0,0 +1,1075 @@ +%global _empty_manifest_terminate_build 0 +Name: python-light-famd +Version: 0.0.3 +Release: 1 +Summary: Light Factor Analysis of Mixed Data +License: BSD License +URL: https://github.com/Cauchemare/Light_FAMD +Source0: https://mirrors.nju.edu.cn/pypi/web/packages/9e/f0/60e56c2e3c00e33cfeab5d54dfdb917fa960fd8d178fb57be1320af7010b/light_famd-0.0.3.tar.gz +BuildArch: noarch + +Requires: python3-scikit-learn +Requires: python3-scipy +Requires: python3-pandas +Requires: python3-numpy + +%description + +# Light_FAMD + +`Light_FAMD` is a library for prcessing [factor analysis of mixed data](https://www.wikiwand.com/en/Factor_analysis). This includes a variety of methods including [principal component analysis (PCA)](https://www.wikiwand.com/en/Principal_component_analysis) and [multiply correspondence analysis (MCA)](https://www.researchgate.net/publication/239542271_Multiple_Correspondence_Analysis). The goal is to provide an efficient and light implementation for each algorithm along with a scikit-learn API. + +## Table of contents + +- [Usage](##Usage) + - [Guidelines](###Guidelines) + - [Principal component analysis (PCA)](#principal-component-analysis-pca) + - [Correspondence analysis (CA)](#correspondence-analysis-ca) + - [Multiple correspondence analysis (MCA)](#multiple-correspondence-analysis-mca) + - [Multiple factor analysis (MFA)](#multiple-factor-analysis-mfa) + - [Factor analysis of mixed data (FAMD)](#factor-analysis-of-mixed-data-famd) +- [Going faster](#going-faster) + + + + +`Light_FAMD` doesn't have any extra dependencies apart from the usual suspects (`sklearn`, `pandas`, `numpy`) which are included with Anaconda. + + + +### Guidelines + +Each base estimator(CA,PCA) provided by `Light_FAMD` extends scikit-learn's `(TransformerMixin,BaseEstimator)`.which means we could use directly `fit_transform`,and `(set_params,get_params)` methods. + +Under the hood `Light_FAMD` uses a [randomised version of SVD](https://scikit-learn.org/dev/modules/generated/sklearn.utils.extmath.randomized_svd.html). This algorithm finds a (usually very good) approximate truncated singular value decomposition using randomization to speed up the computations. It is particularly fast on large matrices on which you wish to extract only a small number of components. In order to obtain further speed up, n_iter can be set <=2 (at the cost of loss of precision). However if you want reproducible results then you should set the `random_state` parameter. + +The randomised version of SVD is an iterative method. Because each of light_famd's algorithms use SVD, they all possess a `n_iter` parameter which controls the number of iterations used for computing the SVD. On the one hand the higher `n_iter` is the more precise the results will be. On the other hand increasing `n_iter` increases the computation time. In general the algorithm converges very quickly so using a low `n_iter` (which is the default behaviour) is recommended. + +In this package,inheritance relationship as shown below(A->B:A is superclass of B): + +- PCA -> MFA -> FAMD +- CA ->MCA + +You are supposed to use each method depending on your situation: + +- All your variables are numeric: use principal component analysis (`PCA`) +- You have a contingency table: use correspondence analysis (`CA`) +- You have more than 2 variables and they are all categorical: use multiple correspondence analysis (`MCA`) +- You have groups of categorical **or** numerical variables: use multiple factor analysis (`MFA`) +- You have both categorical and numerical variables: use factor analysis of mixed data (`FAMD`) + +The next subsections give an overview of each method along with usage information. The following papers give a good overview of the field of factor analysis if you want to go deeper: + +- [A Tutorial on Principal Component Analysis](https://arxiv.org/pdf/1404.1100.pdf) +- [Theory of Correspondence Analysis](http://statmath.wu.ac.at/courses/CAandRelMeth/caipA.pdf) +- [Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions](https://arxiv.org/pdf/0909.4061.pdf) +- [Computation of Multiple Correspondence Analysis, with code in R](https://core.ac.uk/download/pdf/6591520.pdf) +- [Singular Value Decomposition Tutorial](https://davetang.org/file/Singular_Value_Decomposition_Tutorial.pdf) +- [Multiple Factor Analysis](https://www.utdallas.edu/~herve/Abdi-MFA2007-pretty.pdf) + +Notice that `Light_FAMD` does't support the sparse input,see [Truncated_FAMD](https://github.com/Cauchemare/Truncated_FAMD) for an alternative of sparse and big data. + + +### Principal-Component-Analysis: PCA + +**PCA**(rescale_with_mean=True, rescale_with_std=True, n_components=2, n_iter=3, + copy=True, check_input=True, random_state=None, engine='auto'): + +**Args:** +- `rescale_with_mean` (bool): Whether to substract each column's mean or not. +- `rescale_with_std` (bool): Whether to divide each column by it's standard deviation or not. +- `n_components` (int): The number of principal components to compute. +- `n_iter` (int): The number of iterations used for computing the SVD. +- `copy` (bool): Whether to perform the computations inplace or not. +- `check_input` (bool): Whether to check the consistency of the inputs or not. +- `engine`(string):"auto":randomized_svd,"fbpca":Facebook's randomized SVD implementation +- `random_state`(int, RandomState instance or None, optional (default=None):The seed of the -pseudo random number generator to use when shuffling the data. If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random. +Return ndarray (M,k),M:Number of samples,K:Number of components. + +**Examples:** +``` +>>>import numpy as np +>>> np.random.seed(42) # This is for doctests reproducibility + +>>>from light_famd import PCA +>>>X = pd.DataFrame(np.random.randint(0,10,size=(10,3)),columns=list('ABC')) +>>>pca = PCA(n_components=2) +>>>pca.fit(X) +PCA(check_input=True, copy=True, engine='auto', n_components=2, n_iter=3, + random_state=None, rescale_with_mean=True, rescale_with_std=True) + +>>>print(pca.explained_variance_) +[20.20385109 8.48246239] + +>>>print(pca.explained_variance_ratio_) +[0.6734617029875277, 0.28274874633810754] +>>>print(pca.column_correlation(X)) # pearson correlation between component and original column,while p-value >=0.05 this similarity is `Nan`. + 0 1 +A -0.953482 NaN +B 0.907314 NaN +C NaN 0.84211 + +>>>print(pca.transform(X)) +[[-0.82262005 0.11730656] + [ 0.05359079 1.62298683] + [ 1.03052849 0.79973099] + [-0.24313366 0.25651395] + [-0.94630387 -1.04943025] + [-0.70591749 -0.01282583] + [-0.39948373 -1.52612436] + [ 2.70164194 0.38048482] + [-2.49373351 0.53655273] + [ 1.8254311 -1.12519545]] +>>>print(pca.fit_transform(X)) +[[-0.82262005 0.11730656] + [ 0.05359079 1.62298683] + [ 1.03052849 0.79973099] + [-0.24313366 0.25651395] + [-0.94630387 -1.04943025] + [-0.70591749 -0.01282583] + [-0.39948373 -1.52612436] + [ 2.70164194 0.38048482] + [-2.49373351 0.53655273] + [ 1.8254311 -1.12519545]] + +``` +### Correspondence-Analysis: CA + +**CA**(n_components=2, n_iter=10, copy=True, check_input=True, random_state=None, + engine='auto'): + +**Args:** +- `n_components` (int): The number of principal components to compute. +- `copy` (bool): Whether to perform the computations inplace or not. +- `check_input` (bool): Whether to check the consistency of the inputs or not. +- `engine`(string):"auto":randomized_svd,"fbpca":Facebook's randomized SVD implementation +- `random_state`(int, RandomState instance or None, optional (default=None):The seed of the -pseudo random number generator to use when shuffling the data. If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random. + +Return ndarray (M,k),M:Number of samples,K:Number of components. + +**Examples:** +``` +>>>import numpy as np +>>>from light_famd import CA +>>>X = pd.DataFrame(data=np.random.randint(0,100,size=(10,4)),columns=list('ABCD')) +>>>ca=CA(n_components=2,n_iter=2) +>>>ca.fit(X) +CA(check_input=True, copy=True, engine='auto', n_components=2, n_iter=2, + random_state=None) + +>>> print(ca.explained_variance_) +[0.16892141 0.0746376 ] + +>>>print(ca.explained_variance_ratio_) +[0.5650580210934917, 0.2496697790527281] + +>>>print(ca.transform(X)) +[[ 0.23150854 -0.39167802] + [ 0.36006095 0.00301414] + [-0.48192602 -0.13002647] + [-0.06333533 -0.21475652] + [-0.16438708 -0.10418312] + [-0.38129126 -0.16515196] + [ 0.2721296 0.46923757] + [ 0.82953753 0.20638333] + [-0.500007 0.36897935] + [ 0.57932474 -0.1023383 ]] + +>>>print(ca.fit_transform(X)) +[[ 0.23150854 -0.39167802] + [ 0.36006095 0.00301414] + [-0.48192602 -0.13002647] + [-0.06333533 -0.21475652] + [-0.16438708 -0.10418312] + [-0.38129126 -0.16515196] + [ 0.2721296 0.46923757] + [ 0.82953753 0.20638333] + [-0.500007 0.36897935] + [ 0.57932474 -0.1023383 ]] +``` + +### Multiple-Correspondence-Analysis: MCA +MCA class inherits from CA class. + +``` +>>>import pandas as pd +>>>from light_famd import MCA +>>>X=pd.DataFrame(np.random.choice(list('abcde'),size=(10,4),replace=True),columns =list('ABCD')) +>>>print(X) + A B C D +0 d e a d +1 e d b b +2 e d a e +3 b b e d +4 b d b b +5 c b a e +6 e d b a +7 d c d d +8 b c d a +9 a e c c +>>>mca=MCA(n_components=2) +>>>mca.fit(X) +MCA(check_input=True, copy=True, engine='auto', n_components=2, n_iter=10, + random_state=None) + +>>>print(mca.explained_variance_) +[0.90150495 0.76979456] + +>>>print(mca.explained_variance_ratio_) +[0.24040131974598467, 0.20527854948955893] + +>>>print(mca.transform(X)) +[[ 0.55603013 0.7016272 ] + [-0.73558629 -1.17559462] + [-0.44972794 -0.4973024 ] + [-0.16248444 0.95706908] + [-0.66969377 -0.79951057] + [-0.21267777 0.39953562] + [-0.67921667 -0.8707747 ] + [ 0.05058625 1.34573057] + [-0.31952341 0.77285922] + [ 2.62229391 -0.83363941]] + +>>>print(mca.fit_transform(X)) +[[ 0.55603013 0.7016272 ] + [-0.73558629 -1.17559462] + [-0.44972794 -0.4973024 ] + [-0.16248444 0.95706908] + [-0.66969377 -0.79951057] + [-0.21267777 0.39953562] + [-0.67921667 -0.8707747 ] + [ 0.05058625 1.34573057] + [-0.31952341 0.77285922] + [ 2.62229391 -0.83363941]] + +``` +### Multiple-Factor-Analysis: MFA +MFA class inherits from PCA class. +Since FAMD class inherits from MFA and the only thing to do for FAMD is to determine `groups` parameter compare to its superclass `MFA`.therefore we skip this chapiter and go directly to `FAMD`. + + +### Factor-Analysis-of-Mixed-Data: FAMD +The `FAMD` inherits from the `MFA` class, which entails that you have access to all it's methods and properties of `MFA` class. +``` +>>>import pandas as pd +>>>from light_famd import FAMD +>>>X_n = pd.DataFrame(data=np.random.randint(0,100,size=(10,2)),columns=list('AB')) +>>>X_c =pd.DataFrame(np.random.choice(list('abcde'),size=(10,4),replace=True),columns =list('CDEF')) +>>>X=pd.concat([X_n,X_c],axis=1) +>>>print(X) + A B C D E F +0 96 19 b d b e +1 11 46 b d a e +2 0 89 a a a c +3 13 63 c a e d +4 37 36 d b e c +5 10 99 a b d c +6 76 2 c a d e +7 32 5 c a e d +8 49 9 c e e e +9 4 22 c c b d + +>>>famd = FAMD(n_components=2) +>>>famd.fit(X) +MCA PROCESS MCA PROCESS ELIMINATED 0 COLUMNS SINCE THEIR MISS_RATES >= 99% +Out: +FAMD(check_input=True, copy=False, engine='auto', n_components=2, n_iter=2, + random_state=None) + +>>>print(famd.explained_variance_) +[17.40871219 9.73440949] + +>>>print(famd.explained_variance_ratio_) +[0.32596621039327284, 0.1822701494502082] + +>>> print(famd.column_correlation(X)) + 0 1 +A NaN NaN +B NaN NaN +C_a NaN NaN +C_b NaN 0.824458 +C_c 0.922220 NaN +C_d NaN NaN +D_a NaN NaN +D_b NaN NaN +D_c NaN NaN +D_d NaN 0.824458 +D_e NaN NaN +E_a NaN NaN +E_b NaN NaN +E_d NaN NaN +E_e NaN NaN +F_c NaN -0.714447 +F_d 0.673375 NaN +F_e NaN 0.839324 + + + +>>>print(famd.transform(X)) +[[ 2.23848136 5.75809647] + [ 2.0845175 4.78930072] + [ 2.6682068 -2.78991262] + [ 6.2962962 -1.57451325] + [ 2.52140085 -3.28279729] + [ 1.58256681 -3.73135011] + [ 5.19476759 1.18333717] + [ 6.35288446 -1.33186723] + [ 5.02971134 1.6216402 ] + [ 4.05754963 0.69620997]] + +>>>print(famd.fit_transform(X)) +MCA PROCESS HAVE ELIMINATE 0 COLUMNS SINCE ITS MISSING RATE >= 99% +[[ 2.23848136 5.75809647] + [ 2.0845175 4.78930072] + [ 2.6682068 -2.78991262] + [ 6.2962962 -1.57451325] + [ 2.52140085 -3.28279729] + [ 1.58256681 -3.73135011] + [ 5.19476759 1.18333717] + [ 6.35288446 -1.33186723] + [ 5.02971134 1.6216402 ] + [ 4.05754963 0.69620997]] + +``` + + + + +## Going faster + +By default `light_famd` uses `sklearn`'s randomized SVD implementation. One of the goals of `Light_FAMD` is to make it possible to use a different SVD backend. For the while the only other supported backend is [Facebook's randomized SVD implementation](https://research.facebook.com/blog/fast-randomized-svd/) called [fbpca](http://fbpca.readthedocs.org/en/latest/). You can use it by setting the `engine` parameter to `'fbpca'` or see [Truncated_FAMD](https://github.com/Cauchemare/Truncated_FAMD) for an alternative of automatic selection of svd_solver depends on the structure of input: + +```python +>>> import Light_FAMD +>>> pca = Light_FAMD.PCA(engine='fbpca') + +``` + + + + + +%package -n python3-light-famd +Summary: Light Factor Analysis of Mixed Data +Provides: python-light-famd +BuildRequires: python3-devel +BuildRequires: python3-setuptools +BuildRequires: python3-pip +%description -n python3-light-famd + +# Light_FAMD + +`Light_FAMD` is a library for prcessing [factor analysis of mixed data](https://www.wikiwand.com/en/Factor_analysis). This includes a variety of methods including [principal component analysis (PCA)](https://www.wikiwand.com/en/Principal_component_analysis) and [multiply correspondence analysis (MCA)](https://www.researchgate.net/publication/239542271_Multiple_Correspondence_Analysis). The goal is to provide an efficient and light implementation for each algorithm along with a scikit-learn API. + +## Table of contents + +- [Usage](##Usage) + - [Guidelines](###Guidelines) + - [Principal component analysis (PCA)](#principal-component-analysis-pca) + - [Correspondence analysis (CA)](#correspondence-analysis-ca) + - [Multiple correspondence analysis (MCA)](#multiple-correspondence-analysis-mca) + - [Multiple factor analysis (MFA)](#multiple-factor-analysis-mfa) + - [Factor analysis of mixed data (FAMD)](#factor-analysis-of-mixed-data-famd) +- [Going faster](#going-faster) + + + + +`Light_FAMD` doesn't have any extra dependencies apart from the usual suspects (`sklearn`, `pandas`, `numpy`) which are included with Anaconda. + + + +### Guidelines + +Each base estimator(CA,PCA) provided by `Light_FAMD` extends scikit-learn's `(TransformerMixin,BaseEstimator)`.which means we could use directly `fit_transform`,and `(set_params,get_params)` methods. + +Under the hood `Light_FAMD` uses a [randomised version of SVD](https://scikit-learn.org/dev/modules/generated/sklearn.utils.extmath.randomized_svd.html). This algorithm finds a (usually very good) approximate truncated singular value decomposition using randomization to speed up the computations. It is particularly fast on large matrices on which you wish to extract only a small number of components. In order to obtain further speed up, n_iter can be set <=2 (at the cost of loss of precision). However if you want reproducible results then you should set the `random_state` parameter. + +The randomised version of SVD is an iterative method. Because each of light_famd's algorithms use SVD, they all possess a `n_iter` parameter which controls the number of iterations used for computing the SVD. On the one hand the higher `n_iter` is the more precise the results will be. On the other hand increasing `n_iter` increases the computation time. In general the algorithm converges very quickly so using a low `n_iter` (which is the default behaviour) is recommended. + +In this package,inheritance relationship as shown below(A->B:A is superclass of B): + +- PCA -> MFA -> FAMD +- CA ->MCA + +You are supposed to use each method depending on your situation: + +- All your variables are numeric: use principal component analysis (`PCA`) +- You have a contingency table: use correspondence analysis (`CA`) +- You have more than 2 variables and they are all categorical: use multiple correspondence analysis (`MCA`) +- You have groups of categorical **or** numerical variables: use multiple factor analysis (`MFA`) +- You have both categorical and numerical variables: use factor analysis of mixed data (`FAMD`) + +The next subsections give an overview of each method along with usage information. The following papers give a good overview of the field of factor analysis if you want to go deeper: + +- [A Tutorial on Principal Component Analysis](https://arxiv.org/pdf/1404.1100.pdf) +- [Theory of Correspondence Analysis](http://statmath.wu.ac.at/courses/CAandRelMeth/caipA.pdf) +- [Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions](https://arxiv.org/pdf/0909.4061.pdf) +- [Computation of Multiple Correspondence Analysis, with code in R](https://core.ac.uk/download/pdf/6591520.pdf) +- [Singular Value Decomposition Tutorial](https://davetang.org/file/Singular_Value_Decomposition_Tutorial.pdf) +- [Multiple Factor Analysis](https://www.utdallas.edu/~herve/Abdi-MFA2007-pretty.pdf) + +Notice that `Light_FAMD` does't support the sparse input,see [Truncated_FAMD](https://github.com/Cauchemare/Truncated_FAMD) for an alternative of sparse and big data. + + +### Principal-Component-Analysis: PCA + +**PCA**(rescale_with_mean=True, rescale_with_std=True, n_components=2, n_iter=3, + copy=True, check_input=True, random_state=None, engine='auto'): + +**Args:** +- `rescale_with_mean` (bool): Whether to substract each column's mean or not. +- `rescale_with_std` (bool): Whether to divide each column by it's standard deviation or not. +- `n_components` (int): The number of principal components to compute. +- `n_iter` (int): The number of iterations used for computing the SVD. +- `copy` (bool): Whether to perform the computations inplace or not. +- `check_input` (bool): Whether to check the consistency of the inputs or not. +- `engine`(string):"auto":randomized_svd,"fbpca":Facebook's randomized SVD implementation +- `random_state`(int, RandomState instance or None, optional (default=None):The seed of the -pseudo random number generator to use when shuffling the data. If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random. +Return ndarray (M,k),M:Number of samples,K:Number of components. + +**Examples:** +``` +>>>import numpy as np +>>> np.random.seed(42) # This is for doctests reproducibility + +>>>from light_famd import PCA +>>>X = pd.DataFrame(np.random.randint(0,10,size=(10,3)),columns=list('ABC')) +>>>pca = PCA(n_components=2) +>>>pca.fit(X) +PCA(check_input=True, copy=True, engine='auto', n_components=2, n_iter=3, + random_state=None, rescale_with_mean=True, rescale_with_std=True) + +>>>print(pca.explained_variance_) +[20.20385109 8.48246239] + +>>>print(pca.explained_variance_ratio_) +[0.6734617029875277, 0.28274874633810754] +>>>print(pca.column_correlation(X)) # pearson correlation between component and original column,while p-value >=0.05 this similarity is `Nan`. + 0 1 +A -0.953482 NaN +B 0.907314 NaN +C NaN 0.84211 + +>>>print(pca.transform(X)) +[[-0.82262005 0.11730656] + [ 0.05359079 1.62298683] + [ 1.03052849 0.79973099] + [-0.24313366 0.25651395] + [-0.94630387 -1.04943025] + [-0.70591749 -0.01282583] + [-0.39948373 -1.52612436] + [ 2.70164194 0.38048482] + [-2.49373351 0.53655273] + [ 1.8254311 -1.12519545]] +>>>print(pca.fit_transform(X)) +[[-0.82262005 0.11730656] + [ 0.05359079 1.62298683] + [ 1.03052849 0.79973099] + [-0.24313366 0.25651395] + [-0.94630387 -1.04943025] + [-0.70591749 -0.01282583] + [-0.39948373 -1.52612436] + [ 2.70164194 0.38048482] + [-2.49373351 0.53655273] + [ 1.8254311 -1.12519545]] + +``` +### Correspondence-Analysis: CA + +**CA**(n_components=2, n_iter=10, copy=True, check_input=True, random_state=None, + engine='auto'): + +**Args:** +- `n_components` (int): The number of principal components to compute. +- `copy` (bool): Whether to perform the computations inplace or not. +- `check_input` (bool): Whether to check the consistency of the inputs or not. +- `engine`(string):"auto":randomized_svd,"fbpca":Facebook's randomized SVD implementation +- `random_state`(int, RandomState instance or None, optional (default=None):The seed of the -pseudo random number generator to use when shuffling the data. If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random. + +Return ndarray (M,k),M:Number of samples,K:Number of components. + +**Examples:** +``` +>>>import numpy as np +>>>from light_famd import CA +>>>X = pd.DataFrame(data=np.random.randint(0,100,size=(10,4)),columns=list('ABCD')) +>>>ca=CA(n_components=2,n_iter=2) +>>>ca.fit(X) +CA(check_input=True, copy=True, engine='auto', n_components=2, n_iter=2, + random_state=None) + +>>> print(ca.explained_variance_) +[0.16892141 0.0746376 ] + +>>>print(ca.explained_variance_ratio_) +[0.5650580210934917, 0.2496697790527281] + +>>>print(ca.transform(X)) +[[ 0.23150854 -0.39167802] + [ 0.36006095 0.00301414] + [-0.48192602 -0.13002647] + [-0.06333533 -0.21475652] + [-0.16438708 -0.10418312] + [-0.38129126 -0.16515196] + [ 0.2721296 0.46923757] + [ 0.82953753 0.20638333] + [-0.500007 0.36897935] + [ 0.57932474 -0.1023383 ]] + +>>>print(ca.fit_transform(X)) +[[ 0.23150854 -0.39167802] + [ 0.36006095 0.00301414] + [-0.48192602 -0.13002647] + [-0.06333533 -0.21475652] + [-0.16438708 -0.10418312] + [-0.38129126 -0.16515196] + [ 0.2721296 0.46923757] + [ 0.82953753 0.20638333] + [-0.500007 0.36897935] + [ 0.57932474 -0.1023383 ]] +``` + +### Multiple-Correspondence-Analysis: MCA +MCA class inherits from CA class. + +``` +>>>import pandas as pd +>>>from light_famd import MCA +>>>X=pd.DataFrame(np.random.choice(list('abcde'),size=(10,4),replace=True),columns =list('ABCD')) +>>>print(X) + A B C D +0 d e a d +1 e d b b +2 e d a e +3 b b e d +4 b d b b +5 c b a e +6 e d b a +7 d c d d +8 b c d a +9 a e c c +>>>mca=MCA(n_components=2) +>>>mca.fit(X) +MCA(check_input=True, copy=True, engine='auto', n_components=2, n_iter=10, + random_state=None) + +>>>print(mca.explained_variance_) +[0.90150495 0.76979456] + +>>>print(mca.explained_variance_ratio_) +[0.24040131974598467, 0.20527854948955893] + +>>>print(mca.transform(X)) +[[ 0.55603013 0.7016272 ] + [-0.73558629 -1.17559462] + [-0.44972794 -0.4973024 ] + [-0.16248444 0.95706908] + [-0.66969377 -0.79951057] + [-0.21267777 0.39953562] + [-0.67921667 -0.8707747 ] + [ 0.05058625 1.34573057] + [-0.31952341 0.77285922] + [ 2.62229391 -0.83363941]] + +>>>print(mca.fit_transform(X)) +[[ 0.55603013 0.7016272 ] + [-0.73558629 -1.17559462] + [-0.44972794 -0.4973024 ] + [-0.16248444 0.95706908] + [-0.66969377 -0.79951057] + [-0.21267777 0.39953562] + [-0.67921667 -0.8707747 ] + [ 0.05058625 1.34573057] + [-0.31952341 0.77285922] + [ 2.62229391 -0.83363941]] + +``` +### Multiple-Factor-Analysis: MFA +MFA class inherits from PCA class. +Since FAMD class inherits from MFA and the only thing to do for FAMD is to determine `groups` parameter compare to its superclass `MFA`.therefore we skip this chapiter and go directly to `FAMD`. + + +### Factor-Analysis-of-Mixed-Data: FAMD +The `FAMD` inherits from the `MFA` class, which entails that you have access to all it's methods and properties of `MFA` class. +``` +>>>import pandas as pd +>>>from light_famd import FAMD +>>>X_n = pd.DataFrame(data=np.random.randint(0,100,size=(10,2)),columns=list('AB')) +>>>X_c =pd.DataFrame(np.random.choice(list('abcde'),size=(10,4),replace=True),columns =list('CDEF')) +>>>X=pd.concat([X_n,X_c],axis=1) +>>>print(X) + A B C D E F +0 96 19 b d b e +1 11 46 b d a e +2 0 89 a a a c +3 13 63 c a e d +4 37 36 d b e c +5 10 99 a b d c +6 76 2 c a d e +7 32 5 c a e d +8 49 9 c e e e +9 4 22 c c b d + +>>>famd = FAMD(n_components=2) +>>>famd.fit(X) +MCA PROCESS MCA PROCESS ELIMINATED 0 COLUMNS SINCE THEIR MISS_RATES >= 99% +Out: +FAMD(check_input=True, copy=False, engine='auto', n_components=2, n_iter=2, + random_state=None) + +>>>print(famd.explained_variance_) +[17.40871219 9.73440949] + +>>>print(famd.explained_variance_ratio_) +[0.32596621039327284, 0.1822701494502082] + +>>> print(famd.column_correlation(X)) + 0 1 +A NaN NaN +B NaN NaN +C_a NaN NaN +C_b NaN 0.824458 +C_c 0.922220 NaN +C_d NaN NaN +D_a NaN NaN +D_b NaN NaN +D_c NaN NaN +D_d NaN 0.824458 +D_e NaN NaN +E_a NaN NaN +E_b NaN NaN +E_d NaN NaN +E_e NaN NaN +F_c NaN -0.714447 +F_d 0.673375 NaN +F_e NaN 0.839324 + + + +>>>print(famd.transform(X)) +[[ 2.23848136 5.75809647] + [ 2.0845175 4.78930072] + [ 2.6682068 -2.78991262] + [ 6.2962962 -1.57451325] + [ 2.52140085 -3.28279729] + [ 1.58256681 -3.73135011] + [ 5.19476759 1.18333717] + [ 6.35288446 -1.33186723] + [ 5.02971134 1.6216402 ] + [ 4.05754963 0.69620997]] + +>>>print(famd.fit_transform(X)) +MCA PROCESS HAVE ELIMINATE 0 COLUMNS SINCE ITS MISSING RATE >= 99% +[[ 2.23848136 5.75809647] + [ 2.0845175 4.78930072] + [ 2.6682068 -2.78991262] + [ 6.2962962 -1.57451325] + [ 2.52140085 -3.28279729] + [ 1.58256681 -3.73135011] + [ 5.19476759 1.18333717] + [ 6.35288446 -1.33186723] + [ 5.02971134 1.6216402 ] + [ 4.05754963 0.69620997]] + +``` + + + + +## Going faster + +By default `light_famd` uses `sklearn`'s randomized SVD implementation. One of the goals of `Light_FAMD` is to make it possible to use a different SVD backend. For the while the only other supported backend is [Facebook's randomized SVD implementation](https://research.facebook.com/blog/fast-randomized-svd/) called [fbpca](http://fbpca.readthedocs.org/en/latest/). You can use it by setting the `engine` parameter to `'fbpca'` or see [Truncated_FAMD](https://github.com/Cauchemare/Truncated_FAMD) for an alternative of automatic selection of svd_solver depends on the structure of input: + +```python +>>> import Light_FAMD +>>> pca = Light_FAMD.PCA(engine='fbpca') + +``` + + + + + +%package help +Summary: Development documents and examples for light-famd +Provides: python3-light-famd-doc +%description help + +# Light_FAMD + +`Light_FAMD` is a library for prcessing [factor analysis of mixed data](https://www.wikiwand.com/en/Factor_analysis). This includes a variety of methods including [principal component analysis (PCA)](https://www.wikiwand.com/en/Principal_component_analysis) and [multiply correspondence analysis (MCA)](https://www.researchgate.net/publication/239542271_Multiple_Correspondence_Analysis). The goal is to provide an efficient and light implementation for each algorithm along with a scikit-learn API. + +## Table of contents + +- [Usage](##Usage) + - [Guidelines](###Guidelines) + - [Principal component analysis (PCA)](#principal-component-analysis-pca) + - [Correspondence analysis (CA)](#correspondence-analysis-ca) + - [Multiple correspondence analysis (MCA)](#multiple-correspondence-analysis-mca) + - [Multiple factor analysis (MFA)](#multiple-factor-analysis-mfa) + - [Factor analysis of mixed data (FAMD)](#factor-analysis-of-mixed-data-famd) +- [Going faster](#going-faster) + + + + +`Light_FAMD` doesn't have any extra dependencies apart from the usual suspects (`sklearn`, `pandas`, `numpy`) which are included with Anaconda. + + + +### Guidelines + +Each base estimator(CA,PCA) provided by `Light_FAMD` extends scikit-learn's `(TransformerMixin,BaseEstimator)`.which means we could use directly `fit_transform`,and `(set_params,get_params)` methods. + +Under the hood `Light_FAMD` uses a [randomised version of SVD](https://scikit-learn.org/dev/modules/generated/sklearn.utils.extmath.randomized_svd.html). This algorithm finds a (usually very good) approximate truncated singular value decomposition using randomization to speed up the computations. It is particularly fast on large matrices on which you wish to extract only a small number of components. In order to obtain further speed up, n_iter can be set <=2 (at the cost of loss of precision). However if you want reproducible results then you should set the `random_state` parameter. + +The randomised version of SVD is an iterative method. Because each of light_famd's algorithms use SVD, they all possess a `n_iter` parameter which controls the number of iterations used for computing the SVD. On the one hand the higher `n_iter` is the more precise the results will be. On the other hand increasing `n_iter` increases the computation time. In general the algorithm converges very quickly so using a low `n_iter` (which is the default behaviour) is recommended. + +In this package,inheritance relationship as shown below(A->B:A is superclass of B): + +- PCA -> MFA -> FAMD +- CA ->MCA + +You are supposed to use each method depending on your situation: + +- All your variables are numeric: use principal component analysis (`PCA`) +- You have a contingency table: use correspondence analysis (`CA`) +- You have more than 2 variables and they are all categorical: use multiple correspondence analysis (`MCA`) +- You have groups of categorical **or** numerical variables: use multiple factor analysis (`MFA`) +- You have both categorical and numerical variables: use factor analysis of mixed data (`FAMD`) + +The next subsections give an overview of each method along with usage information. The following papers give a good overview of the field of factor analysis if you want to go deeper: + +- [A Tutorial on Principal Component Analysis](https://arxiv.org/pdf/1404.1100.pdf) +- [Theory of Correspondence Analysis](http://statmath.wu.ac.at/courses/CAandRelMeth/caipA.pdf) +- [Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions](https://arxiv.org/pdf/0909.4061.pdf) +- [Computation of Multiple Correspondence Analysis, with code in R](https://core.ac.uk/download/pdf/6591520.pdf) +- [Singular Value Decomposition Tutorial](https://davetang.org/file/Singular_Value_Decomposition_Tutorial.pdf) +- [Multiple Factor Analysis](https://www.utdallas.edu/~herve/Abdi-MFA2007-pretty.pdf) + +Notice that `Light_FAMD` does't support the sparse input,see [Truncated_FAMD](https://github.com/Cauchemare/Truncated_FAMD) for an alternative of sparse and big data. + + +### Principal-Component-Analysis: PCA + +**PCA**(rescale_with_mean=True, rescale_with_std=True, n_components=2, n_iter=3, + copy=True, check_input=True, random_state=None, engine='auto'): + +**Args:** +- `rescale_with_mean` (bool): Whether to substract each column's mean or not. +- `rescale_with_std` (bool): Whether to divide each column by it's standard deviation or not. +- `n_components` (int): The number of principal components to compute. +- `n_iter` (int): The number of iterations used for computing the SVD. +- `copy` (bool): Whether to perform the computations inplace or not. +- `check_input` (bool): Whether to check the consistency of the inputs or not. +- `engine`(string):"auto":randomized_svd,"fbpca":Facebook's randomized SVD implementation +- `random_state`(int, RandomState instance or None, optional (default=None):The seed of the -pseudo random number generator to use when shuffling the data. If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random. +Return ndarray (M,k),M:Number of samples,K:Number of components. + +**Examples:** +``` +>>>import numpy as np +>>> np.random.seed(42) # This is for doctests reproducibility + +>>>from light_famd import PCA +>>>X = pd.DataFrame(np.random.randint(0,10,size=(10,3)),columns=list('ABC')) +>>>pca = PCA(n_components=2) +>>>pca.fit(X) +PCA(check_input=True, copy=True, engine='auto', n_components=2, n_iter=3, + random_state=None, rescale_with_mean=True, rescale_with_std=True) + +>>>print(pca.explained_variance_) +[20.20385109 8.48246239] + +>>>print(pca.explained_variance_ratio_) +[0.6734617029875277, 0.28274874633810754] +>>>print(pca.column_correlation(X)) # pearson correlation between component and original column,while p-value >=0.05 this similarity is `Nan`. + 0 1 +A -0.953482 NaN +B 0.907314 NaN +C NaN 0.84211 + +>>>print(pca.transform(X)) +[[-0.82262005 0.11730656] + [ 0.05359079 1.62298683] + [ 1.03052849 0.79973099] + [-0.24313366 0.25651395] + [-0.94630387 -1.04943025] + [-0.70591749 -0.01282583] + [-0.39948373 -1.52612436] + [ 2.70164194 0.38048482] + [-2.49373351 0.53655273] + [ 1.8254311 -1.12519545]] +>>>print(pca.fit_transform(X)) +[[-0.82262005 0.11730656] + [ 0.05359079 1.62298683] + [ 1.03052849 0.79973099] + [-0.24313366 0.25651395] + [-0.94630387 -1.04943025] + [-0.70591749 -0.01282583] + [-0.39948373 -1.52612436] + [ 2.70164194 0.38048482] + [-2.49373351 0.53655273] + [ 1.8254311 -1.12519545]] + +``` +### Correspondence-Analysis: CA + +**CA**(n_components=2, n_iter=10, copy=True, check_input=True, random_state=None, + engine='auto'): + +**Args:** +- `n_components` (int): The number of principal components to compute. +- `copy` (bool): Whether to perform the computations inplace or not. +- `check_input` (bool): Whether to check the consistency of the inputs or not. +- `engine`(string):"auto":randomized_svd,"fbpca":Facebook's randomized SVD implementation +- `random_state`(int, RandomState instance or None, optional (default=None):The seed of the -pseudo random number generator to use when shuffling the data. If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random. + +Return ndarray (M,k),M:Number of samples,K:Number of components. + +**Examples:** +``` +>>>import numpy as np +>>>from light_famd import CA +>>>X = pd.DataFrame(data=np.random.randint(0,100,size=(10,4)),columns=list('ABCD')) +>>>ca=CA(n_components=2,n_iter=2) +>>>ca.fit(X) +CA(check_input=True, copy=True, engine='auto', n_components=2, n_iter=2, + random_state=None) + +>>> print(ca.explained_variance_) +[0.16892141 0.0746376 ] + +>>>print(ca.explained_variance_ratio_) +[0.5650580210934917, 0.2496697790527281] + +>>>print(ca.transform(X)) +[[ 0.23150854 -0.39167802] + [ 0.36006095 0.00301414] + [-0.48192602 -0.13002647] + [-0.06333533 -0.21475652] + [-0.16438708 -0.10418312] + [-0.38129126 -0.16515196] + [ 0.2721296 0.46923757] + [ 0.82953753 0.20638333] + [-0.500007 0.36897935] + [ 0.57932474 -0.1023383 ]] + +>>>print(ca.fit_transform(X)) +[[ 0.23150854 -0.39167802] + [ 0.36006095 0.00301414] + [-0.48192602 -0.13002647] + [-0.06333533 -0.21475652] + [-0.16438708 -0.10418312] + [-0.38129126 -0.16515196] + [ 0.2721296 0.46923757] + [ 0.82953753 0.20638333] + [-0.500007 0.36897935] + [ 0.57932474 -0.1023383 ]] +``` + +### Multiple-Correspondence-Analysis: MCA +MCA class inherits from CA class. + +``` +>>>import pandas as pd +>>>from light_famd import MCA +>>>X=pd.DataFrame(np.random.choice(list('abcde'),size=(10,4),replace=True),columns =list('ABCD')) +>>>print(X) + A B C D +0 d e a d +1 e d b b +2 e d a e +3 b b e d +4 b d b b +5 c b a e +6 e d b a +7 d c d d +8 b c d a +9 a e c c +>>>mca=MCA(n_components=2) +>>>mca.fit(X) +MCA(check_input=True, copy=True, engine='auto', n_components=2, n_iter=10, + random_state=None) + +>>>print(mca.explained_variance_) +[0.90150495 0.76979456] + +>>>print(mca.explained_variance_ratio_) +[0.24040131974598467, 0.20527854948955893] + +>>>print(mca.transform(X)) +[[ 0.55603013 0.7016272 ] + [-0.73558629 -1.17559462] + [-0.44972794 -0.4973024 ] + [-0.16248444 0.95706908] + [-0.66969377 -0.79951057] + [-0.21267777 0.39953562] + [-0.67921667 -0.8707747 ] + [ 0.05058625 1.34573057] + [-0.31952341 0.77285922] + [ 2.62229391 -0.83363941]] + +>>>print(mca.fit_transform(X)) +[[ 0.55603013 0.7016272 ] + [-0.73558629 -1.17559462] + [-0.44972794 -0.4973024 ] + [-0.16248444 0.95706908] + [-0.66969377 -0.79951057] + [-0.21267777 0.39953562] + [-0.67921667 -0.8707747 ] + [ 0.05058625 1.34573057] + [-0.31952341 0.77285922] + [ 2.62229391 -0.83363941]] + +``` +### Multiple-Factor-Analysis: MFA +MFA class inherits from PCA class. +Since FAMD class inherits from MFA and the only thing to do for FAMD is to determine `groups` parameter compare to its superclass `MFA`.therefore we skip this chapiter and go directly to `FAMD`. + + +### Factor-Analysis-of-Mixed-Data: FAMD +The `FAMD` inherits from the `MFA` class, which entails that you have access to all it's methods and properties of `MFA` class. +``` +>>>import pandas as pd +>>>from light_famd import FAMD +>>>X_n = pd.DataFrame(data=np.random.randint(0,100,size=(10,2)),columns=list('AB')) +>>>X_c =pd.DataFrame(np.random.choice(list('abcde'),size=(10,4),replace=True),columns =list('CDEF')) +>>>X=pd.concat([X_n,X_c],axis=1) +>>>print(X) + A B C D E F +0 96 19 b d b e +1 11 46 b d a e +2 0 89 a a a c +3 13 63 c a e d +4 37 36 d b e c +5 10 99 a b d c +6 76 2 c a d e +7 32 5 c a e d +8 49 9 c e e e +9 4 22 c c b d + +>>>famd = FAMD(n_components=2) +>>>famd.fit(X) +MCA PROCESS MCA PROCESS ELIMINATED 0 COLUMNS SINCE THEIR MISS_RATES >= 99% +Out: +FAMD(check_input=True, copy=False, engine='auto', n_components=2, n_iter=2, + random_state=None) + +>>>print(famd.explained_variance_) +[17.40871219 9.73440949] + +>>>print(famd.explained_variance_ratio_) +[0.32596621039327284, 0.1822701494502082] + +>>> print(famd.column_correlation(X)) + 0 1 +A NaN NaN +B NaN NaN +C_a NaN NaN +C_b NaN 0.824458 +C_c 0.922220 NaN +C_d NaN NaN +D_a NaN NaN +D_b NaN NaN +D_c NaN NaN +D_d NaN 0.824458 +D_e NaN NaN +E_a NaN NaN +E_b NaN NaN +E_d NaN NaN +E_e NaN NaN +F_c NaN -0.714447 +F_d 0.673375 NaN +F_e NaN 0.839324 + + + +>>>print(famd.transform(X)) +[[ 2.23848136 5.75809647] + [ 2.0845175 4.78930072] + [ 2.6682068 -2.78991262] + [ 6.2962962 -1.57451325] + [ 2.52140085 -3.28279729] + [ 1.58256681 -3.73135011] + [ 5.19476759 1.18333717] + [ 6.35288446 -1.33186723] + [ 5.02971134 1.6216402 ] + [ 4.05754963 0.69620997]] + +>>>print(famd.fit_transform(X)) +MCA PROCESS HAVE ELIMINATE 0 COLUMNS SINCE ITS MISSING RATE >= 99% +[[ 2.23848136 5.75809647] + [ 2.0845175 4.78930072] + [ 2.6682068 -2.78991262] + [ 6.2962962 -1.57451325] + [ 2.52140085 -3.28279729] + [ 1.58256681 -3.73135011] + [ 5.19476759 1.18333717] + [ 6.35288446 -1.33186723] + [ 5.02971134 1.6216402 ] + [ 4.05754963 0.69620997]] + +``` + + + + +## Going faster + +By default `light_famd` uses `sklearn`'s randomized SVD implementation. One of the goals of `Light_FAMD` is to make it possible to use a different SVD backend. For the while the only other supported backend is [Facebook's randomized SVD implementation](https://research.facebook.com/blog/fast-randomized-svd/) called [fbpca](http://fbpca.readthedocs.org/en/latest/). You can use it by setting the `engine` parameter to `'fbpca'` or see [Truncated_FAMD](https://github.com/Cauchemare/Truncated_FAMD) for an alternative of automatic selection of svd_solver depends on the structure of input: + +```python +>>> import Light_FAMD +>>> pca = Light_FAMD.PCA(engine='fbpca') + +``` + + + + + +%prep +%autosetup -n light-famd-0.0.3 + +%build +%py3_build + +%install +%py3_install +install -d -m755 %{buildroot}/%{_pkgdocdir} +if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi +if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi +if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi +if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi +pushd %{buildroot} +if [ -d usr/lib ]; then + find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/lib64 ]; then + find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/bin ]; then + find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/sbin ]; then + find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst +fi +touch doclist.lst +if [ -d usr/share/man ]; then + find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst +fi +popd +mv %{buildroot}/filelist.lst . +mv %{buildroot}/doclist.lst . + +%files -n python3-light-famd -f filelist.lst +%dir %{python3_sitelib}/* + +%files help -f doclist.lst +%{_docdir}/* + +%changelog +* Wed May 17 2023 Python_Bot <Python_Bot@openeuler.org> - 0.0.3-1 +- Package Spec generated @@ -0,0 +1 @@ +bc6e6e28443acc65cd49a6f6778d1018 light_famd-0.0.3.tar.gz |
