%global _empty_manifest_terminate_build 0 Name: python-light-famd Version: 0.0.3 Release: 1 Summary: Light Factor Analysis of Mixed Data License: BSD License URL: https://github.com/Cauchemare/Light_FAMD Source0: https://mirrors.aliyun.com/pypi/web/packages/9e/f0/60e56c2e3c00e33cfeab5d54dfdb917fa960fd8d178fb57be1320af7010b/light_famd-0.0.3.tar.gz BuildArch: noarch Requires: python3-scikit-learn Requires: python3-scipy Requires: python3-pandas Requires: python3-numpy %description # Light_FAMD `Light_FAMD` is a library for prcessing [factor analysis of mixed data](https://www.wikiwand.com/en/Factor_analysis). This includes a variety of methods including [principal component analysis (PCA)](https://www.wikiwand.com/en/Principal_component_analysis) and [multiply correspondence analysis (MCA)](https://www.researchgate.net/publication/239542271_Multiple_Correspondence_Analysis). The goal is to provide an efficient and light implementation for each algorithm along with a scikit-learn API. ## Table of contents - [Usage](##Usage) - [Guidelines](###Guidelines) - [Principal component analysis (PCA)](#principal-component-analysis-pca) - [Correspondence analysis (CA)](#correspondence-analysis-ca) - [Multiple correspondence analysis (MCA)](#multiple-correspondence-analysis-mca) - [Multiple factor analysis (MFA)](#multiple-factor-analysis-mfa) - [Factor analysis of mixed data (FAMD)](#factor-analysis-of-mixed-data-famd) - [Going faster](#going-faster) `Light_FAMD` doesn't have any extra dependencies apart from the usual suspects (`sklearn`, `pandas`, `numpy`) which are included with Anaconda. ### Guidelines Each base estimator(CA,PCA) provided by `Light_FAMD` extends scikit-learn's `(TransformerMixin,BaseEstimator)`.which means we could use directly `fit_transform`,and `(set_params,get_params)` methods. Under the hood `Light_FAMD` uses a [randomised version of SVD](https://scikit-learn.org/dev/modules/generated/sklearn.utils.extmath.randomized_svd.html). This algorithm finds a (usually very good) approximate truncated singular value decomposition using randomization to speed up the computations. It is particularly fast on large matrices on which you wish to extract only a small number of components. In order to obtain further speed up, n_iter can be set <=2 (at the cost of loss of precision). However if you want reproducible results then you should set the `random_state` parameter. The randomised version of SVD is an iterative method. Because each of light_famd's algorithms use SVD, they all possess a `n_iter` parameter which controls the number of iterations used for computing the SVD. On the one hand the higher `n_iter` is the more precise the results will be. On the other hand increasing `n_iter` increases the computation time. In general the algorithm converges very quickly so using a low `n_iter` (which is the default behaviour) is recommended. In this package,inheritance relationship as shown below(A->B:A is superclass of B): - PCA -> MFA -> FAMD - CA ->MCA You are supposed to use each method depending on your situation: - All your variables are numeric: use principal component analysis (`PCA`) - You have a contingency table: use correspondence analysis (`CA`) - You have more than 2 variables and they are all categorical: use multiple correspondence analysis (`MCA`) - You have groups of categorical **or** numerical variables: use multiple factor analysis (`MFA`) - You have both categorical and numerical variables: use factor analysis of mixed data (`FAMD`) The next subsections give an overview of each method along with usage information. The following papers give a good overview of the field of factor analysis if you want to go deeper: - [A Tutorial on Principal Component Analysis](https://arxiv.org/pdf/1404.1100.pdf) - [Theory of Correspondence Analysis](http://statmath.wu.ac.at/courses/CAandRelMeth/caipA.pdf) - [Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions](https://arxiv.org/pdf/0909.4061.pdf) - [Computation of Multiple Correspondence Analysis, with code in R](https://core.ac.uk/download/pdf/6591520.pdf) - [Singular Value Decomposition Tutorial](https://davetang.org/file/Singular_Value_Decomposition_Tutorial.pdf) - [Multiple Factor Analysis](https://www.utdallas.edu/~herve/Abdi-MFA2007-pretty.pdf) Notice that `Light_FAMD` does't support the sparse input,see [Truncated_FAMD](https://github.com/Cauchemare/Truncated_FAMD) for an alternative of sparse and big data. ### Principal-Component-Analysis: PCA **PCA**(rescale_with_mean=True, rescale_with_std=True, n_components=2, n_iter=3, copy=True, check_input=True, random_state=None, engine='auto'): **Args:** - `rescale_with_mean` (bool): Whether to substract each column's mean or not. - `rescale_with_std` (bool): Whether to divide each column by it's standard deviation or not. - `n_components` (int): The number of principal components to compute. - `n_iter` (int): The number of iterations used for computing the SVD. - `copy` (bool): Whether to perform the computations inplace or not. - `check_input` (bool): Whether to check the consistency of the inputs or not. - `engine`(string):"auto":randomized_svd,"fbpca":Facebook's randomized SVD implementation - `random_state`(int, RandomState instance or None, optional (default=None):The seed of the -pseudo random number generator to use when shuffling the data. If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random. Return ndarray (M,k),M:Number of samples,K:Number of components. **Examples:** ``` >>>import numpy as np >>> np.random.seed(42) # This is for doctests reproducibility >>>from light_famd import PCA >>>X = pd.DataFrame(np.random.randint(0,10,size=(10,3)),columns=list('ABC')) >>>pca = PCA(n_components=2) >>>pca.fit(X) PCA(check_input=True, copy=True, engine='auto', n_components=2, n_iter=3, random_state=None, rescale_with_mean=True, rescale_with_std=True) >>>print(pca.explained_variance_) [20.20385109 8.48246239] >>>print(pca.explained_variance_ratio_) [0.6734617029875277, 0.28274874633810754] >>>print(pca.column_correlation(X)) # pearson correlation between component and original column,while p-value >=0.05 this similarity is `Nan`. 0 1 A -0.953482 NaN B 0.907314 NaN C NaN 0.84211 >>>print(pca.transform(X)) [[-0.82262005 0.11730656] [ 0.05359079 1.62298683] [ 1.03052849 0.79973099] [-0.24313366 0.25651395] [-0.94630387 -1.04943025] [-0.70591749 -0.01282583] [-0.39948373 -1.52612436] [ 2.70164194 0.38048482] [-2.49373351 0.53655273] [ 1.8254311 -1.12519545]] >>>print(pca.fit_transform(X)) [[-0.82262005 0.11730656] [ 0.05359079 1.62298683] [ 1.03052849 0.79973099] [-0.24313366 0.25651395] [-0.94630387 -1.04943025] [-0.70591749 -0.01282583] [-0.39948373 -1.52612436] [ 2.70164194 0.38048482] [-2.49373351 0.53655273] [ 1.8254311 -1.12519545]] ``` ### Correspondence-Analysis: CA **CA**(n_components=2, n_iter=10, copy=True, check_input=True, random_state=None, engine='auto'): **Args:** - `n_components` (int): The number of principal components to compute. - `copy` (bool): Whether to perform the computations inplace or not. - `check_input` (bool): Whether to check the consistency of the inputs or not. - `engine`(string):"auto":randomized_svd,"fbpca":Facebook's randomized SVD implementation - `random_state`(int, RandomState instance or None, optional (default=None):The seed of the -pseudo random number generator to use when shuffling the data. If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random. Return ndarray (M,k),M:Number of samples,K:Number of components. **Examples:** ``` >>>import numpy as np >>>from light_famd import CA >>>X = pd.DataFrame(data=np.random.randint(0,100,size=(10,4)),columns=list('ABCD')) >>>ca=CA(n_components=2,n_iter=2) >>>ca.fit(X) CA(check_input=True, copy=True, engine='auto', n_components=2, n_iter=2, random_state=None) >>> print(ca.explained_variance_) [0.16892141 0.0746376 ] >>>print(ca.explained_variance_ratio_) [0.5650580210934917, 0.2496697790527281] >>>print(ca.transform(X)) [[ 0.23150854 -0.39167802] [ 0.36006095 0.00301414] [-0.48192602 -0.13002647] [-0.06333533 -0.21475652] [-0.16438708 -0.10418312] [-0.38129126 -0.16515196] [ 0.2721296 0.46923757] [ 0.82953753 0.20638333] [-0.500007 0.36897935] [ 0.57932474 -0.1023383 ]] >>>print(ca.fit_transform(X)) [[ 0.23150854 -0.39167802] [ 0.36006095 0.00301414] [-0.48192602 -0.13002647] [-0.06333533 -0.21475652] [-0.16438708 -0.10418312] [-0.38129126 -0.16515196] [ 0.2721296 0.46923757] [ 0.82953753 0.20638333] [-0.500007 0.36897935] [ 0.57932474 -0.1023383 ]] ``` ### Multiple-Correspondence-Analysis: MCA MCA class inherits from CA class. ``` >>>import pandas as pd >>>from light_famd import MCA >>>X=pd.DataFrame(np.random.choice(list('abcde'),size=(10,4),replace=True),columns =list('ABCD')) >>>print(X) A B C D 0 d e a d 1 e d b b 2 e d a e 3 b b e d 4 b d b b 5 c b a e 6 e d b a 7 d c d d 8 b c d a 9 a e c c >>>mca=MCA(n_components=2) >>>mca.fit(X) MCA(check_input=True, copy=True, engine='auto', n_components=2, n_iter=10, random_state=None) >>>print(mca.explained_variance_) [0.90150495 0.76979456] >>>print(mca.explained_variance_ratio_) [0.24040131974598467, 0.20527854948955893] >>>print(mca.transform(X)) [[ 0.55603013 0.7016272 ] [-0.73558629 -1.17559462] [-0.44972794 -0.4973024 ] [-0.16248444 0.95706908] [-0.66969377 -0.79951057] [-0.21267777 0.39953562] [-0.67921667 -0.8707747 ] [ 0.05058625 1.34573057] [-0.31952341 0.77285922] [ 2.62229391 -0.83363941]] >>>print(mca.fit_transform(X)) [[ 0.55603013 0.7016272 ] [-0.73558629 -1.17559462] [-0.44972794 -0.4973024 ] [-0.16248444 0.95706908] [-0.66969377 -0.79951057] [-0.21267777 0.39953562] [-0.67921667 -0.8707747 ] [ 0.05058625 1.34573057] [-0.31952341 0.77285922] [ 2.62229391 -0.83363941]] ``` ### Multiple-Factor-Analysis: MFA MFA class inherits from PCA class. Since FAMD class inherits from MFA and the only thing to do for FAMD is to determine `groups` parameter compare to its superclass `MFA`.therefore we skip this chapiter and go directly to `FAMD`. ### Factor-Analysis-of-Mixed-Data: FAMD The `FAMD` inherits from the `MFA` class, which entails that you have access to all it's methods and properties of `MFA` class. ``` >>>import pandas as pd >>>from light_famd import FAMD >>>X_n = pd.DataFrame(data=np.random.randint(0,100,size=(10,2)),columns=list('AB')) >>>X_c =pd.DataFrame(np.random.choice(list('abcde'),size=(10,4),replace=True),columns =list('CDEF')) >>>X=pd.concat([X_n,X_c],axis=1) >>>print(X) A B C D E F 0 96 19 b d b e 1 11 46 b d a e 2 0 89 a a a c 3 13 63 c a e d 4 37 36 d b e c 5 10 99 a b d c 6 76 2 c a d e 7 32 5 c a e d 8 49 9 c e e e 9 4 22 c c b d >>>famd = FAMD(n_components=2) >>>famd.fit(X) MCA PROCESS MCA PROCESS ELIMINATED 0 COLUMNS SINCE THEIR MISS_RATES >= 99% Out: FAMD(check_input=True, copy=False, engine='auto', n_components=2, n_iter=2, random_state=None) >>>print(famd.explained_variance_) [17.40871219 9.73440949] >>>print(famd.explained_variance_ratio_) [0.32596621039327284, 0.1822701494502082] >>> print(famd.column_correlation(X)) 0 1 A NaN NaN B NaN NaN C_a NaN NaN C_b NaN 0.824458 C_c 0.922220 NaN C_d NaN NaN D_a NaN NaN D_b NaN NaN D_c NaN NaN D_d NaN 0.824458 D_e NaN NaN E_a NaN NaN E_b NaN NaN E_d NaN NaN E_e NaN NaN F_c NaN -0.714447 F_d 0.673375 NaN F_e NaN 0.839324 >>>print(famd.transform(X)) [[ 2.23848136 5.75809647] [ 2.0845175 4.78930072] [ 2.6682068 -2.78991262] [ 6.2962962 -1.57451325] [ 2.52140085 -3.28279729] [ 1.58256681 -3.73135011] [ 5.19476759 1.18333717] [ 6.35288446 -1.33186723] [ 5.02971134 1.6216402 ] [ 4.05754963 0.69620997]] >>>print(famd.fit_transform(X)) MCA PROCESS HAVE ELIMINATE 0 COLUMNS SINCE ITS MISSING RATE >= 99% [[ 2.23848136 5.75809647] [ 2.0845175 4.78930072] [ 2.6682068 -2.78991262] [ 6.2962962 -1.57451325] [ 2.52140085 -3.28279729] [ 1.58256681 -3.73135011] [ 5.19476759 1.18333717] [ 6.35288446 -1.33186723] [ 5.02971134 1.6216402 ] [ 4.05754963 0.69620997]] ``` ## Going faster By default `light_famd` uses `sklearn`'s randomized SVD implementation. One of the goals of `Light_FAMD` is to make it possible to use a different SVD backend. For the while the only other supported backend is [Facebook's randomized SVD implementation](https://research.facebook.com/blog/fast-randomized-svd/) called [fbpca](http://fbpca.readthedocs.org/en/latest/). You can use it by setting the `engine` parameter to `'fbpca'` or see [Truncated_FAMD](https://github.com/Cauchemare/Truncated_FAMD) for an alternative of automatic selection of svd_solver depends on the structure of input: ```python >>> import Light_FAMD >>> pca = Light_FAMD.PCA(engine='fbpca') ``` %package -n python3-light-famd Summary: Light Factor Analysis of Mixed Data Provides: python-light-famd BuildRequires: python3-devel BuildRequires: python3-setuptools BuildRequires: python3-pip %description -n python3-light-famd # Light_FAMD `Light_FAMD` is a library for prcessing [factor analysis of mixed data](https://www.wikiwand.com/en/Factor_analysis). This includes a variety of methods including [principal component analysis (PCA)](https://www.wikiwand.com/en/Principal_component_analysis) and [multiply correspondence analysis (MCA)](https://www.researchgate.net/publication/239542271_Multiple_Correspondence_Analysis). The goal is to provide an efficient and light implementation for each algorithm along with a scikit-learn API. ## Table of contents - [Usage](##Usage) - [Guidelines](###Guidelines) - [Principal component analysis (PCA)](#principal-component-analysis-pca) - [Correspondence analysis (CA)](#correspondence-analysis-ca) - [Multiple correspondence analysis (MCA)](#multiple-correspondence-analysis-mca) - [Multiple factor analysis (MFA)](#multiple-factor-analysis-mfa) - [Factor analysis of mixed data (FAMD)](#factor-analysis-of-mixed-data-famd) - [Going faster](#going-faster) `Light_FAMD` doesn't have any extra dependencies apart from the usual suspects (`sklearn`, `pandas`, `numpy`) which are included with Anaconda. ### Guidelines Each base estimator(CA,PCA) provided by `Light_FAMD` extends scikit-learn's `(TransformerMixin,BaseEstimator)`.which means we could use directly `fit_transform`,and `(set_params,get_params)` methods. Under the hood `Light_FAMD` uses a [randomised version of SVD](https://scikit-learn.org/dev/modules/generated/sklearn.utils.extmath.randomized_svd.html). This algorithm finds a (usually very good) approximate truncated singular value decomposition using randomization to speed up the computations. It is particularly fast on large matrices on which you wish to extract only a small number of components. In order to obtain further speed up, n_iter can be set <=2 (at the cost of loss of precision). However if you want reproducible results then you should set the `random_state` parameter. The randomised version of SVD is an iterative method. Because each of light_famd's algorithms use SVD, they all possess a `n_iter` parameter which controls the number of iterations used for computing the SVD. On the one hand the higher `n_iter` is the more precise the results will be. On the other hand increasing `n_iter` increases the computation time. In general the algorithm converges very quickly so using a low `n_iter` (which is the default behaviour) is recommended. In this package,inheritance relationship as shown below(A->B:A is superclass of B): - PCA -> MFA -> FAMD - CA ->MCA You are supposed to use each method depending on your situation: - All your variables are numeric: use principal component analysis (`PCA`) - You have a contingency table: use correspondence analysis (`CA`) - You have more than 2 variables and they are all categorical: use multiple correspondence analysis (`MCA`) - You have groups of categorical **or** numerical variables: use multiple factor analysis (`MFA`) - You have both categorical and numerical variables: use factor analysis of mixed data (`FAMD`) The next subsections give an overview of each method along with usage information. The following papers give a good overview of the field of factor analysis if you want to go deeper: - [A Tutorial on Principal Component Analysis](https://arxiv.org/pdf/1404.1100.pdf) - [Theory of Correspondence Analysis](http://statmath.wu.ac.at/courses/CAandRelMeth/caipA.pdf) - [Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions](https://arxiv.org/pdf/0909.4061.pdf) - [Computation of Multiple Correspondence Analysis, with code in R](https://core.ac.uk/download/pdf/6591520.pdf) - [Singular Value Decomposition Tutorial](https://davetang.org/file/Singular_Value_Decomposition_Tutorial.pdf) - [Multiple Factor Analysis](https://www.utdallas.edu/~herve/Abdi-MFA2007-pretty.pdf) Notice that `Light_FAMD` does't support the sparse input,see [Truncated_FAMD](https://github.com/Cauchemare/Truncated_FAMD) for an alternative of sparse and big data. ### Principal-Component-Analysis: PCA **PCA**(rescale_with_mean=True, rescale_with_std=True, n_components=2, n_iter=3, copy=True, check_input=True, random_state=None, engine='auto'): **Args:** - `rescale_with_mean` (bool): Whether to substract each column's mean or not. - `rescale_with_std` (bool): Whether to divide each column by it's standard deviation or not. - `n_components` (int): The number of principal components to compute. - `n_iter` (int): The number of iterations used for computing the SVD. - `copy` (bool): Whether to perform the computations inplace or not. - `check_input` (bool): Whether to check the consistency of the inputs or not. - `engine`(string):"auto":randomized_svd,"fbpca":Facebook's randomized SVD implementation - `random_state`(int, RandomState instance or None, optional (default=None):The seed of the -pseudo random number generator to use when shuffling the data. If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random. Return ndarray (M,k),M:Number of samples,K:Number of components. **Examples:** ``` >>>import numpy as np >>> np.random.seed(42) # This is for doctests reproducibility >>>from light_famd import PCA >>>X = pd.DataFrame(np.random.randint(0,10,size=(10,3)),columns=list('ABC')) >>>pca = PCA(n_components=2) >>>pca.fit(X) PCA(check_input=True, copy=True, engine='auto', n_components=2, n_iter=3, random_state=None, rescale_with_mean=True, rescale_with_std=True) >>>print(pca.explained_variance_) [20.20385109 8.48246239] >>>print(pca.explained_variance_ratio_) [0.6734617029875277, 0.28274874633810754] >>>print(pca.column_correlation(X)) # pearson correlation between component and original column,while p-value >=0.05 this similarity is `Nan`. 0 1 A -0.953482 NaN B 0.907314 NaN C NaN 0.84211 >>>print(pca.transform(X)) [[-0.82262005 0.11730656] [ 0.05359079 1.62298683] [ 1.03052849 0.79973099] [-0.24313366 0.25651395] [-0.94630387 -1.04943025] [-0.70591749 -0.01282583] [-0.39948373 -1.52612436] [ 2.70164194 0.38048482] [-2.49373351 0.53655273] [ 1.8254311 -1.12519545]] >>>print(pca.fit_transform(X)) [[-0.82262005 0.11730656] [ 0.05359079 1.62298683] [ 1.03052849 0.79973099] [-0.24313366 0.25651395] [-0.94630387 -1.04943025] [-0.70591749 -0.01282583] [-0.39948373 -1.52612436] [ 2.70164194 0.38048482] [-2.49373351 0.53655273] [ 1.8254311 -1.12519545]] ``` ### Correspondence-Analysis: CA **CA**(n_components=2, n_iter=10, copy=True, check_input=True, random_state=None, engine='auto'): **Args:** - `n_components` (int): The number of principal components to compute. - `copy` (bool): Whether to perform the computations inplace or not. - `check_input` (bool): Whether to check the consistency of the inputs or not. - `engine`(string):"auto":randomized_svd,"fbpca":Facebook's randomized SVD implementation - `random_state`(int, RandomState instance or None, optional (default=None):The seed of the -pseudo random number generator to use when shuffling the data. If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random. Return ndarray (M,k),M:Number of samples,K:Number of components. **Examples:** ``` >>>import numpy as np >>>from light_famd import CA >>>X = pd.DataFrame(data=np.random.randint(0,100,size=(10,4)),columns=list('ABCD')) >>>ca=CA(n_components=2,n_iter=2) >>>ca.fit(X) CA(check_input=True, copy=True, engine='auto', n_components=2, n_iter=2, random_state=None) >>> print(ca.explained_variance_) [0.16892141 0.0746376 ] >>>print(ca.explained_variance_ratio_) [0.5650580210934917, 0.2496697790527281] >>>print(ca.transform(X)) [[ 0.23150854 -0.39167802] [ 0.36006095 0.00301414] [-0.48192602 -0.13002647] [-0.06333533 -0.21475652] [-0.16438708 -0.10418312] [-0.38129126 -0.16515196] [ 0.2721296 0.46923757] [ 0.82953753 0.20638333] [-0.500007 0.36897935] [ 0.57932474 -0.1023383 ]] >>>print(ca.fit_transform(X)) [[ 0.23150854 -0.39167802] [ 0.36006095 0.00301414] [-0.48192602 -0.13002647] [-0.06333533 -0.21475652] [-0.16438708 -0.10418312] [-0.38129126 -0.16515196] [ 0.2721296 0.46923757] [ 0.82953753 0.20638333] [-0.500007 0.36897935] [ 0.57932474 -0.1023383 ]] ``` ### Multiple-Correspondence-Analysis: MCA MCA class inherits from CA class. ``` >>>import pandas as pd >>>from light_famd import MCA >>>X=pd.DataFrame(np.random.choice(list('abcde'),size=(10,4),replace=True),columns =list('ABCD')) >>>print(X) A B C D 0 d e a d 1 e d b b 2 e d a e 3 b b e d 4 b d b b 5 c b a e 6 e d b a 7 d c d d 8 b c d a 9 a e c c >>>mca=MCA(n_components=2) >>>mca.fit(X) MCA(check_input=True, copy=True, engine='auto', n_components=2, n_iter=10, random_state=None) >>>print(mca.explained_variance_) [0.90150495 0.76979456] >>>print(mca.explained_variance_ratio_) [0.24040131974598467, 0.20527854948955893] >>>print(mca.transform(X)) [[ 0.55603013 0.7016272 ] [-0.73558629 -1.17559462] [-0.44972794 -0.4973024 ] [-0.16248444 0.95706908] [-0.66969377 -0.79951057] [-0.21267777 0.39953562] [-0.67921667 -0.8707747 ] [ 0.05058625 1.34573057] [-0.31952341 0.77285922] [ 2.62229391 -0.83363941]] >>>print(mca.fit_transform(X)) [[ 0.55603013 0.7016272 ] [-0.73558629 -1.17559462] [-0.44972794 -0.4973024 ] [-0.16248444 0.95706908] [-0.66969377 -0.79951057] [-0.21267777 0.39953562] [-0.67921667 -0.8707747 ] [ 0.05058625 1.34573057] [-0.31952341 0.77285922] [ 2.62229391 -0.83363941]] ``` ### Multiple-Factor-Analysis: MFA MFA class inherits from PCA class. Since FAMD class inherits from MFA and the only thing to do for FAMD is to determine `groups` parameter compare to its superclass `MFA`.therefore we skip this chapiter and go directly to `FAMD`. ### Factor-Analysis-of-Mixed-Data: FAMD The `FAMD` inherits from the `MFA` class, which entails that you have access to all it's methods and properties of `MFA` class. ``` >>>import pandas as pd >>>from light_famd import FAMD >>>X_n = pd.DataFrame(data=np.random.randint(0,100,size=(10,2)),columns=list('AB')) >>>X_c =pd.DataFrame(np.random.choice(list('abcde'),size=(10,4),replace=True),columns =list('CDEF')) >>>X=pd.concat([X_n,X_c],axis=1) >>>print(X) A B C D E F 0 96 19 b d b e 1 11 46 b d a e 2 0 89 a a a c 3 13 63 c a e d 4 37 36 d b e c 5 10 99 a b d c 6 76 2 c a d e 7 32 5 c a e d 8 49 9 c e e e 9 4 22 c c b d >>>famd = FAMD(n_components=2) >>>famd.fit(X) MCA PROCESS MCA PROCESS ELIMINATED 0 COLUMNS SINCE THEIR MISS_RATES >= 99% Out: FAMD(check_input=True, copy=False, engine='auto', n_components=2, n_iter=2, random_state=None) >>>print(famd.explained_variance_) [17.40871219 9.73440949] >>>print(famd.explained_variance_ratio_) [0.32596621039327284, 0.1822701494502082] >>> print(famd.column_correlation(X)) 0 1 A NaN NaN B NaN NaN C_a NaN NaN C_b NaN 0.824458 C_c 0.922220 NaN C_d NaN NaN D_a NaN NaN D_b NaN NaN D_c NaN NaN D_d NaN 0.824458 D_e NaN NaN E_a NaN NaN E_b NaN NaN E_d NaN NaN E_e NaN NaN F_c NaN -0.714447 F_d 0.673375 NaN F_e NaN 0.839324 >>>print(famd.transform(X)) [[ 2.23848136 5.75809647] [ 2.0845175 4.78930072] [ 2.6682068 -2.78991262] [ 6.2962962 -1.57451325] [ 2.52140085 -3.28279729] [ 1.58256681 -3.73135011] [ 5.19476759 1.18333717] [ 6.35288446 -1.33186723] [ 5.02971134 1.6216402 ] [ 4.05754963 0.69620997]] >>>print(famd.fit_transform(X)) MCA PROCESS HAVE ELIMINATE 0 COLUMNS SINCE ITS MISSING RATE >= 99% [[ 2.23848136 5.75809647] [ 2.0845175 4.78930072] [ 2.6682068 -2.78991262] [ 6.2962962 -1.57451325] [ 2.52140085 -3.28279729] [ 1.58256681 -3.73135011] [ 5.19476759 1.18333717] [ 6.35288446 -1.33186723] [ 5.02971134 1.6216402 ] [ 4.05754963 0.69620997]] ``` ## Going faster By default `light_famd` uses `sklearn`'s randomized SVD implementation. One of the goals of `Light_FAMD` is to make it possible to use a different SVD backend. For the while the only other supported backend is [Facebook's randomized SVD implementation](https://research.facebook.com/blog/fast-randomized-svd/) called [fbpca](http://fbpca.readthedocs.org/en/latest/). You can use it by setting the `engine` parameter to `'fbpca'` or see [Truncated_FAMD](https://github.com/Cauchemare/Truncated_FAMD) for an alternative of automatic selection of svd_solver depends on the structure of input: ```python >>> import Light_FAMD >>> pca = Light_FAMD.PCA(engine='fbpca') ``` %package help Summary: Development documents and examples for light-famd Provides: python3-light-famd-doc %description help # Light_FAMD `Light_FAMD` is a library for prcessing [factor analysis of mixed data](https://www.wikiwand.com/en/Factor_analysis). This includes a variety of methods including [principal component analysis (PCA)](https://www.wikiwand.com/en/Principal_component_analysis) and [multiply correspondence analysis (MCA)](https://www.researchgate.net/publication/239542271_Multiple_Correspondence_Analysis). The goal is to provide an efficient and light implementation for each algorithm along with a scikit-learn API. ## Table of contents - [Usage](##Usage) - [Guidelines](###Guidelines) - [Principal component analysis (PCA)](#principal-component-analysis-pca) - [Correspondence analysis (CA)](#correspondence-analysis-ca) - [Multiple correspondence analysis (MCA)](#multiple-correspondence-analysis-mca) - [Multiple factor analysis (MFA)](#multiple-factor-analysis-mfa) - [Factor analysis of mixed data (FAMD)](#factor-analysis-of-mixed-data-famd) - [Going faster](#going-faster) `Light_FAMD` doesn't have any extra dependencies apart from the usual suspects (`sklearn`, `pandas`, `numpy`) which are included with Anaconda. ### Guidelines Each base estimator(CA,PCA) provided by `Light_FAMD` extends scikit-learn's `(TransformerMixin,BaseEstimator)`.which means we could use directly `fit_transform`,and `(set_params,get_params)` methods. Under the hood `Light_FAMD` uses a [randomised version of SVD](https://scikit-learn.org/dev/modules/generated/sklearn.utils.extmath.randomized_svd.html). This algorithm finds a (usually very good) approximate truncated singular value decomposition using randomization to speed up the computations. It is particularly fast on large matrices on which you wish to extract only a small number of components. In order to obtain further speed up, n_iter can be set <=2 (at the cost of loss of precision). However if you want reproducible results then you should set the `random_state` parameter. The randomised version of SVD is an iterative method. Because each of light_famd's algorithms use SVD, they all possess a `n_iter` parameter which controls the number of iterations used for computing the SVD. On the one hand the higher `n_iter` is the more precise the results will be. On the other hand increasing `n_iter` increases the computation time. In general the algorithm converges very quickly so using a low `n_iter` (which is the default behaviour) is recommended. In this package,inheritance relationship as shown below(A->B:A is superclass of B): - PCA -> MFA -> FAMD - CA ->MCA You are supposed to use each method depending on your situation: - All your variables are numeric: use principal component analysis (`PCA`) - You have a contingency table: use correspondence analysis (`CA`) - You have more than 2 variables and they are all categorical: use multiple correspondence analysis (`MCA`) - You have groups of categorical **or** numerical variables: use multiple factor analysis (`MFA`) - You have both categorical and numerical variables: use factor analysis of mixed data (`FAMD`) The next subsections give an overview of each method along with usage information. The following papers give a good overview of the field of factor analysis if you want to go deeper: - [A Tutorial on Principal Component Analysis](https://arxiv.org/pdf/1404.1100.pdf) - [Theory of Correspondence Analysis](http://statmath.wu.ac.at/courses/CAandRelMeth/caipA.pdf) - [Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions](https://arxiv.org/pdf/0909.4061.pdf) - [Computation of Multiple Correspondence Analysis, with code in R](https://core.ac.uk/download/pdf/6591520.pdf) - [Singular Value Decomposition Tutorial](https://davetang.org/file/Singular_Value_Decomposition_Tutorial.pdf) - [Multiple Factor Analysis](https://www.utdallas.edu/~herve/Abdi-MFA2007-pretty.pdf) Notice that `Light_FAMD` does't support the sparse input,see [Truncated_FAMD](https://github.com/Cauchemare/Truncated_FAMD) for an alternative of sparse and big data. ### Principal-Component-Analysis: PCA **PCA**(rescale_with_mean=True, rescale_with_std=True, n_components=2, n_iter=3, copy=True, check_input=True, random_state=None, engine='auto'): **Args:** - `rescale_with_mean` (bool): Whether to substract each column's mean or not. - `rescale_with_std` (bool): Whether to divide each column by it's standard deviation or not. - `n_components` (int): The number of principal components to compute. - `n_iter` (int): The number of iterations used for computing the SVD. - `copy` (bool): Whether to perform the computations inplace or not. - `check_input` (bool): Whether to check the consistency of the inputs or not. - `engine`(string):"auto":randomized_svd,"fbpca":Facebook's randomized SVD implementation - `random_state`(int, RandomState instance or None, optional (default=None):The seed of the -pseudo random number generator to use when shuffling the data. If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random. Return ndarray (M,k),M:Number of samples,K:Number of components. **Examples:** ``` >>>import numpy as np >>> np.random.seed(42) # This is for doctests reproducibility >>>from light_famd import PCA >>>X = pd.DataFrame(np.random.randint(0,10,size=(10,3)),columns=list('ABC')) >>>pca = PCA(n_components=2) >>>pca.fit(X) PCA(check_input=True, copy=True, engine='auto', n_components=2, n_iter=3, random_state=None, rescale_with_mean=True, rescale_with_std=True) >>>print(pca.explained_variance_) [20.20385109 8.48246239] >>>print(pca.explained_variance_ratio_) [0.6734617029875277, 0.28274874633810754] >>>print(pca.column_correlation(X)) # pearson correlation between component and original column,while p-value >=0.05 this similarity is `Nan`. 0 1 A -0.953482 NaN B 0.907314 NaN C NaN 0.84211 >>>print(pca.transform(X)) [[-0.82262005 0.11730656] [ 0.05359079 1.62298683] [ 1.03052849 0.79973099] [-0.24313366 0.25651395] [-0.94630387 -1.04943025] [-0.70591749 -0.01282583] [-0.39948373 -1.52612436] [ 2.70164194 0.38048482] [-2.49373351 0.53655273] [ 1.8254311 -1.12519545]] >>>print(pca.fit_transform(X)) [[-0.82262005 0.11730656] [ 0.05359079 1.62298683] [ 1.03052849 0.79973099] [-0.24313366 0.25651395] [-0.94630387 -1.04943025] [-0.70591749 -0.01282583] [-0.39948373 -1.52612436] [ 2.70164194 0.38048482] [-2.49373351 0.53655273] [ 1.8254311 -1.12519545]] ``` ### Correspondence-Analysis: CA **CA**(n_components=2, n_iter=10, copy=True, check_input=True, random_state=None, engine='auto'): **Args:** - `n_components` (int): The number of principal components to compute. - `copy` (bool): Whether to perform the computations inplace or not. - `check_input` (bool): Whether to check the consistency of the inputs or not. - `engine`(string):"auto":randomized_svd,"fbpca":Facebook's randomized SVD implementation - `random_state`(int, RandomState instance or None, optional (default=None):The seed of the -pseudo random number generator to use when shuffling the data. If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random. Return ndarray (M,k),M:Number of samples,K:Number of components. **Examples:** ``` >>>import numpy as np >>>from light_famd import CA >>>X = pd.DataFrame(data=np.random.randint(0,100,size=(10,4)),columns=list('ABCD')) >>>ca=CA(n_components=2,n_iter=2) >>>ca.fit(X) CA(check_input=True, copy=True, engine='auto', n_components=2, n_iter=2, random_state=None) >>> print(ca.explained_variance_) [0.16892141 0.0746376 ] >>>print(ca.explained_variance_ratio_) [0.5650580210934917, 0.2496697790527281] >>>print(ca.transform(X)) [[ 0.23150854 -0.39167802] [ 0.36006095 0.00301414] [-0.48192602 -0.13002647] [-0.06333533 -0.21475652] [-0.16438708 -0.10418312] [-0.38129126 -0.16515196] [ 0.2721296 0.46923757] [ 0.82953753 0.20638333] [-0.500007 0.36897935] [ 0.57932474 -0.1023383 ]] >>>print(ca.fit_transform(X)) [[ 0.23150854 -0.39167802] [ 0.36006095 0.00301414] [-0.48192602 -0.13002647] [-0.06333533 -0.21475652] [-0.16438708 -0.10418312] [-0.38129126 -0.16515196] [ 0.2721296 0.46923757] [ 0.82953753 0.20638333] [-0.500007 0.36897935] [ 0.57932474 -0.1023383 ]] ``` ### Multiple-Correspondence-Analysis: MCA MCA class inherits from CA class. ``` >>>import pandas as pd >>>from light_famd import MCA >>>X=pd.DataFrame(np.random.choice(list('abcde'),size=(10,4),replace=True),columns =list('ABCD')) >>>print(X) A B C D 0 d e a d 1 e d b b 2 e d a e 3 b b e d 4 b d b b 5 c b a e 6 e d b a 7 d c d d 8 b c d a 9 a e c c >>>mca=MCA(n_components=2) >>>mca.fit(X) MCA(check_input=True, copy=True, engine='auto', n_components=2, n_iter=10, random_state=None) >>>print(mca.explained_variance_) [0.90150495 0.76979456] >>>print(mca.explained_variance_ratio_) [0.24040131974598467, 0.20527854948955893] >>>print(mca.transform(X)) [[ 0.55603013 0.7016272 ] [-0.73558629 -1.17559462] [-0.44972794 -0.4973024 ] [-0.16248444 0.95706908] [-0.66969377 -0.79951057] [-0.21267777 0.39953562] [-0.67921667 -0.8707747 ] [ 0.05058625 1.34573057] [-0.31952341 0.77285922] [ 2.62229391 -0.83363941]] >>>print(mca.fit_transform(X)) [[ 0.55603013 0.7016272 ] [-0.73558629 -1.17559462] [-0.44972794 -0.4973024 ] [-0.16248444 0.95706908] [-0.66969377 -0.79951057] [-0.21267777 0.39953562] [-0.67921667 -0.8707747 ] [ 0.05058625 1.34573057] [-0.31952341 0.77285922] [ 2.62229391 -0.83363941]] ``` ### Multiple-Factor-Analysis: MFA MFA class inherits from PCA class. Since FAMD class inherits from MFA and the only thing to do for FAMD is to determine `groups` parameter compare to its superclass `MFA`.therefore we skip this chapiter and go directly to `FAMD`. ### Factor-Analysis-of-Mixed-Data: FAMD The `FAMD` inherits from the `MFA` class, which entails that you have access to all it's methods and properties of `MFA` class. ``` >>>import pandas as pd >>>from light_famd import FAMD >>>X_n = pd.DataFrame(data=np.random.randint(0,100,size=(10,2)),columns=list('AB')) >>>X_c =pd.DataFrame(np.random.choice(list('abcde'),size=(10,4),replace=True),columns =list('CDEF')) >>>X=pd.concat([X_n,X_c],axis=1) >>>print(X) A B C D E F 0 96 19 b d b e 1 11 46 b d a e 2 0 89 a a a c 3 13 63 c a e d 4 37 36 d b e c 5 10 99 a b d c 6 76 2 c a d e 7 32 5 c a e d 8 49 9 c e e e 9 4 22 c c b d >>>famd = FAMD(n_components=2) >>>famd.fit(X) MCA PROCESS MCA PROCESS ELIMINATED 0 COLUMNS SINCE THEIR MISS_RATES >= 99% Out: FAMD(check_input=True, copy=False, engine='auto', n_components=2, n_iter=2, random_state=None) >>>print(famd.explained_variance_) [17.40871219 9.73440949] >>>print(famd.explained_variance_ratio_) [0.32596621039327284, 0.1822701494502082] >>> print(famd.column_correlation(X)) 0 1 A NaN NaN B NaN NaN C_a NaN NaN C_b NaN 0.824458 C_c 0.922220 NaN C_d NaN NaN D_a NaN NaN D_b NaN NaN D_c NaN NaN D_d NaN 0.824458 D_e NaN NaN E_a NaN NaN E_b NaN NaN E_d NaN NaN E_e NaN NaN F_c NaN -0.714447 F_d 0.673375 NaN F_e NaN 0.839324 >>>print(famd.transform(X)) [[ 2.23848136 5.75809647] [ 2.0845175 4.78930072] [ 2.6682068 -2.78991262] [ 6.2962962 -1.57451325] [ 2.52140085 -3.28279729] [ 1.58256681 -3.73135011] [ 5.19476759 1.18333717] [ 6.35288446 -1.33186723] [ 5.02971134 1.6216402 ] [ 4.05754963 0.69620997]] >>>print(famd.fit_transform(X)) MCA PROCESS HAVE ELIMINATE 0 COLUMNS SINCE ITS MISSING RATE >= 99% [[ 2.23848136 5.75809647] [ 2.0845175 4.78930072] [ 2.6682068 -2.78991262] [ 6.2962962 -1.57451325] [ 2.52140085 -3.28279729] [ 1.58256681 -3.73135011] [ 5.19476759 1.18333717] [ 6.35288446 -1.33186723] [ 5.02971134 1.6216402 ] [ 4.05754963 0.69620997]] ``` ## Going faster By default `light_famd` uses `sklearn`'s randomized SVD implementation. One of the goals of `Light_FAMD` is to make it possible to use a different SVD backend. For the while the only other supported backend is [Facebook's randomized SVD implementation](https://research.facebook.com/blog/fast-randomized-svd/) called [fbpca](http://fbpca.readthedocs.org/en/latest/). You can use it by setting the `engine` parameter to `'fbpca'` or see [Truncated_FAMD](https://github.com/Cauchemare/Truncated_FAMD) for an alternative of automatic selection of svd_solver depends on the structure of input: ```python >>> import Light_FAMD >>> pca = Light_FAMD.PCA(engine='fbpca') ``` %prep %autosetup -n light_famd-0.0.3 %build %py3_build %install %py3_install install -d -m755 %{buildroot}/%{_pkgdocdir} if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi pushd %{buildroot} if [ -d usr/lib ]; then find usr/lib -type f -printf "\"/%h/%f\"\n" >> filelist.lst fi if [ -d usr/lib64 ]; then find usr/lib64 -type f -printf "\"/%h/%f\"\n" >> filelist.lst fi if [ -d usr/bin ]; then find usr/bin -type f -printf "\"/%h/%f\"\n" >> filelist.lst fi if [ -d usr/sbin ]; then find usr/sbin -type f -printf "\"/%h/%f\"\n" >> filelist.lst fi touch doclist.lst if [ -d usr/share/man ]; then find usr/share/man -type f -printf "\"/%h/%f.gz\"\n" >> doclist.lst fi popd mv %{buildroot}/filelist.lst . mv %{buildroot}/doclist.lst . %files -n python3-light-famd -f filelist.lst %dir %{python3_sitelib}/* %files help -f doclist.lst %{_docdir}/* %changelog * Thu Jun 08 2023 Python_Bot - 0.0.3-1 - Package Spec generated