diff options
author | CoprDistGit <infra@openeuler.org> | 2023-05-05 12:49:26 +0000 |
---|---|---|
committer | CoprDistGit <infra@openeuler.org> | 2023-05-05 12:49:26 +0000 |
commit | 9e24ec97814a2798b7141d7b64eff16d89d81727 (patch) | |
tree | 4e50bf9b8a7e6d1aa8807805710b8c7d421e2b96 | |
parent | 1891e86409a1c6e904bd6b56861e7d6bd77e79c9 (diff) |
automatic import of python-c-lassoopeneuler20.03
-rw-r--r-- | .gitignore | 1 | ||||
-rw-r--r-- | python-c-lasso.spec | 1625 | ||||
-rw-r--r-- | sources | 1 |
3 files changed, 1627 insertions, 0 deletions
@@ -0,0 +1 @@ +/c-lasso-1.0.11.tar.gz diff --git a/python-c-lasso.spec b/python-c-lasso.spec new file mode 100644 index 0000000..76fc7ad --- /dev/null +++ b/python-c-lasso.spec @@ -0,0 +1,1625 @@ +%global _empty_manifest_terminate_build 0 +Name: python-c-lasso +Version: 1.0.11 +Release: 1 +Summary: Algorithms for constrained Lasso problems +License: MIT +URL: https://github.com/Leo-Simpson/CLasso +Source0: https://mirrors.nju.edu.cn/pypi/web/packages/2f/2b/e668407260df3d2779b12d77eb85aa065cec19dad57e368a519d949c293f/c-lasso-1.0.11.tar.gz +BuildArch: noarch + +Requires: python3-numpy +Requires: python3-h5py +Requires: python3-scipy +Requires: python3-sphinx +Requires: python3-sphinx-gallery +Requires: python3-sphinx-rtd-theme +Requires: python3-numpydoc +Requires: python3-matplotlib +Requires: python3-pandas +Requires: python3-pytest +Requires: python3-pytest-cov + +%description +[](https://arxiv.org/abs/2011.00898) +[](https://doi.org/10.21105/joss.02844) + +<img src="https://i.imgur.com/2nGwlux.png" alt="c-lasso" height="145" align="right"/> + +# c-lasso: a Python package for constrained sparse regression and classification + + +c-lasso is a Python package that enables sparse and robust linear regression and classification with linear equality +constraints on the model parameters. For detailed info, one can check the [documentation](https://c-lasso.readthedocs.io/en/latest/). + +The forward model is assumed to be: + +<img src="https://latex.codecogs.com/gif.latex?y=X\beta+\sigma\epsilon\qquad\text{s.t.}\qquad&space;C\beta=0" title="y=X\beta+\sigma\epsilon\qquad\text{s.t.}\qquad C\beta=0" /> + +Here, y and X are given outcome and predictor data. The vector y can be continuous (for regression) or binary (for classification). C is a general constraint matrix. The vector β comprises the unknown coefficients and σ an +unknown scale. + +The package handles several different estimators for inferring β (and σ), including +the constrained Lasso, the constrained scaled Lasso, sparse Huber M-estimation with linear equality constraints, and regularized Support Vector Machines. +Several different algorithmic strategies, including path and proximal splitting algorithms, are implemented to solve +the underlying convex optimization problems. + +We also include two model selection strategies for determining the sparsity of the model parameters: k-fold cross-validation and stability selection. + +This package is intended to fill the gap between popular python tools such as [scikit-learn](https://scikit-learn.org/stable/) which CANNOT solve sparse constrained problems and general-purpose optimization solvers that do not scale well or are inaccurate (see [benchmarks](./benchmark/README.md)) for the considered problems. In its current stage, however, c-lasso is not yet compatible with the scikit-learn API but rather a stand-alone tool. + +Below we show several use cases of the package, including an application of sparse *log-contrast* +regression tasks for *compositional* microbiome data. + +The code builds on results from several papers which can be found in the [References](#references). We also refer to the accompanying [JOSS paper submission](https://github.com/Leo-Simpson/c-lasso/blob/master/paper/paper.md), also available on [arXiv](https://arxiv.org/pdf/2011.00898.pdf). + +## Table of Contents + +* [Installation](#installation) +* [Regression and classification problems](#regression-and-classification-problems) +* [Getting started](#getting-started) +* [Log-contrast regression for microbiome data](#log-contrast-regression-for-microbiome-data) +* [Optimization schemes](#optimization-schemes) + + +* [References](#references) + + +## Installation + +c-lasso is available on pip. You can install the package +in the shell using + +```shell +pip install c-lasso +``` +To use the c-lasso package in Python, type + +```python + +from classo import classo_problem +# one can add auxiliary functions as well such as random_data or csv_to_np +``` + +The `c-lasso` package depends on the following Python packages: + +- `numpy`; +- `matplotlib`; +- `scipy`; +- `pandas`; +- `pytest` (for tests) + +## Regression and classification problems + +The c-lasso package can solve six different types of estimation problems: +four regression-type and two classification-type formulations. + +#### [R1] Standard constrained Lasso regression: + +<img src="https://latex.codecogs.com/gif.latex?\arg\min_{\beta\in&space;R^d}&space;||&space;X\beta-y&space;||^2&space;+&space;\lambda&space;||\beta||_1&space;\qquad\mbox{s.t.}\qquad&space;C\beta=0" /> + +This is the standard Lasso problem with linear equality constraints on the β vector. +The objective function combines Least-Squares for model fitting with l1 penalty for sparsity. + +#### [R2] Constrained sparse Huber regression: + +<img src="https://latex.codecogs.com/gif.latex?\arg\min_{\beta\in&space;R^d}&space;h_{\rho}(X\beta-y&space;)&space;+&space;\lambda&space;||\beta||_1&space;\qquad\mbox{s.t.}\qquad&space;C\beta=0" /> + +This regression problem uses the [Huber loss](https://en.wikipedia.org/wiki/Huber_loss) as objective function +for robust model fitting with l1 and linear equality constraints on the β vector. The parameter ρ=1.345. + +#### [R3] Constrained scaled Lasso regression: + +<img src="https://latex.codecogs.com/gif.latex?\arg&space;\min_{\beta&space;\in&space;\mathbb{R}^d,&space;\sigma&space;>&space;0}&space;\frac{||&space;X\beta&space;-&space;y||^2}{\sigma}&space;+&space;\frac{n}{2}&space;\sigma+&space;\lambda&space;||\beta||_1&space;\qquad&space;\mbox{s.t.}&space;\qquad&space;C\beta&space;=&space;0" title="\arg \min_{\beta \in \mathbb{R}^d, \sigma > 0} \frac{|| X\beta - y||^2}{\sigma} + \frac{n}{2} \sigma+ \lambda ||\beta||_1 \qquad \mbox{s.t.} \qquad C\beta = 0" /> + +This formulation is similar to [R1] but allows for joint estimation of the (constrained) β vector and +the standard deviation σ in a concomitant fashion (see [References](#references) [4,5] for further info). +This is the default problem formulation in c-lasso. + +#### [R4] Constrained sparse Huber regression with concomitant scale estimation: + +<img src="https://latex.codecogs.com/gif.latex?\arg&space;\min_{\beta&space;\in&space;\mathbb{R}^d,&space;\sigma&space;>&space;0}&space;\left(&space;h_{\rho}&space;\left(&space;\frac{&space;X\beta&space;-&space;y}{\sigma}&space;\right)+&space;n&space;\right)&space;\sigma+&space;\lambda&space;||\beta||_1&space;\qquad&space;\mbox{s.t.}&space;\qquad&space;C\beta&space;=&space;0" title="\arg \min_{\beta \in \mathbb{R}^d, \sigma > 0} \left( h_{\rho} \left( \frac{ X\beta - y}{\sigma} \right)+ n \right) \sigma+ \lambda ||\beta||_1 \qquad \mbox{s.t.} \qquad C\beta = 0" /> + +This formulation combines [R2] and [R3] to allow robust joint estimation of the (constrained) β vector and +the scale σ in a concomitant fashion (see [References](#references) [4,5] for further info). + +#### [C1] Constrained sparse classification with Square Hinge loss: + +<img src="https://latex.codecogs.com/gif.latex?\arg&space;\min_{\beta&space;\in&space;\mathbb{R}^d}&space;\sum_{i=1}^n&space;l(y_i&space;x_i^\top&space;\beta)&space;+&space;\lambda&space;\left\lVert&space;\beta\right\rVert_1&space;\qquad&space;s.t.&space;\qquad&space;C\beta&space;=&space;0" title="\arg \min_{\beta \in \mathbb{R}^d} \sum_{i=1}^n l(y_i x_i \beta) + \lambda \left\lVert \beta\right\rVert_1 \qquad s.t. \qquad C\beta = 0" /> + +where the x<sub>i</sub> are the rows of X and l is defined as: + +<img src="https://latex.codecogs.com/gif.latex?l(r)&space;=&space;\begin{cases}&space;(1-r)^2&space;&&space;if&space;\quad&space;r&space;\leq&space;1&space;\\&space;0&space;&if&space;\quad&space;r&space;\geq&space;1&space;\end{cases}" title="l(r) = \begin{cases} (1-r)^2 & if \quad r \leq 1 \\ 0 &if \quad r \geq 1 \end{cases}" /> + +This formulation is similar to [R1] but adapted for classification tasks using the Square Hinge loss +with (constrained) sparse β vector estimation. + +#### [C2] Constrained sparse classification with Huberized Square Hinge loss: + +<img src="https://latex.codecogs.com/gif.latex?\arg&space;\min_{\beta&space;\in&space;\mathbb{R}^d}&space;\sum_{i=1}^n&space;l_{\rho}(y_i&space;x_i^\top\beta)&space;+&space;\lambda&space;\left\lVert&space;\beta\right\rVert_1&space;\qquad&space;s.t.&space;\qquad&space;C\beta&space;=&space;0" title="\arg \min_{\beta \in \mathbb{R}^d} \sum_{i=1}^n l_{\rho}(y_i x_i\beta) + \lambda \left\lVert \beta\right\rVert_1 \qquad s.t. \qquad C\beta = 0" /> + +where the x<sub>i</sub> are the rows of X and l<sub>ρ</sub> is defined as: + +<img src="https://latex.codecogs.com/gif.latex?l_{\rho}(r)&space;=&space;\begin{cases}&space;(1-r)^2&space;&if&space;\quad&space;\rho&space;\leq&space;r&space;\leq&space;1&space;\\&space;(1-\rho)(1+\rho-2r)&space;&&space;if&space;\quad&space;r&space;\leq&space;\rho&space;\\&space;0&space;&if&space;\quad&space;r&space;\geq&space;1&space;\end{cases}" title="l_{\rho}(r) = \begin{cases} (1-r)^2 &if \quad \rho \leq r \leq 1 \\ (1-\rho)(1+\rho-2r) & if \quad r \leq \rho \\ 0 &if \quad r \geq 1 \end{cases}" /> + + +This formulation is similar to [C1] but uses the Huberized Square Hinge loss for robust classification +with (constrained) sparse β vector estimation. + + +## Getting started + +#### Basic example + +We begin with a basic example that shows how to run c-lasso on synthetic data. This example and the next one can be found on the notebook 'Synthetic data Notebook.ipynb' + +The c-lasso package includes +the routine ```random_data``` that allows you to generate problem instances using normally distributed data. + +```python +m, d, d_nonzero, k, sigma = 100, 200, 5, 1, 0.5 +(X, C, y), sol = random_data(m, d, d_nonzero, k, sigma, zerosum=True, seed=1) +``` +This code snippet generates a problem instance with sparse β in dimension +d=100 (sparsity d_nonzero=5). The design matrix X comprises n=100 samples generated from an i.i.d standard normal +distribution. The dimension of the constraint matrix C is d x k matrix. The noise level is σ=0.5. +The input ```zerosum=True``` implies that C is the all-ones vector and Cβ=0. The n-dimensional outcome vector y +and the regression vector β is then generated to satisfy the given constraints. + +Next we can define a default c-lasso problem instance with the generated data: +```python +problem = classo_problem(X, y, C) +``` +You can look at the generated problem instance by typing: + +```python +print(problem) +``` + +This gives you a summary of the form: + +``` +FORMULATION: R3 + +MODEL SELECTION COMPUTED: + Stability selection + +STABILITY SELECTION PARAMETERS: + numerical_method : not specified + method : first + B = 50 + q = 10 + percent_nS = 0.5 + threshold = 0.7 + lamin = 0.01 + Nlam = 50 +``` +As we have not specified any problem, algorithm, or model selection settings, this problem instance +represents the *default* settings for a c-lasso instance: +- The problem is of regression type and uses formulation [R3], i.e. with concomitant scale estimation. +- The *default* optimization scheme is the path algorithm (see [Optimization schemes](#optimization-schemes) for further info). +- For model selection, stability selection at a theoretically derived λ value is used (see [Reference](#references) [4] for details). Stability selection comprises a relatively large number of parameters. For a description of the settings, we refer to the more advanced examples below and the API. + +You can solve the corresponding c-lasso problem instance using + +```python +problem.solve() +``` + +After completion, the results of the optimization and model selection routines +can be visualized using + +```python +print(problem.solution) +``` + +The command shows the running time(s) for the c-lasso problem instance, and the selected variables for sability selection + +``` +STABILITY SELECTION : + Selected variables : 7 63 148 164 168 + Running time : 1.546s + +``` + +Here, we only used stability selection as *default* model selection strategy. +The command also allows you to inspect the computed stability profile for all variables +at the theoretical λ + + + + +The refitted β values on the selected support are also displayed in the next plot + + + + +#### Advanced example + +In the next example, we show how one can specify different aspects of the problem +formulation and model selection strategy. + +```python +m, d, d_nonzero, k, sigma = 100, 200, 5, 0, 0.5 +(X, C, y), sol = random_data(m, d, d_nonzero, k, sigma, zerosum = True, seed = 4) +problem = classo_problem(X, y, C) +problem.formulation.huber = True +problem.formulation.concomitant = False +problem.model_selection.CV = True +problem.model_selection.LAMfixed = True +problem.model_selection.PATH = True +problem.model_selection.StabSelparameters.method = 'max' +problem.model_selection.CVparameters.seed = 1 +problem.model_selection.LAMfixedparameters.rescaled_lam = True +problem.model_selection.LAMfixedparameters.lam = .1 + +problem.solve() +print(problem) + +print(problem.solution) + +``` + +Results : +``` + FORMULATION: R2 + + MODEL SELECTION COMPUTED: + Lambda fixed + Path + Cross Validation + Stability selection + + LAMBDA FIXED PARAMETERS: + numerical_method = Path-Alg + rescaled lam : True + threshold = 0.09 + lam = 0.1 + theoretical_lam = 0.224 + + PATH PARAMETERS: + numerical_method : Path-Alg + lamin = 0.001 + Nlam = 80 + + + CROSS VALIDATION PARAMETERS: + numerical_method : Path-Alg + one-SE method : True + Nsubset = 5 + lamin = 0.001 + Nlam = 80 + + + STABILITY SELECTION PARAMETERS: + numerical_method : Path-Alg + method : max + B = 50 + q = 10 + percent_nS = 0.5 + threshold = 0.7 + lamin = 0.01 + Nlam = 50 + + LAMBDA FIXED : + Selected variables : 17 59 123 + Running time : 0.104s + + PATH COMPUTATION : + Running time : 0.638s + + CROSS VALIDATION : + Selected variables : 16 17 57 59 64 73 74 76 93 115 123 134 137 181 + Running time : 2.1s + + STABILITY SELECTION : + Selected variables : 17 59 76 123 137 + Running time : 6.062s + +``` + + + + + + + + + + + + + + + +## Log-contrast regression for microbiome data + +In the [the accompanying notebook](./examples/example-notebook.ipynb) we study several microbiome data sets. We showcase two examples below. + +#### BMI prediction using the COMBO dataset + +We first consider the [COMBO data set](./examples/COMBO_data) and show how to predict Body Mass Index (BMI) from microbial genus abundances and two non-compositional covariates using "filtered_data". + +```python +from classo import csv_to_np, classo_problem, clr + +# Load microbiome and covariate data X +X0 = csv_to_np('COMBO_data/complete_data/GeneraCounts.csv', begin = 0).astype(float) +X_C = csv_to_np('COMBO_data/CaloriData.csv', begin = 0).astype(float) +X_F = csv_to_np('COMBO_data/FatData.csv', begin = 0).astype(float) + +# Load BMI measurements y +y = csv_to_np('COMBO_data/BMI.csv', begin = 0).astype(float)[:, 0] +labels = csv_to_np('COMBO_data/complete_data/GeneraPhylo.csv').astype(str)[:, -1] + + +# Normalize/transform data +y = y - np.mean(y) #BMI data (n = 96) +X_C = X_C - np.mean(X_C, axis = 0) #Covariate data (Calorie) +X_F = X_F - np.mean(X_F, axis = 0) #Covariate data (Fat) +X0 = clr(X0, 1 / 2).T + +# Set up design matrix and zero-sum constraints for 45 genera +X = np.concatenate((X0, X_C, X_F, np.ones((len(X0), 1))), axis = 1) # Joint microbiome and covariate data and offset +label = np.concatenate([labels, np.array(['Calorie', 'Fat', 'Bias'])]) +C = np.ones((1, len(X[0]))) +C[0, -1], C[0, -2], C[0, -3] = 0., 0., 0. + + +# Set up c-lassso problem +problem = classo_problem(X, y, C, label = label) + + +# Use stability selection with theoretical lambda [Combettes & Müller, 2020b] +problem.model_selection.StabSelparameters.method = 'lam' +problem.model_selection.StabSelparameters.threshold_label = 0.5 + +# Use formulation R3 +problem.formulation.concomitant = True + +problem.solve() +print(problem) +print(problem.solution) + +# Use formulation R4 +problem.formulation.huber = True +problem.formulation.concomitant = True + +problem.solve() +print(problem) +print(problem.solution) + +``` + + + + + + + + + + +<!--- +<img src="https://i.imgur.com/8tFmM8T.png" alt="Central Park Soil Microbiome" height="250" align="right"/> +#### pH prediction using the Central Park soil dataset +The next microbiome example considers the [Central Park Soil dataset](./examples/pH_data) from [Ramirez et al.](https://royalsocietypublishing.org/doi/full/10.1098/rspb.2014.1988). The sample locations are shown in the Figure on the right.) +--> + +#### pH prediction using the 88 soils dataset + +The next microbiome example considers the [88 soils dataset](./examples/pH_data) from [Lauber et al., 2009](https://pubmed.ncbi.nlm.nih.gov/19502440/). + +The task is to predict pH concentration in the soil from microbial abundance data. A similar analysis is available +in [Tree-Aggregated Predictive Modeling of Microbiome Data](https://www.biorxiv.org/content/10.1101/2020.09.01.277632v1) +with Central Park soil data from [Ramirez et al.](https://royalsocietypublishing.org/doi/full/10.1098/rspb.2014.1988). + +Code to run this application is available in [the accompanying notebook](./examples/example-notebook.ipynb) under `pH data`. Below is a summary of a c-lasso problem instance (using the R3 formulation). + +``` +FORMULATION: R3 + +MODEL SELECTION COMPUTED: + Lambda fixed + Path + Stability selection + +LAMBDA FIXED PARAMETERS: + numerical_method = Path-Alg + rescaled lam : True + threshold = 0.004 + lam : theoretical + theoretical_lam = 0.2182 + +PATH PARAMETERS: + numerical_method : Path-Alg + lamin = 0.001 + Nlam = 80 + + +STABILITY SELECTION PARAMETERS: + numerical_method : Path-Alg + method : lam + B = 50 + q = 10 + percent_nS = 0.5 + threshold = 0.7 + lam = theoretical + theoretical_lam = 0.3085 +``` + +The c-lasso estimation results are summarized below: + +``` +LAMBDA FIXED : + Sigma = 0.198 + Selected variables : 14 18 19 39 43 57 62 85 93 94 104 107 + Running time : 0.008s + + PATH COMPUTATION : + Running time : 0.12s + + STABILITY SELECTION : + Selected variables : 2 12 15 + Running time : 0.287s +``` + + + + + + + + + + + + +## Optimization schemes + +The available problem formulations [R1-C2] require different algorithmic strategies for +efficiently solving the underlying optimization problem. We have implemented four +algorithms (with provable convergence guarantees) that vary in generality and are not +necessarily applicable to all problems. For each problem type, c-lasso has a default algorithm +setting that proved to be the fastest in our numerical experiments. + +### Path algorithms (Path-Alg) +This is the default algorithm for non-concomitant problems [R1,R3,C1,C2]. +The algorithm uses the fact that the solution path along λ is piecewise- +affine (as shown, e.g., in [1]). When Least-Squares is used as objective function, +we derive a novel efficient procedure that allows us to also derive the +solution for the concomitant problem [R2] along the path with little extra computational overhead. + +### Projected primal-dual splitting method (P-PDS): +This algorithm is derived from [2] and belongs to the class of +proximal splitting algorithms. It extends the classical Forward-Backward (FB) +(aka proximal gradient descent) algorithm to handle an additional linear equality constraint +via projection. In the absence of a linear constraint, the method reduces to FB. +This method can solve problem [R1]. For the Huber problem [R3], +P-PDS can solve the mean-shift formulation of the problem (see [6]). + +### Projection-free primal-dual splitting method (PF-PDS): +This algorithm is a special case of an algorithm proposed in [3] (Eq.4.5) and also belongs to the class of +proximal splitting algorithms. The algorithm does not require projection operators +which may be beneficial when C has a more complex structure. In the absence of a linear constraint, +the method reduces to the Forward-Backward-Forward scheme. This method can solve problem [R1]. +For the Huber problem [R3], PF-PDS can solve the mean-shift formulation of the problem (see [6]). + +### Douglas-Rachford-type splitting method (DR) +This algorithm is the most general algorithm and can solve all regression problems +[R1-R4]. It is based on Doulgas Rachford splitting in a higher-dimensional product space. +It makes use of the proximity operators of the perspective of the LS objective (see [4,5]) +The Huber problem with concomitant scale [R4] is reformulated as scaled Lasso problem +with the mean shift (see [6]) and thus solved in (n + d) dimensions. + + + +## References + +* [1] B. R. Gaines, J. Kim, and H. Zhou, [Algorithms for Fitting the Constrained Lasso](https://www.tandfonline.com/doi/abs/10.1080/10618600.2018.1473777?journalCode=ucgs20), J. Comput. Graph. Stat., vol. 27, no. 4, pp. 861–871, 2018. + +* [2] L. Briceno-Arias and S.L. Rivera, [A Projected Primal–Dual Method for Solving Constrained Monotone Inclusions](https://link.springer.com/article/10.1007/s10957-018-1430-2?shared-article-renderer), J. Optim. Theory Appl., vol. 180, Issue 3, March 2019. + +* [3] P. L. Combettes and J.C. Pesquet, [Primal-Dual Splitting Algorithm for Solving Inclusions with Mixtures of Composite, Lipschitzian, and Parallel-Sum Type Monotone Operators](https://arxiv.org/pdf/1107.0081.pdf), Set-Valued and Variational Analysis, vol. 20, pp. 307-330, 2012. + +* [4] P. L. Combettes and C. L. Müller, [Perspective M-estimation via proximal decomposition](https://arxiv.org/abs/1805.06098), Electronic Journal of Statistics, 2020, [Journal version](https://projecteuclid.org/euclid.ejs/1578452535) + +* [5] P. L. Combettes and C. L. Müller, [Regression models for compositional data: General log-contrast formulations, proximal optimization, and microbiome data applications](https://arxiv.org/abs/1903.01050), Statistics in Bioscience, 2020. + +* [6] A. Mishra and C. L. Müller, [Robust regression with compositional covariates](https://arxiv.org/abs/1909.04990), arXiv, 2019. + +* [7] S. Rosset and J. Zhu, [Piecewise linear regularized solution paths](https://projecteuclid.org/euclid.aos/1185303996), Ann. Stat., vol. 35, no. 3, pp. 1012–1030, 2007. + +* [8] J. Bien, X. Yan, L. Simpson, and C. L. Müller, [Tree-Aggregated Predictive Modeling of Microbiome Data](https://www.biorxiv.org/content/10.1101/2020.09.01.277632v1), biorxiv, 2020. + + + + + + +%package -n python3-c-lasso +Summary: Algorithms for constrained Lasso problems +Provides: python-c-lasso +BuildRequires: python3-devel +BuildRequires: python3-setuptools +BuildRequires: python3-pip +%description -n python3-c-lasso +[](https://arxiv.org/abs/2011.00898) +[](https://doi.org/10.21105/joss.02844) + +<img src="https://i.imgur.com/2nGwlux.png" alt="c-lasso" height="145" align="right"/> + +# c-lasso: a Python package for constrained sparse regression and classification + + +c-lasso is a Python package that enables sparse and robust linear regression and classification with linear equality +constraints on the model parameters. For detailed info, one can check the [documentation](https://c-lasso.readthedocs.io/en/latest/). + +The forward model is assumed to be: + +<img src="https://latex.codecogs.com/gif.latex?y=X\beta+\sigma\epsilon\qquad\text{s.t.}\qquad&space;C\beta=0" title="y=X\beta+\sigma\epsilon\qquad\text{s.t.}\qquad C\beta=0" /> + +Here, y and X are given outcome and predictor data. The vector y can be continuous (for regression) or binary (for classification). C is a general constraint matrix. The vector β comprises the unknown coefficients and σ an +unknown scale. + +The package handles several different estimators for inferring β (and σ), including +the constrained Lasso, the constrained scaled Lasso, sparse Huber M-estimation with linear equality constraints, and regularized Support Vector Machines. +Several different algorithmic strategies, including path and proximal splitting algorithms, are implemented to solve +the underlying convex optimization problems. + +We also include two model selection strategies for determining the sparsity of the model parameters: k-fold cross-validation and stability selection. + +This package is intended to fill the gap between popular python tools such as [scikit-learn](https://scikit-learn.org/stable/) which CANNOT solve sparse constrained problems and general-purpose optimization solvers that do not scale well or are inaccurate (see [benchmarks](./benchmark/README.md)) for the considered problems. In its current stage, however, c-lasso is not yet compatible with the scikit-learn API but rather a stand-alone tool. + +Below we show several use cases of the package, including an application of sparse *log-contrast* +regression tasks for *compositional* microbiome data. + +The code builds on results from several papers which can be found in the [References](#references). We also refer to the accompanying [JOSS paper submission](https://github.com/Leo-Simpson/c-lasso/blob/master/paper/paper.md), also available on [arXiv](https://arxiv.org/pdf/2011.00898.pdf). + +## Table of Contents + +* [Installation](#installation) +* [Regression and classification problems](#regression-and-classification-problems) +* [Getting started](#getting-started) +* [Log-contrast regression for microbiome data](#log-contrast-regression-for-microbiome-data) +* [Optimization schemes](#optimization-schemes) + + +* [References](#references) + + +## Installation + +c-lasso is available on pip. You can install the package +in the shell using + +```shell +pip install c-lasso +``` +To use the c-lasso package in Python, type + +```python + +from classo import classo_problem +# one can add auxiliary functions as well such as random_data or csv_to_np +``` + +The `c-lasso` package depends on the following Python packages: + +- `numpy`; +- `matplotlib`; +- `scipy`; +- `pandas`; +- `pytest` (for tests) + +## Regression and classification problems + +The c-lasso package can solve six different types of estimation problems: +four regression-type and two classification-type formulations. + +#### [R1] Standard constrained Lasso regression: + +<img src="https://latex.codecogs.com/gif.latex?\arg\min_{\beta\in&space;R^d}&space;||&space;X\beta-y&space;||^2&space;+&space;\lambda&space;||\beta||_1&space;\qquad\mbox{s.t.}\qquad&space;C\beta=0" /> + +This is the standard Lasso problem with linear equality constraints on the β vector. +The objective function combines Least-Squares for model fitting with l1 penalty for sparsity. + +#### [R2] Constrained sparse Huber regression: + +<img src="https://latex.codecogs.com/gif.latex?\arg\min_{\beta\in&space;R^d}&space;h_{\rho}(X\beta-y&space;)&space;+&space;\lambda&space;||\beta||_1&space;\qquad\mbox{s.t.}\qquad&space;C\beta=0" /> + +This regression problem uses the [Huber loss](https://en.wikipedia.org/wiki/Huber_loss) as objective function +for robust model fitting with l1 and linear equality constraints on the β vector. The parameter ρ=1.345. + +#### [R3] Constrained scaled Lasso regression: + +<img src="https://latex.codecogs.com/gif.latex?\arg&space;\min_{\beta&space;\in&space;\mathbb{R}^d,&space;\sigma&space;>&space;0}&space;\frac{||&space;X\beta&space;-&space;y||^2}{\sigma}&space;+&space;\frac{n}{2}&space;\sigma+&space;\lambda&space;||\beta||_1&space;\qquad&space;\mbox{s.t.}&space;\qquad&space;C\beta&space;=&space;0" title="\arg \min_{\beta \in \mathbb{R}^d, \sigma > 0} \frac{|| X\beta - y||^2}{\sigma} + \frac{n}{2} \sigma+ \lambda ||\beta||_1 \qquad \mbox{s.t.} \qquad C\beta = 0" /> + +This formulation is similar to [R1] but allows for joint estimation of the (constrained) β vector and +the standard deviation σ in a concomitant fashion (see [References](#references) [4,5] for further info). +This is the default problem formulation in c-lasso. + +#### [R4] Constrained sparse Huber regression with concomitant scale estimation: + +<img src="https://latex.codecogs.com/gif.latex?\arg&space;\min_{\beta&space;\in&space;\mathbb{R}^d,&space;\sigma&space;>&space;0}&space;\left(&space;h_{\rho}&space;\left(&space;\frac{&space;X\beta&space;-&space;y}{\sigma}&space;\right)+&space;n&space;\right)&space;\sigma+&space;\lambda&space;||\beta||_1&space;\qquad&space;\mbox{s.t.}&space;\qquad&space;C\beta&space;=&space;0" title="\arg \min_{\beta \in \mathbb{R}^d, \sigma > 0} \left( h_{\rho} \left( \frac{ X\beta - y}{\sigma} \right)+ n \right) \sigma+ \lambda ||\beta||_1 \qquad \mbox{s.t.} \qquad C\beta = 0" /> + +This formulation combines [R2] and [R3] to allow robust joint estimation of the (constrained) β vector and +the scale σ in a concomitant fashion (see [References](#references) [4,5] for further info). + +#### [C1] Constrained sparse classification with Square Hinge loss: + +<img src="https://latex.codecogs.com/gif.latex?\arg&space;\min_{\beta&space;\in&space;\mathbb{R}^d}&space;\sum_{i=1}^n&space;l(y_i&space;x_i^\top&space;\beta)&space;+&space;\lambda&space;\left\lVert&space;\beta\right\rVert_1&space;\qquad&space;s.t.&space;\qquad&space;C\beta&space;=&space;0" title="\arg \min_{\beta \in \mathbb{R}^d} \sum_{i=1}^n l(y_i x_i \beta) + \lambda \left\lVert \beta\right\rVert_1 \qquad s.t. \qquad C\beta = 0" /> + +where the x<sub>i</sub> are the rows of X and l is defined as: + +<img src="https://latex.codecogs.com/gif.latex?l(r)&space;=&space;\begin{cases}&space;(1-r)^2&space;&&space;if&space;\quad&space;r&space;\leq&space;1&space;\\&space;0&space;&if&space;\quad&space;r&space;\geq&space;1&space;\end{cases}" title="l(r) = \begin{cases} (1-r)^2 & if \quad r \leq 1 \\ 0 &if \quad r \geq 1 \end{cases}" /> + +This formulation is similar to [R1] but adapted for classification tasks using the Square Hinge loss +with (constrained) sparse β vector estimation. + +#### [C2] Constrained sparse classification with Huberized Square Hinge loss: + +<img src="https://latex.codecogs.com/gif.latex?\arg&space;\min_{\beta&space;\in&space;\mathbb{R}^d}&space;\sum_{i=1}^n&space;l_{\rho}(y_i&space;x_i^\top\beta)&space;+&space;\lambda&space;\left\lVert&space;\beta\right\rVert_1&space;\qquad&space;s.t.&space;\qquad&space;C\beta&space;=&space;0" title="\arg \min_{\beta \in \mathbb{R}^d} \sum_{i=1}^n l_{\rho}(y_i x_i\beta) + \lambda \left\lVert \beta\right\rVert_1 \qquad s.t. \qquad C\beta = 0" /> + +where the x<sub>i</sub> are the rows of X and l<sub>ρ</sub> is defined as: + +<img src="https://latex.codecogs.com/gif.latex?l_{\rho}(r)&space;=&space;\begin{cases}&space;(1-r)^2&space;&if&space;\quad&space;\rho&space;\leq&space;r&space;\leq&space;1&space;\\&space;(1-\rho)(1+\rho-2r)&space;&&space;if&space;\quad&space;r&space;\leq&space;\rho&space;\\&space;0&space;&if&space;\quad&space;r&space;\geq&space;1&space;\end{cases}" title="l_{\rho}(r) = \begin{cases} (1-r)^2 &if \quad \rho \leq r \leq 1 \\ (1-\rho)(1+\rho-2r) & if \quad r \leq \rho \\ 0 &if \quad r \geq 1 \end{cases}" /> + + +This formulation is similar to [C1] but uses the Huberized Square Hinge loss for robust classification +with (constrained) sparse β vector estimation. + + +## Getting started + +#### Basic example + +We begin with a basic example that shows how to run c-lasso on synthetic data. This example and the next one can be found on the notebook 'Synthetic data Notebook.ipynb' + +The c-lasso package includes +the routine ```random_data``` that allows you to generate problem instances using normally distributed data. + +```python +m, d, d_nonzero, k, sigma = 100, 200, 5, 1, 0.5 +(X, C, y), sol = random_data(m, d, d_nonzero, k, sigma, zerosum=True, seed=1) +``` +This code snippet generates a problem instance with sparse β in dimension +d=100 (sparsity d_nonzero=5). The design matrix X comprises n=100 samples generated from an i.i.d standard normal +distribution. The dimension of the constraint matrix C is d x k matrix. The noise level is σ=0.5. +The input ```zerosum=True``` implies that C is the all-ones vector and Cβ=0. The n-dimensional outcome vector y +and the regression vector β is then generated to satisfy the given constraints. + +Next we can define a default c-lasso problem instance with the generated data: +```python +problem = classo_problem(X, y, C) +``` +You can look at the generated problem instance by typing: + +```python +print(problem) +``` + +This gives you a summary of the form: + +``` +FORMULATION: R3 + +MODEL SELECTION COMPUTED: + Stability selection + +STABILITY SELECTION PARAMETERS: + numerical_method : not specified + method : first + B = 50 + q = 10 + percent_nS = 0.5 + threshold = 0.7 + lamin = 0.01 + Nlam = 50 +``` +As we have not specified any problem, algorithm, or model selection settings, this problem instance +represents the *default* settings for a c-lasso instance: +- The problem is of regression type and uses formulation [R3], i.e. with concomitant scale estimation. +- The *default* optimization scheme is the path algorithm (see [Optimization schemes](#optimization-schemes) for further info). +- For model selection, stability selection at a theoretically derived λ value is used (see [Reference](#references) [4] for details). Stability selection comprises a relatively large number of parameters. For a description of the settings, we refer to the more advanced examples below and the API. + +You can solve the corresponding c-lasso problem instance using + +```python +problem.solve() +``` + +After completion, the results of the optimization and model selection routines +can be visualized using + +```python +print(problem.solution) +``` + +The command shows the running time(s) for the c-lasso problem instance, and the selected variables for sability selection + +``` +STABILITY SELECTION : + Selected variables : 7 63 148 164 168 + Running time : 1.546s + +``` + +Here, we only used stability selection as *default* model selection strategy. +The command also allows you to inspect the computed stability profile for all variables +at the theoretical λ + + + + +The refitted β values on the selected support are also displayed in the next plot + + + + +#### Advanced example + +In the next example, we show how one can specify different aspects of the problem +formulation and model selection strategy. + +```python +m, d, d_nonzero, k, sigma = 100, 200, 5, 0, 0.5 +(X, C, y), sol = random_data(m, d, d_nonzero, k, sigma, zerosum = True, seed = 4) +problem = classo_problem(X, y, C) +problem.formulation.huber = True +problem.formulation.concomitant = False +problem.model_selection.CV = True +problem.model_selection.LAMfixed = True +problem.model_selection.PATH = True +problem.model_selection.StabSelparameters.method = 'max' +problem.model_selection.CVparameters.seed = 1 +problem.model_selection.LAMfixedparameters.rescaled_lam = True +problem.model_selection.LAMfixedparameters.lam = .1 + +problem.solve() +print(problem) + +print(problem.solution) + +``` + +Results : +``` + FORMULATION: R2 + + MODEL SELECTION COMPUTED: + Lambda fixed + Path + Cross Validation + Stability selection + + LAMBDA FIXED PARAMETERS: + numerical_method = Path-Alg + rescaled lam : True + threshold = 0.09 + lam = 0.1 + theoretical_lam = 0.224 + + PATH PARAMETERS: + numerical_method : Path-Alg + lamin = 0.001 + Nlam = 80 + + + CROSS VALIDATION PARAMETERS: + numerical_method : Path-Alg + one-SE method : True + Nsubset = 5 + lamin = 0.001 + Nlam = 80 + + + STABILITY SELECTION PARAMETERS: + numerical_method : Path-Alg + method : max + B = 50 + q = 10 + percent_nS = 0.5 + threshold = 0.7 + lamin = 0.01 + Nlam = 50 + + LAMBDA FIXED : + Selected variables : 17 59 123 + Running time : 0.104s + + PATH COMPUTATION : + Running time : 0.638s + + CROSS VALIDATION : + Selected variables : 16 17 57 59 64 73 74 76 93 115 123 134 137 181 + Running time : 2.1s + + STABILITY SELECTION : + Selected variables : 17 59 76 123 137 + Running time : 6.062s + +``` + + + + + + + + + + + + + + + +## Log-contrast regression for microbiome data + +In the [the accompanying notebook](./examples/example-notebook.ipynb) we study several microbiome data sets. We showcase two examples below. + +#### BMI prediction using the COMBO dataset + +We first consider the [COMBO data set](./examples/COMBO_data) and show how to predict Body Mass Index (BMI) from microbial genus abundances and two non-compositional covariates using "filtered_data". + +```python +from classo import csv_to_np, classo_problem, clr + +# Load microbiome and covariate data X +X0 = csv_to_np('COMBO_data/complete_data/GeneraCounts.csv', begin = 0).astype(float) +X_C = csv_to_np('COMBO_data/CaloriData.csv', begin = 0).astype(float) +X_F = csv_to_np('COMBO_data/FatData.csv', begin = 0).astype(float) + +# Load BMI measurements y +y = csv_to_np('COMBO_data/BMI.csv', begin = 0).astype(float)[:, 0] +labels = csv_to_np('COMBO_data/complete_data/GeneraPhylo.csv').astype(str)[:, -1] + + +# Normalize/transform data +y = y - np.mean(y) #BMI data (n = 96) +X_C = X_C - np.mean(X_C, axis = 0) #Covariate data (Calorie) +X_F = X_F - np.mean(X_F, axis = 0) #Covariate data (Fat) +X0 = clr(X0, 1 / 2).T + +# Set up design matrix and zero-sum constraints for 45 genera +X = np.concatenate((X0, X_C, X_F, np.ones((len(X0), 1))), axis = 1) # Joint microbiome and covariate data and offset +label = np.concatenate([labels, np.array(['Calorie', 'Fat', 'Bias'])]) +C = np.ones((1, len(X[0]))) +C[0, -1], C[0, -2], C[0, -3] = 0., 0., 0. + + +# Set up c-lassso problem +problem = classo_problem(X, y, C, label = label) + + +# Use stability selection with theoretical lambda [Combettes & Müller, 2020b] +problem.model_selection.StabSelparameters.method = 'lam' +problem.model_selection.StabSelparameters.threshold_label = 0.5 + +# Use formulation R3 +problem.formulation.concomitant = True + +problem.solve() +print(problem) +print(problem.solution) + +# Use formulation R4 +problem.formulation.huber = True +problem.formulation.concomitant = True + +problem.solve() +print(problem) +print(problem.solution) + +``` + + + + + + + + + + +<!--- +<img src="https://i.imgur.com/8tFmM8T.png" alt="Central Park Soil Microbiome" height="250" align="right"/> +#### pH prediction using the Central Park soil dataset +The next microbiome example considers the [Central Park Soil dataset](./examples/pH_data) from [Ramirez et al.](https://royalsocietypublishing.org/doi/full/10.1098/rspb.2014.1988). The sample locations are shown in the Figure on the right.) +--> + +#### pH prediction using the 88 soils dataset + +The next microbiome example considers the [88 soils dataset](./examples/pH_data) from [Lauber et al., 2009](https://pubmed.ncbi.nlm.nih.gov/19502440/). + +The task is to predict pH concentration in the soil from microbial abundance data. A similar analysis is available +in [Tree-Aggregated Predictive Modeling of Microbiome Data](https://www.biorxiv.org/content/10.1101/2020.09.01.277632v1) +with Central Park soil data from [Ramirez et al.](https://royalsocietypublishing.org/doi/full/10.1098/rspb.2014.1988). + +Code to run this application is available in [the accompanying notebook](./examples/example-notebook.ipynb) under `pH data`. Below is a summary of a c-lasso problem instance (using the R3 formulation). + +``` +FORMULATION: R3 + +MODEL SELECTION COMPUTED: + Lambda fixed + Path + Stability selection + +LAMBDA FIXED PARAMETERS: + numerical_method = Path-Alg + rescaled lam : True + threshold = 0.004 + lam : theoretical + theoretical_lam = 0.2182 + +PATH PARAMETERS: + numerical_method : Path-Alg + lamin = 0.001 + Nlam = 80 + + +STABILITY SELECTION PARAMETERS: + numerical_method : Path-Alg + method : lam + B = 50 + q = 10 + percent_nS = 0.5 + threshold = 0.7 + lam = theoretical + theoretical_lam = 0.3085 +``` + +The c-lasso estimation results are summarized below: + +``` +LAMBDA FIXED : + Sigma = 0.198 + Selected variables : 14 18 19 39 43 57 62 85 93 94 104 107 + Running time : 0.008s + + PATH COMPUTATION : + Running time : 0.12s + + STABILITY SELECTION : + Selected variables : 2 12 15 + Running time : 0.287s +``` + + + + + + + + + + + + +## Optimization schemes + +The available problem formulations [R1-C2] require different algorithmic strategies for +efficiently solving the underlying optimization problem. We have implemented four +algorithms (with provable convergence guarantees) that vary in generality and are not +necessarily applicable to all problems. For each problem type, c-lasso has a default algorithm +setting that proved to be the fastest in our numerical experiments. + +### Path algorithms (Path-Alg) +This is the default algorithm for non-concomitant problems [R1,R3,C1,C2]. +The algorithm uses the fact that the solution path along λ is piecewise- +affine (as shown, e.g., in [1]). When Least-Squares is used as objective function, +we derive a novel efficient procedure that allows us to also derive the +solution for the concomitant problem [R2] along the path with little extra computational overhead. + +### Projected primal-dual splitting method (P-PDS): +This algorithm is derived from [2] and belongs to the class of +proximal splitting algorithms. It extends the classical Forward-Backward (FB) +(aka proximal gradient descent) algorithm to handle an additional linear equality constraint +via projection. In the absence of a linear constraint, the method reduces to FB. +This method can solve problem [R1]. For the Huber problem [R3], +P-PDS can solve the mean-shift formulation of the problem (see [6]). + +### Projection-free primal-dual splitting method (PF-PDS): +This algorithm is a special case of an algorithm proposed in [3] (Eq.4.5) and also belongs to the class of +proximal splitting algorithms. The algorithm does not require projection operators +which may be beneficial when C has a more complex structure. In the absence of a linear constraint, +the method reduces to the Forward-Backward-Forward scheme. This method can solve problem [R1]. +For the Huber problem [R3], PF-PDS can solve the mean-shift formulation of the problem (see [6]). + +### Douglas-Rachford-type splitting method (DR) +This algorithm is the most general algorithm and can solve all regression problems +[R1-R4]. It is based on Doulgas Rachford splitting in a higher-dimensional product space. +It makes use of the proximity operators of the perspective of the LS objective (see [4,5]) +The Huber problem with concomitant scale [R4] is reformulated as scaled Lasso problem +with the mean shift (see [6]) and thus solved in (n + d) dimensions. + + + +## References + +* [1] B. R. Gaines, J. Kim, and H. Zhou, [Algorithms for Fitting the Constrained Lasso](https://www.tandfonline.com/doi/abs/10.1080/10618600.2018.1473777?journalCode=ucgs20), J. Comput. Graph. Stat., vol. 27, no. 4, pp. 861–871, 2018. + +* [2] L. Briceno-Arias and S.L. Rivera, [A Projected Primal–Dual Method for Solving Constrained Monotone Inclusions](https://link.springer.com/article/10.1007/s10957-018-1430-2?shared-article-renderer), J. Optim. Theory Appl., vol. 180, Issue 3, March 2019. + +* [3] P. L. Combettes and J.C. Pesquet, [Primal-Dual Splitting Algorithm for Solving Inclusions with Mixtures of Composite, Lipschitzian, and Parallel-Sum Type Monotone Operators](https://arxiv.org/pdf/1107.0081.pdf), Set-Valued and Variational Analysis, vol. 20, pp. 307-330, 2012. + +* [4] P. L. Combettes and C. L. Müller, [Perspective M-estimation via proximal decomposition](https://arxiv.org/abs/1805.06098), Electronic Journal of Statistics, 2020, [Journal version](https://projecteuclid.org/euclid.ejs/1578452535) + +* [5] P. L. Combettes and C. L. Müller, [Regression models for compositional data: General log-contrast formulations, proximal optimization, and microbiome data applications](https://arxiv.org/abs/1903.01050), Statistics in Bioscience, 2020. + +* [6] A. Mishra and C. L. Müller, [Robust regression with compositional covariates](https://arxiv.org/abs/1909.04990), arXiv, 2019. + +* [7] S. Rosset and J. Zhu, [Piecewise linear regularized solution paths](https://projecteuclid.org/euclid.aos/1185303996), Ann. Stat., vol. 35, no. 3, pp. 1012–1030, 2007. + +* [8] J. Bien, X. Yan, L. Simpson, and C. L. Müller, [Tree-Aggregated Predictive Modeling of Microbiome Data](https://www.biorxiv.org/content/10.1101/2020.09.01.277632v1), biorxiv, 2020. + + + + + + +%package help +Summary: Development documents and examples for c-lasso +Provides: python3-c-lasso-doc +%description help +[](https://arxiv.org/abs/2011.00898) +[](https://doi.org/10.21105/joss.02844) + +<img src="https://i.imgur.com/2nGwlux.png" alt="c-lasso" height="145" align="right"/> + +# c-lasso: a Python package for constrained sparse regression and classification + + +c-lasso is a Python package that enables sparse and robust linear regression and classification with linear equality +constraints on the model parameters. For detailed info, one can check the [documentation](https://c-lasso.readthedocs.io/en/latest/). + +The forward model is assumed to be: + +<img src="https://latex.codecogs.com/gif.latex?y=X\beta+\sigma\epsilon\qquad\text{s.t.}\qquad&space;C\beta=0" title="y=X\beta+\sigma\epsilon\qquad\text{s.t.}\qquad C\beta=0" /> + +Here, y and X are given outcome and predictor data. The vector y can be continuous (for regression) or binary (for classification). C is a general constraint matrix. The vector β comprises the unknown coefficients and σ an +unknown scale. + +The package handles several different estimators for inferring β (and σ), including +the constrained Lasso, the constrained scaled Lasso, sparse Huber M-estimation with linear equality constraints, and regularized Support Vector Machines. +Several different algorithmic strategies, including path and proximal splitting algorithms, are implemented to solve +the underlying convex optimization problems. + +We also include two model selection strategies for determining the sparsity of the model parameters: k-fold cross-validation and stability selection. + +This package is intended to fill the gap between popular python tools such as [scikit-learn](https://scikit-learn.org/stable/) which CANNOT solve sparse constrained problems and general-purpose optimization solvers that do not scale well or are inaccurate (see [benchmarks](./benchmark/README.md)) for the considered problems. In its current stage, however, c-lasso is not yet compatible with the scikit-learn API but rather a stand-alone tool. + +Below we show several use cases of the package, including an application of sparse *log-contrast* +regression tasks for *compositional* microbiome data. + +The code builds on results from several papers which can be found in the [References](#references). We also refer to the accompanying [JOSS paper submission](https://github.com/Leo-Simpson/c-lasso/blob/master/paper/paper.md), also available on [arXiv](https://arxiv.org/pdf/2011.00898.pdf). + +## Table of Contents + +* [Installation](#installation) +* [Regression and classification problems](#regression-and-classification-problems) +* [Getting started](#getting-started) +* [Log-contrast regression for microbiome data](#log-contrast-regression-for-microbiome-data) +* [Optimization schemes](#optimization-schemes) + + +* [References](#references) + + +## Installation + +c-lasso is available on pip. You can install the package +in the shell using + +```shell +pip install c-lasso +``` +To use the c-lasso package in Python, type + +```python + +from classo import classo_problem +# one can add auxiliary functions as well such as random_data or csv_to_np +``` + +The `c-lasso` package depends on the following Python packages: + +- `numpy`; +- `matplotlib`; +- `scipy`; +- `pandas`; +- `pytest` (for tests) + +## Regression and classification problems + +The c-lasso package can solve six different types of estimation problems: +four regression-type and two classification-type formulations. + +#### [R1] Standard constrained Lasso regression: + +<img src="https://latex.codecogs.com/gif.latex?\arg\min_{\beta\in&space;R^d}&space;||&space;X\beta-y&space;||^2&space;+&space;\lambda&space;||\beta||_1&space;\qquad\mbox{s.t.}\qquad&space;C\beta=0" /> + +This is the standard Lasso problem with linear equality constraints on the β vector. +The objective function combines Least-Squares for model fitting with l1 penalty for sparsity. + +#### [R2] Constrained sparse Huber regression: + +<img src="https://latex.codecogs.com/gif.latex?\arg\min_{\beta\in&space;R^d}&space;h_{\rho}(X\beta-y&space;)&space;+&space;\lambda&space;||\beta||_1&space;\qquad\mbox{s.t.}\qquad&space;C\beta=0" /> + +This regression problem uses the [Huber loss](https://en.wikipedia.org/wiki/Huber_loss) as objective function +for robust model fitting with l1 and linear equality constraints on the β vector. The parameter ρ=1.345. + +#### [R3] Constrained scaled Lasso regression: + +<img src="https://latex.codecogs.com/gif.latex?\arg&space;\min_{\beta&space;\in&space;\mathbb{R}^d,&space;\sigma&space;>&space;0}&space;\frac{||&space;X\beta&space;-&space;y||^2}{\sigma}&space;+&space;\frac{n}{2}&space;\sigma+&space;\lambda&space;||\beta||_1&space;\qquad&space;\mbox{s.t.}&space;\qquad&space;C\beta&space;=&space;0" title="\arg \min_{\beta \in \mathbb{R}^d, \sigma > 0} \frac{|| X\beta - y||^2}{\sigma} + \frac{n}{2} \sigma+ \lambda ||\beta||_1 \qquad \mbox{s.t.} \qquad C\beta = 0" /> + +This formulation is similar to [R1] but allows for joint estimation of the (constrained) β vector and +the standard deviation σ in a concomitant fashion (see [References](#references) [4,5] for further info). +This is the default problem formulation in c-lasso. + +#### [R4] Constrained sparse Huber regression with concomitant scale estimation: + +<img src="https://latex.codecogs.com/gif.latex?\arg&space;\min_{\beta&space;\in&space;\mathbb{R}^d,&space;\sigma&space;>&space;0}&space;\left(&space;h_{\rho}&space;\left(&space;\frac{&space;X\beta&space;-&space;y}{\sigma}&space;\right)+&space;n&space;\right)&space;\sigma+&space;\lambda&space;||\beta||_1&space;\qquad&space;\mbox{s.t.}&space;\qquad&space;C\beta&space;=&space;0" title="\arg \min_{\beta \in \mathbb{R}^d, \sigma > 0} \left( h_{\rho} \left( \frac{ X\beta - y}{\sigma} \right)+ n \right) \sigma+ \lambda ||\beta||_1 \qquad \mbox{s.t.} \qquad C\beta = 0" /> + +This formulation combines [R2] and [R3] to allow robust joint estimation of the (constrained) β vector and +the scale σ in a concomitant fashion (see [References](#references) [4,5] for further info). + +#### [C1] Constrained sparse classification with Square Hinge loss: + +<img src="https://latex.codecogs.com/gif.latex?\arg&space;\min_{\beta&space;\in&space;\mathbb{R}^d}&space;\sum_{i=1}^n&space;l(y_i&space;x_i^\top&space;\beta)&space;+&space;\lambda&space;\left\lVert&space;\beta\right\rVert_1&space;\qquad&space;s.t.&space;\qquad&space;C\beta&space;=&space;0" title="\arg \min_{\beta \in \mathbb{R}^d} \sum_{i=1}^n l(y_i x_i \beta) + \lambda \left\lVert \beta\right\rVert_1 \qquad s.t. \qquad C\beta = 0" /> + +where the x<sub>i</sub> are the rows of X and l is defined as: + +<img src="https://latex.codecogs.com/gif.latex?l(r)&space;=&space;\begin{cases}&space;(1-r)^2&space;&&space;if&space;\quad&space;r&space;\leq&space;1&space;\\&space;0&space;&if&space;\quad&space;r&space;\geq&space;1&space;\end{cases}" title="l(r) = \begin{cases} (1-r)^2 & if \quad r \leq 1 \\ 0 &if \quad r \geq 1 \end{cases}" /> + +This formulation is similar to [R1] but adapted for classification tasks using the Square Hinge loss +with (constrained) sparse β vector estimation. + +#### [C2] Constrained sparse classification with Huberized Square Hinge loss: + +<img src="https://latex.codecogs.com/gif.latex?\arg&space;\min_{\beta&space;\in&space;\mathbb{R}^d}&space;\sum_{i=1}^n&space;l_{\rho}(y_i&space;x_i^\top\beta)&space;+&space;\lambda&space;\left\lVert&space;\beta\right\rVert_1&space;\qquad&space;s.t.&space;\qquad&space;C\beta&space;=&space;0" title="\arg \min_{\beta \in \mathbb{R}^d} \sum_{i=1}^n l_{\rho}(y_i x_i\beta) + \lambda \left\lVert \beta\right\rVert_1 \qquad s.t. \qquad C\beta = 0" /> + +where the x<sub>i</sub> are the rows of X and l<sub>ρ</sub> is defined as: + +<img src="https://latex.codecogs.com/gif.latex?l_{\rho}(r)&space;=&space;\begin{cases}&space;(1-r)^2&space;&if&space;\quad&space;\rho&space;\leq&space;r&space;\leq&space;1&space;\\&space;(1-\rho)(1+\rho-2r)&space;&&space;if&space;\quad&space;r&space;\leq&space;\rho&space;\\&space;0&space;&if&space;\quad&space;r&space;\geq&space;1&space;\end{cases}" title="l_{\rho}(r) = \begin{cases} (1-r)^2 &if \quad \rho \leq r \leq 1 \\ (1-\rho)(1+\rho-2r) & if \quad r \leq \rho \\ 0 &if \quad r \geq 1 \end{cases}" /> + + +This formulation is similar to [C1] but uses the Huberized Square Hinge loss for robust classification +with (constrained) sparse β vector estimation. + + +## Getting started + +#### Basic example + +We begin with a basic example that shows how to run c-lasso on synthetic data. This example and the next one can be found on the notebook 'Synthetic data Notebook.ipynb' + +The c-lasso package includes +the routine ```random_data``` that allows you to generate problem instances using normally distributed data. + +```python +m, d, d_nonzero, k, sigma = 100, 200, 5, 1, 0.5 +(X, C, y), sol = random_data(m, d, d_nonzero, k, sigma, zerosum=True, seed=1) +``` +This code snippet generates a problem instance with sparse β in dimension +d=100 (sparsity d_nonzero=5). The design matrix X comprises n=100 samples generated from an i.i.d standard normal +distribution. The dimension of the constraint matrix C is d x k matrix. The noise level is σ=0.5. +The input ```zerosum=True``` implies that C is the all-ones vector and Cβ=0. The n-dimensional outcome vector y +and the regression vector β is then generated to satisfy the given constraints. + +Next we can define a default c-lasso problem instance with the generated data: +```python +problem = classo_problem(X, y, C) +``` +You can look at the generated problem instance by typing: + +```python +print(problem) +``` + +This gives you a summary of the form: + +``` +FORMULATION: R3 + +MODEL SELECTION COMPUTED: + Stability selection + +STABILITY SELECTION PARAMETERS: + numerical_method : not specified + method : first + B = 50 + q = 10 + percent_nS = 0.5 + threshold = 0.7 + lamin = 0.01 + Nlam = 50 +``` +As we have not specified any problem, algorithm, or model selection settings, this problem instance +represents the *default* settings for a c-lasso instance: +- The problem is of regression type and uses formulation [R3], i.e. with concomitant scale estimation. +- The *default* optimization scheme is the path algorithm (see [Optimization schemes](#optimization-schemes) for further info). +- For model selection, stability selection at a theoretically derived λ value is used (see [Reference](#references) [4] for details). Stability selection comprises a relatively large number of parameters. For a description of the settings, we refer to the more advanced examples below and the API. + +You can solve the corresponding c-lasso problem instance using + +```python +problem.solve() +``` + +After completion, the results of the optimization and model selection routines +can be visualized using + +```python +print(problem.solution) +``` + +The command shows the running time(s) for the c-lasso problem instance, and the selected variables for sability selection + +``` +STABILITY SELECTION : + Selected variables : 7 63 148 164 168 + Running time : 1.546s + +``` + +Here, we only used stability selection as *default* model selection strategy. +The command also allows you to inspect the computed stability profile for all variables +at the theoretical λ + + + + +The refitted β values on the selected support are also displayed in the next plot + + + + +#### Advanced example + +In the next example, we show how one can specify different aspects of the problem +formulation and model selection strategy. + +```python +m, d, d_nonzero, k, sigma = 100, 200, 5, 0, 0.5 +(X, C, y), sol = random_data(m, d, d_nonzero, k, sigma, zerosum = True, seed = 4) +problem = classo_problem(X, y, C) +problem.formulation.huber = True +problem.formulation.concomitant = False +problem.model_selection.CV = True +problem.model_selection.LAMfixed = True +problem.model_selection.PATH = True +problem.model_selection.StabSelparameters.method = 'max' +problem.model_selection.CVparameters.seed = 1 +problem.model_selection.LAMfixedparameters.rescaled_lam = True +problem.model_selection.LAMfixedparameters.lam = .1 + +problem.solve() +print(problem) + +print(problem.solution) + +``` + +Results : +``` + FORMULATION: R2 + + MODEL SELECTION COMPUTED: + Lambda fixed + Path + Cross Validation + Stability selection + + LAMBDA FIXED PARAMETERS: + numerical_method = Path-Alg + rescaled lam : True + threshold = 0.09 + lam = 0.1 + theoretical_lam = 0.224 + + PATH PARAMETERS: + numerical_method : Path-Alg + lamin = 0.001 + Nlam = 80 + + + CROSS VALIDATION PARAMETERS: + numerical_method : Path-Alg + one-SE method : True + Nsubset = 5 + lamin = 0.001 + Nlam = 80 + + + STABILITY SELECTION PARAMETERS: + numerical_method : Path-Alg + method : max + B = 50 + q = 10 + percent_nS = 0.5 + threshold = 0.7 + lamin = 0.01 + Nlam = 50 + + LAMBDA FIXED : + Selected variables : 17 59 123 + Running time : 0.104s + + PATH COMPUTATION : + Running time : 0.638s + + CROSS VALIDATION : + Selected variables : 16 17 57 59 64 73 74 76 93 115 123 134 137 181 + Running time : 2.1s + + STABILITY SELECTION : + Selected variables : 17 59 76 123 137 + Running time : 6.062s + +``` + + + + + + + + + + + + + + + +## Log-contrast regression for microbiome data + +In the [the accompanying notebook](./examples/example-notebook.ipynb) we study several microbiome data sets. We showcase two examples below. + +#### BMI prediction using the COMBO dataset + +We first consider the [COMBO data set](./examples/COMBO_data) and show how to predict Body Mass Index (BMI) from microbial genus abundances and two non-compositional covariates using "filtered_data". + +```python +from classo import csv_to_np, classo_problem, clr + +# Load microbiome and covariate data X +X0 = csv_to_np('COMBO_data/complete_data/GeneraCounts.csv', begin = 0).astype(float) +X_C = csv_to_np('COMBO_data/CaloriData.csv', begin = 0).astype(float) +X_F = csv_to_np('COMBO_data/FatData.csv', begin = 0).astype(float) + +# Load BMI measurements y +y = csv_to_np('COMBO_data/BMI.csv', begin = 0).astype(float)[:, 0] +labels = csv_to_np('COMBO_data/complete_data/GeneraPhylo.csv').astype(str)[:, -1] + + +# Normalize/transform data +y = y - np.mean(y) #BMI data (n = 96) +X_C = X_C - np.mean(X_C, axis = 0) #Covariate data (Calorie) +X_F = X_F - np.mean(X_F, axis = 0) #Covariate data (Fat) +X0 = clr(X0, 1 / 2).T + +# Set up design matrix and zero-sum constraints for 45 genera +X = np.concatenate((X0, X_C, X_F, np.ones((len(X0), 1))), axis = 1) # Joint microbiome and covariate data and offset +label = np.concatenate([labels, np.array(['Calorie', 'Fat', 'Bias'])]) +C = np.ones((1, len(X[0]))) +C[0, -1], C[0, -2], C[0, -3] = 0., 0., 0. + + +# Set up c-lassso problem +problem = classo_problem(X, y, C, label = label) + + +# Use stability selection with theoretical lambda [Combettes & Müller, 2020b] +problem.model_selection.StabSelparameters.method = 'lam' +problem.model_selection.StabSelparameters.threshold_label = 0.5 + +# Use formulation R3 +problem.formulation.concomitant = True + +problem.solve() +print(problem) +print(problem.solution) + +# Use formulation R4 +problem.formulation.huber = True +problem.formulation.concomitant = True + +problem.solve() +print(problem) +print(problem.solution) + +``` + + + + + + + + + + +<!--- +<img src="https://i.imgur.com/8tFmM8T.png" alt="Central Park Soil Microbiome" height="250" align="right"/> +#### pH prediction using the Central Park soil dataset +The next microbiome example considers the [Central Park Soil dataset](./examples/pH_data) from [Ramirez et al.](https://royalsocietypublishing.org/doi/full/10.1098/rspb.2014.1988). The sample locations are shown in the Figure on the right.) +--> + +#### pH prediction using the 88 soils dataset + +The next microbiome example considers the [88 soils dataset](./examples/pH_data) from [Lauber et al., 2009](https://pubmed.ncbi.nlm.nih.gov/19502440/). + +The task is to predict pH concentration in the soil from microbial abundance data. A similar analysis is available +in [Tree-Aggregated Predictive Modeling of Microbiome Data](https://www.biorxiv.org/content/10.1101/2020.09.01.277632v1) +with Central Park soil data from [Ramirez et al.](https://royalsocietypublishing.org/doi/full/10.1098/rspb.2014.1988). + +Code to run this application is available in [the accompanying notebook](./examples/example-notebook.ipynb) under `pH data`. Below is a summary of a c-lasso problem instance (using the R3 formulation). + +``` +FORMULATION: R3 + +MODEL SELECTION COMPUTED: + Lambda fixed + Path + Stability selection + +LAMBDA FIXED PARAMETERS: + numerical_method = Path-Alg + rescaled lam : True + threshold = 0.004 + lam : theoretical + theoretical_lam = 0.2182 + +PATH PARAMETERS: + numerical_method : Path-Alg + lamin = 0.001 + Nlam = 80 + + +STABILITY SELECTION PARAMETERS: + numerical_method : Path-Alg + method : lam + B = 50 + q = 10 + percent_nS = 0.5 + threshold = 0.7 + lam = theoretical + theoretical_lam = 0.3085 +``` + +The c-lasso estimation results are summarized below: + +``` +LAMBDA FIXED : + Sigma = 0.198 + Selected variables : 14 18 19 39 43 57 62 85 93 94 104 107 + Running time : 0.008s + + PATH COMPUTATION : + Running time : 0.12s + + STABILITY SELECTION : + Selected variables : 2 12 15 + Running time : 0.287s +``` + + + + + + + + + + + + +## Optimization schemes + +The available problem formulations [R1-C2] require different algorithmic strategies for +efficiently solving the underlying optimization problem. We have implemented four +algorithms (with provable convergence guarantees) that vary in generality and are not +necessarily applicable to all problems. For each problem type, c-lasso has a default algorithm +setting that proved to be the fastest in our numerical experiments. + +### Path algorithms (Path-Alg) +This is the default algorithm for non-concomitant problems [R1,R3,C1,C2]. +The algorithm uses the fact that the solution path along λ is piecewise- +affine (as shown, e.g., in [1]). When Least-Squares is used as objective function, +we derive a novel efficient procedure that allows us to also derive the +solution for the concomitant problem [R2] along the path with little extra computational overhead. + +### Projected primal-dual splitting method (P-PDS): +This algorithm is derived from [2] and belongs to the class of +proximal splitting algorithms. It extends the classical Forward-Backward (FB) +(aka proximal gradient descent) algorithm to handle an additional linear equality constraint +via projection. In the absence of a linear constraint, the method reduces to FB. +This method can solve problem [R1]. For the Huber problem [R3], +P-PDS can solve the mean-shift formulation of the problem (see [6]). + +### Projection-free primal-dual splitting method (PF-PDS): +This algorithm is a special case of an algorithm proposed in [3] (Eq.4.5) and also belongs to the class of +proximal splitting algorithms. The algorithm does not require projection operators +which may be beneficial when C has a more complex structure. In the absence of a linear constraint, +the method reduces to the Forward-Backward-Forward scheme. This method can solve problem [R1]. +For the Huber problem [R3], PF-PDS can solve the mean-shift formulation of the problem (see [6]). + +### Douglas-Rachford-type splitting method (DR) +This algorithm is the most general algorithm and can solve all regression problems +[R1-R4]. It is based on Doulgas Rachford splitting in a higher-dimensional product space. +It makes use of the proximity operators of the perspective of the LS objective (see [4,5]) +The Huber problem with concomitant scale [R4] is reformulated as scaled Lasso problem +with the mean shift (see [6]) and thus solved in (n + d) dimensions. + + + +## References + +* [1] B. R. Gaines, J. Kim, and H. Zhou, [Algorithms for Fitting the Constrained Lasso](https://www.tandfonline.com/doi/abs/10.1080/10618600.2018.1473777?journalCode=ucgs20), J. Comput. Graph. Stat., vol. 27, no. 4, pp. 861–871, 2018. + +* [2] L. Briceno-Arias and S.L. Rivera, [A Projected Primal–Dual Method for Solving Constrained Monotone Inclusions](https://link.springer.com/article/10.1007/s10957-018-1430-2?shared-article-renderer), J. Optim. Theory Appl., vol. 180, Issue 3, March 2019. + +* [3] P. L. Combettes and J.C. Pesquet, [Primal-Dual Splitting Algorithm for Solving Inclusions with Mixtures of Composite, Lipschitzian, and Parallel-Sum Type Monotone Operators](https://arxiv.org/pdf/1107.0081.pdf), Set-Valued and Variational Analysis, vol. 20, pp. 307-330, 2012. + +* [4] P. L. Combettes and C. L. Müller, [Perspective M-estimation via proximal decomposition](https://arxiv.org/abs/1805.06098), Electronic Journal of Statistics, 2020, [Journal version](https://projecteuclid.org/euclid.ejs/1578452535) + +* [5] P. L. Combettes and C. L. Müller, [Regression models for compositional data: General log-contrast formulations, proximal optimization, and microbiome data applications](https://arxiv.org/abs/1903.01050), Statistics in Bioscience, 2020. + +* [6] A. Mishra and C. L. Müller, [Robust regression with compositional covariates](https://arxiv.org/abs/1909.04990), arXiv, 2019. + +* [7] S. Rosset and J. Zhu, [Piecewise linear regularized solution paths](https://projecteuclid.org/euclid.aos/1185303996), Ann. Stat., vol. 35, no. 3, pp. 1012–1030, 2007. + +* [8] J. Bien, X. Yan, L. Simpson, and C. L. Müller, [Tree-Aggregated Predictive Modeling of Microbiome Data](https://www.biorxiv.org/content/10.1101/2020.09.01.277632v1), biorxiv, 2020. + + + + + + +%prep +%autosetup -n c-lasso-1.0.11 + +%build +%py3_build + +%install +%py3_install +install -d -m755 %{buildroot}/%{_pkgdocdir} +if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi +if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi +if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi +if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi +pushd %{buildroot} +if [ -d usr/lib ]; then + find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/lib64 ]; then + find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/bin ]; then + find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/sbin ]; then + find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst +fi +touch doclist.lst +if [ -d usr/share/man ]; then + find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst +fi +popd +mv %{buildroot}/filelist.lst . +mv %{buildroot}/doclist.lst . + +%files -n python3-c-lasso -f filelist.lst +%dir %{python3_sitelib}/* + +%files help -f doclist.lst +%{_docdir}/* + +%changelog +* Fri May 05 2023 Python_Bot <Python_Bot@openeuler.org> - 1.0.11-1 +- Package Spec generated @@ -0,0 +1 @@ +e6c7ca5c6456bf96865da173e72fb7b9 c-lasso-1.0.11.tar.gz |