%global _empty_manifest_terminate_build 0
Name: python-c-lasso
Version: 1.0.11
Release: 1
Summary: Algorithms for constrained Lasso problems
License: MIT
URL: https://github.com/Leo-Simpson/CLasso
Source0: https://mirrors.nju.edu.cn/pypi/web/packages/2f/2b/e668407260df3d2779b12d77eb85aa065cec19dad57e368a519d949c293f/c-lasso-1.0.11.tar.gz
BuildArch: noarch
Requires: python3-numpy
Requires: python3-h5py
Requires: python3-scipy
Requires: python3-sphinx
Requires: python3-sphinx-gallery
Requires: python3-sphinx-rtd-theme
Requires: python3-numpydoc
Requires: python3-matplotlib
Requires: python3-pandas
Requires: python3-pytest
Requires: python3-pytest-cov
%description
[![arXiv](https://img.shields.io/badge/arXiv-2011.00898-b31b1b.svg)](https://arxiv.org/abs/2011.00898)
[![DOI](https://joss.theoj.org/papers/10.21105/joss.02844/status.svg)](https://doi.org/10.21105/joss.02844)
# c-lasso: a Python package for constrained sparse regression and classification
c-lasso is a Python package that enables sparse and robust linear regression and classification with linear equality
constraints on the model parameters. For detailed info, one can check the [documentation](https://c-lasso.readthedocs.io/en/latest/).
The forward model is assumed to be:
Here, y and X are given outcome and predictor data. The vector y can be continuous (for regression) or binary (for classification). C is a general constraint matrix. The vector β comprises the unknown coefficients and σ an
unknown scale.
The package handles several different estimators for inferring β (and σ), including
the constrained Lasso, the constrained scaled Lasso, sparse Huber M-estimation with linear equality constraints, and regularized Support Vector Machines.
Several different algorithmic strategies, including path and proximal splitting algorithms, are implemented to solve
the underlying convex optimization problems.
We also include two model selection strategies for determining the sparsity of the model parameters: k-fold cross-validation and stability selection.
This package is intended to fill the gap between popular python tools such as [scikit-learn](https://scikit-learn.org/stable/) which CANNOT solve sparse constrained problems and general-purpose optimization solvers that do not scale well or are inaccurate (see [benchmarks](./benchmark/README.md)) for the considered problems. In its current stage, however, c-lasso is not yet compatible with the scikit-learn API but rather a stand-alone tool.
Below we show several use cases of the package, including an application of sparse *log-contrast*
regression tasks for *compositional* microbiome data.
The code builds on results from several papers which can be found in the [References](#references). We also refer to the accompanying [JOSS paper submission](https://github.com/Leo-Simpson/c-lasso/blob/master/paper/paper.md), also available on [arXiv](https://arxiv.org/pdf/2011.00898.pdf).
## Table of Contents
* [Installation](#installation)
* [Regression and classification problems](#regression-and-classification-problems)
* [Getting started](#getting-started)
* [Log-contrast regression for microbiome data](#log-contrast-regression-for-microbiome-data)
* [Optimization schemes](#optimization-schemes)
* [References](#references)
## Installation
c-lasso is available on pip. You can install the package
in the shell using
```shell
pip install c-lasso
```
To use the c-lasso package in Python, type
```python
from classo import classo_problem
# one can add auxiliary functions as well such as random_data or csv_to_np
```
The `c-lasso` package depends on the following Python packages:
- `numpy`;
- `matplotlib`;
- `scipy`;
- `pandas`;
- `pytest` (for tests)
## Regression and classification problems
The c-lasso package can solve six different types of estimation problems:
four regression-type and two classification-type formulations.
#### [R1] Standard constrained Lasso regression:
This is the standard Lasso problem with linear equality constraints on the β vector.
The objective function combines Least-Squares for model fitting with l1 penalty for sparsity.
#### [R2] Constrained sparse Huber regression:
This regression problem uses the [Huber loss](https://en.wikipedia.org/wiki/Huber_loss) as objective function
for robust model fitting with l1 and linear equality constraints on the β vector. The parameter ρ=1.345.
#### [R3] Constrained scaled Lasso regression:
This formulation is similar to [R1] but allows for joint estimation of the (constrained) β vector and
the standard deviation σ in a concomitant fashion (see [References](#references) [4,5] for further info).
This is the default problem formulation in c-lasso.
#### [R4] Constrained sparse Huber regression with concomitant scale estimation:
This formulation combines [R2] and [R3] to allow robust joint estimation of the (constrained) β vector and
the scale σ in a concomitant fashion (see [References](#references) [4,5] for further info).
#### [C1] Constrained sparse classification with Square Hinge loss:
where the xi are the rows of X and l is defined as:
This formulation is similar to [R1] but adapted for classification tasks using the Square Hinge loss
with (constrained) sparse β vector estimation.
#### [C2] Constrained sparse classification with Huberized Square Hinge loss:
where the xi are the rows of X and lρ is defined as:
This formulation is similar to [C1] but uses the Huberized Square Hinge loss for robust classification
with (constrained) sparse β vector estimation.
## Getting started
#### Basic example
We begin with a basic example that shows how to run c-lasso on synthetic data. This example and the next one can be found on the notebook 'Synthetic data Notebook.ipynb'
The c-lasso package includes
the routine ```random_data``` that allows you to generate problem instances using normally distributed data.
```python
m, d, d_nonzero, k, sigma = 100, 200, 5, 1, 0.5
(X, C, y), sol = random_data(m, d, d_nonzero, k, sigma, zerosum=True, seed=1)
```
This code snippet generates a problem instance with sparse β in dimension
d=100 (sparsity d_nonzero=5). The design matrix X comprises n=100 samples generated from an i.i.d standard normal
distribution. The dimension of the constraint matrix C is d x k matrix. The noise level is σ=0.5.
The input ```zerosum=True``` implies that C is the all-ones vector and Cβ=0. The n-dimensional outcome vector y
and the regression vector β is then generated to satisfy the given constraints.
Next we can define a default c-lasso problem instance with the generated data:
```python
problem = classo_problem(X, y, C)
```
You can look at the generated problem instance by typing:
```python
print(problem)
```
This gives you a summary of the form:
```
FORMULATION: R3
MODEL SELECTION COMPUTED:
Stability selection
STABILITY SELECTION PARAMETERS:
numerical_method : not specified
method : first
B = 50
q = 10
percent_nS = 0.5
threshold = 0.7
lamin = 0.01
Nlam = 50
```
As we have not specified any problem, algorithm, or model selection settings, this problem instance
represents the *default* settings for a c-lasso instance:
- The problem is of regression type and uses formulation [R3], i.e. with concomitant scale estimation.
- The *default* optimization scheme is the path algorithm (see [Optimization schemes](#optimization-schemes) for further info).
- For model selection, stability selection at a theoretically derived λ value is used (see [Reference](#references) [4] for details). Stability selection comprises a relatively large number of parameters. For a description of the settings, we refer to the more advanced examples below and the API.
You can solve the corresponding c-lasso problem instance using
```python
problem.solve()
```
After completion, the results of the optimization and model selection routines
can be visualized using
```python
print(problem.solution)
```
The command shows the running time(s) for the c-lasso problem instance, and the selected variables for sability selection
```
STABILITY SELECTION :
Selected variables : 7 63 148 164 168
Running time : 1.546s
```
Here, we only used stability selection as *default* model selection strategy.
The command also allows you to inspect the computed stability profile for all variables
at the theoretical λ
![1.StabSel](https://github.com/Leo-Simpson/c-lasso/blob/master/figures/basic/StabSel.png)
The refitted β values on the selected support are also displayed in the next plot
![beta](https://github.com/Leo-Simpson/c-lasso/blob/master/figures/basic/beta.png)
#### Advanced example
In the next example, we show how one can specify different aspects of the problem
formulation and model selection strategy.
```python
m, d, d_nonzero, k, sigma = 100, 200, 5, 0, 0.5
(X, C, y), sol = random_data(m, d, d_nonzero, k, sigma, zerosum = True, seed = 4)
problem = classo_problem(X, y, C)
problem.formulation.huber = True
problem.formulation.concomitant = False
problem.model_selection.CV = True
problem.model_selection.LAMfixed = True
problem.model_selection.PATH = True
problem.model_selection.StabSelparameters.method = 'max'
problem.model_selection.CVparameters.seed = 1
problem.model_selection.LAMfixedparameters.rescaled_lam = True
problem.model_selection.LAMfixedparameters.lam = .1
problem.solve()
print(problem)
print(problem.solution)
```
Results :
```
FORMULATION: R2
MODEL SELECTION COMPUTED:
Lambda fixed
Path
Cross Validation
Stability selection
LAMBDA FIXED PARAMETERS:
numerical_method = Path-Alg
rescaled lam : True
threshold = 0.09
lam = 0.1
theoretical_lam = 0.224
PATH PARAMETERS:
numerical_method : Path-Alg
lamin = 0.001
Nlam = 80
CROSS VALIDATION PARAMETERS:
numerical_method : Path-Alg
one-SE method : True
Nsubset = 5
lamin = 0.001
Nlam = 80
STABILITY SELECTION PARAMETERS:
numerical_method : Path-Alg
method : max
B = 50
q = 10
percent_nS = 0.5
threshold = 0.7
lamin = 0.01
Nlam = 50
LAMBDA FIXED :
Selected variables : 17 59 123
Running time : 0.104s
PATH COMPUTATION :
Running time : 0.638s
CROSS VALIDATION :
Selected variables : 16 17 57 59 64 73 74 76 93 115 123 134 137 181
Running time : 2.1s
STABILITY SELECTION :
Selected variables : 17 59 76 123 137
Running time : 6.062s
```
![2.StabSel](https://github.com/Leo-Simpson/c-lasso/blob/master/figures/advanced/StabSel.png)
![2.StabSel-beta](https://github.com/Leo-Simpson/c-lasso/blob/master/figures/advanced/StabSel-beta.png)
![2.CV-beta](https://github.com/Leo-Simpson/c-lasso/blob/master/figures/advanced/CVbeta.png)
![2.CV-graph](https://github.com/Leo-Simpson/c-lasso/blob/master/figures/advanced/CV.png)
![2.LAM-beta](https://github.com/Leo-Simpson/c-lasso/blob/master/figures/advanced/beta.png)
![2.Path](https://github.com/Leo-Simpson/c-lasso/blob/master/figures/advanced/Beta-path.png)
## Log-contrast regression for microbiome data
In the [the accompanying notebook](./examples/example-notebook.ipynb) we study several microbiome data sets. We showcase two examples below.
#### BMI prediction using the COMBO dataset
We first consider the [COMBO data set](./examples/COMBO_data) and show how to predict Body Mass Index (BMI) from microbial genus abundances and two non-compositional covariates using "filtered_data".
```python
from classo import csv_to_np, classo_problem, clr
# Load microbiome and covariate data X
X0 = csv_to_np('COMBO_data/complete_data/GeneraCounts.csv', begin = 0).astype(float)
X_C = csv_to_np('COMBO_data/CaloriData.csv', begin = 0).astype(float)
X_F = csv_to_np('COMBO_data/FatData.csv', begin = 0).astype(float)
# Load BMI measurements y
y = csv_to_np('COMBO_data/BMI.csv', begin = 0).astype(float)[:, 0]
labels = csv_to_np('COMBO_data/complete_data/GeneraPhylo.csv').astype(str)[:, -1]
# Normalize/transform data
y = y - np.mean(y) #BMI data (n = 96)
X_C = X_C - np.mean(X_C, axis = 0) #Covariate data (Calorie)
X_F = X_F - np.mean(X_F, axis = 0) #Covariate data (Fat)
X0 = clr(X0, 1 / 2).T
# Set up design matrix and zero-sum constraints for 45 genera
X = np.concatenate((X0, X_C, X_F, np.ones((len(X0), 1))), axis = 1) # Joint microbiome and covariate data and offset
label = np.concatenate([labels, np.array(['Calorie', 'Fat', 'Bias'])])
C = np.ones((1, len(X[0])))
C[0, -1], C[0, -2], C[0, -3] = 0., 0., 0.
# Set up c-lassso problem
problem = classo_problem(X, y, C, label = label)
# Use stability selection with theoretical lambda [Combettes & Müller, 2020b]
problem.model_selection.StabSelparameters.method = 'lam'
problem.model_selection.StabSelparameters.threshold_label = 0.5
# Use formulation R3
problem.formulation.concomitant = True
problem.solve()
print(problem)
print(problem.solution)
# Use formulation R4
problem.formulation.huber = True
problem.formulation.concomitant = True
problem.solve()
print(problem)
print(problem.solution)
```
![3.Stability profile R3](https://github.com/Leo-Simpson/c-lasso/blob/master/figures/exampleFilteredCOMBO/R3-StabSel.png)
![3.Beta solution R3](https://github.com/Leo-Simpson/c-lasso/blob/master/figures/exampleFilteredCOMBO/R3-StabSel-beta.png)
![3.Stability profile R4](https://github.com/Leo-Simpson/c-lasso/blob/master/figures/exampleFilteredCOMBO/R4-StabSel.png)
![3.Beta solution R4](https://github.com/Leo-Simpson/c-lasso/blob/master/figures/exampleFilteredCOMBO/R4-StabSel-beta.png)
#### pH prediction using the 88 soils dataset
The next microbiome example considers the [88 soils dataset](./examples/pH_data) from [Lauber et al., 2009](https://pubmed.ncbi.nlm.nih.gov/19502440/).
The task is to predict pH concentration in the soil from microbial abundance data. A similar analysis is available
in [Tree-Aggregated Predictive Modeling of Microbiome Data](https://www.biorxiv.org/content/10.1101/2020.09.01.277632v1)
with Central Park soil data from [Ramirez et al.](https://royalsocietypublishing.org/doi/full/10.1098/rspb.2014.1988).
Code to run this application is available in [the accompanying notebook](./examples/example-notebook.ipynb) under `pH data`. Below is a summary of a c-lasso problem instance (using the R3 formulation).
```
FORMULATION: R3
MODEL SELECTION COMPUTED:
Lambda fixed
Path
Stability selection
LAMBDA FIXED PARAMETERS:
numerical_method = Path-Alg
rescaled lam : True
threshold = 0.004
lam : theoretical
theoretical_lam = 0.2182
PATH PARAMETERS:
numerical_method : Path-Alg
lamin = 0.001
Nlam = 80
STABILITY SELECTION PARAMETERS:
numerical_method : Path-Alg
method : lam
B = 50
q = 10
percent_nS = 0.5
threshold = 0.7
lam = theoretical
theoretical_lam = 0.3085
```
The c-lasso estimation results are summarized below:
```
LAMBDA FIXED :
Sigma = 0.198
Selected variables : 14 18 19 39 43 57 62 85 93 94 104 107
Running time : 0.008s
PATH COMPUTATION :
Running time : 0.12s
STABILITY SELECTION :
Selected variables : 2 12 15
Running time : 0.287s
```
![Ex4.1](https://github.com/Leo-Simpson/c-lasso/blob/master/figures/examplePH/R3-Beta-path.png)
![Ex4.2](https://github.com/Leo-Simpson/c-lasso/blob/master/figures/examplePH/R3-Sigma-path.png)
![Ex4.3](https://github.com/Leo-Simpson/c-lasso/blob/master/figures/examplePH/R3-StabSel.png)
![Ex4.4](https://github.com/Leo-Simpson/c-lasso/blob/master/figures/examplePH/R3-StabSel-beta.png)
![Ex4.5](https://github.com/Leo-Simpson/c-lasso/blob/master/figures/examplePH/R3-beta.png)
## Optimization schemes
The available problem formulations [R1-C2] require different algorithmic strategies for
efficiently solving the underlying optimization problem. We have implemented four
algorithms (with provable convergence guarantees) that vary in generality and are not
necessarily applicable to all problems. For each problem type, c-lasso has a default algorithm
setting that proved to be the fastest in our numerical experiments.
### Path algorithms (Path-Alg)
This is the default algorithm for non-concomitant problems [R1,R3,C1,C2].
The algorithm uses the fact that the solution path along λ is piecewise-
affine (as shown, e.g., in [1]). When Least-Squares is used as objective function,
we derive a novel efficient procedure that allows us to also derive the
solution for the concomitant problem [R2] along the path with little extra computational overhead.
### Projected primal-dual splitting method (P-PDS):
This algorithm is derived from [2] and belongs to the class of
proximal splitting algorithms. It extends the classical Forward-Backward (FB)
(aka proximal gradient descent) algorithm to handle an additional linear equality constraint
via projection. In the absence of a linear constraint, the method reduces to FB.
This method can solve problem [R1]. For the Huber problem [R3],
P-PDS can solve the mean-shift formulation of the problem (see [6]).
### Projection-free primal-dual splitting method (PF-PDS):
This algorithm is a special case of an algorithm proposed in [3] (Eq.4.5) and also belongs to the class of
proximal splitting algorithms. The algorithm does not require projection operators
which may be beneficial when C has a more complex structure. In the absence of a linear constraint,
the method reduces to the Forward-Backward-Forward scheme. This method can solve problem [R1].
For the Huber problem [R3], PF-PDS can solve the mean-shift formulation of the problem (see [6]).
### Douglas-Rachford-type splitting method (DR)
This algorithm is the most general algorithm and can solve all regression problems
[R1-R4]. It is based on Doulgas Rachford splitting in a higher-dimensional product space.
It makes use of the proximity operators of the perspective of the LS objective (see [4,5])
The Huber problem with concomitant scale [R4] is reformulated as scaled Lasso problem
with the mean shift (see [6]) and thus solved in (n + d) dimensions.
## References
* [1] B. R. Gaines, J. Kim, and H. Zhou, [Algorithms for Fitting the Constrained Lasso](https://www.tandfonline.com/doi/abs/10.1080/10618600.2018.1473777?journalCode=ucgs20), J. Comput. Graph. Stat., vol. 27, no. 4, pp. 861–871, 2018.
* [2] L. Briceno-Arias and S.L. Rivera, [A Projected Primal–Dual Method for Solving Constrained Monotone Inclusions](https://link.springer.com/article/10.1007/s10957-018-1430-2?shared-article-renderer), J. Optim. Theory Appl., vol. 180, Issue 3, March 2019.
* [3] P. L. Combettes and J.C. Pesquet, [Primal-Dual Splitting Algorithm for Solving Inclusions with Mixtures of Composite, Lipschitzian, and Parallel-Sum Type Monotone Operators](https://arxiv.org/pdf/1107.0081.pdf), Set-Valued and Variational Analysis, vol. 20, pp. 307-330, 2012.
* [4] P. L. Combettes and C. L. Müller, [Perspective M-estimation via proximal decomposition](https://arxiv.org/abs/1805.06098), Electronic Journal of Statistics, 2020, [Journal version](https://projecteuclid.org/euclid.ejs/1578452535)
* [5] P. L. Combettes and C. L. Müller, [Regression models for compositional data: General log-contrast formulations, proximal optimization, and microbiome data applications](https://arxiv.org/abs/1903.01050), Statistics in Bioscience, 2020.
* [6] A. Mishra and C. L. Müller, [Robust regression with compositional covariates](https://arxiv.org/abs/1909.04990), arXiv, 2019.
* [7] S. Rosset and J. Zhu, [Piecewise linear regularized solution paths](https://projecteuclid.org/euclid.aos/1185303996), Ann. Stat., vol. 35, no. 3, pp. 1012–1030, 2007.
* [8] J. Bien, X. Yan, L. Simpson, and C. L. Müller, [Tree-Aggregated Predictive Modeling of Microbiome Data](https://www.biorxiv.org/content/10.1101/2020.09.01.277632v1), biorxiv, 2020.
%package -n python3-c-lasso
Summary: Algorithms for constrained Lasso problems
Provides: python-c-lasso
BuildRequires: python3-devel
BuildRequires: python3-setuptools
BuildRequires: python3-pip
%description -n python3-c-lasso
[![arXiv](https://img.shields.io/badge/arXiv-2011.00898-b31b1b.svg)](https://arxiv.org/abs/2011.00898)
[![DOI](https://joss.theoj.org/papers/10.21105/joss.02844/status.svg)](https://doi.org/10.21105/joss.02844)
# c-lasso: a Python package for constrained sparse regression and classification
c-lasso is a Python package that enables sparse and robust linear regression and classification with linear equality
constraints on the model parameters. For detailed info, one can check the [documentation](https://c-lasso.readthedocs.io/en/latest/).
The forward model is assumed to be:
Here, y and X are given outcome and predictor data. The vector y can be continuous (for regression) or binary (for classification). C is a general constraint matrix. The vector β comprises the unknown coefficients and σ an
unknown scale.
The package handles several different estimators for inferring β (and σ), including
the constrained Lasso, the constrained scaled Lasso, sparse Huber M-estimation with linear equality constraints, and regularized Support Vector Machines.
Several different algorithmic strategies, including path and proximal splitting algorithms, are implemented to solve
the underlying convex optimization problems.
We also include two model selection strategies for determining the sparsity of the model parameters: k-fold cross-validation and stability selection.
This package is intended to fill the gap between popular python tools such as [scikit-learn](https://scikit-learn.org/stable/) which CANNOT solve sparse constrained problems and general-purpose optimization solvers that do not scale well or are inaccurate (see [benchmarks](./benchmark/README.md)) for the considered problems. In its current stage, however, c-lasso is not yet compatible with the scikit-learn API but rather a stand-alone tool.
Below we show several use cases of the package, including an application of sparse *log-contrast*
regression tasks for *compositional* microbiome data.
The code builds on results from several papers which can be found in the [References](#references). We also refer to the accompanying [JOSS paper submission](https://github.com/Leo-Simpson/c-lasso/blob/master/paper/paper.md), also available on [arXiv](https://arxiv.org/pdf/2011.00898.pdf).
## Table of Contents
* [Installation](#installation)
* [Regression and classification problems](#regression-and-classification-problems)
* [Getting started](#getting-started)
* [Log-contrast regression for microbiome data](#log-contrast-regression-for-microbiome-data)
* [Optimization schemes](#optimization-schemes)
* [References](#references)
## Installation
c-lasso is available on pip. You can install the package
in the shell using
```shell
pip install c-lasso
```
To use the c-lasso package in Python, type
```python
from classo import classo_problem
# one can add auxiliary functions as well such as random_data or csv_to_np
```
The `c-lasso` package depends on the following Python packages:
- `numpy`;
- `matplotlib`;
- `scipy`;
- `pandas`;
- `pytest` (for tests)
## Regression and classification problems
The c-lasso package can solve six different types of estimation problems:
four regression-type and two classification-type formulations.
#### [R1] Standard constrained Lasso regression:
This is the standard Lasso problem with linear equality constraints on the β vector.
The objective function combines Least-Squares for model fitting with l1 penalty for sparsity.
#### [R2] Constrained sparse Huber regression:
This regression problem uses the [Huber loss](https://en.wikipedia.org/wiki/Huber_loss) as objective function
for robust model fitting with l1 and linear equality constraints on the β vector. The parameter ρ=1.345.
#### [R3] Constrained scaled Lasso regression:
This formulation is similar to [R1] but allows for joint estimation of the (constrained) β vector and
the standard deviation σ in a concomitant fashion (see [References](#references) [4,5] for further info).
This is the default problem formulation in c-lasso.
#### [R4] Constrained sparse Huber regression with concomitant scale estimation:
This formulation combines [R2] and [R3] to allow robust joint estimation of the (constrained) β vector and
the scale σ in a concomitant fashion (see [References](#references) [4,5] for further info).
#### [C1] Constrained sparse classification with Square Hinge loss:
where the xi are the rows of X and l is defined as:
This formulation is similar to [R1] but adapted for classification tasks using the Square Hinge loss
with (constrained) sparse β vector estimation.
#### [C2] Constrained sparse classification with Huberized Square Hinge loss:
where the xi are the rows of X and lρ is defined as:
This formulation is similar to [C1] but uses the Huberized Square Hinge loss for robust classification
with (constrained) sparse β vector estimation.
## Getting started
#### Basic example
We begin with a basic example that shows how to run c-lasso on synthetic data. This example and the next one can be found on the notebook 'Synthetic data Notebook.ipynb'
The c-lasso package includes
the routine ```random_data``` that allows you to generate problem instances using normally distributed data.
```python
m, d, d_nonzero, k, sigma = 100, 200, 5, 1, 0.5
(X, C, y), sol = random_data(m, d, d_nonzero, k, sigma, zerosum=True, seed=1)
```
This code snippet generates a problem instance with sparse β in dimension
d=100 (sparsity d_nonzero=5). The design matrix X comprises n=100 samples generated from an i.i.d standard normal
distribution. The dimension of the constraint matrix C is d x k matrix. The noise level is σ=0.5.
The input ```zerosum=True``` implies that C is the all-ones vector and Cβ=0. The n-dimensional outcome vector y
and the regression vector β is then generated to satisfy the given constraints.
Next we can define a default c-lasso problem instance with the generated data:
```python
problem = classo_problem(X, y, C)
```
You can look at the generated problem instance by typing:
```python
print(problem)
```
This gives you a summary of the form:
```
FORMULATION: R3
MODEL SELECTION COMPUTED:
Stability selection
STABILITY SELECTION PARAMETERS:
numerical_method : not specified
method : first
B = 50
q = 10
percent_nS = 0.5
threshold = 0.7
lamin = 0.01
Nlam = 50
```
As we have not specified any problem, algorithm, or model selection settings, this problem instance
represents the *default* settings for a c-lasso instance:
- The problem is of regression type and uses formulation [R3], i.e. with concomitant scale estimation.
- The *default* optimization scheme is the path algorithm (see [Optimization schemes](#optimization-schemes) for further info).
- For model selection, stability selection at a theoretically derived λ value is used (see [Reference](#references) [4] for details). Stability selection comprises a relatively large number of parameters. For a description of the settings, we refer to the more advanced examples below and the API.
You can solve the corresponding c-lasso problem instance using
```python
problem.solve()
```
After completion, the results of the optimization and model selection routines
can be visualized using
```python
print(problem.solution)
```
The command shows the running time(s) for the c-lasso problem instance, and the selected variables for sability selection
```
STABILITY SELECTION :
Selected variables : 7 63 148 164 168
Running time : 1.546s
```
Here, we only used stability selection as *default* model selection strategy.
The command also allows you to inspect the computed stability profile for all variables
at the theoretical λ
![1.StabSel](https://github.com/Leo-Simpson/c-lasso/blob/master/figures/basic/StabSel.png)
The refitted β values on the selected support are also displayed in the next plot
![beta](https://github.com/Leo-Simpson/c-lasso/blob/master/figures/basic/beta.png)
#### Advanced example
In the next example, we show how one can specify different aspects of the problem
formulation and model selection strategy.
```python
m, d, d_nonzero, k, sigma = 100, 200, 5, 0, 0.5
(X, C, y), sol = random_data(m, d, d_nonzero, k, sigma, zerosum = True, seed = 4)
problem = classo_problem(X, y, C)
problem.formulation.huber = True
problem.formulation.concomitant = False
problem.model_selection.CV = True
problem.model_selection.LAMfixed = True
problem.model_selection.PATH = True
problem.model_selection.StabSelparameters.method = 'max'
problem.model_selection.CVparameters.seed = 1
problem.model_selection.LAMfixedparameters.rescaled_lam = True
problem.model_selection.LAMfixedparameters.lam = .1
problem.solve()
print(problem)
print(problem.solution)
```
Results :
```
FORMULATION: R2
MODEL SELECTION COMPUTED:
Lambda fixed
Path
Cross Validation
Stability selection
LAMBDA FIXED PARAMETERS:
numerical_method = Path-Alg
rescaled lam : True
threshold = 0.09
lam = 0.1
theoretical_lam = 0.224
PATH PARAMETERS:
numerical_method : Path-Alg
lamin = 0.001
Nlam = 80
CROSS VALIDATION PARAMETERS:
numerical_method : Path-Alg
one-SE method : True
Nsubset = 5
lamin = 0.001
Nlam = 80
STABILITY SELECTION PARAMETERS:
numerical_method : Path-Alg
method : max
B = 50
q = 10
percent_nS = 0.5
threshold = 0.7
lamin = 0.01
Nlam = 50
LAMBDA FIXED :
Selected variables : 17 59 123
Running time : 0.104s
PATH COMPUTATION :
Running time : 0.638s
CROSS VALIDATION :
Selected variables : 16 17 57 59 64 73 74 76 93 115 123 134 137 181
Running time : 2.1s
STABILITY SELECTION :
Selected variables : 17 59 76 123 137
Running time : 6.062s
```
![2.StabSel](https://github.com/Leo-Simpson/c-lasso/blob/master/figures/advanced/StabSel.png)
![2.StabSel-beta](https://github.com/Leo-Simpson/c-lasso/blob/master/figures/advanced/StabSel-beta.png)
![2.CV-beta](https://github.com/Leo-Simpson/c-lasso/blob/master/figures/advanced/CVbeta.png)
![2.CV-graph](https://github.com/Leo-Simpson/c-lasso/blob/master/figures/advanced/CV.png)
![2.LAM-beta](https://github.com/Leo-Simpson/c-lasso/blob/master/figures/advanced/beta.png)
![2.Path](https://github.com/Leo-Simpson/c-lasso/blob/master/figures/advanced/Beta-path.png)
## Log-contrast regression for microbiome data
In the [the accompanying notebook](./examples/example-notebook.ipynb) we study several microbiome data sets. We showcase two examples below.
#### BMI prediction using the COMBO dataset
We first consider the [COMBO data set](./examples/COMBO_data) and show how to predict Body Mass Index (BMI) from microbial genus abundances and two non-compositional covariates using "filtered_data".
```python
from classo import csv_to_np, classo_problem, clr
# Load microbiome and covariate data X
X0 = csv_to_np('COMBO_data/complete_data/GeneraCounts.csv', begin = 0).astype(float)
X_C = csv_to_np('COMBO_data/CaloriData.csv', begin = 0).astype(float)
X_F = csv_to_np('COMBO_data/FatData.csv', begin = 0).astype(float)
# Load BMI measurements y
y = csv_to_np('COMBO_data/BMI.csv', begin = 0).astype(float)[:, 0]
labels = csv_to_np('COMBO_data/complete_data/GeneraPhylo.csv').astype(str)[:, -1]
# Normalize/transform data
y = y - np.mean(y) #BMI data (n = 96)
X_C = X_C - np.mean(X_C, axis = 0) #Covariate data (Calorie)
X_F = X_F - np.mean(X_F, axis = 0) #Covariate data (Fat)
X0 = clr(X0, 1 / 2).T
# Set up design matrix and zero-sum constraints for 45 genera
X = np.concatenate((X0, X_C, X_F, np.ones((len(X0), 1))), axis = 1) # Joint microbiome and covariate data and offset
label = np.concatenate([labels, np.array(['Calorie', 'Fat', 'Bias'])])
C = np.ones((1, len(X[0])))
C[0, -1], C[0, -2], C[0, -3] = 0., 0., 0.
# Set up c-lassso problem
problem = classo_problem(X, y, C, label = label)
# Use stability selection with theoretical lambda [Combettes & Müller, 2020b]
problem.model_selection.StabSelparameters.method = 'lam'
problem.model_selection.StabSelparameters.threshold_label = 0.5
# Use formulation R3
problem.formulation.concomitant = True
problem.solve()
print(problem)
print(problem.solution)
# Use formulation R4
problem.formulation.huber = True
problem.formulation.concomitant = True
problem.solve()
print(problem)
print(problem.solution)
```
![3.Stability profile R3](https://github.com/Leo-Simpson/c-lasso/blob/master/figures/exampleFilteredCOMBO/R3-StabSel.png)
![3.Beta solution R3](https://github.com/Leo-Simpson/c-lasso/blob/master/figures/exampleFilteredCOMBO/R3-StabSel-beta.png)
![3.Stability profile R4](https://github.com/Leo-Simpson/c-lasso/blob/master/figures/exampleFilteredCOMBO/R4-StabSel.png)
![3.Beta solution R4](https://github.com/Leo-Simpson/c-lasso/blob/master/figures/exampleFilteredCOMBO/R4-StabSel-beta.png)
#### pH prediction using the 88 soils dataset
The next microbiome example considers the [88 soils dataset](./examples/pH_data) from [Lauber et al., 2009](https://pubmed.ncbi.nlm.nih.gov/19502440/).
The task is to predict pH concentration in the soil from microbial abundance data. A similar analysis is available
in [Tree-Aggregated Predictive Modeling of Microbiome Data](https://www.biorxiv.org/content/10.1101/2020.09.01.277632v1)
with Central Park soil data from [Ramirez et al.](https://royalsocietypublishing.org/doi/full/10.1098/rspb.2014.1988).
Code to run this application is available in [the accompanying notebook](./examples/example-notebook.ipynb) under `pH data`. Below is a summary of a c-lasso problem instance (using the R3 formulation).
```
FORMULATION: R3
MODEL SELECTION COMPUTED:
Lambda fixed
Path
Stability selection
LAMBDA FIXED PARAMETERS:
numerical_method = Path-Alg
rescaled lam : True
threshold = 0.004
lam : theoretical
theoretical_lam = 0.2182
PATH PARAMETERS:
numerical_method : Path-Alg
lamin = 0.001
Nlam = 80
STABILITY SELECTION PARAMETERS:
numerical_method : Path-Alg
method : lam
B = 50
q = 10
percent_nS = 0.5
threshold = 0.7
lam = theoretical
theoretical_lam = 0.3085
```
The c-lasso estimation results are summarized below:
```
LAMBDA FIXED :
Sigma = 0.198
Selected variables : 14 18 19 39 43 57 62 85 93 94 104 107
Running time : 0.008s
PATH COMPUTATION :
Running time : 0.12s
STABILITY SELECTION :
Selected variables : 2 12 15
Running time : 0.287s
```
![Ex4.1](https://github.com/Leo-Simpson/c-lasso/blob/master/figures/examplePH/R3-Beta-path.png)
![Ex4.2](https://github.com/Leo-Simpson/c-lasso/blob/master/figures/examplePH/R3-Sigma-path.png)
![Ex4.3](https://github.com/Leo-Simpson/c-lasso/blob/master/figures/examplePH/R3-StabSel.png)
![Ex4.4](https://github.com/Leo-Simpson/c-lasso/blob/master/figures/examplePH/R3-StabSel-beta.png)
![Ex4.5](https://github.com/Leo-Simpson/c-lasso/blob/master/figures/examplePH/R3-beta.png)
## Optimization schemes
The available problem formulations [R1-C2] require different algorithmic strategies for
efficiently solving the underlying optimization problem. We have implemented four
algorithms (with provable convergence guarantees) that vary in generality and are not
necessarily applicable to all problems. For each problem type, c-lasso has a default algorithm
setting that proved to be the fastest in our numerical experiments.
### Path algorithms (Path-Alg)
This is the default algorithm for non-concomitant problems [R1,R3,C1,C2].
The algorithm uses the fact that the solution path along λ is piecewise-
affine (as shown, e.g., in [1]). When Least-Squares is used as objective function,
we derive a novel efficient procedure that allows us to also derive the
solution for the concomitant problem [R2] along the path with little extra computational overhead.
### Projected primal-dual splitting method (P-PDS):
This algorithm is derived from [2] and belongs to the class of
proximal splitting algorithms. It extends the classical Forward-Backward (FB)
(aka proximal gradient descent) algorithm to handle an additional linear equality constraint
via projection. In the absence of a linear constraint, the method reduces to FB.
This method can solve problem [R1]. For the Huber problem [R3],
P-PDS can solve the mean-shift formulation of the problem (see [6]).
### Projection-free primal-dual splitting method (PF-PDS):
This algorithm is a special case of an algorithm proposed in [3] (Eq.4.5) and also belongs to the class of
proximal splitting algorithms. The algorithm does not require projection operators
which may be beneficial when C has a more complex structure. In the absence of a linear constraint,
the method reduces to the Forward-Backward-Forward scheme. This method can solve problem [R1].
For the Huber problem [R3], PF-PDS can solve the mean-shift formulation of the problem (see [6]).
### Douglas-Rachford-type splitting method (DR)
This algorithm is the most general algorithm and can solve all regression problems
[R1-R4]. It is based on Doulgas Rachford splitting in a higher-dimensional product space.
It makes use of the proximity operators of the perspective of the LS objective (see [4,5])
The Huber problem with concomitant scale [R4] is reformulated as scaled Lasso problem
with the mean shift (see [6]) and thus solved in (n + d) dimensions.
## References
* [1] B. R. Gaines, J. Kim, and H. Zhou, [Algorithms for Fitting the Constrained Lasso](https://www.tandfonline.com/doi/abs/10.1080/10618600.2018.1473777?journalCode=ucgs20), J. Comput. Graph. Stat., vol. 27, no. 4, pp. 861–871, 2018.
* [2] L. Briceno-Arias and S.L. Rivera, [A Projected Primal–Dual Method for Solving Constrained Monotone Inclusions](https://link.springer.com/article/10.1007/s10957-018-1430-2?shared-article-renderer), J. Optim. Theory Appl., vol. 180, Issue 3, March 2019.
* [3] P. L. Combettes and J.C. Pesquet, [Primal-Dual Splitting Algorithm for Solving Inclusions with Mixtures of Composite, Lipschitzian, and Parallel-Sum Type Monotone Operators](https://arxiv.org/pdf/1107.0081.pdf), Set-Valued and Variational Analysis, vol. 20, pp. 307-330, 2012.
* [4] P. L. Combettes and C. L. Müller, [Perspective M-estimation via proximal decomposition](https://arxiv.org/abs/1805.06098), Electronic Journal of Statistics, 2020, [Journal version](https://projecteuclid.org/euclid.ejs/1578452535)
* [5] P. L. Combettes and C. L. Müller, [Regression models for compositional data: General log-contrast formulations, proximal optimization, and microbiome data applications](https://arxiv.org/abs/1903.01050), Statistics in Bioscience, 2020.
* [6] A. Mishra and C. L. Müller, [Robust regression with compositional covariates](https://arxiv.org/abs/1909.04990), arXiv, 2019.
* [7] S. Rosset and J. Zhu, [Piecewise linear regularized solution paths](https://projecteuclid.org/euclid.aos/1185303996), Ann. Stat., vol. 35, no. 3, pp. 1012–1030, 2007.
* [8] J. Bien, X. Yan, L. Simpson, and C. L. Müller, [Tree-Aggregated Predictive Modeling of Microbiome Data](https://www.biorxiv.org/content/10.1101/2020.09.01.277632v1), biorxiv, 2020.
%package help
Summary: Development documents and examples for c-lasso
Provides: python3-c-lasso-doc
%description help
[![arXiv](https://img.shields.io/badge/arXiv-2011.00898-b31b1b.svg)](https://arxiv.org/abs/2011.00898)
[![DOI](https://joss.theoj.org/papers/10.21105/joss.02844/status.svg)](https://doi.org/10.21105/joss.02844)
# c-lasso: a Python package for constrained sparse regression and classification
c-lasso is a Python package that enables sparse and robust linear regression and classification with linear equality
constraints on the model parameters. For detailed info, one can check the [documentation](https://c-lasso.readthedocs.io/en/latest/).
The forward model is assumed to be:
Here, y and X are given outcome and predictor data. The vector y can be continuous (for regression) or binary (for classification). C is a general constraint matrix. The vector β comprises the unknown coefficients and σ an
unknown scale.
The package handles several different estimators for inferring β (and σ), including
the constrained Lasso, the constrained scaled Lasso, sparse Huber M-estimation with linear equality constraints, and regularized Support Vector Machines.
Several different algorithmic strategies, including path and proximal splitting algorithms, are implemented to solve
the underlying convex optimization problems.
We also include two model selection strategies for determining the sparsity of the model parameters: k-fold cross-validation and stability selection.
This package is intended to fill the gap between popular python tools such as [scikit-learn](https://scikit-learn.org/stable/) which CANNOT solve sparse constrained problems and general-purpose optimization solvers that do not scale well or are inaccurate (see [benchmarks](./benchmark/README.md)) for the considered problems. In its current stage, however, c-lasso is not yet compatible with the scikit-learn API but rather a stand-alone tool.
Below we show several use cases of the package, including an application of sparse *log-contrast*
regression tasks for *compositional* microbiome data.
The code builds on results from several papers which can be found in the [References](#references). We also refer to the accompanying [JOSS paper submission](https://github.com/Leo-Simpson/c-lasso/blob/master/paper/paper.md), also available on [arXiv](https://arxiv.org/pdf/2011.00898.pdf).
## Table of Contents
* [Installation](#installation)
* [Regression and classification problems](#regression-and-classification-problems)
* [Getting started](#getting-started)
* [Log-contrast regression for microbiome data](#log-contrast-regression-for-microbiome-data)
* [Optimization schemes](#optimization-schemes)
* [References](#references)
## Installation
c-lasso is available on pip. You can install the package
in the shell using
```shell
pip install c-lasso
```
To use the c-lasso package in Python, type
```python
from classo import classo_problem
# one can add auxiliary functions as well such as random_data or csv_to_np
```
The `c-lasso` package depends on the following Python packages:
- `numpy`;
- `matplotlib`;
- `scipy`;
- `pandas`;
- `pytest` (for tests)
## Regression and classification problems
The c-lasso package can solve six different types of estimation problems:
four regression-type and two classification-type formulations.
#### [R1] Standard constrained Lasso regression:
This is the standard Lasso problem with linear equality constraints on the β vector.
The objective function combines Least-Squares for model fitting with l1 penalty for sparsity.
#### [R2] Constrained sparse Huber regression:
This regression problem uses the [Huber loss](https://en.wikipedia.org/wiki/Huber_loss) as objective function
for robust model fitting with l1 and linear equality constraints on the β vector. The parameter ρ=1.345.
#### [R3] Constrained scaled Lasso regression:
This formulation is similar to [R1] but allows for joint estimation of the (constrained) β vector and
the standard deviation σ in a concomitant fashion (see [References](#references) [4,5] for further info).
This is the default problem formulation in c-lasso.
#### [R4] Constrained sparse Huber regression with concomitant scale estimation:
This formulation combines [R2] and [R3] to allow robust joint estimation of the (constrained) β vector and
the scale σ in a concomitant fashion (see [References](#references) [4,5] for further info).
#### [C1] Constrained sparse classification with Square Hinge loss:
where the xi are the rows of X and l is defined as:
This formulation is similar to [R1] but adapted for classification tasks using the Square Hinge loss
with (constrained) sparse β vector estimation.
#### [C2] Constrained sparse classification with Huberized Square Hinge loss:
where the xi are the rows of X and lρ is defined as:
This formulation is similar to [C1] but uses the Huberized Square Hinge loss for robust classification
with (constrained) sparse β vector estimation.
## Getting started
#### Basic example
We begin with a basic example that shows how to run c-lasso on synthetic data. This example and the next one can be found on the notebook 'Synthetic data Notebook.ipynb'
The c-lasso package includes
the routine ```random_data``` that allows you to generate problem instances using normally distributed data.
```python
m, d, d_nonzero, k, sigma = 100, 200, 5, 1, 0.5
(X, C, y), sol = random_data(m, d, d_nonzero, k, sigma, zerosum=True, seed=1)
```
This code snippet generates a problem instance with sparse β in dimension
d=100 (sparsity d_nonzero=5). The design matrix X comprises n=100 samples generated from an i.i.d standard normal
distribution. The dimension of the constraint matrix C is d x k matrix. The noise level is σ=0.5.
The input ```zerosum=True``` implies that C is the all-ones vector and Cβ=0. The n-dimensional outcome vector y
and the regression vector β is then generated to satisfy the given constraints.
Next we can define a default c-lasso problem instance with the generated data:
```python
problem = classo_problem(X, y, C)
```
You can look at the generated problem instance by typing:
```python
print(problem)
```
This gives you a summary of the form:
```
FORMULATION: R3
MODEL SELECTION COMPUTED:
Stability selection
STABILITY SELECTION PARAMETERS:
numerical_method : not specified
method : first
B = 50
q = 10
percent_nS = 0.5
threshold = 0.7
lamin = 0.01
Nlam = 50
```
As we have not specified any problem, algorithm, or model selection settings, this problem instance
represents the *default* settings for a c-lasso instance:
- The problem is of regression type and uses formulation [R3], i.e. with concomitant scale estimation.
- The *default* optimization scheme is the path algorithm (see [Optimization schemes](#optimization-schemes) for further info).
- For model selection, stability selection at a theoretically derived λ value is used (see [Reference](#references) [4] for details). Stability selection comprises a relatively large number of parameters. For a description of the settings, we refer to the more advanced examples below and the API.
You can solve the corresponding c-lasso problem instance using
```python
problem.solve()
```
After completion, the results of the optimization and model selection routines
can be visualized using
```python
print(problem.solution)
```
The command shows the running time(s) for the c-lasso problem instance, and the selected variables for sability selection
```
STABILITY SELECTION :
Selected variables : 7 63 148 164 168
Running time : 1.546s
```
Here, we only used stability selection as *default* model selection strategy.
The command also allows you to inspect the computed stability profile for all variables
at the theoretical λ
![1.StabSel](https://github.com/Leo-Simpson/c-lasso/blob/master/figures/basic/StabSel.png)
The refitted β values on the selected support are also displayed in the next plot
![beta](https://github.com/Leo-Simpson/c-lasso/blob/master/figures/basic/beta.png)
#### Advanced example
In the next example, we show how one can specify different aspects of the problem
formulation and model selection strategy.
```python
m, d, d_nonzero, k, sigma = 100, 200, 5, 0, 0.5
(X, C, y), sol = random_data(m, d, d_nonzero, k, sigma, zerosum = True, seed = 4)
problem = classo_problem(X, y, C)
problem.formulation.huber = True
problem.formulation.concomitant = False
problem.model_selection.CV = True
problem.model_selection.LAMfixed = True
problem.model_selection.PATH = True
problem.model_selection.StabSelparameters.method = 'max'
problem.model_selection.CVparameters.seed = 1
problem.model_selection.LAMfixedparameters.rescaled_lam = True
problem.model_selection.LAMfixedparameters.lam = .1
problem.solve()
print(problem)
print(problem.solution)
```
Results :
```
FORMULATION: R2
MODEL SELECTION COMPUTED:
Lambda fixed
Path
Cross Validation
Stability selection
LAMBDA FIXED PARAMETERS:
numerical_method = Path-Alg
rescaled lam : True
threshold = 0.09
lam = 0.1
theoretical_lam = 0.224
PATH PARAMETERS:
numerical_method : Path-Alg
lamin = 0.001
Nlam = 80
CROSS VALIDATION PARAMETERS:
numerical_method : Path-Alg
one-SE method : True
Nsubset = 5
lamin = 0.001
Nlam = 80
STABILITY SELECTION PARAMETERS:
numerical_method : Path-Alg
method : max
B = 50
q = 10
percent_nS = 0.5
threshold = 0.7
lamin = 0.01
Nlam = 50
LAMBDA FIXED :
Selected variables : 17 59 123
Running time : 0.104s
PATH COMPUTATION :
Running time : 0.638s
CROSS VALIDATION :
Selected variables : 16 17 57 59 64 73 74 76 93 115 123 134 137 181
Running time : 2.1s
STABILITY SELECTION :
Selected variables : 17 59 76 123 137
Running time : 6.062s
```
![2.StabSel](https://github.com/Leo-Simpson/c-lasso/blob/master/figures/advanced/StabSel.png)
![2.StabSel-beta](https://github.com/Leo-Simpson/c-lasso/blob/master/figures/advanced/StabSel-beta.png)
![2.CV-beta](https://github.com/Leo-Simpson/c-lasso/blob/master/figures/advanced/CVbeta.png)
![2.CV-graph](https://github.com/Leo-Simpson/c-lasso/blob/master/figures/advanced/CV.png)
![2.LAM-beta](https://github.com/Leo-Simpson/c-lasso/blob/master/figures/advanced/beta.png)
![2.Path](https://github.com/Leo-Simpson/c-lasso/blob/master/figures/advanced/Beta-path.png)
## Log-contrast regression for microbiome data
In the [the accompanying notebook](./examples/example-notebook.ipynb) we study several microbiome data sets. We showcase two examples below.
#### BMI prediction using the COMBO dataset
We first consider the [COMBO data set](./examples/COMBO_data) and show how to predict Body Mass Index (BMI) from microbial genus abundances and two non-compositional covariates using "filtered_data".
```python
from classo import csv_to_np, classo_problem, clr
# Load microbiome and covariate data X
X0 = csv_to_np('COMBO_data/complete_data/GeneraCounts.csv', begin = 0).astype(float)
X_C = csv_to_np('COMBO_data/CaloriData.csv', begin = 0).astype(float)
X_F = csv_to_np('COMBO_data/FatData.csv', begin = 0).astype(float)
# Load BMI measurements y
y = csv_to_np('COMBO_data/BMI.csv', begin = 0).astype(float)[:, 0]
labels = csv_to_np('COMBO_data/complete_data/GeneraPhylo.csv').astype(str)[:, -1]
# Normalize/transform data
y = y - np.mean(y) #BMI data (n = 96)
X_C = X_C - np.mean(X_C, axis = 0) #Covariate data (Calorie)
X_F = X_F - np.mean(X_F, axis = 0) #Covariate data (Fat)
X0 = clr(X0, 1 / 2).T
# Set up design matrix and zero-sum constraints for 45 genera
X = np.concatenate((X0, X_C, X_F, np.ones((len(X0), 1))), axis = 1) # Joint microbiome and covariate data and offset
label = np.concatenate([labels, np.array(['Calorie', 'Fat', 'Bias'])])
C = np.ones((1, len(X[0])))
C[0, -1], C[0, -2], C[0, -3] = 0., 0., 0.
# Set up c-lassso problem
problem = classo_problem(X, y, C, label = label)
# Use stability selection with theoretical lambda [Combettes & Müller, 2020b]
problem.model_selection.StabSelparameters.method = 'lam'
problem.model_selection.StabSelparameters.threshold_label = 0.5
# Use formulation R3
problem.formulation.concomitant = True
problem.solve()
print(problem)
print(problem.solution)
# Use formulation R4
problem.formulation.huber = True
problem.formulation.concomitant = True
problem.solve()
print(problem)
print(problem.solution)
```
![3.Stability profile R3](https://github.com/Leo-Simpson/c-lasso/blob/master/figures/exampleFilteredCOMBO/R3-StabSel.png)
![3.Beta solution R3](https://github.com/Leo-Simpson/c-lasso/blob/master/figures/exampleFilteredCOMBO/R3-StabSel-beta.png)
![3.Stability profile R4](https://github.com/Leo-Simpson/c-lasso/blob/master/figures/exampleFilteredCOMBO/R4-StabSel.png)
![3.Beta solution R4](https://github.com/Leo-Simpson/c-lasso/blob/master/figures/exampleFilteredCOMBO/R4-StabSel-beta.png)
#### pH prediction using the 88 soils dataset
The next microbiome example considers the [88 soils dataset](./examples/pH_data) from [Lauber et al., 2009](https://pubmed.ncbi.nlm.nih.gov/19502440/).
The task is to predict pH concentration in the soil from microbial abundance data. A similar analysis is available
in [Tree-Aggregated Predictive Modeling of Microbiome Data](https://www.biorxiv.org/content/10.1101/2020.09.01.277632v1)
with Central Park soil data from [Ramirez et al.](https://royalsocietypublishing.org/doi/full/10.1098/rspb.2014.1988).
Code to run this application is available in [the accompanying notebook](./examples/example-notebook.ipynb) under `pH data`. Below is a summary of a c-lasso problem instance (using the R3 formulation).
```
FORMULATION: R3
MODEL SELECTION COMPUTED:
Lambda fixed
Path
Stability selection
LAMBDA FIXED PARAMETERS:
numerical_method = Path-Alg
rescaled lam : True
threshold = 0.004
lam : theoretical
theoretical_lam = 0.2182
PATH PARAMETERS:
numerical_method : Path-Alg
lamin = 0.001
Nlam = 80
STABILITY SELECTION PARAMETERS:
numerical_method : Path-Alg
method : lam
B = 50
q = 10
percent_nS = 0.5
threshold = 0.7
lam = theoretical
theoretical_lam = 0.3085
```
The c-lasso estimation results are summarized below:
```
LAMBDA FIXED :
Sigma = 0.198
Selected variables : 14 18 19 39 43 57 62 85 93 94 104 107
Running time : 0.008s
PATH COMPUTATION :
Running time : 0.12s
STABILITY SELECTION :
Selected variables : 2 12 15
Running time : 0.287s
```
![Ex4.1](https://github.com/Leo-Simpson/c-lasso/blob/master/figures/examplePH/R3-Beta-path.png)
![Ex4.2](https://github.com/Leo-Simpson/c-lasso/blob/master/figures/examplePH/R3-Sigma-path.png)
![Ex4.3](https://github.com/Leo-Simpson/c-lasso/blob/master/figures/examplePH/R3-StabSel.png)
![Ex4.4](https://github.com/Leo-Simpson/c-lasso/blob/master/figures/examplePH/R3-StabSel-beta.png)
![Ex4.5](https://github.com/Leo-Simpson/c-lasso/blob/master/figures/examplePH/R3-beta.png)
## Optimization schemes
The available problem formulations [R1-C2] require different algorithmic strategies for
efficiently solving the underlying optimization problem. We have implemented four
algorithms (with provable convergence guarantees) that vary in generality and are not
necessarily applicable to all problems. For each problem type, c-lasso has a default algorithm
setting that proved to be the fastest in our numerical experiments.
### Path algorithms (Path-Alg)
This is the default algorithm for non-concomitant problems [R1,R3,C1,C2].
The algorithm uses the fact that the solution path along λ is piecewise-
affine (as shown, e.g., in [1]). When Least-Squares is used as objective function,
we derive a novel efficient procedure that allows us to also derive the
solution for the concomitant problem [R2] along the path with little extra computational overhead.
### Projected primal-dual splitting method (P-PDS):
This algorithm is derived from [2] and belongs to the class of
proximal splitting algorithms. It extends the classical Forward-Backward (FB)
(aka proximal gradient descent) algorithm to handle an additional linear equality constraint
via projection. In the absence of a linear constraint, the method reduces to FB.
This method can solve problem [R1]. For the Huber problem [R3],
P-PDS can solve the mean-shift formulation of the problem (see [6]).
### Projection-free primal-dual splitting method (PF-PDS):
This algorithm is a special case of an algorithm proposed in [3] (Eq.4.5) and also belongs to the class of
proximal splitting algorithms. The algorithm does not require projection operators
which may be beneficial when C has a more complex structure. In the absence of a linear constraint,
the method reduces to the Forward-Backward-Forward scheme. This method can solve problem [R1].
For the Huber problem [R3], PF-PDS can solve the mean-shift formulation of the problem (see [6]).
### Douglas-Rachford-type splitting method (DR)
This algorithm is the most general algorithm and can solve all regression problems
[R1-R4]. It is based on Doulgas Rachford splitting in a higher-dimensional product space.
It makes use of the proximity operators of the perspective of the LS objective (see [4,5])
The Huber problem with concomitant scale [R4] is reformulated as scaled Lasso problem
with the mean shift (see [6]) and thus solved in (n + d) dimensions.
## References
* [1] B. R. Gaines, J. Kim, and H. Zhou, [Algorithms for Fitting the Constrained Lasso](https://www.tandfonline.com/doi/abs/10.1080/10618600.2018.1473777?journalCode=ucgs20), J. Comput. Graph. Stat., vol. 27, no. 4, pp. 861–871, 2018.
* [2] L. Briceno-Arias and S.L. Rivera, [A Projected Primal–Dual Method for Solving Constrained Monotone Inclusions](https://link.springer.com/article/10.1007/s10957-018-1430-2?shared-article-renderer), J. Optim. Theory Appl., vol. 180, Issue 3, March 2019.
* [3] P. L. Combettes and J.C. Pesquet, [Primal-Dual Splitting Algorithm for Solving Inclusions with Mixtures of Composite, Lipschitzian, and Parallel-Sum Type Monotone Operators](https://arxiv.org/pdf/1107.0081.pdf), Set-Valued and Variational Analysis, vol. 20, pp. 307-330, 2012.
* [4] P. L. Combettes and C. L. Müller, [Perspective M-estimation via proximal decomposition](https://arxiv.org/abs/1805.06098), Electronic Journal of Statistics, 2020, [Journal version](https://projecteuclid.org/euclid.ejs/1578452535)
* [5] P. L. Combettes and C. L. Müller, [Regression models for compositional data: General log-contrast formulations, proximal optimization, and microbiome data applications](https://arxiv.org/abs/1903.01050), Statistics in Bioscience, 2020.
* [6] A. Mishra and C. L. Müller, [Robust regression with compositional covariates](https://arxiv.org/abs/1909.04990), arXiv, 2019.
* [7] S. Rosset and J. Zhu, [Piecewise linear regularized solution paths](https://projecteuclid.org/euclid.aos/1185303996), Ann. Stat., vol. 35, no. 3, pp. 1012–1030, 2007.
* [8] J. Bien, X. Yan, L. Simpson, and C. L. Müller, [Tree-Aggregated Predictive Modeling of Microbiome Data](https://www.biorxiv.org/content/10.1101/2020.09.01.277632v1), biorxiv, 2020.
%prep
%autosetup -n c-lasso-1.0.11
%build
%py3_build
%install
%py3_install
install -d -m755 %{buildroot}/%{_pkgdocdir}
if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi
if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi
if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi
if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi
pushd %{buildroot}
if [ -d usr/lib ]; then
find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst
fi
if [ -d usr/lib64 ]; then
find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst
fi
if [ -d usr/bin ]; then
find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst
fi
if [ -d usr/sbin ]; then
find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst
fi
touch doclist.lst
if [ -d usr/share/man ]; then
find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst
fi
popd
mv %{buildroot}/filelist.lst .
mv %{buildroot}/doclist.lst .
%files -n python3-c-lasso -f filelist.lst
%dir %{python3_sitelib}/*
%files help -f doclist.lst
%{_docdir}/*
%changelog
* Fri May 05 2023 Python_Bot - 1.0.11-1
- Package Spec generated