diff options
author | CoprDistGit <infra@openeuler.org> | 2023-04-23 03:48:06 +0000 |
---|---|---|
committer | CoprDistGit <infra@openeuler.org> | 2023-04-23 03:48:06 +0000 |
commit | deae515f1561977827a0c9cf8f074c66cb682576 (patch) | |
tree | 7cfc51484021e624f5beda867b150fae425eec76 | |
parent | 4a865dde6825fce09c50bb276712686dacdba8bb (diff) |
automatic import of python-featurewizopeneuler20.03
-rw-r--r-- | .gitignore | 1 | ||||
-rw-r--r-- | python-featurewiz.spec | 788 | ||||
-rw-r--r-- | sources | 2 |
3 files changed, 312 insertions, 479 deletions
@@ -1 +1,2 @@ /featurewiz-0.2.6.tar.gz +/featurewiz-0.2.8.tar.gz diff --git a/python-featurewiz.spec b/python-featurewiz.spec index 58b6863..da622f2 100644 --- a/python-featurewiz.spec +++ b/python-featurewiz.spec @@ -1,11 +1,11 @@ %global _empty_manifest_terminate_build 0 Name: python-featurewiz -Version: 0.2.6 +Version: 0.2.8 Release: 1 Summary: Select Best Features from your data set - any size - now with XGBoost! License: Apache License 2.0 URL: https://github.com/AutoViML/featurewiz -Source0: https://mirrors.nju.edu.cn/pypi/web/packages/a3/37/8bf60ab38cc03571208d98ec7148246b9e62aa0b2b4c44160b53d23eb664/featurewiz-0.2.6.tar.gz +Source0: https://mirrors.nju.edu.cn/pypi/web/packages/d3/bd/8c6df689a4a1b8f38a030cda6c840ec9b30c9c9b11f801d734c839627262/featurewiz-0.2.8.tar.gz BuildArch: noarch Requires: python3-Pillow @@ -30,106 +30,46 @@ Requires: python3-xlrd %description # featurewiz - - -<p> - -## Update (October 2022): FeatureWiz 2.0 is here. -<ol> -<li><b>featurewiz 2.0 is here. You have two small performance improvements:</li> </b> -1. SULOV method now has a higher correlation limit of 0.90 as default. This means fewer variables are removed and hence more vars are selected. You can always set it back to the old limit by setting `corr_limit`=0.70 if you want.<br> -2. Recursive XGBoost algorithm is tighter in that it selects fewer features in each iteration. To see how many it selects, set `verbose` flag to 1. <br> -The net effect is that the same number of features are selected but they are better at producing more accurate models. Try it out and let us know. </ol> - -## Update (September 2022): You can now skip SULOV method using skip_sulov flag -<ol> -<li>featurewiz now has a new input: `skip_sulov` flag is here. You can set it to `True` to skip the SULOV method if needed.</li> -</ol> - -## Update (August 2022): Silent mode with verbose=0 -<ol> -<li><b>featurewiz now has a "silent" mode which you can set using the "verbose=0" option.</b> It will run silently with no charts or graphs and very minimal verbose output. Hope this helps!<br></li> -</ol> -## Update (May 2022) -<ol> -<li><b>featurewiz as of version 0.1.50 or higher has multiple high performance models</b> that you can use to build highly performant models once you have completed feature selection. These models are based on LightGBM and XGBoost and have even Stacking and Blending ensembles. You can find them as functions starting with "simple_" and "complex_" under featurewiz. All the best!<br></li> -</ol> -## Update (March 2022) +`featurewiz` is a powerful feature selection library that has a number of features that make it stand out from the competition, including: <ol> -<li><b>featurewiz as of version 0.1.04 or higher can read `feather-format` files at blazing speeds.</b> See example below on how to convert your CSV files to feather. Then you can feed those '.ftr' files to featurewiz and it will read it 10-100X faster!<br></li> +<li>It provides one of the best automatic feature selection algorithms (Minimum Redundancy Maximum Relevance (MRMR)) described by wikipedia as: <a href="https://en.wikipedia.org/wiki/Minimum_redundancy_feature_selection">"The MRMR selection has been found to be more powerful than the maximum relevance feature selection"</a> such as Boruta.</li> +<li>It selects the best number of un-correlated features that have maximum mutual information about the target without having to specify the number of features</li> +<li>It is fast and easy to use, and comes with a number of helpful features, such as a built-in categorical-to-numeric encoder and a powerful feature engineering module</li> +<li>It is well-documented, and it comes with a number of <a href="https://github.com/AutoViML/featurewiz/tree/main/examples">examples</a>.</li> +<li>It is actively maintained, and it is regularly updated with new features and bug fixes.</li> </ol> +If you are looking for a single feature selection library, we would definitely recommend checking out featurewiz. It is a powerful tool that can help you to improve the performance of your machine learning models. - -<ol> -<li><b>featurewiz now runs at blazing speeds thanks to using GPU's by default.</b> So if you are running a large data set on Colab and/or Kaggle, make sure you turn on the GPU kernels. featurewiz will automatically detect that GPU is turned on and will utilize XGBoost using GPU-hist. That will ensure it will crunch your datasets even faster. I have tested it with a very large data set and it reduced the running time from 52 mins to 1 minute! That's a 98% reduction in running time using GPU compared to CPU!<br></li> -</ol> -## Update (Jan 2022) -<ol> -<li><b>FeatureWiz as of version 0.0.90 or higher is a scikit-learn compatible feature selection transformer.</b> You can perform fit and predict as follows. You will get a Transformer that can select the top variables from your dataset. You can also use it in sklearn pipelines as a Transformer.</li> - -``` -from featurewiz import FeatureWiz -features = FeatureWiz(corr_limit=0.70, feature_engg='', category_encoders='', -dask_xgboost_flag=False, nrows=None, verbose=2) -X_train_selected = features.fit_transform(X_train, y_train) -X_test_selected = features.transform(X_test) -features.features ### provides the list of selected features ### -``` - -<li><b>Featurewiz is now upgraded with XGBOOST 1.5.1 for DASK for blazing fast performance</b> even for very large data sets! Set `dask_xgboost_flag = True` to run dask + xgboost.</li> -<li><b>Featurewiz now runs with a default setting of `nrows=None`.</b> This means it will run using all rows. But if you want it to run faster, then you can change `nrows` to 1000 or whatever, so it will sample that many rows and run.</li> -<li><b>Featurewiz has lots of new fast model builder functions:</b> that you can use to build highly performant models with the features selected by featurewiz. They are:<br> -1. <b>simple_LightGBM_model()</b> - simple regression and classification with one target label<br> -2. <b>simple_XGBoost_model()</b> - simple regression and classification with one target label<br> -3. <b>complex_LightGBM_model()</b> - more complex multi-label and multi-class models<br> -4. <b>complex_XGBoost_model()</b> - more complex multi-label and multi-class models<br> -5. <b>Stacking_Classifier()</b>: Stacking model that can handle multi-label, multi-class problems<br> -6. <b>Stacking_Regressor()</b>: Stacking model that can handle multi-label, regression problems<br> -7. <b>Blending_Regressor()</b>: Blending model that can handle multi-label, regression problems<br></li> -</ol> - -## Good News! -As of June 2022, thanks to [arturdaraujo](https://github.com/arturdaraujo), featurewiz is now available on conda-forge. You can try:<br> - -``` - conda install -c conda-forge featurewiz -``` - -### If the above conda install fails, you can try installing featurewiz this way: -##Step 1: Install featurewiz first<br> - -``` - !pip install featurewiz --ignore-installed --no-deps - !pip install xlrd --ignore-installed --no-deps -``` - -##Step 2: Next, install Pillow since Kaggle has an incompatible version. <br> +# Table of Contents +<ul> +<li><a href="#introduction">What is featurewiz</a></li> +<li><a href="#working">How it works</a></li> +<li><a href="#tips">Tips for using featurewiz</a></li> +<li><a href="#install">How to install featurewiz</a></li> +<li><a href="#usage">Usage</a></li> +<li><a href="#api">API</a></li> +<li><a href="#additional">Additional Tips</a></li> +<li><a href="#maintainers">Maintainers</a></li> +<li><a href="#contributing">Contributing</a></li> +<li><a href="#license">License</a></li> +<li><a href="#disclaimer">Disclaimer</a></li> +</ul> +<p> -``` - !pip install Pillow==9.0.0 -``` + -## What is featurewiz? +## Introduction `featurewiz` a new python library for creating and selecting the best features in your data set fast! `featurewiz` can be used in one or two ways. Both are explained below. -## 1. Feature Engineering +### 1. Feature Engineering <p>The first step is not absolutely necessary but it can be used to create new features that may or may not be helpful (be careful with automated feature engineering tools!).<p> 1. <b>Performing Feature Engineering</b>: One of the gaps in open source AutoML tools and especially Auto_ViML has been the lack of feature engineering capabilities that high powered competitions such as Kaggle required. The ability to create "interaction" variables or adding "group-by" features or "target-encoding" categorical variables was difficult and sifting through those hundreds of new features to find best features was difficult and left only to "experts" or "professionals". featurewiz was created to help you in this endeavor.<br> <p>featurewiz now enables you to add hundreds of such features with a single line of code. Set the "feature_engg" flag to "interactions", "groupby" or "target" and featurewiz will select the best encoders for each of those options and create hundreds (perhaps thousands) of features in one go. Not only that, using the next step, featurewiz will sift through numerous such variables and find only the least correlated and most relevant features to your model. All in one step!.<br> -You must use this syntax for feature engg. Otherwise, featurewiz will give an error: - -``` -import featurewiz as FW -outputs = FW.featurewiz(dataname=train, target=target, corr_limit=0.70, verbose=2, sep=',', - header=0, test_data='',feature_engg='', category_encoders='', - dask_xgboost_flag=False, nrows=None) -``` -  -## 2. Feature Selection +### 2. Feature Selection <p>The second step is Feature Selection. `featurewiz` uses the MRMR (Minimum Redundancy Maximum Relevance) algorithm as the basis for its feature selection. <br> <b> Why do Feature Selection</b>? Once you have created 100's of new features, you still have three questions left to answer: 1. How do we interpret those newly created features? @@ -140,65 +80,38 @@ All are very important questions and featurewiz answers them by using the SULOV <p><b>SULOV</b>: SULOV stands for `Searching for Uncorrelated List of Variables`. The SULOV algorithm is based on the Minimum-Redundancy-Maximum-Relevance (MRMR) <a href="https://towardsdatascience.com/mrmr-explained-exactly-how-you-wished-someone-explained-to-you-9cf4ed27458b">algorithm explained in this article</a> as one of the best feature selection methods. To understand how MRMR works and how it is different from `Boruta` and other feature selection methods, see the chart below. Here "Minimal Optimal" refers to the MRMR and featurewiz kind of algorithms while "all-relevant" refers to Boruta kind of algorithms.<br>  -<br> -The working of the SULOV algorithm is as follows: + +## Working +`featurewiz` performs feature selection in 2 steps. Each step is explained below. +<b>The working of the `SULOV` algorithm</b> is as follows: <ol> -<li>Find all the pairs of highly correlated variables exceeding a correlation threshold (say absolute(0.7)). -<li>Then find their MIS score (Mutual Information Score) to the target variable. MIS is a non-parametric scoring method. So its suitable for all kinds of variables and target. -<li>Now take each pair of correlated variables, then knock off the one with the lower MIS score. -<li>What’s left is the ones with the highest Information scores and least correlation with each other. +<li>Find all the pairs of highly correlated variables exceeding a correlation threshold (say absolute(0.7)).</li> +<li>Then find their MIS score (Mutual Information Score) to the target variable. MIS is a non-parametric scoring method. So its suitable for all kinds of variables and target.</li> +<li>Now take each pair of correlated variables, then knock off the one with the lower MIS score.</li> +<li>What’s left is the ones with the highest Information scores and least correlation with each other.</li> </ol>  -<b>Recursive XGBoost</b>: Once SULOV has selected variables that have high mutual information scores with least less correlation amongst them, we use XGBoost to repeatedly find best features among the remaining variables after SULOV. The Recursive XGBoost method is explained in this chart below. -Here is how it works: +<b>The working of the Recursive XGBoost</b> is as follows: +Once SULOV has selected variables that have high mutual information scores with least less correlation amongst them, featurewiz uses XGBoost to repeatedly find the best features among the remaining variables after SULOV. <ol> -<li>Select all variables in data set and the full data split into train and valid sets. -<li>Find top X features (could be 10) on train using valid for early stopping (to prevent over-fitting) -<li>Then take next set of vars and find top X -<li>Do this 5 times. Combine all selected features and de-duplicate them. +<li>Select all variables in data set and the full data split into train and valid sets.</li> +<li>Find top X features (could be 10) on train using valid for early stopping (to prevent over-fitting)</li> +<li>Then take next set of vars and find top X</li> +<li>Do this 5 times. Combine all selected features and de-duplicate them.</li> </ol>  -<b>Building the simplest and most "interpretable" model</b>: featurewiz represents the "next best" step you must perform after doing feature engineering since you might have added some highly correlated or even useless features when you use automated feature engineering. featurewiz ensures you have the least number of features needed to build a high performing or equivalent model. - -<b>A WORD OF CAUTION:</b> Just because you can engineer new features, doesn't mean you should always create tons of new features. You must make sure you understand what the new features stand for before you attempt to build a model with these (sometimes useless) features. featurewiz displays the SULOV chart which can show you how the 100's of newly created variables added to your dataset are highly correlated to each other and were removed. This will help you understand how feature selection works in featurewiz. - -## Table of Contents -<ul> -<li><a href="#background">Background</a></li> -<li><a href="#install">Install</a></li> -<li><a href="#usage">Usage</a></li> -<li><a href="#api">API</a></li> -<li><a href="#maintainers">Maintainers</a></li> -<li><a href="#contributing">Contributing</a></li> -<li><a href="#license">License</a></li> -</ul> - -## Background - - - -To learn more about how featurewiz works under the hood, watch this [video](https://www.youtube.com/embed/ZiNutwPcAU0)<br> - -<p>featurewiz was designed for selecting High Performance variables with the fewest steps. - -In most cases, featurewiz builds models with 20%-99% fewer features than your original data set with nearly the same or slightly lower performance (this is based on my trials. Your experience may vary).<br> -<p> -featurewiz is every Data Scientist's feature wizard that will:<ol> -<li><b>Automatically pre-process data</b>: you can send in your entire dataframe "as is" and featurewiz will classify and change/label encode categorical variables changes to help XGBoost processing. It classifies variables as numeric or categorical or NLP or date-time variables automatically so it can use them correctly to model.<br> -<li><b>Perform feature engineering automatically</b>: The ability to create "interaction" variables or adding "group-by" features or "target-encoding" categorical variables is difficult and sifting through those hundreds of new features is painstaking and left only to "experts". Now, with featurewiz you can create hundreds or even thousands of new features with the click of a mouse. This is very helpful when you have a small number of features to start with. However, be careful with this option. You can very easily create a monster with this option. -<li><b>Perform feature reduction automatically</b>. When you have small data sets and you know your domain well, it is easy to perhaps do EDA and identify which variables are important. But when you have a very large data set with hundreds if not thousands of variables, selecting the best features from your model can mean the difference between a bloated and highly complex model or a simple model with the fewest and most information-rich features. featurewiz uses XGBoost repeatedly to perform feature selection. You must try it on your large data sets and compare!<br> -<li><b>Explain SULOV method graphically </b> using networkx library so you can see which variables are highly correlated to which ones and which of those have high or low mutual information scores automatically. Just set verbose = 2 to see the graph. <br> -<li><b>Build a fast LightGBM model </b> using the features selected by featurewiz. There is a function called "simple_lightgbm_model" which you can use to build a fast model. It is a new module, so check it out.<br> -</ol> - -<b>*** Notes of Gratitude ***</b>:<br> +## Tips +Here are some additional tips for ML engineers and data scientists when using featurewiz: <ol> -<li><b>Alex Lekov</b> (https://github.com/Alex-Lekov/AutoML_Alex/tree/master/automl_alex) for his DataBunch and encoders modules which are used by the tool (although with some modifications).</li> -<li><b>Category Encoders</b> library in Python : This is an amazing library. Make sure you read all about the encoders that featurewiz uses here: https://contrib.scikit-learn.org/category_encoders/index.html </li> +<li><b>Always cross-validate your results</b>: When you use a feature selection tool, it is important to cross-validate your results. This means that you should split your data into a training set and a test set. Use the training set to select features, and then evaluate your model on the test set. This will help you to ensure that your model is not overfitting to the training data.</li> +<li><b>Use multiple feature selection tools</b>: It is a good idea to use multiple feature selection tools and compare the results. This will help you to get a better understanding of which features are most important for your data.</li> +<li><b>Don't forget to engineer new features</b>: Feature selection is only one part of the process of building a good machine learning model. You should also spend time engineering your features to make them as informative as possible. This can involve things like creating new features, transforming existing features, and removing irrelevant features.</li> +<li><b>Don't overfit your model</b>: It is important to avoid overfitting your model to the training data. Overfitting occurs when your model learns the noise in the training data, rather than the underlying signal. To avoid overfitting, you can use regularization techniques, such as lasso or elasticnet.</li> +<li><b>Start with a small number of features</b>: When you are first starting out, it is a good idea to start with a small number of features. This will help you to avoid overfitting your model. As you become more experienced, you can experiment with adding more features.</li> </ol> ## Install @@ -208,21 +121,6 @@ featurewiz is every Data Scientist's feature wizard that will:<ol> <li><b>featurewiz is built using xgboost, dask, numpy, pandas and matplotlib</b>. It should run on most Python 3 Anaconda installations. You won't have to import any special libraries other than "dask", "XGBoost" and "networkx" library. Optionally, it uses LightGBM for fast modeling, which it installs automatically. </li> <li><b>We use "networkx" library for charts and interpretability</b>. <br>But if you don't have these libraries, featurewiz will install those for you automatically.</li> </ol> -- [Anaconda](https://docs.anaconda.com/anaconda/install/) - -To clone featurewiz, it is better to create a new environment, and install the required dependencies: - -To install from PyPi: - -``` -conda create -n <your_env_name> python=3.7 anaconda -conda activate <your_env_name> # ON WINDOWS: `source activate <your_env_name>` -pip install featurewiz --ignore-installed --no-deps -pip install lazytransform -or -pip install git+https://github.com/AutoViML/featurewiz.git -``` - To install from source: ``` @@ -235,29 +133,52 @@ cd featurewiz pip install -r requirements.txt ``` +## Good News: You can install featurewiz on Colab and Kaggle easily in 2 steps! +<a href="updates.md">Check out more latest updates from this page</a><br> +As of June 2022, thanks to [arturdaraujo](https://github.com/arturdaraujo), featurewiz is now available on conda-forge. You can try:<br> + +``` + conda install -c conda-forge featurewiz +``` + +### If the above conda install fails, you can try installing featurewiz this way: +##Step 1: Install featurewiz first<br> + +``` + !pip install featurewiz --ignore-installed --no-deps + !pip install xlrd --ignore-installed --no-deps +``` + +##Step 2: Next, install Pillow since Kaggle has an incompatible version. <br> + +``` + !pip install Pillow==9.0.0 +``` + ## Usage -As of Jan 2022, you now invoke featurewiz in two ways for two different goals. For feature selection, you must use the scikit-learn compatible fit and predict transformer syntax such as below. +For feature selection, you must use the newer syntax which is similar to the scikit-learn fit and predict transformer syntax below. ``` from featurewiz import FeatureWiz -features = FeatureWiz(corr_limit=0.70, feature_engg='', category_encoders='', dask_xgboost_flag=False, nrows=None, verbose=2) -X_train_selected = features.fit_transform(X_train, y_train) -X_test_selected = features.transform(X_test) -features.features ### provides the list of selected features ### +fwiz = FeatureWiz(corr_limit=0.70, feature_engg='', category_encoders='', dask_xgboost_flag=False, nrows=None, verbose=2) +X_train_selected = fwiz.fit_transform(X_train, y_train) +X_test_selected = fwiz.transform(X_test) +### get list of selected features ### +fwiz.features ``` Alternatively, you can use featurewiz for feature engineering using this older syntax. Otherwise, it will give an error. If you want to combine feature engg and then feature selection, you must use this older syntax: ``` -import featurewiz as FW -outputs = FW.featurewiz(dataname=train, target=target, corr_limit=0.70, verbose=2, sep=',', +import featurewiz as fwiz +outputs = fwiz.featurewiz(dataname=train, target=target, corr_limit=0.70, verbose=2, sep=',', header=0, test_data='',feature_engg='', category_encoders='', dask_xgboost_flag=False, nrows=None) ``` `outputs`: There will always be multiple objects in output. The objects in that tuple can vary: -1. "features" and "train": It be a list (of selected features) and one dataframe (if you sent in train only) +1. "features" and "trainm": It be a list (of selected features) and one dataframe (if you sent in train only) 2. "trainm" and "testm": It can be two dataframes when you send in both test and train but with selected features. <ol> <li>Both the selected features and dataframes are ready for you to now to do further modeling. @@ -305,7 +226,8 @@ outputs = FW.featurewiz(dataname=train, target=target, corr_limit=0.70, verbose= The mean target value (regardless of the feature value). - `dask_xgboost_flag`: Default is False. Set to True to use dask_xgboost estimator. You can turn it off if it gives an error. Then it will use pandas and regular xgboost to do the job. - `nrows`: default `None`. You can set the number of rows to read from your datafile if it is too large to fit into either dask or pandas. But you won't have to if you use dask. -**Return values** + +**Output values** - `outputs`: Output is always a tuple. We can call our outputs in that tuple: out1 and out2. - `out1` and `out2`: If you sent in just one dataframe or filename as input, you will get: - 1. `features`: It will be a list (of selected features) and @@ -314,6 +236,28 @@ outputs = FW.featurewiz(dataname=train, target=target, corr_limit=0.70, verbose= - 1. `trainm`: a modified train dataframe with engineered and selected features from dataname and - 2. `testm`: a modified test dataframe with engineered and selected features from test_data. +## Additional + + + +To learn more about how featurewiz works under the hood, watch this [video](https://www.youtube.com/embed/ZiNutwPcAU0)<br> +<p>featurewiz was designed for selecting High Performance variables with the fewest steps. +In most cases, featurewiz builds models with 20%-99% fewer features than your original data set with nearly the same or slightly lower performance (this is based on my trials. Your experience may vary).<br> +<p> +featurewiz is every Data Scientist's feature wizard that will:<ol> +<li><b>Automatically pre-process data</b>: you can send in your entire dataframe "as is" and featurewiz will classify and change/label encode categorical variables changes to help XGBoost processing. It classifies variables as numeric or categorical or NLP or date-time variables automatically so it can use them correctly to model.<br> +<li><b>Perform feature engineering automatically</b>: The ability to create "interaction" variables or adding "group-by" features or "target-encoding" categorical variables is difficult and sifting through those hundreds of new features is painstaking and left only to "experts". Now, with featurewiz you can create hundreds or even thousands of new features with the click of a mouse. This is very helpful when you have a small number of features to start with. However, be careful with this option. You can very easily create a monster with this option. +<li><b>Perform feature reduction automatically</b>. When you have small data sets and you know your domain well, it is easy to perhaps do EDA and identify which variables are important. But when you have a very large data set with hundreds if not thousands of variables, selecting the best features from your model can mean the difference between a bloated and highly complex model or a simple model with the fewest and most information-rich features. featurewiz uses XGBoost repeatedly to perform feature selection. You must try it on your large data sets and compare!<br> +<li><b>Explain SULOV method graphically </b> using networkx library so you can see which variables are highly correlated to which ones and which of those have high or low mutual information scores automatically. Just set verbose = 2 to see the graph. <br> +<li><b>Build a fast XGBoost or LightGBM model using the features selected by featurewiz</b>. There is a function called "simple_lightgbm_model" which you can use to build a fast model. It is a new module, so check it out.<br> +</ol> + +<b>*** A Note of Gratitude ***</b>:<br> +<ol> +<li><b>Alex Lekov</b> (https://github.com/Alex-Lekov/AutoML_Alex/tree/master/automl_alex) for his DataBunch and encoders modules which are used by the tool (although with some modifications).</li> +<li><b>Category Encoders</b> library in Python : This is an amazing library. Make sure you read all about the encoders that featurewiz uses here: https://contrib.scikit-learn.org/category_encoders/index.html </li> +</ol> + ## Maintainers * [@AutoViML](https://github.com/AutoViML) @@ -342,106 +286,46 @@ BuildRequires: python3-setuptools BuildRequires: python3-pip %description -n python3-featurewiz # featurewiz - - -<p> - -## Update (October 2022): FeatureWiz 2.0 is here. +`featurewiz` is a powerful feature selection library that has a number of features that make it stand out from the competition, including: <ol> -<li><b>featurewiz 2.0 is here. You have two small performance improvements:</li> </b> -1. SULOV method now has a higher correlation limit of 0.90 as default. This means fewer variables are removed and hence more vars are selected. You can always set it back to the old limit by setting `corr_limit`=0.70 if you want.<br> -2. Recursive XGBoost algorithm is tighter in that it selects fewer features in each iteration. To see how many it selects, set `verbose` flag to 1. <br> -The net effect is that the same number of features are selected but they are better at producing more accurate models. Try it out and let us know. </ol> - -## Update (September 2022): You can now skip SULOV method using skip_sulov flag -<ol> -<li>featurewiz now has a new input: `skip_sulov` flag is here. You can set it to `True` to skip the SULOV method if needed.</li> -</ol> - -## Update (August 2022): Silent mode with verbose=0 -<ol> -<li><b>featurewiz now has a "silent" mode which you can set using the "verbose=0" option.</b> It will run silently with no charts or graphs and very minimal verbose output. Hope this helps!<br></li> -</ol> -## Update (May 2022) -<ol> -<li><b>featurewiz as of version 0.1.50 or higher has multiple high performance models</b> that you can use to build highly performant models once you have completed feature selection. These models are based on LightGBM and XGBoost and have even Stacking and Blending ensembles. You can find them as functions starting with "simple_" and "complex_" under featurewiz. All the best!<br></li> +<li>It provides one of the best automatic feature selection algorithms (Minimum Redundancy Maximum Relevance (MRMR)) described by wikipedia as: <a href="https://en.wikipedia.org/wiki/Minimum_redundancy_feature_selection">"The MRMR selection has been found to be more powerful than the maximum relevance feature selection"</a> such as Boruta.</li> +<li>It selects the best number of un-correlated features that have maximum mutual information about the target without having to specify the number of features</li> +<li>It is fast and easy to use, and comes with a number of helpful features, such as a built-in categorical-to-numeric encoder and a powerful feature engineering module</li> +<li>It is well-documented, and it comes with a number of <a href="https://github.com/AutoViML/featurewiz/tree/main/examples">examples</a>.</li> +<li>It is actively maintained, and it is regularly updated with new features and bug fixes.</li> </ol> -## Update (March 2022) -<ol> -<li><b>featurewiz as of version 0.1.04 or higher can read `feather-format` files at blazing speeds.</b> See example below on how to convert your CSV files to feather. Then you can feed those '.ftr' files to featurewiz and it will read it 10-100X faster!<br></li> -</ol> - - -<ol> -<li><b>featurewiz now runs at blazing speeds thanks to using GPU's by default.</b> So if you are running a large data set on Colab and/or Kaggle, make sure you turn on the GPU kernels. featurewiz will automatically detect that GPU is turned on and will utilize XGBoost using GPU-hist. That will ensure it will crunch your datasets even faster. I have tested it with a very large data set and it reduced the running time from 52 mins to 1 minute! That's a 98% reduction in running time using GPU compared to CPU!<br></li> -</ol> -## Update (Jan 2022) -<ol> -<li><b>FeatureWiz as of version 0.0.90 or higher is a scikit-learn compatible feature selection transformer.</b> You can perform fit and predict as follows. You will get a Transformer that can select the top variables from your dataset. You can also use it in sklearn pipelines as a Transformer.</li> +If you are looking for a single feature selection library, we would definitely recommend checking out featurewiz. It is a powerful tool that can help you to improve the performance of your machine learning models. -``` -from featurewiz import FeatureWiz -features = FeatureWiz(corr_limit=0.70, feature_engg='', category_encoders='', -dask_xgboost_flag=False, nrows=None, verbose=2) -X_train_selected = features.fit_transform(X_train, y_train) -X_test_selected = features.transform(X_test) -features.features ### provides the list of selected features ### -``` - -<li><b>Featurewiz is now upgraded with XGBOOST 1.5.1 for DASK for blazing fast performance</b> even for very large data sets! Set `dask_xgboost_flag = True` to run dask + xgboost.</li> -<li><b>Featurewiz now runs with a default setting of `nrows=None`.</b> This means it will run using all rows. But if you want it to run faster, then you can change `nrows` to 1000 or whatever, so it will sample that many rows and run.</li> -<li><b>Featurewiz has lots of new fast model builder functions:</b> that you can use to build highly performant models with the features selected by featurewiz. They are:<br> -1. <b>simple_LightGBM_model()</b> - simple regression and classification with one target label<br> -2. <b>simple_XGBoost_model()</b> - simple regression and classification with one target label<br> -3. <b>complex_LightGBM_model()</b> - more complex multi-label and multi-class models<br> -4. <b>complex_XGBoost_model()</b> - more complex multi-label and multi-class models<br> -5. <b>Stacking_Classifier()</b>: Stacking model that can handle multi-label, multi-class problems<br> -6. <b>Stacking_Regressor()</b>: Stacking model that can handle multi-label, regression problems<br> -7. <b>Blending_Regressor()</b>: Blending model that can handle multi-label, regression problems<br></li> -</ol> - -## Good News! -As of June 2022, thanks to [arturdaraujo](https://github.com/arturdaraujo), featurewiz is now available on conda-forge. You can try:<br> - -``` - conda install -c conda-forge featurewiz -``` - -### If the above conda install fails, you can try installing featurewiz this way: -##Step 1: Install featurewiz first<br> - -``` - !pip install featurewiz --ignore-installed --no-deps - !pip install xlrd --ignore-installed --no-deps -``` - -##Step 2: Next, install Pillow since Kaggle has an incompatible version. <br> +# Table of Contents +<ul> +<li><a href="#introduction">What is featurewiz</a></li> +<li><a href="#working">How it works</a></li> +<li><a href="#tips">Tips for using featurewiz</a></li> +<li><a href="#install">How to install featurewiz</a></li> +<li><a href="#usage">Usage</a></li> +<li><a href="#api">API</a></li> +<li><a href="#additional">Additional Tips</a></li> +<li><a href="#maintainers">Maintainers</a></li> +<li><a href="#contributing">Contributing</a></li> +<li><a href="#license">License</a></li> +<li><a href="#disclaimer">Disclaimer</a></li> +</ul> +<p> -``` - !pip install Pillow==9.0.0 -``` + -## What is featurewiz? +## Introduction `featurewiz` a new python library for creating and selecting the best features in your data set fast! `featurewiz` can be used in one or two ways. Both are explained below. -## 1. Feature Engineering +### 1. Feature Engineering <p>The first step is not absolutely necessary but it can be used to create new features that may or may not be helpful (be careful with automated feature engineering tools!).<p> 1. <b>Performing Feature Engineering</b>: One of the gaps in open source AutoML tools and especially Auto_ViML has been the lack of feature engineering capabilities that high powered competitions such as Kaggle required. The ability to create "interaction" variables or adding "group-by" features or "target-encoding" categorical variables was difficult and sifting through those hundreds of new features to find best features was difficult and left only to "experts" or "professionals". featurewiz was created to help you in this endeavor.<br> <p>featurewiz now enables you to add hundreds of such features with a single line of code. Set the "feature_engg" flag to "interactions", "groupby" or "target" and featurewiz will select the best encoders for each of those options and create hundreds (perhaps thousands) of features in one go. Not only that, using the next step, featurewiz will sift through numerous such variables and find only the least correlated and most relevant features to your model. All in one step!.<br> -You must use this syntax for feature engg. Otherwise, featurewiz will give an error: - -``` -import featurewiz as FW -outputs = FW.featurewiz(dataname=train, target=target, corr_limit=0.70, verbose=2, sep=',', - header=0, test_data='',feature_engg='', category_encoders='', - dask_xgboost_flag=False, nrows=None) -``` -  -## 2. Feature Selection +### 2. Feature Selection <p>The second step is Feature Selection. `featurewiz` uses the MRMR (Minimum Redundancy Maximum Relevance) algorithm as the basis for its feature selection. <br> <b> Why do Feature Selection</b>? Once you have created 100's of new features, you still have three questions left to answer: 1. How do we interpret those newly created features? @@ -452,65 +336,38 @@ All are very important questions and featurewiz answers them by using the SULOV <p><b>SULOV</b>: SULOV stands for `Searching for Uncorrelated List of Variables`. The SULOV algorithm is based on the Minimum-Redundancy-Maximum-Relevance (MRMR) <a href="https://towardsdatascience.com/mrmr-explained-exactly-how-you-wished-someone-explained-to-you-9cf4ed27458b">algorithm explained in this article</a> as one of the best feature selection methods. To understand how MRMR works and how it is different from `Boruta` and other feature selection methods, see the chart below. Here "Minimal Optimal" refers to the MRMR and featurewiz kind of algorithms while "all-relevant" refers to Boruta kind of algorithms.<br>  -<br> -The working of the SULOV algorithm is as follows: + +## Working +`featurewiz` performs feature selection in 2 steps. Each step is explained below. +<b>The working of the `SULOV` algorithm</b> is as follows: <ol> -<li>Find all the pairs of highly correlated variables exceeding a correlation threshold (say absolute(0.7)). -<li>Then find their MIS score (Mutual Information Score) to the target variable. MIS is a non-parametric scoring method. So its suitable for all kinds of variables and target. -<li>Now take each pair of correlated variables, then knock off the one with the lower MIS score. -<li>What’s left is the ones with the highest Information scores and least correlation with each other. +<li>Find all the pairs of highly correlated variables exceeding a correlation threshold (say absolute(0.7)).</li> +<li>Then find their MIS score (Mutual Information Score) to the target variable. MIS is a non-parametric scoring method. So its suitable for all kinds of variables and target.</li> +<li>Now take each pair of correlated variables, then knock off the one with the lower MIS score.</li> +<li>What’s left is the ones with the highest Information scores and least correlation with each other.</li> </ol>  -<b>Recursive XGBoost</b>: Once SULOV has selected variables that have high mutual information scores with least less correlation amongst them, we use XGBoost to repeatedly find best features among the remaining variables after SULOV. The Recursive XGBoost method is explained in this chart below. -Here is how it works: +<b>The working of the Recursive XGBoost</b> is as follows: +Once SULOV has selected variables that have high mutual information scores with least less correlation amongst them, featurewiz uses XGBoost to repeatedly find the best features among the remaining variables after SULOV. <ol> -<li>Select all variables in data set and the full data split into train and valid sets. -<li>Find top X features (could be 10) on train using valid for early stopping (to prevent over-fitting) -<li>Then take next set of vars and find top X -<li>Do this 5 times. Combine all selected features and de-duplicate them. +<li>Select all variables in data set and the full data split into train and valid sets.</li> +<li>Find top X features (could be 10) on train using valid for early stopping (to prevent over-fitting)</li> +<li>Then take next set of vars and find top X</li> +<li>Do this 5 times. Combine all selected features and de-duplicate them.</li> </ol>  -<b>Building the simplest and most "interpretable" model</b>: featurewiz represents the "next best" step you must perform after doing feature engineering since you might have added some highly correlated or even useless features when you use automated feature engineering. featurewiz ensures you have the least number of features needed to build a high performing or equivalent model. - -<b>A WORD OF CAUTION:</b> Just because you can engineer new features, doesn't mean you should always create tons of new features. You must make sure you understand what the new features stand for before you attempt to build a model with these (sometimes useless) features. featurewiz displays the SULOV chart which can show you how the 100's of newly created variables added to your dataset are highly correlated to each other and were removed. This will help you understand how feature selection works in featurewiz. - -## Table of Contents -<ul> -<li><a href="#background">Background</a></li> -<li><a href="#install">Install</a></li> -<li><a href="#usage">Usage</a></li> -<li><a href="#api">API</a></li> -<li><a href="#maintainers">Maintainers</a></li> -<li><a href="#contributing">Contributing</a></li> -<li><a href="#license">License</a></li> -</ul> - -## Background - - - -To learn more about how featurewiz works under the hood, watch this [video](https://www.youtube.com/embed/ZiNutwPcAU0)<br> - -<p>featurewiz was designed for selecting High Performance variables with the fewest steps. - -In most cases, featurewiz builds models with 20%-99% fewer features than your original data set with nearly the same or slightly lower performance (this is based on my trials. Your experience may vary).<br> -<p> -featurewiz is every Data Scientist's feature wizard that will:<ol> -<li><b>Automatically pre-process data</b>: you can send in your entire dataframe "as is" and featurewiz will classify and change/label encode categorical variables changes to help XGBoost processing. It classifies variables as numeric or categorical or NLP or date-time variables automatically so it can use them correctly to model.<br> -<li><b>Perform feature engineering automatically</b>: The ability to create "interaction" variables or adding "group-by" features or "target-encoding" categorical variables is difficult and sifting through those hundreds of new features is painstaking and left only to "experts". Now, with featurewiz you can create hundreds or even thousands of new features with the click of a mouse. This is very helpful when you have a small number of features to start with. However, be careful with this option. You can very easily create a monster with this option. -<li><b>Perform feature reduction automatically</b>. When you have small data sets and you know your domain well, it is easy to perhaps do EDA and identify which variables are important. But when you have a very large data set with hundreds if not thousands of variables, selecting the best features from your model can mean the difference between a bloated and highly complex model or a simple model with the fewest and most information-rich features. featurewiz uses XGBoost repeatedly to perform feature selection. You must try it on your large data sets and compare!<br> -<li><b>Explain SULOV method graphically </b> using networkx library so you can see which variables are highly correlated to which ones and which of those have high or low mutual information scores automatically. Just set verbose = 2 to see the graph. <br> -<li><b>Build a fast LightGBM model </b> using the features selected by featurewiz. There is a function called "simple_lightgbm_model" which you can use to build a fast model. It is a new module, so check it out.<br> -</ol> - -<b>*** Notes of Gratitude ***</b>:<br> +## Tips +Here are some additional tips for ML engineers and data scientists when using featurewiz: <ol> -<li><b>Alex Lekov</b> (https://github.com/Alex-Lekov/AutoML_Alex/tree/master/automl_alex) for his DataBunch and encoders modules which are used by the tool (although with some modifications).</li> -<li><b>Category Encoders</b> library in Python : This is an amazing library. Make sure you read all about the encoders that featurewiz uses here: https://contrib.scikit-learn.org/category_encoders/index.html </li> +<li><b>Always cross-validate your results</b>: When you use a feature selection tool, it is important to cross-validate your results. This means that you should split your data into a training set and a test set. Use the training set to select features, and then evaluate your model on the test set. This will help you to ensure that your model is not overfitting to the training data.</li> +<li><b>Use multiple feature selection tools</b>: It is a good idea to use multiple feature selection tools and compare the results. This will help you to get a better understanding of which features are most important for your data.</li> +<li><b>Don't forget to engineer new features</b>: Feature selection is only one part of the process of building a good machine learning model. You should also spend time engineering your features to make them as informative as possible. This can involve things like creating new features, transforming existing features, and removing irrelevant features.</li> +<li><b>Don't overfit your model</b>: It is important to avoid overfitting your model to the training data. Overfitting occurs when your model learns the noise in the training data, rather than the underlying signal. To avoid overfitting, you can use regularization techniques, such as lasso or elasticnet.</li> +<li><b>Start with a small number of features</b>: When you are first starting out, it is a good idea to start with a small number of features. This will help you to avoid overfitting your model. As you become more experienced, you can experiment with adding more features.</li> </ol> ## Install @@ -520,21 +377,6 @@ featurewiz is every Data Scientist's feature wizard that will:<ol> <li><b>featurewiz is built using xgboost, dask, numpy, pandas and matplotlib</b>. It should run on most Python 3 Anaconda installations. You won't have to import any special libraries other than "dask", "XGBoost" and "networkx" library. Optionally, it uses LightGBM for fast modeling, which it installs automatically. </li> <li><b>We use "networkx" library for charts and interpretability</b>. <br>But if you don't have these libraries, featurewiz will install those for you automatically.</li> </ol> -- [Anaconda](https://docs.anaconda.com/anaconda/install/) - -To clone featurewiz, it is better to create a new environment, and install the required dependencies: - -To install from PyPi: - -``` -conda create -n <your_env_name> python=3.7 anaconda -conda activate <your_env_name> # ON WINDOWS: `source activate <your_env_name>` -pip install featurewiz --ignore-installed --no-deps -pip install lazytransform -or -pip install git+https://github.com/AutoViML/featurewiz.git -``` - To install from source: ``` @@ -547,29 +389,52 @@ cd featurewiz pip install -r requirements.txt ``` +## Good News: You can install featurewiz on Colab and Kaggle easily in 2 steps! +<a href="updates.md">Check out more latest updates from this page</a><br> +As of June 2022, thanks to [arturdaraujo](https://github.com/arturdaraujo), featurewiz is now available on conda-forge. You can try:<br> + +``` + conda install -c conda-forge featurewiz +``` + +### If the above conda install fails, you can try installing featurewiz this way: +##Step 1: Install featurewiz first<br> + +``` + !pip install featurewiz --ignore-installed --no-deps + !pip install xlrd --ignore-installed --no-deps +``` + +##Step 2: Next, install Pillow since Kaggle has an incompatible version. <br> + +``` + !pip install Pillow==9.0.0 +``` + ## Usage -As of Jan 2022, you now invoke featurewiz in two ways for two different goals. For feature selection, you must use the scikit-learn compatible fit and predict transformer syntax such as below. +For feature selection, you must use the newer syntax which is similar to the scikit-learn fit and predict transformer syntax below. ``` from featurewiz import FeatureWiz -features = FeatureWiz(corr_limit=0.70, feature_engg='', category_encoders='', dask_xgboost_flag=False, nrows=None, verbose=2) -X_train_selected = features.fit_transform(X_train, y_train) -X_test_selected = features.transform(X_test) -features.features ### provides the list of selected features ### +fwiz = FeatureWiz(corr_limit=0.70, feature_engg='', category_encoders='', dask_xgboost_flag=False, nrows=None, verbose=2) +X_train_selected = fwiz.fit_transform(X_train, y_train) +X_test_selected = fwiz.transform(X_test) +### get list of selected features ### +fwiz.features ``` Alternatively, you can use featurewiz for feature engineering using this older syntax. Otherwise, it will give an error. If you want to combine feature engg and then feature selection, you must use this older syntax: ``` -import featurewiz as FW -outputs = FW.featurewiz(dataname=train, target=target, corr_limit=0.70, verbose=2, sep=',', +import featurewiz as fwiz +outputs = fwiz.featurewiz(dataname=train, target=target, corr_limit=0.70, verbose=2, sep=',', header=0, test_data='',feature_engg='', category_encoders='', dask_xgboost_flag=False, nrows=None) ``` `outputs`: There will always be multiple objects in output. The objects in that tuple can vary: -1. "features" and "train": It be a list (of selected features) and one dataframe (if you sent in train only) +1. "features" and "trainm": It be a list (of selected features) and one dataframe (if you sent in train only) 2. "trainm" and "testm": It can be two dataframes when you send in both test and train but with selected features. <ol> <li>Both the selected features and dataframes are ready for you to now to do further modeling. @@ -617,7 +482,8 @@ outputs = FW.featurewiz(dataname=train, target=target, corr_limit=0.70, verbose= The mean target value (regardless of the feature value). - `dask_xgboost_flag`: Default is False. Set to True to use dask_xgboost estimator. You can turn it off if it gives an error. Then it will use pandas and regular xgboost to do the job. - `nrows`: default `None`. You can set the number of rows to read from your datafile if it is too large to fit into either dask or pandas. But you won't have to if you use dask. -**Return values** + +**Output values** - `outputs`: Output is always a tuple. We can call our outputs in that tuple: out1 and out2. - `out1` and `out2`: If you sent in just one dataframe or filename as input, you will get: - 1. `features`: It will be a list (of selected features) and @@ -626,6 +492,28 @@ outputs = FW.featurewiz(dataname=train, target=target, corr_limit=0.70, verbose= - 1. `trainm`: a modified train dataframe with engineered and selected features from dataname and - 2. `testm`: a modified test dataframe with engineered and selected features from test_data. +## Additional + + + +To learn more about how featurewiz works under the hood, watch this [video](https://www.youtube.com/embed/ZiNutwPcAU0)<br> +<p>featurewiz was designed for selecting High Performance variables with the fewest steps. +In most cases, featurewiz builds models with 20%-99% fewer features than your original data set with nearly the same or slightly lower performance (this is based on my trials. Your experience may vary).<br> +<p> +featurewiz is every Data Scientist's feature wizard that will:<ol> +<li><b>Automatically pre-process data</b>: you can send in your entire dataframe "as is" and featurewiz will classify and change/label encode categorical variables changes to help XGBoost processing. It classifies variables as numeric or categorical or NLP or date-time variables automatically so it can use them correctly to model.<br> +<li><b>Perform feature engineering automatically</b>: The ability to create "interaction" variables or adding "group-by" features or "target-encoding" categorical variables is difficult and sifting through those hundreds of new features is painstaking and left only to "experts". Now, with featurewiz you can create hundreds or even thousands of new features with the click of a mouse. This is very helpful when you have a small number of features to start with. However, be careful with this option. You can very easily create a monster with this option. +<li><b>Perform feature reduction automatically</b>. When you have small data sets and you know your domain well, it is easy to perhaps do EDA and identify which variables are important. But when you have a very large data set with hundreds if not thousands of variables, selecting the best features from your model can mean the difference between a bloated and highly complex model or a simple model with the fewest and most information-rich features. featurewiz uses XGBoost repeatedly to perform feature selection. You must try it on your large data sets and compare!<br> +<li><b>Explain SULOV method graphically </b> using networkx library so you can see which variables are highly correlated to which ones and which of those have high or low mutual information scores automatically. Just set verbose = 2 to see the graph. <br> +<li><b>Build a fast XGBoost or LightGBM model using the features selected by featurewiz</b>. There is a function called "simple_lightgbm_model" which you can use to build a fast model. It is a new module, so check it out.<br> +</ol> + +<b>*** A Note of Gratitude ***</b>:<br> +<ol> +<li><b>Alex Lekov</b> (https://github.com/Alex-Lekov/AutoML_Alex/tree/master/automl_alex) for his DataBunch and encoders modules which are used by the tool (although with some modifications).</li> +<li><b>Category Encoders</b> library in Python : This is an amazing library. Make sure you read all about the encoders that featurewiz uses here: https://contrib.scikit-learn.org/category_encoders/index.html </li> +</ol> + ## Maintainers * [@AutoViML](https://github.com/AutoViML) @@ -651,106 +539,46 @@ Summary: Development documents and examples for featurewiz Provides: python3-featurewiz-doc %description help # featurewiz - - -<p> - -## Update (October 2022): FeatureWiz 2.0 is here. -<ol> -<li><b>featurewiz 2.0 is here. You have two small performance improvements:</li> </b> -1. SULOV method now has a higher correlation limit of 0.90 as default. This means fewer variables are removed and hence more vars are selected. You can always set it back to the old limit by setting `corr_limit`=0.70 if you want.<br> -2. Recursive XGBoost algorithm is tighter in that it selects fewer features in each iteration. To see how many it selects, set `verbose` flag to 1. <br> -The net effect is that the same number of features are selected but they are better at producing more accurate models. Try it out and let us know. </ol> - -## Update (September 2022): You can now skip SULOV method using skip_sulov flag -<ol> -<li>featurewiz now has a new input: `skip_sulov` flag is here. You can set it to `True` to skip the SULOV method if needed.</li> -</ol> - -## Update (August 2022): Silent mode with verbose=0 +`featurewiz` is a powerful feature selection library that has a number of features that make it stand out from the competition, including: <ol> -<li><b>featurewiz now has a "silent" mode which you can set using the "verbose=0" option.</b> It will run silently with no charts or graphs and very minimal verbose output. Hope this helps!<br></li> +<li>It provides one of the best automatic feature selection algorithms (Minimum Redundancy Maximum Relevance (MRMR)) described by wikipedia as: <a href="https://en.wikipedia.org/wiki/Minimum_redundancy_feature_selection">"The MRMR selection has been found to be more powerful than the maximum relevance feature selection"</a> such as Boruta.</li> +<li>It selects the best number of un-correlated features that have maximum mutual information about the target without having to specify the number of features</li> +<li>It is fast and easy to use, and comes with a number of helpful features, such as a built-in categorical-to-numeric encoder and a powerful feature engineering module</li> +<li>It is well-documented, and it comes with a number of <a href="https://github.com/AutoViML/featurewiz/tree/main/examples">examples</a>.</li> +<li>It is actively maintained, and it is regularly updated with new features and bug fixes.</li> </ol> -## Update (May 2022) -<ol> -<li><b>featurewiz as of version 0.1.50 or higher has multiple high performance models</b> that you can use to build highly performant models once you have completed feature selection. These models are based on LightGBM and XGBoost and have even Stacking and Blending ensembles. You can find them as functions starting with "simple_" and "complex_" under featurewiz. All the best!<br></li> -</ol> -## Update (March 2022) -<ol> -<li><b>featurewiz as of version 0.1.04 or higher can read `feather-format` files at blazing speeds.</b> See example below on how to convert your CSV files to feather. Then you can feed those '.ftr' files to featurewiz and it will read it 10-100X faster!<br></li> -</ol> - - -<ol> -<li><b>featurewiz now runs at blazing speeds thanks to using GPU's by default.</b> So if you are running a large data set on Colab and/or Kaggle, make sure you turn on the GPU kernels. featurewiz will automatically detect that GPU is turned on and will utilize XGBoost using GPU-hist. That will ensure it will crunch your datasets even faster. I have tested it with a very large data set and it reduced the running time from 52 mins to 1 minute! That's a 98% reduction in running time using GPU compared to CPU!<br></li> -</ol> -## Update (Jan 2022) -<ol> -<li><b>FeatureWiz as of version 0.0.90 or higher is a scikit-learn compatible feature selection transformer.</b> You can perform fit and predict as follows. You will get a Transformer that can select the top variables from your dataset. You can also use it in sklearn pipelines as a Transformer.</li> - -``` -from featurewiz import FeatureWiz -features = FeatureWiz(corr_limit=0.70, feature_engg='', category_encoders='', -dask_xgboost_flag=False, nrows=None, verbose=2) -X_train_selected = features.fit_transform(X_train, y_train) -X_test_selected = features.transform(X_test) -features.features ### provides the list of selected features ### -``` - -<li><b>Featurewiz is now upgraded with XGBOOST 1.5.1 for DASK for blazing fast performance</b> even for very large data sets! Set `dask_xgboost_flag = True` to run dask + xgboost.</li> -<li><b>Featurewiz now runs with a default setting of `nrows=None`.</b> This means it will run using all rows. But if you want it to run faster, then you can change `nrows` to 1000 or whatever, so it will sample that many rows and run.</li> -<li><b>Featurewiz has lots of new fast model builder functions:</b> that you can use to build highly performant models with the features selected by featurewiz. They are:<br> -1. <b>simple_LightGBM_model()</b> - simple regression and classification with one target label<br> -2. <b>simple_XGBoost_model()</b> - simple regression and classification with one target label<br> -3. <b>complex_LightGBM_model()</b> - more complex multi-label and multi-class models<br> -4. <b>complex_XGBoost_model()</b> - more complex multi-label and multi-class models<br> -5. <b>Stacking_Classifier()</b>: Stacking model that can handle multi-label, multi-class problems<br> -6. <b>Stacking_Regressor()</b>: Stacking model that can handle multi-label, regression problems<br> -7. <b>Blending_Regressor()</b>: Blending model that can handle multi-label, regression problems<br></li> -</ol> - -## Good News! -As of June 2022, thanks to [arturdaraujo](https://github.com/arturdaraujo), featurewiz is now available on conda-forge. You can try:<br> - -``` - conda install -c conda-forge featurewiz -``` - -### If the above conda install fails, you can try installing featurewiz this way: -##Step 1: Install featurewiz first<br> - -``` - !pip install featurewiz --ignore-installed --no-deps - !pip install xlrd --ignore-installed --no-deps -``` +If you are looking for a single feature selection library, we would definitely recommend checking out featurewiz. It is a powerful tool that can help you to improve the performance of your machine learning models. -##Step 2: Next, install Pillow since Kaggle has an incompatible version. <br> +# Table of Contents +<ul> +<li><a href="#introduction">What is featurewiz</a></li> +<li><a href="#working">How it works</a></li> +<li><a href="#tips">Tips for using featurewiz</a></li> +<li><a href="#install">How to install featurewiz</a></li> +<li><a href="#usage">Usage</a></li> +<li><a href="#api">API</a></li> +<li><a href="#additional">Additional Tips</a></li> +<li><a href="#maintainers">Maintainers</a></li> +<li><a href="#contributing">Contributing</a></li> +<li><a href="#license">License</a></li> +<li><a href="#disclaimer">Disclaimer</a></li> +</ul> +<p> -``` - !pip install Pillow==9.0.0 -``` + -## What is featurewiz? +## Introduction `featurewiz` a new python library for creating and selecting the best features in your data set fast! `featurewiz` can be used in one or two ways. Both are explained below. -## 1. Feature Engineering +### 1. Feature Engineering <p>The first step is not absolutely necessary but it can be used to create new features that may or may not be helpful (be careful with automated feature engineering tools!).<p> 1. <b>Performing Feature Engineering</b>: One of the gaps in open source AutoML tools and especially Auto_ViML has been the lack of feature engineering capabilities that high powered competitions such as Kaggle required. The ability to create "interaction" variables or adding "group-by" features or "target-encoding" categorical variables was difficult and sifting through those hundreds of new features to find best features was difficult and left only to "experts" or "professionals". featurewiz was created to help you in this endeavor.<br> <p>featurewiz now enables you to add hundreds of such features with a single line of code. Set the "feature_engg" flag to "interactions", "groupby" or "target" and featurewiz will select the best encoders for each of those options and create hundreds (perhaps thousands) of features in one go. Not only that, using the next step, featurewiz will sift through numerous such variables and find only the least correlated and most relevant features to your model. All in one step!.<br> -You must use this syntax for feature engg. Otherwise, featurewiz will give an error: - -``` -import featurewiz as FW -outputs = FW.featurewiz(dataname=train, target=target, corr_limit=0.70, verbose=2, sep=',', - header=0, test_data='',feature_engg='', category_encoders='', - dask_xgboost_flag=False, nrows=None) -``` -  -## 2. Feature Selection +### 2. Feature Selection <p>The second step is Feature Selection. `featurewiz` uses the MRMR (Minimum Redundancy Maximum Relevance) algorithm as the basis for its feature selection. <br> <b> Why do Feature Selection</b>? Once you have created 100's of new features, you still have three questions left to answer: 1. How do we interpret those newly created features? @@ -761,65 +589,38 @@ All are very important questions and featurewiz answers them by using the SULOV <p><b>SULOV</b>: SULOV stands for `Searching for Uncorrelated List of Variables`. The SULOV algorithm is based on the Minimum-Redundancy-Maximum-Relevance (MRMR) <a href="https://towardsdatascience.com/mrmr-explained-exactly-how-you-wished-someone-explained-to-you-9cf4ed27458b">algorithm explained in this article</a> as one of the best feature selection methods. To understand how MRMR works and how it is different from `Boruta` and other feature selection methods, see the chart below. Here "Minimal Optimal" refers to the MRMR and featurewiz kind of algorithms while "all-relevant" refers to Boruta kind of algorithms.<br>  -<br> -The working of the SULOV algorithm is as follows: + +## Working +`featurewiz` performs feature selection in 2 steps. Each step is explained below. +<b>The working of the `SULOV` algorithm</b> is as follows: <ol> -<li>Find all the pairs of highly correlated variables exceeding a correlation threshold (say absolute(0.7)). -<li>Then find their MIS score (Mutual Information Score) to the target variable. MIS is a non-parametric scoring method. So its suitable for all kinds of variables and target. -<li>Now take each pair of correlated variables, then knock off the one with the lower MIS score. -<li>What’s left is the ones with the highest Information scores and least correlation with each other. +<li>Find all the pairs of highly correlated variables exceeding a correlation threshold (say absolute(0.7)).</li> +<li>Then find their MIS score (Mutual Information Score) to the target variable. MIS is a non-parametric scoring method. So its suitable for all kinds of variables and target.</li> +<li>Now take each pair of correlated variables, then knock off the one with the lower MIS score.</li> +<li>What’s left is the ones with the highest Information scores and least correlation with each other.</li> </ol>  -<b>Recursive XGBoost</b>: Once SULOV has selected variables that have high mutual information scores with least less correlation amongst them, we use XGBoost to repeatedly find best features among the remaining variables after SULOV. The Recursive XGBoost method is explained in this chart below. -Here is how it works: +<b>The working of the Recursive XGBoost</b> is as follows: +Once SULOV has selected variables that have high mutual information scores with least less correlation amongst them, featurewiz uses XGBoost to repeatedly find the best features among the remaining variables after SULOV. <ol> -<li>Select all variables in data set and the full data split into train and valid sets. -<li>Find top X features (could be 10) on train using valid for early stopping (to prevent over-fitting) -<li>Then take next set of vars and find top X -<li>Do this 5 times. Combine all selected features and de-duplicate them. +<li>Select all variables in data set and the full data split into train and valid sets.</li> +<li>Find top X features (could be 10) on train using valid for early stopping (to prevent over-fitting)</li> +<li>Then take next set of vars and find top X</li> +<li>Do this 5 times. Combine all selected features and de-duplicate them.</li> </ol>  -<b>Building the simplest and most "interpretable" model</b>: featurewiz represents the "next best" step you must perform after doing feature engineering since you might have added some highly correlated or even useless features when you use automated feature engineering. featurewiz ensures you have the least number of features needed to build a high performing or equivalent model. - -<b>A WORD OF CAUTION:</b> Just because you can engineer new features, doesn't mean you should always create tons of new features. You must make sure you understand what the new features stand for before you attempt to build a model with these (sometimes useless) features. featurewiz displays the SULOV chart which can show you how the 100's of newly created variables added to your dataset are highly correlated to each other and were removed. This will help you understand how feature selection works in featurewiz. - -## Table of Contents -<ul> -<li><a href="#background">Background</a></li> -<li><a href="#install">Install</a></li> -<li><a href="#usage">Usage</a></li> -<li><a href="#api">API</a></li> -<li><a href="#maintainers">Maintainers</a></li> -<li><a href="#contributing">Contributing</a></li> -<li><a href="#license">License</a></li> -</ul> - -## Background - - - -To learn more about how featurewiz works under the hood, watch this [video](https://www.youtube.com/embed/ZiNutwPcAU0)<br> - -<p>featurewiz was designed for selecting High Performance variables with the fewest steps. - -In most cases, featurewiz builds models with 20%-99% fewer features than your original data set with nearly the same or slightly lower performance (this is based on my trials. Your experience may vary).<br> -<p> -featurewiz is every Data Scientist's feature wizard that will:<ol> -<li><b>Automatically pre-process data</b>: you can send in your entire dataframe "as is" and featurewiz will classify and change/label encode categorical variables changes to help XGBoost processing. It classifies variables as numeric or categorical or NLP or date-time variables automatically so it can use them correctly to model.<br> -<li><b>Perform feature engineering automatically</b>: The ability to create "interaction" variables or adding "group-by" features or "target-encoding" categorical variables is difficult and sifting through those hundreds of new features is painstaking and left only to "experts". Now, with featurewiz you can create hundreds or even thousands of new features with the click of a mouse. This is very helpful when you have a small number of features to start with. However, be careful with this option. You can very easily create a monster with this option. -<li><b>Perform feature reduction automatically</b>. When you have small data sets and you know your domain well, it is easy to perhaps do EDA and identify which variables are important. But when you have a very large data set with hundreds if not thousands of variables, selecting the best features from your model can mean the difference between a bloated and highly complex model or a simple model with the fewest and most information-rich features. featurewiz uses XGBoost repeatedly to perform feature selection. You must try it on your large data sets and compare!<br> -<li><b>Explain SULOV method graphically </b> using networkx library so you can see which variables are highly correlated to which ones and which of those have high or low mutual information scores automatically. Just set verbose = 2 to see the graph. <br> -<li><b>Build a fast LightGBM model </b> using the features selected by featurewiz. There is a function called "simple_lightgbm_model" which you can use to build a fast model. It is a new module, so check it out.<br> -</ol> - -<b>*** Notes of Gratitude ***</b>:<br> +## Tips +Here are some additional tips for ML engineers and data scientists when using featurewiz: <ol> -<li><b>Alex Lekov</b> (https://github.com/Alex-Lekov/AutoML_Alex/tree/master/automl_alex) for his DataBunch and encoders modules which are used by the tool (although with some modifications).</li> -<li><b>Category Encoders</b> library in Python : This is an amazing library. Make sure you read all about the encoders that featurewiz uses here: https://contrib.scikit-learn.org/category_encoders/index.html </li> +<li><b>Always cross-validate your results</b>: When you use a feature selection tool, it is important to cross-validate your results. This means that you should split your data into a training set and a test set. Use the training set to select features, and then evaluate your model on the test set. This will help you to ensure that your model is not overfitting to the training data.</li> +<li><b>Use multiple feature selection tools</b>: It is a good idea to use multiple feature selection tools and compare the results. This will help you to get a better understanding of which features are most important for your data.</li> +<li><b>Don't forget to engineer new features</b>: Feature selection is only one part of the process of building a good machine learning model. You should also spend time engineering your features to make them as informative as possible. This can involve things like creating new features, transforming existing features, and removing irrelevant features.</li> +<li><b>Don't overfit your model</b>: It is important to avoid overfitting your model to the training data. Overfitting occurs when your model learns the noise in the training data, rather than the underlying signal. To avoid overfitting, you can use regularization techniques, such as lasso or elasticnet.</li> +<li><b>Start with a small number of features</b>: When you are first starting out, it is a good idea to start with a small number of features. This will help you to avoid overfitting your model. As you become more experienced, you can experiment with adding more features.</li> </ol> ## Install @@ -829,21 +630,6 @@ featurewiz is every Data Scientist's feature wizard that will:<ol> <li><b>featurewiz is built using xgboost, dask, numpy, pandas and matplotlib</b>. It should run on most Python 3 Anaconda installations. You won't have to import any special libraries other than "dask", "XGBoost" and "networkx" library. Optionally, it uses LightGBM for fast modeling, which it installs automatically. </li> <li><b>We use "networkx" library for charts and interpretability</b>. <br>But if you don't have these libraries, featurewiz will install those for you automatically.</li> </ol> -- [Anaconda](https://docs.anaconda.com/anaconda/install/) - -To clone featurewiz, it is better to create a new environment, and install the required dependencies: - -To install from PyPi: - -``` -conda create -n <your_env_name> python=3.7 anaconda -conda activate <your_env_name> # ON WINDOWS: `source activate <your_env_name>` -pip install featurewiz --ignore-installed --no-deps -pip install lazytransform -or -pip install git+https://github.com/AutoViML/featurewiz.git -``` - To install from source: ``` @@ -856,29 +642,52 @@ cd featurewiz pip install -r requirements.txt ``` +## Good News: You can install featurewiz on Colab and Kaggle easily in 2 steps! +<a href="updates.md">Check out more latest updates from this page</a><br> +As of June 2022, thanks to [arturdaraujo](https://github.com/arturdaraujo), featurewiz is now available on conda-forge. You can try:<br> + +``` + conda install -c conda-forge featurewiz +``` + +### If the above conda install fails, you can try installing featurewiz this way: +##Step 1: Install featurewiz first<br> + +``` + !pip install featurewiz --ignore-installed --no-deps + !pip install xlrd --ignore-installed --no-deps +``` + +##Step 2: Next, install Pillow since Kaggle has an incompatible version. <br> + +``` + !pip install Pillow==9.0.0 +``` + ## Usage -As of Jan 2022, you now invoke featurewiz in two ways for two different goals. For feature selection, you must use the scikit-learn compatible fit and predict transformer syntax such as below. +For feature selection, you must use the newer syntax which is similar to the scikit-learn fit and predict transformer syntax below. ``` from featurewiz import FeatureWiz -features = FeatureWiz(corr_limit=0.70, feature_engg='', category_encoders='', dask_xgboost_flag=False, nrows=None, verbose=2) -X_train_selected = features.fit_transform(X_train, y_train) -X_test_selected = features.transform(X_test) -features.features ### provides the list of selected features ### +fwiz = FeatureWiz(corr_limit=0.70, feature_engg='', category_encoders='', dask_xgboost_flag=False, nrows=None, verbose=2) +X_train_selected = fwiz.fit_transform(X_train, y_train) +X_test_selected = fwiz.transform(X_test) +### get list of selected features ### +fwiz.features ``` Alternatively, you can use featurewiz for feature engineering using this older syntax. Otherwise, it will give an error. If you want to combine feature engg and then feature selection, you must use this older syntax: ``` -import featurewiz as FW -outputs = FW.featurewiz(dataname=train, target=target, corr_limit=0.70, verbose=2, sep=',', +import featurewiz as fwiz +outputs = fwiz.featurewiz(dataname=train, target=target, corr_limit=0.70, verbose=2, sep=',', header=0, test_data='',feature_engg='', category_encoders='', dask_xgboost_flag=False, nrows=None) ``` `outputs`: There will always be multiple objects in output. The objects in that tuple can vary: -1. "features" and "train": It be a list (of selected features) and one dataframe (if you sent in train only) +1. "features" and "trainm": It be a list (of selected features) and one dataframe (if you sent in train only) 2. "trainm" and "testm": It can be two dataframes when you send in both test and train but with selected features. <ol> <li>Both the selected features and dataframes are ready for you to now to do further modeling. @@ -926,7 +735,8 @@ outputs = FW.featurewiz(dataname=train, target=target, corr_limit=0.70, verbose= The mean target value (regardless of the feature value). - `dask_xgboost_flag`: Default is False. Set to True to use dask_xgboost estimator. You can turn it off if it gives an error. Then it will use pandas and regular xgboost to do the job. - `nrows`: default `None`. You can set the number of rows to read from your datafile if it is too large to fit into either dask or pandas. But you won't have to if you use dask. -**Return values** + +**Output values** - `outputs`: Output is always a tuple. We can call our outputs in that tuple: out1 and out2. - `out1` and `out2`: If you sent in just one dataframe or filename as input, you will get: - 1. `features`: It will be a list (of selected features) and @@ -935,6 +745,28 @@ outputs = FW.featurewiz(dataname=train, target=target, corr_limit=0.70, verbose= - 1. `trainm`: a modified train dataframe with engineered and selected features from dataname and - 2. `testm`: a modified test dataframe with engineered and selected features from test_data. +## Additional + + + +To learn more about how featurewiz works under the hood, watch this [video](https://www.youtube.com/embed/ZiNutwPcAU0)<br> +<p>featurewiz was designed for selecting High Performance variables with the fewest steps. +In most cases, featurewiz builds models with 20%-99% fewer features than your original data set with nearly the same or slightly lower performance (this is based on my trials. Your experience may vary).<br> +<p> +featurewiz is every Data Scientist's feature wizard that will:<ol> +<li><b>Automatically pre-process data</b>: you can send in your entire dataframe "as is" and featurewiz will classify and change/label encode categorical variables changes to help XGBoost processing. It classifies variables as numeric or categorical or NLP or date-time variables automatically so it can use them correctly to model.<br> +<li><b>Perform feature engineering automatically</b>: The ability to create "interaction" variables or adding "group-by" features or "target-encoding" categorical variables is difficult and sifting through those hundreds of new features is painstaking and left only to "experts". Now, with featurewiz you can create hundreds or even thousands of new features with the click of a mouse. This is very helpful when you have a small number of features to start with. However, be careful with this option. You can very easily create a monster with this option. +<li><b>Perform feature reduction automatically</b>. When you have small data sets and you know your domain well, it is easy to perhaps do EDA and identify which variables are important. But when you have a very large data set with hundreds if not thousands of variables, selecting the best features from your model can mean the difference between a bloated and highly complex model or a simple model with the fewest and most information-rich features. featurewiz uses XGBoost repeatedly to perform feature selection. You must try it on your large data sets and compare!<br> +<li><b>Explain SULOV method graphically </b> using networkx library so you can see which variables are highly correlated to which ones and which of those have high or low mutual information scores automatically. Just set verbose = 2 to see the graph. <br> +<li><b>Build a fast XGBoost or LightGBM model using the features selected by featurewiz</b>. There is a function called "simple_lightgbm_model" which you can use to build a fast model. It is a new module, so check it out.<br> +</ol> + +<b>*** A Note of Gratitude ***</b>:<br> +<ol> +<li><b>Alex Lekov</b> (https://github.com/Alex-Lekov/AutoML_Alex/tree/master/automl_alex) for his DataBunch and encoders modules which are used by the tool (although with some modifications).</li> +<li><b>Category Encoders</b> library in Python : This is an amazing library. Make sure you read all about the encoders that featurewiz uses here: https://contrib.scikit-learn.org/category_encoders/index.html </li> +</ol> + ## Maintainers * [@AutoViML](https://github.com/AutoViML) @@ -956,7 +788,7 @@ This project is not an official Google project. It is not supported by Google an %prep -%autosetup -n featurewiz-0.2.6 +%autosetup -n featurewiz-0.2.8 %build %py3_build @@ -996,5 +828,5 @@ mv %{buildroot}/doclist.lst . %{_docdir}/* %changelog -* Mon Apr 10 2023 Python_Bot <Python_Bot@openeuler.org> - 0.2.6-1 +* Sun Apr 23 2023 Python_Bot <Python_Bot@openeuler.org> - 0.2.8-1 - Package Spec generated @@ -1 +1 @@ -d4697aa8e2a48e0858cdb06da6488f4b featurewiz-0.2.6.tar.gz +6ef04ee14d53fbfcc8bd98456b3e1341 featurewiz-0.2.8.tar.gz |