Machine learning is the scientific study of algorithms and statistical models to perform a specific task effectively without using explicit instructions. Machine learning algorithms include – supervised and unsupervised algorithms.
In supervised learning, the target is already known and is used in the model prediction.
- Classification: When target variable is categorical
- Regression: When target variable is continuous
In unsupervised learning, the target is not known and is supposed to be determined through the models.
- Clustering: Customer segmentation
- Association: Market basket analysis
The guide has three sections
- Feature engineering
- Feature selection
- Hyper-parameter optimization
|Feature Engineering||Feature Selection||Hyper-parameters optimization|
|Variable typesVariable characteristicVariable transformationTreating categorical variablesFeature scalingDiscretization or binningMissing data imputationOutlier treatment||Filter MethodWrapper MethodEmbedded Method||ChallengesSearch AlgorithmsCross Validation|
Feature engineering is the process of using domain knowledge of the data to create features or variables to use in machine learning. The following topics are covered in this section:
- Types of variables
- Variable characteristics
- Variable transformation
- Treating categorical variables
- Feature scaling
- Discretization or binning
- Missing value imputation
- Outlier treatment
- Variable types
- Numerical variable: can be discrete or continuous. Discrete variable takes only whole numbers. Continuous variable takes any value within some range.
- Categorical variable: can be ordinal or nominal. Ordinal variable takes categories that can be meaningfully ordered. Nominal variable takes labels that have no intrinsic order.
- Mixed variables: can have number / labels in different observations or number / labels in same observation
- Variable characteristic
- Cardinality: the number of different labels is known as cardinality. As cardinality increases the chances of over-fitting also increases.
- Skewed distribution: One of the tails is longer than the other tail. For skewed distribution, median is better than mean for imputation.
- Magnitude: impact the regression coefficients. Features with bigger magnitudes dominate over features with smaller magnitudes. Feature scaling helps to bring all the features in the same range.
- Missing data: occurs when no data is stored for a certain observation in the variable. Can have significant impact on the model.
- Outliers: is a data point that is significantly different from the remaining data. Depending upon the context, outliers either deserve special attention or should be completely ignored.
- Variable transformation
If the distribution of the variable is skewed then transformations are applied to make the distribution closer to normal distribution.
- Logarithmic (X>0)
- Exponential (X>>large; may lead to errors)
- Reciprocal (X<>0)
- Box-Cox (X>0)
- Treating categorical variables
Machine learning algorithms work only with numerical variables. Hence, replacing the categories with numerical representations is done, so that machine learning models can use these variables.
- One hot encoding: Consists of encoding each categorical variable with a set of Boolean variables K-1 dummies are created. One hot encoding of top categories only considers the most frequent categories.
- Ordinal encoding: Consists of replacing the categories by digits from 0 to 9. Numbers are assigned arbitrarily. This encoding method allows for quick benchmarking of machine learning models.
- Count of frequent encoding: Categories are replaced by percentage of observations shown against that category. Captures representation of each label.
- Target guided encoding: Helps to get monotonic relationship between the variable and target. Categories are replaced with integers from 1 to K where k is the number of distinct categories in the variable, but the number is informed by the mean of the target for each category. Probability ratio encoding is where each category is replaced by the odds-ratio or weight of evidence.
- Mean encoding: Replacing the category by average of target value for that category.
- Rare label encoding: Rare labels are those that appear only in a tiny proportion of the observations in the dataset. These labels are grouped together into a single label.
- Binary encoding: Binary code is used to encode the meaning of the variable. However, it lacks human readable meaning.
Feature scaling is the method used to normalize the range of values. This is done to bring all the variables at the same scale.
- Standardization [Z = (x-u)/s]: It preserves the shape of the variable with mean = 0 and standard deviation = 1. It preserves outliers.
- Mean normalization [Z = (x-mean) / (max-min)]: Rescales the range of the variable with mean = 0. It may alter the shape of the variable.
- Min max scalar [Z = (x-min) / (max-min)]: Rescales the range of the variable and returns only positive values. It preserves outliers.
- Maximum absolute scalar [Z = x / max|x|]: Rescales the range of the variable. Mean is not centered to 0 and variance is not scaled.
- Scaling to Median and IQR [Z = (x – Median) / (Q3 – Q1)]: Median is centered to zero and it handles outliers.
Discretization is the process of transforming continuous variables into discrete variables by a set of continuous intervals. It is also called binning.
- Equal width: Divides the variable into K bins of same width. It does not improve value spread and we observe the same distribution.
- Equal frequency: Divides the variable into K bins with same number of observations. Interval boundaries correspond to quartiles. It handles outliers and improves the spread of the variable.
- K means: Applying K-means clustering to the continuous variable. Divides the variable into clusters according to the centroids.
- Decision trees: Consists of using decision trees to identify the optimal bins. It creates discrete variable as well as monotonic relationship. It handles outliers.
- Note on monotonic relationship: Re-order the intervals so that we get monotonic relationship with target. Monotonic relationship improves performance of machine learning models and creates shallower trees.
Act of replacing the missing data with statistical estimates of missing values. The goal is to produce a whole dataset that can be used to train machine learning models.
- Complete Case Analysis: The list wise deletion or discarding observations where values in any of the variable are missing. We analyze only those observations for which information is available for all the variables. Suitable for numerical and categorical variables. Should be used when data is missing at random and not more than 5% of data is missing.
- Mean or Median Imputation: consists of replacing all occurrences of missing values within a variable with either mean or median. Suitable for numerical variable. If the variable is normally distributed then use mean if the distribution is skewed then use median. Should be used when data is missing at random and missing observations mostly look like the majority of the data. Mean or median should be calculated only on training set. The value should be used to replace missing values in both train and test datasets. This is to avoid over-fitting.
- Arbitrary Value Imputation: Consists of replacing missing value with an arbitrary value. For categorical – ‘missing’ and for numerical – 999. This should be used when data is not missing at random. Works well with tree based algorithms but not with linear regression or logistic regression.
- Frequent Category Imputation: Mode imputation consists of replacing all occurrences of missing values within a variable with mode. Suitable for categorical variable. Should be used when data is missing at random and the missing observations most likely look like majority of observations.
- Missing Category Imputation: Consists of treating missing data as an additional label or category. This is widely used method for categorical variables.
- Random Sample Imputation: Consists in taking random observations from the pool of available observations of the variable and use it to fill the missing values. Suitable for both numerical and categorical.
- Missing indicator: Additional binary variable is added which indicates whether the data was missing for an observation or not. Suitable for numerical and categorical variables. It should be used together with other methods that assume data is missing at random (mean, median or mode imputation and random sample imputation). If data is missing at random then it is captured by mean or median and if data is not missing at random then it is captured by the binary variable. If more than 5% of data is missing then it is advised to add missing indicator.
- KNN Imputation: Determines missing data points as weighted average of the values of its K nearest neighbors. KNN is trained on other variables, the K nearest neighbors is determined and weighted average is taken to impute the missing value. Suitable when a small percentage of the data is missing.
- MICE: A series of models whereby each variable is modeled conditional upon other variables in the data. Each incomplete variable is imputed by a separate model.
- Miss Forest: MICE is implemented using Random Forest. Works well with mixed data types and can handle non-linear relationship.
Outlier is a data point that is significantly different from the remaining data. Outliers may impact the performance of linear models, however, their impact is minimal on tree-based algorithms. Outliers can be identified using Gaussian distribution (u -/+ 3*s), Interquartile range (Q3 – Q1) and Quartiles (1 percentile and 99 percentile).
- Trimming: Remove outliers from dataset. However, it can remove large proportion of data.
- Capping: No data is removed. However, it distorts variable distribution.
- Missing data: The outliers are treated as missing data.
- Discretization: The outliers are put into lower and upper bins.
- Arbitrary capping: Domain knowledge of the variable is required to cap the min and max
Feature selection is the process of selecting a subset of relevant features for use in machine learning model building. The following topics are covered in this section:
- Filter method
- Wrapper Method
- Embedded Method
- Filter Method
Filter methods rely on the characteristics of data and are model agnostic. They tend to be less computationally expensive and suitable for quick screening.
- Constant feature: Same value for all observations. Checks on standard deviation and count of unique
- Quasi constant feature: Same value for most of the observations. Check on variance threshold and count of distinct observations.
- Duplication feature: Identical feature. Retain only one of the duplicate features.
- Correlation: It refers to the degree to which a pair of variables is linearly related. Correlated predictor variables provide redundant information. Good feature subset contains features highly correlated with the target, yet uncorrelated to each other.
- Fisher score (Chi-square): Statistical test, best suited to determine a difference between expected frequencies and observed frequencies. Smallest the p-value, biggest is the importance.
- Univariate (one way ANOVA): Tests the hypothesis that 2 or more samples have same mean. Samples should be independent, normally distributed and homogeneity of variance. Variables with p-value > 0.05 are not important to predict Y.
Wrapper methods use machine learning models to score the feature subset. A new model is trained on each feature subset and usually provide the best performing subset.
- Step forward: Begin with no feature and add one feature at a time (mlxtend)
- Recursive function addition: If on adding the feature the increment is more than the threshold then keep the feature
- Condition: Increase > Threshold
- Step backward: Begin with all the features and remove one feature at a time (mlxtend)
- Recursive function elimination: If on removing the feature the decrease is less than the threshold then drop the feature
- Condition: Decrease < Threshold
- Exhaustive: Tries all possible feature combinations
- Stop condition: When the performance does not increase beyond a certain threshold or decrease beyond a certain threshold
Performs feature selection as part of the model construction process and considers the interaction between models and features. Embedded methods are faster than wrapper methods and more accurate than filter methods.
- Regularization: Consists of adding a penalty to the different parameters of the model to reduce the freedom of the model. Helps to improve generalization ability of the model.
- Lasso (L1): Shrinks some parameters to zero (feature elimination)
- Ridge (L2): As the penalization increases the coefficients approach zero (no feature is eliminated)
- Tree: Build machine learning model (decision tree, random forest or gradient boosting) and calculate feature importance. Remove least important feature and repeat till a condition is met.
The objective of the learning algorithm is to find a function that reduces error over a dataset. Hyper-parameters are not directly learned by the model and are important to prevent over-fitting. Hyper-parameters are specified outside the training procedure and they control the flexibility of the model.
The process of finding the best hyper-parameter for a given dataset is called hyper-parameter optimization. The objective is to minimize the generalization error. Generalization is the ability of an algorithm to be effective across various inputs. The search for the best hyper-parameter consists of hyper-parameter space, method of sampling, cross-validation scheme and performance metrics.
The following topics are covered in this section:
- Search algorithms
- Cross validation
It is impossible to define a formula to get the best hyper-parameter; hence different combinations of hyper-parameters need to be evaluated. Some hyper-parameters affect performance a lot, however, most of the hyper-parameters do not have a huge effect on the performance. Hence, it is important to identify those hyper-parameters that impact the performance of machine learning and to optimize them.
Below is the list of hyper-parameters that were found to have a huge effect on the performance of respective machine learning algorithms:
|Decision Tree||Max depth – The maximum depth of the tree. Min samples leaf – The minimum number of samples required to be at a leaf node.|
|Random Forest||N estimators – The number of trees in the forestMax depth – The maximum depth of the tree. Min samples leaf – The minimum number of samples required to be at a leaf node.|
|Gradient Boosting||Loss – The loss function to be optimized, ‘deviance’ refers to logistic regression and ‘exponential’ refers to AdaBoost algorithm.N estimators – The number of boosting stages to perform. Gradient boosting is fairly robust to over-fitting so a large number usually results in better performance.Max depth – The maximum depth of the tree. Min samples leaf – The minimum number of samples required to be at a leaf node.|
|K Nearest Neighbors||N neighbors – Number of neighbors to use by default for knn|
|Artificial Neural Network||Activation – Activation function for the hidden layer (default is relu).Solver – The solver for weight optimization (default is adam)Hidden layer sizes – The ith element represents the number of neurons in the ith hidden layer.|
- Manual Search: Identify regions of promising hyper-parameters to delimit Grid Search. Also helps to get familiar with hyper-parameter and their effect on the model. However, it lacks reproducibility and does not explore the entire hyper-parameter space.
- Grid Search: Exhaustive search through a specified subset of hyper-parameters of a learning algorithm. Examines all possible combinations. However, it is computationally expensive, hyper-parameter values are determined manually and not ideal for continuous hyper-parameters
- Random Search: Hyper-parameter values are selected by random draws from a uniform distribution. Examines some combinations and the user determines the number of combinations to examine. However, there is only a small reduction in efficiency in low dimension spaces.
|Manual Search||Grid Search||Random Search|
The training set is divided into k-folds. Model is trained on k-1 folds and tested on the k-th fold. The process is repeated k-times. The final performance is averaged. Cross-validation can be used to select the best hyper-parameters, select the best performing model and estimate the generalization error of a given model. Stratified k-fold cross-validation is useful when the dataset is imbalanced. Each fold has a similar proportion of the observation of each class and there is no overlap of the test sets.
The performance of the machine learning model should be constant across different datasets. When the model performs well on train set but not on live data the model over-fits the train data.
- Under-fit: High Bias
- Over-fit: High Variance