Feature Engineering for initial insights

Whether you’re a seasoned Db2 professional or you’re just beginning, keeping up with Db2 metrics is no trivial task, and you will have to become familiar with lots of metrics, parameter names and meaningful numbers if you want to be successful and have a deep understanding of your environment, which is especially true if you’re given the task of doing capacity planning for your subsystems.

 

Now let’s think for a moment, how do you begin with this never ending task? Traces? Batch reports from your Db2 monitoring tools? RMF data? Depending on the type of study, you can model the behavior of your subsystems themselves, batch workloads, trends for online applications, or a specific study for a particular application, and this choice will give you a different subset of parameters that offers hints on your Db2 environment, however, you will have dozens of metrics to choose from. Sounds familiar? Congratulations! You became a data scientist before you knew it, and it won’t come to you as a surprise to hear that for data scientists, mostly 80% of the time is spent on data cleaning, or more formally, Feature Engineering.

 

What is Feature Engineering?

Let’s recall or define quickly, all samples in your data are called observations, while their characteristics or columns (viewed as a table) are called features. So what is Feature Engineering? Your goal might be to show your data in a chart for a presentation, do calculations on it, or you might be considering exploring Machine Learning or applying another statistical approach, and your data will have to be cleaned, transformed or scaled for a model to work on it. For example, if you have a data set that contains multiple features and you want to visualize them to find out if there are visible groups within it, you will have to choose representative features or reduce the number of dimensions within your data. Another example is the fact that not all of your data will be in numbers, and it might be represented in categories so it will require different treatment. Feature Engineering covers this and much more.

 

The data set

This article will use a data set extracted from Kaggle, and it contains sample data from parameters that are significant for an application to a Masters Program (i.e. scores like GRE, TOEFL, and some more), it also contains a target feature called “Chance of Admit”, which represents the chance for admission given the current features (parameters). If you wish to experiment and learn, I strongly recommend browsing through Kaggle data sets. This could be application data stored in one of your databases, or data extracted from your Db2 monitors that you have saved over time, SMF trace data, you name it. The point is to obtain insights from your data for further processing.

 

The framework

For the following examples, I will be using Python 3 using the previous data set in CSV format, however, you can directly extract it from a Db2 table using ODBC or JDBC access through Spark. After that, I will work with the open source libraries Pandas, matplotlib and seaborn to manipulate and show the data. Lastly, I will show a quick example using the open source library for Machine Learning, scikit-learn.

 

Importing the data

 

I’m using the data from a CSV file, so I’m using the direct function from Pandas to read it, and I’m also dropping the first column because it’s just a numbering index from the original data.

df = pd.read_csv('Admission_Predict_Ver1.1.csv')

df = df.drop(['Serial No.'], axis=1)

df.head()

 

 

GRE Score

TOEFL Score

University Rating

SOP

LOR

CGPA

Research

Chance of Admit

0

337

118

4

4.5

4.5

9.65

1

0.92

1

324

107

4

4.0

4.5

8.87

1

0.76

2

316

104

3

3.0

3.5

8.00

1

0.72

3

322

110

3

3.5

2.5

8.67

1

0.80

4

314

103

2

2.0

3.0

8.21

0

0.65

 

Filtering data

At this point we have a data set that contains a target feature (chance of admit in our example), but I can think about saving my Db2 monitoring data, adding an extra column indicating the relative health of my environment and saving it over time (or a simple binary value that means OK or NOT OK to point out a moment of stress that occurred at this moment). I might have a considerable amount of features and I have a new problem, how do I choose which features were relevant for this target feature? Or in my example, how can I tell which parameters are more relevant for a good chance for being admitted? We will need to filter to choose the most relevant features from the data we have.

 

One of the common ways to do so, which is already implemented in Pandas, is to use Pearson correlation. Pearson correlation is a statistical concept that measures the linear correlation between two variables, and it is given by a single number ranging from -1 to 1, where -1 is a total negative correlation, 1 is a total positive correlation and 0 represents no linear correlation. How do you interpret this? If the value is positive, it means that ‘if a variable is high, so is the other’, and if it’s negative, it means that ‘when one variable is high, the other tends to be low’. Going back to our example, the following lines will calculate the Pearson correlation coefficient from each feature with the ‘chance of admit’ feature, and it will be displayed in a heatmap so we can view all results at once.

 

plt.figure(figsize=(12,10))

correlation_matrix = df.corr()

sns.heatmap(correlation_matrix, annot=True, cmap=plt.cm.Reds)

plt.show()

 

How do you read a heatmap? Like any other table, you just cross the a name from a row to one in the columns and find the number. This number will indicate the Pearson correlation coefficient between those two variables. The colors represent the scale, where a darker color shows a higher correlation and light color shows a very low correlation. For this example, we’re only interested in crossing rows against the column ‘Chance of Admit’ since that’s the target feature.

 

Heatmap_Feature_Engineering

 

If we cross the name of the features to the target feature, we will see that the parameter with the lowest correlation coefficient is the feature called ‘Research’, while the feature with the highest correlation is the field called ‘CGPA’ (for this example, this represents the undergraduate GPA score of the applicant). What can I do with this information? If we have a high number of features, we can use this approach to filter out the features that are not significant and then work with a delimited number of features to create a chart or for example, or maybe choose the order of features to use in a Decision Tree for Machine Learning.

 

This is one simple approach, but if your objective is to work with linear regression, you’ll need to work iteratively as independent features are assumed to be uncorrelated to each other. Applying the previous process in an iterative way will take out variables that are highly correlated to each other.

Side note: Linear and non linear correlation

Correlation is a measure between two variables to explain their deviations from the respective means are. If the ratio of change is constant, then we can conclude it is a linear correlation (and it can be shown in a plot by a straight line). On the other hand, a non linear correlation means the ratio of change is not constant and it is commonly represented in a plot by a curved line.

 

However, we live in a real world and we see real examples out there, so you will certainly be dealing with outliers (an observation distant from other observations, i.e. a sample with a highly unusual value for a certain feature) within your data. For this situation the Pearson correlation might not be the right choice for you. The Spearman rank’s correlation coefficient will measure the correlation between two variables in a more robust way as it also works if the correlation between the variables is linear or not. We can explore this alternative with our features with the following code:

def SpearmanCorr(xs, ys):

xs = pd.Series(xs)

ys = pd.Series(ys)

return xs.corr(ys, method='spearman')

 

And now we use this function on our data set

sp_corr={}

columns = list(df.columns)

for i in columns[:-1]:

sp_corr[i] = SpearmanCorr(df[i], df[columns[-1]])

 

This will give us a dictionary that shows the Spearman’s correlation coefficient for each feature:

Feature Spearman’s correlation coefficient
'GRE Score' 0.822201159536554
'TOEFL Score' 0.793634163203685
'University Rating' 0.703742462611185
'SOP' 0.702799393449472
'LOR ' 0.643627140892576
'CGPA' 0.888785659961307
'Research' 0.565715494451781

 

 

For this particular example there’s no significant difference, however, this is a useful technique when handling outliers, which you will have to face for certain.

 

Normalization

Have you noticed that, in my example, the first two features appear to be in different scales with respect for the rest? Many Machine Learning algorithms will unwelcome such data and will expect data to be within the same range, often between 0 and 1, where the maximum possible value will be scaled down to 1 and the rest of the values will be normalized. The open source library, scikit-learn, provides a ready-to-use MinMaxScaler that uses the following formula:

X_std = (X - X.min(axis=0)) / (X.max(axis=0) X.min(axis=0))

X_scaled = X_std * (max - min) + min

 

Applied to our example, you can use the following code, which will result in a numpy array.

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

scaler.fit(df)

print(scaler.transform(df))

 

[[0.94 0.92857143 0.75 ... 0.91346154 1. 0.92063492]

[0.68 0.53571429 0.75 ... 0.66346154 1. 0.66666667]

[0.52 0.42857143 0.5 ... 0.38461538 1. 0.6031746 ]

...

[0.8 1. 1. ... 0.88461538 1. 0.93650794]

[0.44 0.39285714 0.75 ... 0.5224359 0. 0.61904762]

[0.74 0.75 0.75 ... 0.71794872 0. 0.79365079]]

 

Lasso regularization

Another common approach is Lasso regularization, which belongs to the family of embedded methods. Lasso stands for “least absolute shrinkage and selection operator”, and it’s a method that can locate irrelevant features to penalize them by turning them into zeroes. To show this concept, I’ll switch to another data set already included in the scikit-learn library, which is the famous Boston data set (a data set commonly used to predict house prices and useful if you want to learn about data science). The code to try Lasso regularization is as follows:

from sklearn.datasets import load_boston
from sklearn.linear_model import LassoCV

#We load the data set
x = load_boston()
df = pd.DataFrame(x.data, columns = x.feature_names)
df["MEDV"] = x.target

#Now we remove the target feature MEDV from the original data set
X = df.drop("MEDV",1)

#And create another vector with only the target feature
y = df["MEDV"]

#We can now use Lasso regularization
reg = LassoCV()
reg.fit(X, y)

 

Now we can ask how many features were selected and how many were discarded by the attribute reg.coef_ from the regressor object.

 

To display the names of the features and their weight (which will contain zero in case the feature has been penalized):

 

print( pd.Series(reg.coef_, index = X.columns) )

 

Feature Weight according to Lasso
CRIM -0.074266
ZN 0.049454
INDUS -0.000000
CHAS 0.000000
NOX -0.000000
RM 1.804385
AGE 0.011333
DIS -0.813244
RAD 0.272284
TAX -0.015425
PTRATIO -0.742872
B 0.008926
LSTAT -0.703654

 

The values in zero, whether negative or positive, represent irrelevant features according to our regressor, so they were discarded in the process of evaluation (meaning, the moment we execute the ‘fit’ function).

 

How did this process decide to discard the features shown in zeroes? In a nutshell, LassoCV uses an iterative process to predict the target feature, and in each iteration the regressor takes a different set of features. At the end of the process all prediction scores are compared and Lasso identifies the features that did not significantly contribute to make a prediction.

 

We can now print how many features were selected and how many were discarded in the process.

print('Our example has selected {0} variables'.format(str(sum(reg.coef_ != 0))))

print('Our example also discarded {0} variables'.format(sum(reg.coef_ == 0)))

 

Our example has selected 10 variables

Our example also discarded 3 variables

 

Closing thoughts

 

Feature Regression is an art, and we have barely scratched the surface as we only dealt with numerical data. Dates, strings and categorical data have their own set of approaches and even so, their usefulness will vary depending on your objective and it often requires domain knowledge to be able to discard or even create new features out of your existing features. However, with time and practice, you will be able to show insight from your data and provide more value from it (and more value from your skills too). To close the note, I go back to the curious idea I mentioned at the beginning, if you’re collecting hundreds of Db2 metrics, how do you choose which ones are the ones most meaningful to you? Maybe you’re automatically doing Feature Engineering as part of your daily job, so why not giving it more attention?

 

 

References

Mohan S Acharya, Asfia Armaan, Aneeta S Antony : A Comparison of Regression Models for Prediction of Graduate Admissions, IEEE International Conference on Computational Intelligence in Data Science 2019

MinMaxScaler. Scikit-learn documentation. https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html. Retrieved on April 18th, 2019.

LassoCV. Scikit-learn documentation. https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LassoCV.html. Retrieved on April 18th, 2019.

Downey, A. (2011). Think stats. Sebastopol, CA: O'Reilly.

 

3 Likes
Recent Stories
Early stage predicate evaluation with DECFLOAT and implicit casts

Meet “the Crusher” - or how we learned to love query acceleration

May sum up at IDUG