카테고리 없음

[딥러닝중급]Feature Engineering

happynaraepapa 2025. 3. 13. 15:55

sources:
https://www.kaggle.com/code/ryanholbrook/what-is-feature-engineering

What Is Feature Engineering

Explore and run machine learning code with Kaggle Notebooks | Using data from FE Course Data

www.kaggle.com

Intro.
#중급부터는 모두 해석하지 않습니다. 중요부분만.
Welcome to Feature Engineering!
In this course you'll learn about one of the most important steps on the way to building a great machine learning model: feature engineering. You'll learn how to:

  • determine which features are the most important with mutual information (충분한 정보를 통해 어떤 피쳐가 중요할지 결정)
  • invent new features in several real-world problem domains (실제 문제 영역에서 새로운 피쳐를 찾아냄(개발함.; 발명함))
  • encode high-cardinality categoricals with a target encoding (분류방식에서 새로운 엔 코딩방식을 개발해서 분류의 정확도를 높임.)
  • create segmentation features with k-means clustering (k-means 클러스터링 방식에서 하부 분할 피쳐를 작성)
  • decompose a dataset's variation into features with principal component analysis (PCA; principal component analysis 주성분 분석을 통해 데이터 셋 변동성을 경감시킴)

The hands-on exercises build up to a complete notebook that applies all of these techniques to make a submission to the House Prices Getting Started competition. After completing this course, you'll have several ideas that you can use to further improve your performance.

Are you ready? Let's go!

The Goal of Feature Engineering
The goal of feature engineering is simply to make your data better suited to the problem at hand.
피쳐 엔지니어링의 목적은 단순히 현재 마주하고 있는 문제에 더 잘 맞아 들어가도록 데이터를 개선(?) 하는 것.

Consider "apparent temperature" measures like the heat index and the wind chill. These quantities attempt to measure the perceived temperature to humans based on air temperature, humidity, and wind speed, things which we can measure directly.
#Apparant temperature: 겉보기 온도
#heat index : 열 지수
#wind chill (index): 체감온도(지수)

You could think of an apparent temperature as the result of a kind of feature engineering, an attempt to make the observed data more relevant to what we actually care about: how it actually feels outside!

You might perform feature engineering to:

  • improve a model's predictive performance
  • reduce computational or data needs
  • improve interpretability of the results

A Guiding Principle of Feature Engineering

For a feature to be useful, it must have a relationship to the target that your model is able to learn. Linear models, for instance, are only able to learn linear relationships. So, when using a linear model, your goal is to transform the features to make their relationship to the target linear.
...
예를 들어, 선형모델은 선형 관계만 학습할 수 있다.
따라서 선형 모델을 사용한다면 데이터의 관계모델은 목표인 선형 모델을 타겟으로 해야 한다.

The key idea here is that a transformation you apply to a feature becomes in essence a part of the model itself. Say you were trying to predict the Price of square plots of land from the Length of one side. Fitting a linear model directly to Length gives poor results: the relationship is not linear.
'땅의 한쪽변 길이로 땅의 가격을 예측하는 모델"
이때 선형 모델에 이 데이터셋을 바로 적용하면 결과가 나쁘다. 왜냐하면 둘의 관계가 선형 관계가  아니기 땜이다.

A linear model fits poorly with only Length as feature.
선형 모델을 사용하면 길이만 피쳐로 사용하여  잘 안맞는다. (fit poorly)

If we square the Length feature to get 'Area', however, we create a linear relationship. Adding Area to the feature set means this linear model can now fit a parabola.
만약 우리가 길이 피쳐로 부터 제곱을 해서 면적 피쳐를 만들면 선형적인 관계를 만들 수 있다.
피쳐 셋에 면적을 넣는다는 것은 이 선형 모델이 이제 포물선 모델에 맞아 들어갈 수 있다는 것을 의미.
(중략)

Example - Concrete Formulations
예시 - 콘크리트 포뮬러

To illustrate these ideas we'll see how adding a few synthetic features to a dataset can improve the predictive performance of a random forest model.
랜덤 포레스트 모델에서 데이터셋에 합성된 피쳐를 넣었을때 어떻게 예측 능력이 향상되는지 보겠다.

The Concrete dataset contains a variety of concrete formulations and the resulting product's compressive strength, which is a measure of how much load that kind of concrete can bear. The task for this dataset is to predict a concrete's compressive strength given its formulation.

데이터셋은 다양한 콘크리트 포뮬러가 있고, 해당 콘크리트의 압축 강도가 주어져 있다.
이 모델은 어떤 콘크리트의 포뮬러에서 해당 콘크리트의 압축 강도를 예측하려는 것.
#파이썬 코딩
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score
df = pd.read_csv("../input/fe-course-data/concrete.csv")

You can see here the various ingredients going into each variety of concrete. We'll see in a moment how adding some additional synthetic features derived from these can help a model to learn important relationships among them.

We'll first establish a baseline by training the model on the un-augmented dataset. This will help us determine whether our new features are actually useful.

Establishing baselines like this is good practice at the start of the feature engineering process. A baseline score can help you decide whether your new features are worth keeping, or whether you should discard them and possibly try something else.

X = df.copy()
y = X.pop("CompressiveStrength")

# Train and score baseline model
baseline = RandomForestRegressor(criterion="absolute_error", random_state=0)
baseline_score = cross_val_score(
    baseline, X, y, cv=5, scoring="neg_mean_absolute_error"
)
baseline_score = -1 * baseline_score.mean()

print(f"MAE Baseline Score: {baseline_score:.4}")
MAE Baseline Score: 8.232
If you ever cook at home, you might know that the ratio of ingredients in a recipe is usually a better predictor of how the recipe turns out than their absolute amounts. We might reason then that ratios of the features above would be a good predictor of CompressiveStrength.

The cell below adds three new ratio features to the dataset.

X = df.copy()
y = X.pop("CompressiveStrength")

# Create synthetic features
X["FCRatio"] = X["FineAggregate"] / X["CoarseAggregate"]
X["AggCmtRatio"] = (X["CoarseAggregate"] + X["FineAggregate"]) / X["Cement"]
X["WtrCmtRatio"] = X["Water"] / X["Cement"]

# Train and score model on dataset with additional ratio features
model = RandomForestRegressor(criterion="absolute_error", random_state=0)
score = cross_val_score(
    model, X, y, cv=5, scoring="neg_mean_absolute_error"
)
score = -1 * score.mean()

print(f"MAE Score with Ratio Features: {score:.4}")



MAE Score with Ratio Features: 7.948
And sure enough, performance improved! This is evidence that these new ratio features exposed important information to the model that it wasn't detecting before.