[머신러닝 중급] missing values 결측값

로봇-AI

[머신러닝 중급] missing values 결측값

happynaraepapa 2025. 1. 16. 14:58

https://www.kaggle.com/code/alexisbcook/missing-values

Missing Values

Explore and run machine learning code with Kaggle Notebooks | Using data from multiple data sources

www.kaggle.com

In this tutorial, you will learn three approaches to dealing with missing values. Then you'll compare the effectiveness of these approaches on a real-world dataset.
여기서는 결측값을 처리하는 방법에 대해 배우고 실제 얻어지는 현실 데이터셋을 어떻게 하면 효과적으로 다룰 수 있는지에 대해 배울 것.

Introduction

There are many ways data can end up with missing values. For example,
결측값은 다양한 이유에서 발생하는데, 예를 들어

A 2 bedroom house won't include a value for the size of a third bedroom.
침실이 두 개 뿐이라면 3번째 침실 면적에 대한 자료는 빠져있을 것임.
A survey respondent may choose not to share his income.
설문지 응답자가 자신의 수입에 대해서는 (개인정보라서) 밝히지 않을 가능성 있음.

Most machine learning libraries (including scikit-learn) give an error if you try to build a model using data with missing values. So you'll need to choose one of the strategies below.
대부분의 머신러닝에 사용되는 라이브러리들은 (사이킷런도 포함) 만약 결측값을 가진 데이터로 모델을 만들려고 하면 에러 메시지를 낼 것이다.
결측값을 처리하는 방법은 크게 3가지로 볼 수 있다.

Three Approaches

1) A Simple Option: Drop Columns with Missing Values
1) 컬럼드랍(Column Drop) : 결측값을 갖고 있는 컬럼 데이터를 버린다.

The simplest option is to drop columns with missing values. Unless most values in the dropped columns are missing, the model loses access to a lot of (potentially useful!) information with this approach.
가장 간단한 방법이지만 해당 컬럼의 데이터를 모두 잃게 되고 때로는 그 데이터가 매우 중요할 수도 있을 것이다.

As an extreme example, consider a dataset with 10,000 rows, where one important column is missing a single entry. This approach would drop the column entirely!
예를 들어 10000 개가 넘는 자료중에서 한개 자료가 결측값이라고 그 컬럼 자체를 없애면 9999개의 입력자료를 날려버리는 셈이 된다.

2) A Better Option: Imputation
Imputation (대체값)
Imputation fills in the missing values with some number. For instance, we can fill in the mean value along each column.
이 방법은 없는 값을 다른 값으로 채워넣는 방법이다.
예를 들면 우리는 (결측값을 제외한 다른 값의) 평균값을 대체값으로 넣을 수도 있을 것이다.

The imputed value won't be exactly right in most cases, but it usually leads to more accurate models than you would get from dropping the column entirely.
이 대체값이 항상 옳은 (적당한)값이 되리라는 보장은 없다. 다만 단순히 컬럼 전체를 버리는 것보다 나은 결과를 가져다 줄 수 있다는 가정에서 수행하는 것이다.

3) An Extension To Imputation
개선된 대체값
Imputation is the standard approach, and it usually works well. However, imputed values may be systematically above or below their actual values (which weren't collected in the dataset). Or rows with missing values may be unique in some other way. In that case, your model would make better predictions by considering which values were originally missing.
대체값을 넣는 것이 표준적인 방법이긴 하지만 기본적으로 그 자리를 채웠을 실제값과 유사하리라는 보장은 없다. 크거나 작거나 혹은 아주 특이한 값일 수도 있다.
어떤 경우에는 결측값이 발생한 데이터 위치(인덱스)가 중요한 역할을 하기도 하는데,

In this approach, we impute the missing values, as before. And, additionally, for each column with missing entries in the original dataset, we add a new column that shows the location of the imputed entries.
이런 경우 우리는 별도의 새로운 컬럼을 만든다음 결측값은 대체값으로 대신 새로운 컬럼에 대체값의 위치를 넣어준다.

In some cases, this will meaningfully improve results. In other cases, it doesn't help at all.
때로는 이것이 더 나은 모델로 이어지기도 하고 그렇지 않을 수도 있다.

Example
예시

In the example, we will work with the Melbourne Housing dataset. Our model will use information such as the number of rooms and land size to predict home price.
이번 예시에서 우리는 멜번 집값 데이터를 이용할 것이다. 우리는 대지 면적(land size)과 numbers of rooms(방 개수)로 집값을 예측할것읻.

We won't focus on the data loading step. Instead, you can imagine you are at a point where you already have the training and validation data in X_train, X_valid, y_train, and y_valid.
여기서는 데이터 로딩 스텝을 다시 설명하지 않겠다. .
앞서 나온 자료 참조해서 데이터 셋을
X_train, X_valid, y_train, y_valid로 나눠보자. #대소문자 주의해라.

Define Function to Measure Quality of Each Approach

We define a function score_dataset() to compare different approaches to dealing with missing values. This function reports the mean absolute error (MAE) from a random forest model.
score_dataset()이라는 함수를 정의하고 각 접근방식에 따라 RF 모델에서 MAE값이 어떻게 나타나는지 알아볼 것.

Score from Approach 1 (Drop Columns with Missing Values)
1)결측값 포함한 컬럼 드랍

Since we are working with both training and validation sets, we are careful to drop the same columns in both DataFrames.
트레이닝 데이터셋과 밸리데이션 데이터셋 모두 해당 컬럼을 없애야되는 점에 주의할 것.

# Get names of columns with missing values
#  결측값을 포함한 컬럼의 헤더(네임)를 얻는 방법
cols_with_missing = [col for col in X_train.columns
                     if X_train[col].isnull().any()]
#앞서 리스트에서 공부한 리스트 해석(List
#Comprehension) 참고
#for col in X_train.columns
#              if X_train[col].isnull().any()
#을 줄여서 한줄로 표현한 것.
#즉 X_train에 있는 컬럼 중 하나씩 넣어서 혹시
#isnull()이 있으면 -->그 col(헤더명, 컬럼명)을
#돌려줘라.... ->그걸 모아서 리스트를 만들려는 것.

# Drop columns in training and validation data
#이제 해당 컬럼을 트레이닝 데이터 밸리데이션 데이터 양쪽에서 드랍한다.

reduced_X_train = X_train.drop(cols_with_missing, axis=1)
reduced_X_valid = X_valid.drop(cols_with_missing, axis=1)

print("MAE from Approach 1 (Drop columns with missing values):")
print(score_dataset(reduced_X_train, reduced_X_valid, y_train, y_valid))
MAE from Approach 1 (Drop columns with missing values):
183550.22137772635
#그리고 나서 해당 트레이닝, 밸리데이션데이터로 RF를 돌림. -->MAE 값 산출.

Score from Approach 2 (Imputation)
Next, we use SimpleImputer to replace missing values with the mean value along each column.
이번에는 단순 대체(simple imputation)으로

Although it's simple, filling in the mean value generally performs quite well (but this varies by dataset). While statisticians have experimented with more complex ways to determine imputed values (such as regression imputation, for instance), the complex strategies typically give no additional benefit once you plug the results into sophisticated machine learning models.
단순하지만 수학적으로 훨씬 복잡한 imputation 방법과 비교해도 대게 뒤쳐지지 않는다.
그리고 아래 예시처럼 사이킷런에서 이미 모듈을 제공한다.

#추가로 import해야될 것.
from sklearn.impute import SimpleImputer

# Imputation
my_imputer = SimpleImputer()
imputed_X_train = pd.DataFrame(my_imputer.fit_transform(X_train))
imputed_X_valid = pd.DataFrame(my_imputer.transform(X_valid))
#my_imputer 인스탄스에서 fit_transform과 transform 메서드를 따로 쓴 것 주의. 그리고 전체를 다시 데이터 프레임으로 만들기 위헤 pd.DataFrame으로 묶음. --- 일단 이 코드 셋은 그냥 그렇구나 하고 베껴쓸 것.

# Imputation removed column names; put them back
#위의 메서드를 쓰면 데이터 프레임에서 컬럼헤더가 날아가는 거다 그래서 그걸 다시 넣어줌.
#X_train.columns는 헤더의 리스트다.

imputed_X_train.columns = X_train.columns
imputed_X_valid.columns = X_valid.columns

#이하는 동일하므로 생략.
print("MAE from Approach 2 (Imputation):")
print(score_dataset(imputed_X_train, imputed_X_valid, y_train, y_valid))
MAE from Approach 2 (Imputation):
178166.46269899711

We see that Approach 2 has lower MAE than Approach 1, so Approach 2 performed better on this dataset.
여기서 2번 방식이 1번 방식보다 MAE가 낮게 나왔고 따라서 퍼포먼스가 낫다고 말할 수 있다.

Score from Approach 3 (An Extension to Imputation)

Next, we impute the missing values, while also keeping track of which values were imputed.
다음은 결측값의 인덱싱을 추적하면서 대체값을 찾는 3번째 방법을 고려해 보자.

# Make copy to avoid changing original data (when imputing)
X_train_plus = X_train.copy()
X_valid_plus = X_valid.copy()

# Make new columns indicating what will be imputed
for col in cols_with_missing:
    X_train_plus[col + '_was_missing'] = X_train_plus[col].isnull()
    X_valid_plus[col + '_was_missing'] = X_valid_plus[col].isnull()

#이부분의 코드가 난해하다.
#우선 cols_with_missing은 결측치가 있는 컬럼명이다.
#해당 컬럼명을 X_train_plus[col].isnull()은
#X_train_plus에도 해당 컬럼에서 결측값이 있다면
#참 아니면 거짓이 된다. 따라서 X_train_plus에 #새로운 컬럼으로 "col + 'was_missing' "을
#생성하면 거기에는 만약 예전에도 비어 있었다면
#True 아니면  False가 들어갈 것이다.

#이하는 동일하므로 생략
# Imputation
my_imputer = SimpleImputer()
imputed_X_train_plus = pd.DataFrame(my_imputer.fit_transform(X_train_plus))
imputed_X_valid_plus = pd.DataFrame(my_imputer.transform(X_valid_plus))

# Imputation removed column names; put them back
imputed_X_train_plus.columns = X_train_plus.columns
imputed_X_valid_plus.columns = X_valid_plus.columns

print("MAE from Approach 3 (An Extension to Imputation):")
print(score_dataset(imputed_X_train_plus, imputed_X_valid_plus, y_train, y_valid))
MAE from Approach 3 (An Extension to Imputation):
178927.503183954

As we can see, Approach 3 performed slightly worse than Approach 2.
위에서 보면 3번 모델이 2번 모델에 성능이 약간 떨어진다고 할 수 있다.

So, why did imputation perform better than dropping the columns?
그럼 왜 그냥 컬럼드랍보다 대체가 났다고 말하는 것일까?

The training data has 10864 rows and 12 columns, where three columns contain missing data. For each column, less than half of the entries are missing. Thus, dropping the columns removes a lot of useful information, and so it makes sense that imputation would perform better.
예를 들면 현재 데이터셋은 10864행의 데이터를 들고 있고 12컬럼인데 이중 3컬럼이 결측치를 담고 있다. 각 결측치가 있는 컬럼은 약 절반 정도 인데, 그럼에도 여전히 이 데이터를 모두 사용하지 않는 것보다 사용하는 쪽이 성능이 높게 나왔다는 것이다.

# Shape of training data (num_rows, num_columns)
print(X_train.shape)

# Number of missing values in each column of training data
missing_val_count_by_column = (X_train.isnull().sum())
print(missing_val_count_by_column[missing_val_count_by_column > 0])
(10864, 12)
Car               49
BuildingArea    5156
YearBuilt       4307
dtype: int64

이하 생략
<끝.