[머신러닝중급] Pipelines

로봇-AI

[머신러닝중급] Pipelines

happynaraepapa 2025. 1. 23. 11:24

source :
https://www.kaggle.com/code/alexisbcook/pipelines
...
#ML모델링은 데이터만 바뀔뿐 계속 반복적인 수행을 하게 되는 부분이 많다.
#파이프라인이란 일련의 과정을 간소화하기 위한 사전 준비된 도구/틀이라고 보면 간단하다.
#아래 번역 내용은 필요한 부분만 빠르게 짚고 나머지는 스킵.

Introduction
Pipelines are a simple way to keep your data preprocessing and modeling code organized. Specifically, a pipeline bundles preprocessing and modeling steps so you can use the whole bundle as if it were a single step.
파이프라인 번들은 전처리(pre-processing)와 모델링 과정을 하나의 번들로 해결할 수 있도록 해준다.

Many data scientists hack together models without pipelines, but pipelines have some important benefits. Those include:
파이프라인 안쓰는 사람도 많지만 다음과 같은 장점이 있다.
...
(생략)
...
Example
멜번데이터에서,

Step 1: Define Preprocessing Steps
Similar to how a pipeline bundles together preprocessing and modeling steps, we use the ColumnTransformer class to bundle together different preprocessing steps. The code below:
Column Transformer 클래스를 이용하여 전처리 진행. 숫자값의경우 대체값을 찾거나, ordered encoding이나 one-hot encoding을 하거나, 이런 작업을 번들로 가지고 있다. (잠깐, 앞서 배운 결측값처리가 전처리의 대표적인 작업이다.)

...(생략)...
Step 2: Define the Model

이제 RF 모델을 정의할 건데 기존 방식과 크게 다르지 않다.

Step 3: Create and Evaluate the Pipeline
Finally, we use the Pipeline class to define a pipeline that bundles the preprocessing and modeling steps. There are a few important things to notice:
이제 파이프라인 클래스로 위의 전처리 플러스 모델링을 번들로 만들어야 하는데 중요한 것 몇가지가 있다.

With the pipeline, we preprocess the training data and fit the model in a single line of code. (In contrast, without a pipeline, we have to do imputation, one-hot encoding, and model training in separate steps. This becomes especially messy if we have to deal with both numerical and categorical variables!)
파이프라인을 이용하면 우리는 트레이닝 데이터를 전처리하고 모델을 피팅하고 하는 과정을 한줄의 코딩으로 해결할 수 있다. 기존 대로였다면 결측값을 찾고, 결측값을 대체하고, 명목변수는 엔코딩하고 모델 을 피팅하고의 여러 과정을 거쳐야 한다.
특히 데이터값 자체가 숫자값과 명목변수를 모두 가지고 있다면 더욱 지저분해질 것이다.

With the pipeline, we supply the unprocessed features in X_valid to the predict() command, and the pipeline automatically preprocesses the features before generating predictions. (However, without a pipeline, we have to remember to preprocess the validation data before making predictions.)
그리고 이제 다시 모델을 평가하는 과정에서 X_valid 도 그냥 처리없이 번들에 넣기만하면 알아서 전처리 과정이 진행되고 prediction이 이루어진다. 그러나 파이프라인이 아니었다면 앞서 이야기했던 처리과정을 고스란히 진행해야 한다.

...
생략
...