[딥러닝초급]Deep Reinforcement Learning 심층강화학습

로봇-AI

by happynaraepapa 2025. 4. 10. 15:37

sources:
https://www.kaggle.com/code/alexisbcook/deep-reinforcement-learning

Deep Reinforcement Learning

Explore and run machine learning code with Kaggle Notebooks | Using data from Connect X

www.kaggle.com

Introduction
So far, our agents have relied on detailed information about how to play the game. The heuristic really provides a lot of guidance about how to select moves!
여태까지는 우리 에이전트(프로그램)들은 구체적인 정보를 줘야 게임 플레이를 할 수 있었다.
휴리스틱 방법은 어떤 행동을 선택할 것인지에 대한 가이드를 제공한다.

In this tutorial, you'll learn how to use reinforcement learning to build an intelligent agent without the use of a heuristic. Instead, we will gradually refine the agent's strategy over time, simply by playing the game and trying to maximize the winning rate.
여기서는 휴리스틱 대신 강화학습을 통해 인공지능 에이젼트를 만들어 볼 것이다. 에이젼트의 전략을 반복적인 플레이로 개선하고 승리할 확률을 최대한 높이려고 할 것.

In this notebook, we won't be able to explore this complex field in detail, but you'll learn about the big picture and explore code that you can use to train your own agent.
여기서는 디테일한 코딩 부분 보다는 전체적인 그림을 가지고 코드를 탐색하는 것에 집중해보자.

Neural Networks¶
신경망
It's difficult to come up with a perfect heuristic. Improving the heuristic generally entails playing the game many times, to determine specific cases where the agent could have made better choices. And, it can prove challenging to interpret what exactly is going wrong, and ultimately to fix old mistakes without accidentally introducing new ones.
완벽한 휴리스틱을 만드는 것은 어렵다.
휴리스틱을 개선하려면 게임을 반복적으로 수행하는 것이 일반적. - ...
결국 과거의 잘못을 개선하고 새로운 문제를 만들지 않는 방식으로 진행.

Wouldn't it be much easier if we had a more systematic way of improving the agent with gameplay experience?
이렇게 에이젼트를 개선해나가는 방향을 좀 더 시스템 적인 방법으로 진행할 수 있다면 어떨까

In this tutorial, towards this goal, we'll replace the heuristic with a neural network
여기서는 휴리스틱을 신경망(Neural Network)으로 대체해 보겠다..

The network accepts the current board as input. And, it outputs a probability for each possible move.
(이하, 번역은 추가 하지 않습니다. 본문 기준으로 진행합시다.)

Then, the agent selects a move by sampling from these probabilities. For instance, for the game board in the image above, the agent selects column 4 with 50% probability.

This way, to encode a good gameplay strategy, we need only amend the weights of the network so that for every possible game board, it assigns higher probabilities to better moves.
게임저냙 수정을 위해서 신경망의 가중치값(weight)만 변화하면 됨.

At least in theory, that's our goal. In practice, we won't actually check if that's the case -- since remember that Connect Four has over 4 trillion possible game boards!

Setup
How can we approach the task of amending the weights of the network, in practice? Here's the approach we'll take in this lesson:

After each move, we give the agent a reward that tells it how well it did:
각 움직임 이후에 어떻게 네트워크 가중치를 조정할거냐 --> 에이젼트에 게임에 대한 리워드(보상)을 적절히 수여한다.

If the agent wins the game in that move, we give it a reward of +1.
Else if the agent plays an invalid move (which ends the game), we give it a reward of -10.
Else if the opponent wins the game in its next move (i.e., the agent failed to prevent its opponent from winning), we give the agent a reward of -1.
Else, the agent gets a reward of 1/42.

At the end of each game, the agent adds up its reward. We refer to the sum of rewards as the agent's cumulative reward.
각게임에서 보상을 받고 게임 셋이 완료되면 에이젼트의 축적된 보상값을 비교.

For instance, if the game lasted 8 moves (each player played four times), and the agent ultimately won, then its cumulative reward is 3*(1/42) + 1.

If the game lasted 11 moves (and the opponent went first, so the agent played five times), and the opponent won in its final move, then the agent's cumulative reward is 4*(1/42) - 1.

If the game ends in a draw, then the agent played exactly 21 moves, and it gets a cumulative reward of 21*(1/42).
If the game lasted 7 moves and ended with the agent selecting an invalid move, the agent gets a cumulative reward of 3*(1/42) - 10.
Our goal is to find the weights of the neural network that (on average) maximize the agent's cumulative reward.

This idea of using reward to track the performance of an agent is a core idea in the field of reinforcement learning. Once we define the problem in this way, we can use any of a variety of reinforcement learning algorithms to produce an agent.
이 방식은 보상(리워드)에 대한 추적을 하면 에이젼트의 성능을 개선한다는 아이디어 즉 더 높은 보상을 받도록 에이젼트 신경망 가중치를 조정.

Reinforcement Learning
강화 학습
There are many different reinforcement learning algorithms, such as DQN, A2C, and PPO, among others. All of these algorithms use a similar process to produce an agent:
다양한 강화 학습 방법
DQN, A2C, PPO 등등. - 알고리즘 이름.
모두 비슷한 방법으로 에이젼트 학습진행

Initially, the weights are set to random values.
우선은 랜덤으로 신경망의 가중치를 정함.

As the agent plays the game, the algorithm continually tries out new values for the weights, to see how the cumulative reward is affected, on average. Over time, after playing many games, we get a good idea of how the weights affect cumulative reward, and the algorithm settles towards weights that performed better.

게임을 거듭수행하여 리워드가 최대화시키는 방향으로 가중치를 조정.

Of course, we have glossed over the details here, and there's a lot of complexity involved in this process. For now, we focus on the big picture!
물론 디테일한 부분을 지나쳤지만 일단 여기서는 빅 픽쳐에 집중

This way, we'll end up with an agent that tries to win the game (so it gets the final reward of +1, and avoids the -1 and -10) and tries to make the game last as long as possible (so that it collects the 1/42 bonus as many times as it can).
이 방법으로 게임 에이젼트가 게임을이길 수 있도록 , 그리고 가능한 게임을 오래 지속하여 가능한 많은 보너스를 받을 수 있도록 진행.

You might argue that it doesn't really make sense to want the game to last as long as possible -- this might result in a very inefficient agent that doesn't play obvious winning moves early in gameplay.
가능한 게임을 오래한다는 것은
가능한 빨리 게임을 이긴다는 것과 상충된다고 생각할 수도 있다. (후자가 최선의 전략이므로)
And, your intuition would be correct -- this will make the agent take longer to play a winning move! The reason we include the 1/42 bonus is to help the algorithms we'll use to converge better.
틀린 이야기는 아닌데(오래걸리는데) 우리가 1/42 보너스를 넣은 이유가 Converge를 개선하기 위한 알고리즘 때문.

Further discussion is outside of the scope of this course, but you can learn more by reading about the "temporal credit assignment problem" and "reward shaping".
In the next section, we'll use the Proximal Policy Optimization (PPO) algorithm to create an agent.
더이상의 논의는 이 코스의 범위를 벗어나므로 하지 않겠지만, ... 을 읽어봐라.
우리는 여기서 에이젼트를 위해 PPO (Proximal Policy Optimization ) 알고리즘을 사용할 것이다.

이하 생략.

'로봇-AI' 카테고리의 다른 글

[AI초급]One Step Lookahead (한 발 앞서 보자) (0)	2025.03.25
[AI초급] Play the Game (0)	2025.03.21
[딥러닝초급]Custom Convnets (0)	2025.03.11
[딥러닝기초]Custom Convnet (0)	2025.03.07
[딥러닝초급]Data Augmentation (0)	2025.03.07