VisualBasic 机器学习，第一步：Q-Learning

谢桂纲

5.00/5 (9投票s)

2016 年 3 月 28 日

CPOL

7分钟阅读

35228

机器学习玩贪吃蛇游戏。

引言

过去几周，谷歌 AlphaGO 引起了全世界的关注，一个人工神经网络展示了其卓越的能力，凭借深度学习技术，AlphaGO 甚至击败了世界上顶级的围棋棋手。

目前，让我们通过它的人工智能仍然只存在于电影中，因为我们的世界对人工智能程序来说过于复杂。视频游戏中的世界是现实世界的简化版本，我们无法为现实世界创建像人类一样的人工智能，但在游戏世界中创建玩家人工智能是可能的。

对我来说，最快乐的事情就是玩视频游戏，而马里奥游戏是我最喜欢的，因为 NES 游戏马里奥兄弟陪伴我度过了童年。Seth Bling 利用进化神经网络技术制作的马里奥游戏机玩家 MarI/O 启发我进入了机器学习领域。

MarI/O 视频：https://www.youtube.com/watch?v=qv6UVOQ0F44

通过神经进化让机器学习玩马里奥游戏

受 MarI/O 伟大工作的启发，我决定开发我自己的 AI 引擎框架，不是为了严肃的研究，只是为了好玩和满足我的个人爱好。

Q-Learning 细节

Q-Learning 是一种无模型机器学习算法，对于像我这样的机器学习初学者来说，它是 ML 中最简单的算法。通常，Q-Learning 总共有三个步骤

QL_AI 的工作原理：Q-Learning 的三个步骤概述

阶段 1：输入环境状态

贪吃蛇如何看世界？

为了生成作为输入的环境状态，贪吃蛇必须首先看到世界。MarI/O 通过将游戏世界抽象为一些方块来简化状态输入，因此在这个贪吃蛇游戏中，我们只是将世界抽象为两点：蛇的头部和它的食物（如果不考虑致命的墙壁或一些方块障碍）。这两点完美地模拟了蛇和外部世界。考虑到我们只想训练贪吃蛇吃食物，所以我们只需要教贪吃蛇如何接近食物，有两个元素用于生成状态，以便 QTable 可以生成一个身份状态

蛇与食物的相对位置描述
贪吃蛇当前的移动方向

因此，我们可以通过简单地导出蛇的位置和食物的位置给 QL_AI 来让贪吃蛇看到世界。

' Get environment state as input       
Dim pre = Distance(_snakeGame.Snake.Location, _snakeGame.food.Location)

Call _stat.SetState(_stat.GetNextState(Nothing))

Imports Microsoft.VisualBasic.DataMining.Framework.QLearning
Imports Microsoft.VisualBasic.GamePads.EngineParts
Imports Microsoft.VisualBasic.GamePads

''' <summary>
''' Environment state inputs
''' </summary>
Public Structure GameControl : Implements ICloneable

    ''' <summary>
    ''' The relative position between the head of the snake and his food
    ''' </summary>
    Dim position As Controls
    ''' <summary>
    ''' The current movement direction of the snake.(蛇的当前运动方向)
    ''' </summary>
    Dim moveDIR As Controls

    ''' <summary>
    ''' hash key for the QTable
    ''' </summary>
    ''' <returns></returns>
    Public Overrides Function ToString() As String
        Return $"{CInt(position)},{vbTab}{CInt(moveDIR)}"
    End Function

    Public Function Clone() As Object Implements ICloneable.Clone
        Return New GameControl With {
            .position = position,
            .moveDIR = moveDIR
        }
    End Function
End Structure

Public Class QState : Inherits QState(Of GameControl)

    Dim game As Snake.GameEngine

    Sub New(game As Snake.GameEngine)
        Me.game = game
    End Sub

    ''' <summary>
    ''' The position relationship of the snake head and his food consists of 
    ''' the current environment state
    ''' </summary>
    ''' <param name="action">当前的动作</param>
    ''' <returns></returns>
    Public Overrides Function GetNextState(action As Integer) As GameControl
        Dim pos As Controls = Position(game.Snake.Location, game.food.Location, False)
        Dim stat = New GameControl With {
            .moveDIR = game.Snake.Direction,
            .position = pos
        }  ' 当前的动作加上当前的状态构成q-learn里面的一个状态
        Return stat
    End Function
End Class

阶段 2：输出控制

Q-Learning 的算法核心是 QTable，QTable 基本上由两个元素组成：环境状态和与特定环境状态相关的动作 Q 值。代码中动作状态对象中元素的索引用于动作，例如 0 代表向上动作，1 代表向下动作等。由于 Q-Learning 的动作输出只有 4 个方向控制：上、下、左 和 右，因此 QTable 的动作范围是 4。

QTable 的基础

Imports System
Imports System.Collections
Imports System.Collections.Generic
Imports Microsoft.VisualBasic.DataMining.Framework.QLearning.DataModel
Imports Microsoft.VisualBasic

Namespace QLearning

    ''' <summary>
    ''' The heart of the Q-learning algorithm, the QTable contains the table
    ''' which maps states, actions and their Q values. This class has elaborate
    ''' documentation, and should be the focus of the students' body of work
    ''' for the purposes of this tutorial.
    '''
    ''' @author A.Liapis (Original author), A. Hartzen (2013 modifications); 
    ''' xie.guigang@gcmodeller.org (2016 modifications)
    ''' </summary>
    Public MustInherit Class QTable(Of T As ICloneable)
        Implements IQTable

        ''' <summary>
        ''' The table variable stores the Q-table, where the state is saved
        ''' directly as the actual map. Each map state has an array of Q values
        ''' for all the actions available for that state.
        ''' </summary>
        Public ReadOnly Property Table As Dictionary(Of Action) _
               Implements IQTable.Table

        ''' <summary>
        ''' The actionRange variable determines the number of actions available
        ''' at any map state, and therefore the number of Q values in each entry
        ''' of the Q-table.
        ''' </summary>
        Public ReadOnly Property ActionRange As Integer Implements IQTable.ActionRange

#Region "E-GREEDY Q-LEARNING SPECIFIC VARIABLES"

        ''' <summary>
        ''' For e-greedy Q-learning, when taking an action, a random number is
        ''' checked against the explorationChance variable: if the number is
        ''' below the explorationChance, then exploration takes place picking
        ''' an action at random. Note that the explorationChance is not a final
        ''' because it is customary that the exploration chance changes as the
        ''' training goes on.
        ''' </summary>
        Public Property ExplorationChance As Single = 0.05F _
               Implements IQTable.ExplorationChance
        ''' <summary>
        ''' The discount factor is saved as the gammaValue variable. The
        ''' discount factor determines the importance of future rewards.
        ''' If the gammaValue is 0, then the AI will only consider immediate
        ''' rewards, while with a gammaValue near 1 (but below 1), the AI will
        ''' try to maximize the long-term reward even if it is many moves away.
        ''' </summary>
        Public Property GammaValue As Single = 0.9F Implements IQTable.GammaValue
        ''' <summary>
        ''' The learningRate determines how new information affects accumulated
        ''' information from previous instances. If the learningRate is 1, then
        ''' the new information completely overrides any previous information.
        ''' Note that the learningRate is not a final because it is
        ''' customary that the learningRate changes as the
        ''' training goes on.
        ''' </summary>
        Public Property LearningRate As Single = 0.15F Implements IQTable.LearningRate

例如，在上面显示的代码截图上，QTable 类中的 Table 变量，其键是一个可以表示某种环境状态的 string，字典的值是动作的 Q 值。在这个贪吃蛇游戏中，字典中每个元素有四个值，其 Q 值元素索引代表操纵杆上的四个方向按钮，Q 值决定程序按哪个方向按钮是最佳动作。

在 QTable 的初始状态下，QTable 字典中没有元素，但随着游戏的持续进行，越来越多的环境状态将存储在 QTable 中，因此您可以将 QTable 视为一个人做某事的经验。

探索或经验

在这个 QL_AI 贪吃蛇游戏控制器程序中，我们使用 ε-贪婪方法算法来让程序选择如何处理当前环境状态：尝试新的探索或基于以前的经验的动作

''' <summary>
''' For this example, the getNextAction function uses an e-greedy
''' approach, having exploration happen if the exploration chance
''' is rolled.
''' ( **** 请注意，这个函数所返回的值为最佳选择的Index编号，所以可能还需要进行一些转换 **** )
''' </summary>
''' <param name="map"> current map (state) </param>
''' <returns> the action to be taken by the calling program </returns>
Public Overridable Function NextAction(map As T) As Integer
    _prevState = CType(map.Clone(), T)

    If __randomGenerator.NextDouble() < ExplorationChance Then
        _prevAction = __explore()
    Else
        _prevAction = __getBestAction(map)
    End If

    Return _prevAction
End Function

定义最佳动作

如上所述，Q-Learning 的算法核心是应用于 QTable 对象的特定环境状态下动作的 Q 值（奖励和惩罚），因此我们应该定义一个动作对象来表示特定环境状态下的最佳动作

Imports Microsoft.VisualBasic.ComponentModel.Collection.Generic
Imports Microsoft.VisualBasic.Serialization

Namespace QLearning

    ''' <summary>
    ''' One specific environment state have some possible actions,
    ''' but there is just one best action on the current environment state based on the 
    ''' accumulate q-values
    ''' </summary>
    Public Class Action : Implements sIdEnumerable

        ''' <summary>
        ''' The environment variables state as inputs for the machine.
        ''' </summary>
        ''' <returns></returns>
        Public Property EnvirState As String Implements sIdEnumerable.Identifier
        ''' <summary>
        ''' Actions for the current state.
        ''' </summary>
        ''' <returns></returns>
        Public Property Qvalues As Single()

        ''' <summary>
        ''' Environment -> actions' Q-values
        ''' </summary>
        ''' <returns></returns>
        Public Overrides Function ToString() As String
            Return $"[ {EnvirState} ] {vbTab}--> {Qvalues.GetJson}"
        End Function

    End Class
End Namespace

对于上面定义的类代码，我们知道在当前环境状态 EnvirState 中，程序有一些动作选择，动作编码为 Qvalues 数组中的索引，Qvalues 属性中的数组元素表示当前状态 EnvirState 下的动作奖励，Qvalues 中元素值越高意味着当前状态下动作奖励越高，因此我们只让程序返回 Qvalues 中最大值的索引，并且该索引可以解码为最佳动作。获取当前状态下的最佳动作就像下面的函数动作一样

''' <summary>
''' The getBestAction function uses a greedy approach for finding
''' the best action to take. Note that if all Q values for the current
''' state are equal (such as all 0 if the state has never been visited
''' before), then getBestAction will always choose the same action.
''' If such an action is invalid, this may lead to a deadlock as the
''' map state never changes: for situations like these, exploration
''' can get the algorithm out of this deadlock.
''' </summary>
''' <param name="map"> current map (state) </param>
''' <returns> the action with the highest Q value </returns>
Private Function __getBestAction(map As T) As Integer
    Dim rewards() As Single = Me.__getActionsQValues(map)
    Dim maxRewards As Single = Single.NegativeInfinity
    Dim indexMaxRewards As Integer = 0

    For i As Integer = 0 To rewards.Length - 1
       ' Gets the max element value its index in the Qvalues

       If maxRewards < rewards(i) Then
          maxRewards = rewards(i)
          indexMaxRewards = i
       End If
    Next i

    ' decode this index value as the action controls
    Return indexMaxRewards
End Function

在这个 snake 游戏中，程序的操纵杆上只有四个方向键，因此 Action 类中的 Qvalues 属性有四个元素，代表机器程序按下的每个方向按钮的 Q 值。

Qvalues 中的四个元素代表每个动作的 Q 值。

在最佳动作之后，其索引根据当前环境状态输入从 Qtable 返回，我们可以将动作索引解码为贪吃蛇的操纵杆控制

Dim index As Integer = Q.NextAction(_stat.Current)
Dim action As Controls

Select Case index
    Case 0
        action = Controls.Up
    Case 1
        action = Controls.Down
    Case 2
        action = Controls.Left
    Case 3
        action = Controls.Right
    Case Else
        action = Controls.NotBind
End Select

Call _snakeGame.Invoke(action)

因此我们可以解释程序如何在当前环境状态下采取行动

如果随机数在 ExplorationChance 的范围内，程序将采取随机行动来探索它的新世界。

如果不是，则程序将根据当前环境状态和 QTable 中的历史记录做出最佳动作决策，也就是说，它根据经验采取行动。

阶段 3：反馈（学习或适应）

如您在上面显示的 __getBestAction 函数中所示，首先，程序从当前环境状态获取 Q 值，然后从比较中，程序可以选择一个可以最大化该方向奖励的动作索引。

学习第一步：看世界

为了为程序生成做某事的经验，实现了一个字典 add 方法，用于将新的环境状态添加到 QTable，从而使程序能够学习新世界。

''' <summary>
''' This helper function is used for entering the map state into the
''' HashMap </summary>
''' <param name="map"> </param>
''' <returns> String used as a key for the HashMap </returns>
Protected MustOverride Function __getMapString(map As T) As String

''' <summary>
''' The getActionsQValues function returns an array of Q values for
''' all the actions available at any state. Note that if the current
''' map state does not already exist in the Q table (never visited
''' before), then it is initiated with Q values of 0 for all of the
''' available actions.
''' </summary>
''' <param name="map"> current map (state) </param>
''' <returns> an array of Q values for all the actions available at any state </returns>
Private Function __getActionsQValues(map As T) As Single()
    Dim actions() As Single = GetValues(map)
    If actions Is Nothing Then ' 还不存在这个动作，则添加新的动作
        Dim initialActions(ActionRange - 1) As Single
        For i As Integer = 0 To ActionRange - 1
            initialActions(i) = 0.0F
        Next i

        ' If the current environment state is not in the program's memory, 
        ' then store it, this is the so called learn
        _Table += New Action With {  
              .EnvirState = __getMapString(map),
              .Qvalues = initialActions
        }
        Return initialActions
    End If
    Return actions
End Function

学习第二步：改进经验

为了教导孩子，当孩子按照我们期望的方式行事时，我们通常会给予一些奖励，所以在 Q-Learning 中也是如此，我们使用奖励来改进程序在特定环境状态下动作的经验

''' <summary>
''' The updateQvalue is the heart of the Q-learning algorithm. Based on
''' the reward gained by taking the action prevAction while being in the
''' state prevState, the updateQvalue must update the Q value of that
''' {prevState, prevAction} entry in the Q table. In order to do that,
''' the Q value of the best action of the current map state must also
''' be calculated.
''' </summary>
''' <param name="reward"> at the current map state </param>
''' <param name="map"> current map state (for finding the best action of the
''' current map state) </param>
Public Overridable Sub UpdateQvalue(reward As Integer, map As T)
    Dim preVal() As Single = Me.__getActionsQValues(Me._prevState)
    preVal(Me._prevAction) += Me.LearningRate * (reward + Me.GammaValue * _
    Me.__getActionsQValues(map)(Me.__getBestAction(map)) - preVal(Me._prevAction))
End Sub

Update Q 值函数可以表示为以下公式

其中 Q 是上一个环境状态下动作的 Q 值，T 是上一个环境状态的最佳动作。R 代表学习率，E 是当前环境状态。

如果程序的动作使其接近我们的目标，那么我们用奖励更新动作的 Q 值，如果不是，那么我们用惩罚分数更新动作的 Q 值，这样经过几次训练循环后，程序将按照我们期望的方式工作。Q-Learning 就像狗的经典条件反射一样。

经典条件反射：伊万·巴甫洛夫和他的一只狗的雕像

运行示例

为了节省您运行此示例的时间，您可以从此链接下载经过训练的贪吃蛇 QL_AI QTable 数据

https://github.com/xieguigang/VB_GamePads/tree/master/AI/DATA

并使用以下命令启动带有训练数据的贪吃蛇

snakeAI /start /load "filepath-to-the-download-dataset"

否则，您也可以重新训练贪吃蛇。从头开始训练贪吃蛇，只需双击贪吃蛇应用程序，它将以全新的状态开始运行。

本文只是试图从最简单的贪吃蛇游戏来解释机器学习的基本概念，我希望您能从中获得灵感，构建一个更强大的 AI 程序，能够处理更复杂的环境情况。

希望您能享受！;-)