GPU CPU Tensorflow 硬件产品展示中级 C++

低功耗微控制器上的深度学习 AI：使用 TensorFlow Lite Micro 在 Arm Cortex-M 设备上进行 MNIST 手写识别

Raphael Mun

0/5 (0投票)

2020年4月16日

CPOL

10分钟阅读

19923

在本文中，我们将构建一个功能齐全的 MNIST 手写识别应用程序，使用 TensorFlow Lite 在低功耗 STMicroelectronics 微控制器上运行我们的 AI 推理，该微控制器使用基于 Arm Cortex M7 的处理器。

使用TensorFlow和MNIST进行手写数字识别已成为人工智能（AI）和机器学习（ML）的常见入门方法。“MNIST”是修改后的美国国家标准与技术研究院数据库，其中包含70,000个手写数字示例。MNIST数据库是训练图像处理系统和ML软件常用的图像源。

虽然使用TensorFlow和MNIST的ML教程很常见，但直到最近，它们通常都在配备工作站级GPU的成熟x86处理环境中演示。最近也有一些使用功能更强大的智能手机的示例，但即使是这些袖珍计算环境也通常提供多核CPU和强大的专用GPU。

然而，如今，你甚至可以在一个8位微控制器上创建一个功能齐全的MNIST手写识别应用程序。为了演示，我们将构建一个功能齐全的MNIST手写识别应用程序，就像下图所示的那样，使用TensorFlow Lite在基于Arm Cortex M7处理器的低功耗STMicroelectronics微控制器上运行我们的AI推理。像这样的实现展示了即使在电池供电、独立设备上，也能在各种边缘物联网或手持场景中创建健壮的ML应用程序的能力。

你需要一些东西来构建这个项目

一个Arm Cortex-M驱动的微控制器设备。我将使用STM32F746G Discovery板，但任何带有Arm Cortex-M处理器的设备都应该能正常工作。你也可以查看这个支持TensorFlow Lite for Microcontrollers的设备列表。
你最喜欢的C++ IDE工具链，用于嵌入式设备开发。我将使用免费的跨平台PlatformIO IDE。
TensorFlow Lite for Microcontrollers C++库，用于你的项目编译。

你可以在GitHub上找到此项目的代码。

本文假设你熟悉C/C++和ML，但如果你不熟悉，也无需担心。你仍然可以跟着操作，尝试将项目部署到你自己的设备上！

快速概览

在我们开始之前，让我们看看通过TensorFlow在微控制器上运行深度学习AI项目所需的步骤

使用数据集（MNIST手写数字）训练预测模型
将模型转换为TensorFlow Lite
创建嵌入式应用程序
生成样本数据
部署并测试应用程序

为了使这个过程更快更容易，我在Google Colab上创建了一个Jupyter notebook，为你处理前两个步骤，无需在你的机器上安装和配置Python，即可在浏览器中完成。这也可以作为其他项目的参考，因为它包含了训练和评估MNIST模型所需的所有代码（使用TensorFlow），以及将模型转换为TensorFlow Lite for Microcontrollers以供离线使用，并生成模型的C数组代码版本，以便轻松编译到任何C++程序中。

如果你想跳到第3步的嵌入式应用程序，请务必先点击notebook菜单中的运行时 > 全部运行来生成model.h文件。你可以从左侧的文件列表中下载它，或者从GitHub仓库下载预构建的模型，以包含在你的项目中。

如果你想在本地机器上运行这些步骤，请确保你使用的是TensorFlow 2.0或更高版本，并使用Anaconda来安装和使用Python。如果你使用前面提到的Jupyter notebook，则无需担心安装TensorFlow 2.0，因为它已包含在notebook中。

使用MNIST训练TensorFlow模型

Keras是一个高级神经网络Python库，常用于AI解决方案的原型设计。它与TensorFlow集成，还包含一个内置的MNIST数据集，其中包含60,000张图像和10,000个测试样本，可直接在TensorFlow中访问。

为了预测手写数字，我使用这个数据集训练了一个相对简单的模型，它以28x28图像作为输入形状，并使用带有单个隐藏层的Softmax激活函数输出到10个类别。这足以达到96.6%的准确率，但如果你愿意，可以添加更多的隐藏层或张量。

有关在TensorFlow中使用MNIST数据集的更深入讨论，我建议查阅网上许多优秀的TensorFlow教程，例如O'Reilly的“又一个MNIST教程与TensorFlow”。你也可以参考这个notebook中的TensorFlow正弦波模型示例，熟悉TensorFlow模型的训练和评估以及将模型转换为TensorFlow Lite for Microcontrollers。

将模型转换为TensorFlow Lite

我在步骤1中创建的模型非常有用且准确，但其文件大小和内存使用量使其难以移植或在嵌入式设备上使用。这就是TensorFlow Lite发挥作用的地方，因为其运行时针对移动、嵌入式和物联网设备进行了优化，并且以非常小的（仅几KB！）大小要求实现了低延迟。它允许你在准确性、速度和大小之间进行权衡，以选择适合你需求的模型。

在这种情况下，我需要TensorFlow Lite占用尽可能少的闪存空间和内存，同时保持快速，因此我们可以牺牲一点精度而不会损失太多。

为了进一步缩小尺寸，TensorFlow Lite Converter支持模型量化，将计算从使用32位浮点值切换到使用8位整数，因为通常浮点值的高精度并不是必需的。这也可以显著减小模型大小并提高性能。

我无法让量化模型在我的STM32F7 Discovery设备上正确且一致地使用Softmax函数，出现了“failed to invoke”错误。TensorFlow Lite Converter正在持续开发中，一些模型结构尚不支持。例如，它将一些权重转换为int8值而不是uint8，而int8尚不支持。至少目前是这样。

尽管如此，如果转换器支持模型中使用的所有元素，它可以极大地缩小你训练模型的尺寸，而且只需几行代码即可启用，所以我建议你尝试一下。所需的代码行只是在我的notebook中注释掉了，你可以取消注释并生成最终模型，看看它是否在你的设备上正常工作。

现场的微控制器嵌入式设备通常存储空间有限。在试验台上，我总是可以使用更大的存储卡进行外部存储。然而，为了模拟一个无法访问外部存储来获取.tflite文件的环境，我可以将模型导出为代码，使其存在于应用程序本身中。

我在我的notebook末尾添加了一个Python脚本来处理这部分，并将其转换为model.h文件。如果你愿意，也可以在Linux中使用xxd -i shell命令将生成的tflite文件转换为C数组。从左侧菜单下载此文件，并准备在下一步中将其添加到你的嵌入式应用程序项目中。

import binascii

def convert_to_c_array(bytes) -> str:
  hexstr = binascii.hexlify(bytes).decode("UTF-8")
  hexstr = hexstr.upper()
  array = ["0x" + hexstr[i:i + 2] for i in range(0, len(hexstr), 2)]
  array = [array[i:i+10] for i in range(0, len(array), 10)]
  return ",\n  ".join([", ".join(e) for e in array])

tflite_binary = open("model.tflite", 'rb').read()
ascii_bytes = convert_to_c_array(tflite_binary)
c_file = "const unsigned char tf_model[] = {\n  " + ascii_bytes + 
  "\n};\nunsigned int tf_model_len = " + str(len(tflite_binary)) + ";"
# print(c_file)
open("model.h", "w").write(c_file)

创建嵌入式应用程序

现在我们准备好将我们训练好的MNIST模型投入到实际的低功耗微控制器中工作。你的具体步骤可能因你的工具链而异，但我使用Platform IDE和我的STM32F746G Discovery设备采取了以下步骤

首先，创建一个新的应用程序项目，并根据你对应的Arm Cortex-M驱动设备配置设置，然后准备好你的主设置和循环函数。我选择了Stm32Cube框架，以便我可以将结果输出到屏幕。如果你正在使用Stm32Cube，你可以从仓库下载stm32_app.h和stm32_app.c文件，并创建一个包含设置和循环函数的main.cpp，如下所示

#include "stm32_app.h"

void setup() {
}

void loop() {
}

添加或下载TensorFlow Lite Micro Library。对于PlatformIO IDE，我已经为你预配置了库，因此你可以从此处将tfmicro文件夹下载到你的项目lib文件夹中，并将其作为库依赖项添加到你的platformio.ini文件中

[env:disco_f746ng]
platform = ststm32
board = disco_f746ng
framework = stm32cube
lib_deps = tfmicro

在代码顶部包含TensorFlowLite库头文件，如下所示

#include "stm32_app.h"
#include "tensorflow/lite/experimental/micro/kernels/all_ops_resolver.h"
#include "tensorflow/lite/experimental/micro/micro_error_reporter.h"
#include "tensorflow/lite/experimental/micro/micro_interpreter.h"
#include "tensorflow/lite/schema/schema_generated.h"
#include "tensorflow/lite/version.h"

void setup() {
}

void loop() {
}

将之前转换的model.h文件包含到此项目的include文件夹中，并将其添加到TensorFlow头文件下方。然后保存并构建，确保一切正常，没有错误。

#include "model.h"

为TensorFlow定义你将在代码中使用的以下全局变量

// Globals
const tflite::Model* model = nullptr;
tflite::MicroInterpreter* interpreter = nullptr;
tflite::ErrorReporter* reporter = nullptr;
TfLiteTensor* input = nullptr;
TfLiteTensor* output = nullptr;
constexpr int kTensorArenaSize = 5000; // Just pick a big enough number
uint8_t tensor_arena[ kTensorArenaSize ] = { 0 };
float* input_buffer = nullptr;

在你的setup函数中，加载模型，设置TensorFlow运行器，分配张量，并保存输入和输出向量以及指向输入缓冲区的指针，我们将它作为浮点数组进行接口。你的函数现在应该看起来像这样

void setup() {
  // Load Model
  static tflite::MicroErrorReporter error_reporter;
  reporter = &error_reporter;
  reporter->Report( "Let's use AI to recognize some numbers!" );

  model = tflite::GetModel( tf_model );
  if( model->version() != TFLITE_SCHEMA_VERSION ) {
    reporter->Report( 
      "Model is schema version: %d\nSupported schema version is: %d", 
      model->version(), TFLITE_SCHEMA_VERSION );
    return;
  }
 
  // Set up our TF runner
  static tflite::ops::micro::AllOpsResolver resolver;
  static tflite::MicroInterpreter static_interpreter(
      model, resolver, tensor_arena, kTensorArenaSize, reporter );
  interpreter = &static_interpreter;
 
  // Allocate memory from the tensor_arena for the model's tensors.
  TfLiteStatus allocate_status = interpreter->AllocateTensors();
  if( allocate_status != kTfLiteOk ) {
    reporter->Report( "AllocateTensors() failed" );
    return;
  }

  // Obtain pointers to the model's input and output tensors.
  input = interpreter->input(0);
  output = interpreter->output(0);

  // Save the input buffer to put our MNIST images into
  input_buffer = input->data.f;
}

准备TensorFlow在Arm Cortex-M设备上运行，每个循环调用之间有一秒的短延迟，如下所示

void loop() {
  // Run our model
  TfLiteStatus invoke_status = interpreter->Invoke();
  if( invoke_status != kTfLiteOk ) {
    reporter->Report( "Invoke failed" );
    return;
  }
 
  float* result = output->data.f;
  char resultText[ 256 ];
  sprintf( resultText, "It looks like the number: %d", std::distance( result, std::max_element( result, result + 10 ) ) );
  draw_text( resultText, 0xFF0000FF );

  // Wait 1-sec til before running again
  delay( 1000 );
}

你的应用程序现在已准备就绪。它只是等着我们输入一些有趣的MNIST测试图像来处理！

生成用于嵌入的MNIST样本数据

接下来，让我们获取一些手写数字图像供我们的设备读取。

为了在不依赖外部存储的情况下将这些图像添加到程序中，我们可以提前将100个MNIST图像从JPEG转换为位单色图像，并像我们的TensorFlow模型一样存储为C数组。为此，我使用了一个名为image2cpp的开源网络工具，它可以在一个批处理中为我们完成大部分工作。如果你想自己生成它们，请解析像素并将八个像素一次编码到一个字节中，然后以C数组格式写入，如下所示。

注意：该网络工具为Arduino IDE生成代码，因此请查找并删除代码中所有PROGMEM的实例，然后它将与PlatformIO一起编译。

举例来说，这个手写零的测试图像应该被转换为以下数组

// 'mnist_0_1', 28x28px
const unsigned char mnist_1 [] PROGMEM = {
  0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
  0x00, 0x07, 0x00, 0x00, 0x00, 0x07, 0x00, 0x00, 0x00, 0x0f, 0x00, 0x00, 0x00, 0x1f, 0x80, 0x00,
  0x00, 0x3f, 0xe0, 0x00, 0x00, 0x7f, 0xf0, 0x00, 0x00, 0x7e, 0x30, 0x00, 0x00, 0xfc, 0x38, 0x00,
  0x00, 0xf0, 0x1c, 0x00, 0x00, 0xe0, 0x1c, 0x00, 0x00, 0xc0, 0x1e, 0x00, 0x00, 0xc0, 0x1c, 0x00,
  0x01, 0xc0, 0x3c, 0x00, 0x01, 0xc0, 0xf8, 0x00, 0x01, 0xc1, 0xf8, 0x00, 0x01, 0xcf, 0xf0, 0x00,
  0x00, 0xff, 0xf0, 0x00, 0x00, 0xff, 0xc0, 0x00, 0x00, 0x7f, 0x00, 0x00, 0x00, 0x1c, 0x00, 0x00,
  0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
};

将你生成的图像保存到项目中的一个名为mnist.h的新文件中，或者如果你想节省时间并跳过此步骤，你可以直接从GitHub下载我的版本。

在文件的底部，我将所有数组组合成一个最终的集合，这样我们每秒可以选择一个随机图像进行处理

const unsigned char* test_images[] = {
  mnist_1, mnist_2, mnist_3, mnist_4, mnist_5, 
  mnist_6, mnist_7, mnist_8, mnist_9, mnist_10,
  mnist_11, mnist_12, mnist_13, mnist_14, mnist_15, 
  mnist_16, mnist_17, mnist_18, mnist_19, mnist_20,
  mnist_21, mnist_22, mnist_23, mnist_24, mnist_25, 
  mnist_26, mnist_27, mnist_28, mnist_29, mnist_30,
  mnist_31, mnist_32, mnist_33, mnist_34, mnist_35, 
  mnist_36, mnist_37, mnist_38, mnist_39, mnist_40,
  mnist_41, mnist_42, mnist_43, mnist_44, mnist_45, 
  mnist_46, mnist_47, mnist_48, mnist_49, mnist_50,
  mnist_51, mnist_52, mnist_53, mnist_54, mnist_55, 
  mnist_56, mnist_57, mnist_58, mnist_59, mnist_60,
  mnist_61, mnist_62, mnist_63, mnist_64, mnist_65, 
  mnist_66, mnist_67, mnist_68, mnist_69, mnist_70,
  mnist_71, mnist_72, mnist_73, mnist_74, mnist_75, 
  mnist_76, mnist_77, mnist_78, mnist_79, mnist_80,
  mnist_81, mnist_82, mnist_83, mnist_84, mnist_85, 
  mnist_86, mnist_87, mnist_88, mnist_89, mnist_90,
  mnist_91, mnist_92, mnist_93, mnist_94, mnist_95, 
  mnist_96, mnist_97, mnist_98, mnist_99, mnist_100,
};

别忘了在代码顶部包含你的新图像头文件

#include "mnist.h"

测试MNIST图像

将这些示例图像添加到代码后，你可以添加两个辅助函数，一个用于将单色图像读入输入向量，另一个用于渲染到内置显示器。以下是我放在设置函数正上方的函数

void bitmap_to_float_array( float* dest, const unsigned char* bitmap ) { // Populate input_vec with the monochrome 1bpp bitmap
  int pixel = 0;
  for( int y = 0; y < 28; y++ ) {
    for( int x = 0; x < 28; x++ ) {
      int B = x / 8; // the Byte # of the row
      int b = x % 8; // the Bit # of the Byte
      dest[ pixel ] = ( bitmap[ y * 4 + B ] >> ( 7 - b ) ) & 
                        0x1 ? 1.0f : 0.0f;
      pixel++;
    }
  }
}

void draw_input_buffer() {
  clear_display();
  for( int y = 0; y < 28; y++ ) {
    for( int x = 0; x < 28; x++ ) {
      draw_pixel( x + 16, y + 3, input_buffer[ y * 28 + x ] > 0 ? 0xFFFFFFFF : 0xFF000000 );
    }
  }
}

最后，在我们的循环中，我们可以选择一个随机测试图像读入输入缓冲区并绘制到显示器上，如下所示

void loop() {
  // Pick a random test image for input
  const int num_test_images = ( sizeof( test_images ) / 
                                sizeof( *test_images ) );
  bitmap_to_float_array( input_buffer, 
                         test_images[ rand() % num_test_images ] );
  draw_input_buffer();
 
  // Run our model
  ...
}

如果一切顺利，你的项目将构建并部署，你将看到你的微控制器正在识别所有手写数字并输出一些不错的结果！你能相信吗？

下一步是什么？

现在你已经了解了低功耗Arm Cortex-M微控制器如何利用TensorFlow的深度学习能力，你已经准备好做更多的事情！从检测不同类型的动物和物体，到训练设备理解语音或回答问题，你和你的设备可以开启以前认为只有通过高性能计算机和设备才能实现的新可能性。

TensorFlow团队开发的TensorFlow Lite for Microcontrollers有一些很棒的示例，可以在他们的GitHub上找到，阅读这些最佳实践，确保你的AI项目在Arm Cortex-M设备上运行时获得最大收益。