使用 OpenVINO™ 工具包进行人类动作识别

Paula_Ramos

0/5 (0投票)

2022年9月19日

CPOL

6分钟阅读

7316

在本文中，您将学习如何在同步调度中使用 OpenVINO™ AI 工具包进行实时人体动作识别。

我来到英特尔已经几个月了，很高兴能与大家分享我一直在做的工作。今天，我将向您介绍我的第一本关于人体动作识别的笔记本。我希望您会喜欢它，并能将其应用于您正在进行的开发中。

在此博客中，您将学习如何在同步调度中使用 OpenVINO™ AI 工具包进行实时人体动作识别。

人体动作识别是一种人工智能能力，可以在录制或直播视频中查找和分类大量的活动。例如，如果您有一个大型家庭视频收藏，并且想要查找某个特定记忆，人体动作识别是最简单快捷的方法。传统方法需要您花费大量人工精力和时间来审查您拥有的每个视频，直到找到正确的视频。使用人体动作识别，您可以训练人工智能模型自动根据录制的活动为您分类和组织视频，从而更容易在几秒钟内查找和访问您最珍贵的记忆。

此操作也可以应用于制造业等企业。例如，为人类工人提供能够识别其执行的任务、手势反馈的解决方案，并通过识别和提醒经理任何危险来确保他们的安全。

但这只是人体动作识别能力的一些例子。在未来几年里，我预计在这个领域会看到更多新的、令人兴奋的用例。运行此笔记本后，请告诉我您还设想哪些其他领域可以从这种人工智能能力中受益。但现在，让我们开始吧。

关于此笔记本

对于此笔记本，我使用的是 DeepMind Kinetics-400 人体动作视频数据集，其中包含总共 400 种动作，包括人物动作（例如，写作、饮酒、大笑）、人际动作（例如，拥抱、握手、玩扑克）以及人与物动作（骑滑板车、洗衣服、吹气球）。您还可以区分一组亲子互动，例如编辫子或梳头发、萨尔萨舞或机器人舞，以及拉小提琴或弹吉他（图 1）。有关标签和数据集的更多信息，请参阅“Kinetics 人体动作视频数据集”研究论文。

图 1. 使用 OpenVINO™ 工具包进行人体动作识别

您可以使用通用计算机运行此笔记本，无需硬件加速器。使用 AI 工具包 OpenVINO 的一大优点是它设计用于边缘计算，因此可以优化 GPU、CPU 和 VPU 以高效运行您的 AI 推理模型。但同样，这些都不是必需的。可以使用各种视频源，例如来自 URL 的视频片段、本地存储的文件或网络摄像头馈送。

我还将使用来自 Open Model Zoo 的动作识别模型，该模型提供了各种预训练的深度学习模型和演示应用程序。我正在使用的模型基于视频转换器，采用 ResNet34 架构（图 2）。它包含两个模型：

编码器，基于 PyTorch 框架，输入形状为 [1x3x224x224] — 1 批次大小，3 个颜色通道，图像尺寸为 224x224 像素；输出形状为 [1x512x1x1]，表示已处理帧的嵌入。
解码器，也基于 PyTorch 框架，输入形状为 [1x16x512] — 1 批次大小，剪辑时长为 1 秒的 16 帧，以及 512 维的嵌入。

图 2. 人体动作识别笔记本的流水线。

我选择每秒分析 16 帧——这是 Kinetics-400 作者平均用于查找类别分数的帧数。这些帧经过预处理，只分析中心裁剪的图像，如图 1 中的 GIF 所示。

这两个模型创建了一个序列到序列（Seq2Seq）系统，用于识别 Kinetics-400 数据集中的人类活动。由于注释不详尽，模型的性能并非最佳，但它可以帮助我们理解管道。

您可以通过以下方式开始识别您自己的视频：

使用 OpenVINO Notebooks 准备您的安装。
准备您的视频源，可以是网络摄像头或包含您想要检测的常见活动的视频文件。请查看数据集标签，以了解要检测的动作名称。
在您的计算机上打开 Jupyter 笔记本。该笔记本可以在 Windows、MacOS 和 Ubuntu 操作系统上，通过不同的互联网浏览器运行。

使用 OpenVINO™ 进行实时动作识别

现在，我将向您展示笔记本的一些亮点

1. 下载模型

我们正在使用 Open Model Zoo 工具，例如 omz_downloader。它是一个命令行工具，可以自动创建目录结构并下载所选模型。在本例中，它是 Open Model Zoo 中的 “action-recognition-0001”模型。

if not os.path.exists(model_path_decoder) or not os.path.exists(model_path_encoder):
    download_command = f"omz_downloader " \
                       f"--name {model_name} " \
                       f"--precision {precision} " \
                       f"--output_dir {base_model_dir}"
    ! $download_command

2. 模型初始化

要开始推理，请初始化推理引擎，从文件中读取网络和权重，将模型加载到选定的设备（我的情况是 CPU）上，并获取输入和输出节点。

# Initialize inference engine
ie_core = Core()

def model_init(model_path: str) -> Tuple:
    """
    Read the network and weights from file, load the 
    model on the CPU and get input and output names of nodes
    
    :param: model: model architecture path *.xml
    :returns:
             compiled_model: Compiled model
             input_key: Input node for model
             output_key: Output node for model
    """
    
    # Read the network and corresponding weights from file
    model = ie_core.read_model(model=model_path)
    # compile the model for the CPU (you can use GPU or MYRIAD as well)
    compiled_model = ie_core.compile_model(model=model, device_name="CPU")
    #Get input and output names of nodes
    input_keys = compiled_model.input(0)
    output_keys = compiled_model.output(0)
    return input_keys, output_keys, compiled_model

3. 辅助函数

您需要大量代码来准备和可视化您的结果。创建一个中心裁剪的 ROI，调整图像大小，并在每个帧中放置文本信息。

4. AI 功能

奇迹发生在这里。

a. 在运行编码器之前预处理帧 (preprocessing)

在将帧通过编码器之前，准备图像 — 将其缩放到其最短尺寸，再通过裁剪、居中和正方形化（使宽度和高度相等）缩放到所选尺寸。帧必须从高-宽-通道 (HWC) 转置为通道-高-宽 (CHW)。

def preprocessing(frame: np.ndarray, size: int) -> np.ndarray:
    """
    Preparing frame before Encoder.
    The image should be scaled to its shortest dimension at "size"
    and cropped, centered, and squared so that both width and 
    height have lengths "size". Frame must be transposed from
    Height-Width-Channels (HWCs) to Channels-Height-Width (CHW).
    
    :param frame: input frame
    :param size: input size to encoder model
    :returns: resized and cropped frame
    """
    # Adapative resize 
    preprocessed = adaptive_resize(frame, size)
    # Center_crop
    (preprocessed, roi) = center_crop(preprocessed)
    # Transpose frame HWC -> CHW
    preprocessed = preprocessed.transpose((2, 0, 1))[None,] # HWC -> CHW
    return preprocessed, roi

b. 每帧编码器推理 (encoder)

此函数调用先前为编码器模型配置的网络 (compiled_model)，从输出节点提取数据，并将其附加到数组中，供解码器使用。

def encoder(
    preprocessed: np.ndarray,
    compiled_model: CompiledModel
) -> List:
    """
    Encoder Inference per frame. This function calls the network previously
    configured for the encoder model (compiled_model), extracts the data
    from the output node, and appends it in an array to be used by the decoder.
    
    :param: preprocessed: preprocessing frame
    :param: compiled_model: Encoder model network
    :returns: encoder_output: embedding layer that is appended with each arriving frame 
    """
    output_key_en = compiled_model.output(0)
    
    # Get results on action-recognition-0001-encoder model
    infer_result_encoder = compiled_model([preprocessed])[output_key_en]
    return infer_result_encoder

c. 每组帧的解码器推理 (decoder)

此函数连接来自编码器输出的嵌入层，并转置数组以匹配解码器输入大小。它调用先前为解码器模型配置的网络 (compiled_model_de)，提取 logits (是的，logits 是真实存在的；您可以在此处了解更多信息)，并对其进行归一化以沿指定轴获取置信度值。它将最高概率解码为相应的标签名称。

def decoder(encoder_output: List, compiled_model_de: CompiledModel) -> List:
    """
    Decoder inference per set of frames. This function concatenates the embedding layer
    forms the encorder output, transpose the array to match with the decoder input size.
    Calls the network previously configured for the decoder model (compiled_model_de), extracts
    the logits and normalize those to get confidence values along specified axis.
    Decodes top probabilities into corresponding label names
    
    :param: encoder_output: embedding layer for 16 frames
    :param: compiled_model_de: Decoder model network
    :returns: decoded_labels: The k most probable actions from the labels list
              decoded_top_probs: confidence for the k most probable actions
    """
    # Concatenate sample_duration frames in just one array
    decoder_input = np.concatenate(encoder_output, axis=0)
    # Organize input shape vector to the Decoder (shape: [1x16x512]]
    decoder_input = decoder_input.transpose((2, 0, 1, 3))
    decoder_input = np.squeeze(decoder_input, axis=3)
    output_key_de = compiled_model_de.output(0)
    # Get results on action-recognition-0001-decoder model
    result_de = compiled_model_de([decoder_input])[output_key_de]
    # Normalize logits to get confidence values along specified axis
    probs = softmax(results_de - np.max(result_de))
    # Decodes top probabilities into corresponding label names
    decoded_labels, decoded_top_probs = decode_output(probs, labels, top_k=3)
    return decoded_labels, decoded_top_probs

运行完整的笔记本流水线

现在，让我们看看笔记本的实际操作。

选择您希望运行完整工作流程的视频。

video_file = "https://archive.org/serve/ISSVideoResourceLifeOnStation720p/ISS%20Video%20Resource_LifeOnStation_720p.mp4"
run_action_recognition(source=video_file, flip=False, use_popup=False, skip_first_frames=600)

选择网络摄像头并再次运行完整的工作流程。

run_action_recognition(source=0, flip=False, use_popup=False, skip_first_frames=0)

恭喜！您已完成。希望您觉得这个主题有趣且对您的应用程序开发有用。😉

要了解有关 OpenVINO 工具包及其功能的更多信息，请访问 https://www.openvino.ai/。如需更多实践人工智能培训，请查看我们的 AI 开发团队探险。

资源

声明和免责声明

英特尔技术可能需要启用硬件、软件或服务激活。

没有任何产品或组件可以绝对安全。

您的成本和结果可能会有所不同。

英特尔不控制或审计第三方数据。您应该查阅其他来源来评估准确性。

英特尔不承担任何明示和暗示的担保，包括但不限于适销性、特定用途适用性和非侵权性的暗示担保，以及因履行过程、交易过程或贸易使用而产生的任何担保。

本文档不授予任何知识产权的许可（明示或暗示，禁止反言或以其他方式）。