AI 社交距离检测器：通过 YOLO 对象检测改进人员检测

Dawid Borycki

5.00/5 (3投票s)

2020 年 12 月 11 日

CPOL

6分钟阅读

11170

在本系列的最后一篇文章中，我们将改进我们用于 AI 驱动的社交距离检测的 Python 控制台应用程序。

我们已经知道如何从网络摄像头或视频文件中检测图像中的人员并计算他们之间的距离。然而，我们发现底层的 AI 模型 (MobileNet) 并不总是表现良好。也就是说，它无法检测图像中的所有人员。

我们将通过采用最先进的 YOLO（You Only Look Once）对象检测器来改进这一点。网上有很多关于 YOLO 的教程和描述，所以我在这里就不详细介绍了。我将专注于调整我们的应用程序以使用 YOLO 而不是 MobileNet。最后，我们将实现下图所示的结果。您可以在此处找到包含所有必需模型和视频文件的配套代码。

加载 YOLO 对象检测

要将 YOLO 用于对象检测，请按照与 MobileNet 相同的路径进行。

具体来说，首先加载并配置模型。然后，预处理输入图像，使其与 YOLO 输入兼容。接下来，运行推理并在 YOLO 神经网络的输出处解析结果。

我在 YoloInference 类（参见 Part_08 文件夹中的 yolo_inference.py）中实现了上述所有操作。我首先加载了 YOLO 模型。一个预训练的模型包含三个文件：

config – 包含 YOLO 神经网络的参数。
weights – 存储神经网络的权重。
labels – 包含检测对象标签的文本文件。

在 MobileNet 中，config 和 weights 在一个 *.tflite 文件中。在这里，两者是分开的。

要加载 YOLO 网络，我使用了 OpenCV 的 DNN（深度学习网络）模块中的 readNetFromDarknet 方法。它返回一个代表网络的 [对象](https://docs.opencv.ac.cn/4.x/d7/d8d/group__dnn.html#ga1094b367f775c2663c065003f8c01261)。这个对象类似于 TensorFlow 中的 Interpreter。有了这个解释器，我就可以获取网络输出的信息（请参阅 YoloInference 类）。

def load_model_and_configure(self, config_file_path, weights_file_path): 
    # Load YOLO
    self.interpreter = opencv.dnn.readNetFromDarknet(config_file_path, weights_file_path)
 
    # Get output layers 
    layer_names = self.interpreter.getLayerNames()
    self.output_layers = 
        [layer_names[i[0] - 1] for i in self.interpreter.getUnconnectedOutLayers()]
 
    # Set the input image size accepted by YOLO and scaling factor
    self.input_image_size = (608, 608)
    self.scaling_factor = 1 / 255.0

请注意，上述方法还设置了 YoloInference 类的两个成员：

input_image_size – 存储传递给 YOLO 网络的图像的大小。我从配置文件中获取了这些值。
scaling_factor – 一个用于在推理前乘以每个图像像素的数字。通过这种缩放，图像像素将从整数（值为 0 到 255）转换为浮点数（值为 0 到 1）。

然后，我在 YoloInference 类的构造函数中调用 load_model_and_configure 函数。此外，我还加载了标签（使用与 MobileNet 相同的方法）。

def __init__(self, config_file_path, weights_file_path, labels_file_path):
    # Load model
    self.load_model_and_configure(config_file_path, weights_file_path)
 
    # Load labels
    self.load_labels_from_file(labels_file_path)

运行推理

加载模型后，我们可以准备输入图像，然后运行推理。为了预处理图像，我使用了以下方法：

def prepare_image(self, image):    
    # Converts image to the blob using scaling factor and input image size accepted by YOLO
    blob = opencv.dnn.blobFromImage(image, self.scaling_factor, 
        self.input_image_size, swapRB=True, crop=False)
 
    return blob

该方法调用 OpenCV DNN 模块的 blobFromImage。该方法接受像素缩放因子和图像大小。还有两个附加参数：swapRB 和 crop。第一个参数将交换红色和蓝色通道。这是必需的，因为 OpenCV 的图像具有 BGR 颜色通道顺序。交换后，颜色通道将为 RGB 顺序。第二个参数指示是否应裁剪图像以符合预期的输入大小。

然后，我运行推理（参见 YoloInference 中的 detect_people 函数）。

image = self.prepare_image(image)
 
# Set the blob as the interpreter (neural network) input
self.interpreter.setInput(image)
 
# Run inference
output_layers = self.interpreter.forward(self.output_layers)

有关检测到的对象的信息编码在 output_layers 变量中。这是网络输出的列表。然后，我们需要解析这些输出以获取检测结果。

结果解读

为了处理输出层，我使用了两个 for 循环。第一个循环遍历层。第二个循环分析每个层的检测结果。

# Process output layers
detected_people = []
 
for output_layer in output_layers:     
    for detection_result in output_layer: 
        object_info = self.parse_detection_result(input_image_size,
            detection_result, threshold)                
 
        if(object_info is not None): 
            detected_people.append(object_info)

在上面的代码中，我使用了辅助方法 parse_detection_result。它接受三个参数：

input_image_size – 原始输入图像的大小。
detection_result – 输出层的对象。
threshold – 分数阈值。分数低于此值的检测将被拒绝。

给定这些输入，parse_detection_result 方法会解码对象标签及其分数，然后查找标签为“person”的对象。最后，该方法解码对象的边界框并将其转换为矩形。此转换是为了使代码与其他应用程序部分兼容（有关转换方法，请参阅配套代码的 yolo_inference.py）。最后，该方法将矩形、标签和分数打包到 Python 字典中。这里还有一个对象：一个框。稍后我将用它来改进对象位置检测。

def parse_detection_result(self, input_image_size, detection_result, threshold): 
    # Get the object label and detection score
    label, score = self.get_object_label_and_detection_score(detection_result)
    
    # Store only objects with the score above the threshold and label 'person'
    if(score > threshold and label == 'person'):
        box = detection_result[0:4]
        
        return {
            'rectangle': self.convert_bounding_box_to_rectangle_points(
                box, input_image_size),
            'label': label,
            'score': float(score),
            'box' : self.adjust_bounding_box_to_image(box, input_image_size)
        }
    else:
        return None

为了解码标签和分数，我使用了另一个助手：

def get_object_label_and_detection_score(self, detection_result):
    scores = detection_result[5:]    
 
    class_id = np.argmax(scores)
 
    return self.labels[class_id], scores[class_id]

它接受来自检测结果的原始分数，计算最大分数的 [位置](https://stackoverflow.com/questions/20919150/how-to-get-the-index-of-the-maximum-value-in-a-numpy-array) 并使用该位置查找相应的标签。

预览检测到的人员

我们现在可以使用我们的视频文件测试 YOLO 检测器。为此，我使用了之前开发的大部分组件，包括视频读取器、图像助手和距离分析器。此外，我还导入了 YoloInference 类。这是完整的脚本：

import sys
 
sys.path.insert(1, '../Part_03/')
sys.path.insert(1, '../Part_05/')
sys.path.insert(1, '../Part_06/')
 
from yolo_inference import YoloInference as model
from image_helper import ImageHelper as imgHelper
from video_reader import VideoReader as videoReader
from distance_analyzer import DistanceAnalyzer as analyzer
 
if __name__ == "__main__": 
    # Load and prepare model
    config_file_path = '../Models/03_yolo.cfg'    
    weights_file_path = '../Models/04_yolo.weights'
    labels_file_path = '../Models/05_yolo-labels.txt'
 
    # Initialize model
    ai_model = model(config_file_path, weights_file_path, labels_file_path)
 
    # Initialize video reader
    video_file_path = '../Videos/01.mp4'
    video_reader = videoReader(video_file_path)
 
    # Detection and preview parameters
    score_threshold = 0.5
    delay_between_frames = 5
 
    # Perform object detection in the video sequence
    while(True):
        # Get frame from the video file
        frame = video_reader.read_next_frame()
 
        # If frame is None, then break the loop
        if(frame is None):
            break
        
        # Perform detection        
        results = ai_model.detect_people(frame, score_threshold)
 
        imgHelper.display_image_with_detected_objects(frame, 
            results, delay_between_frames)

脚本看起来与我们为 MobileNet 开发的脚本几乎相同。唯一的区别是我们使用了 YoloInference 而不是 Inference。运行上述代码后，您应该会看到下图所示的结果。立即显而易见的是，YOLO 检测到了图像中的每个人，但我们有很多重叠的边界框。让我们看看如何删除它们。

过滤掉重叠的边界框

在每个重叠的边界框中，我们需要选择最好的一个（分数最高的那个）。幸运的是，我们不需要从头开始实现所有内容。OpenCV 中有一个专门的函数可以做到这一点——DNN 中的 NMSBoxes。它使用非最大抑制（NMS）算法来过滤掉无用的框。

NMSBoxes 接受四个输入参数：

boxes – 边界框列表。
scores – 分数列表。
threshold – 分数阈值。
nms_threshold – NMS 算法的阈值。

使用以下代码，我从 detect_people 返回的结果中获取框和分数，这样我就可以只获取字典中相应字段的值：

def get_values_from_detection_results_by_key(self, detection_results, dict_key):        
    return [detection_results[i][dict_key] for i in range(0, len(detection_results))]

随后，为了集成 NMSBoxes，我在 YoloInference 类中添加了另一个助手：

def filter_detections(self, detected_people, threshold, nms_threshold):
    # Get scores and boxes
    scores = self.get_values_from_detection_results_by_key(detected_people, 'score')
    boxes = self.get_values_from_detection_results_by_key(detected_people, 'box')
    
    # Get best detections
    best_detections_indices = opencv.dnn.NMSBoxes(boxes, 
        scores, threshold, nms_threshold)                
 
    # Return filtered people
    return [detected_people[i] for i in best_detections_indices.flatten()]

最后，我在 detect_people 中调用 filter_detections，如下所示：

def detect_people(self, image, threshold):
    # Store the original image size
    input_image_size = image.shape[-2::-1]
 
    # Preprocess image to get the blob
    image = self.prepare_image(image)
 
    # Set the blob as the interpreter (neural network) input
    self.interpreter.setInput(image)
 
    # Run inference
    output_layers = self.interpreter.forward(self.output_layers)
    
    # Process output layers
    detected_people = []        
    for output_layer in output_layers:            
        for detection_result in output_layer:                
            object_info = self.parse_detection_result(input_image_size, 
                detection_result, threshold)                
 
            if(object_info is not None):                    
                detected_people.append(object_info)
 
    # Filter out overlapping detections
    nms_threshold = 0.75
    detected_people = self.filter_detections(detected_people, threshold, nms_threshold)
    
    return detected_people

整合

有了以上所有部分，我们现在可以修改主脚本如下（完整代码请参见 Part_08 文件夹中的 main.py）：

# Get frame from the video file
frame = video_reader.read_next_frame()
 
# If frame is None, then break the loop
if(frame is None):
    break
 
# Perform detection        
results = ai_model.detect_people(frame, score_threshold)
 
#imgHelper.display_image_with_detected_objects(frame, results, delay_between_frames) 
 
# Find people that are too close
proximity_distance_threshold = 150
people_that_are_too_close = analyzer.find_people_that_are_too_close(
    results, proximity_distance_threshold)
 
#Indicate those people in the image
imgHelper.indicate_people_that_are_too_close(
    frame, people_that_are_too_close, delay_between_frames)

脚本设置了 AI 模型，打开了示例视频文件，并查找了距离过近的人员。在这里，我将距离阈值设置为 150 像素。运行 main.py 后，您将获得引言中显示的结果，从而实现我们 AI 驱动的社交距离检测器的目标。

总结

在本文中，我们实现了应用程序的最终版本，该版本可以根据来自摄像头或视频文件的图像来指示违反社交距离规定的人员。

我们从学习 OpenCV 的计算机视觉任务（图像采集和显示）开始这段激动人心的旅程。然后，我们学习了图像标注、TensorFlow Lite 对象检测以及如何计算检测对象之间的距离。最后，我们集成了最先进的 YOLO 对象检测器，使我们的应用程序更加健壮。

至此，我们的旅程就结束了。希望您喜欢这个系列文章！我鼓励您在此基础上进行扩展，甚至找到新的应用。