AI 队列长度检测：使用 Keras 进行自定义对象检测的 R-CNN

MehreenTahir

5.00/5 (4投票s)

2020 年 10 月 27 日

CPOL

6分钟阅读

9613

158

在本文中，我们将探讨一些用于对象检测的其他算法，并学习如何为自定义对象检测实现它们。

下载源代码 - 4.2 KB

在使用对象检测算法时，基本方法是通过在目标对象周围绘制边界框来尝试定位感兴趣的对象。由于可能存在多个感兴趣的对象，并且它们的出现次数是未知的，因此这会导致可变长度的输出层，这意味着对象检测问题无法通过构建由全连接层组成的标准深度神经网络来解决。解决此问题的一种方法是采用图像中的不同感兴趣区域，并使用神经网络检测每个特定区域内所需对象是否存在。这种方法似乎也失败了，因为所需对象在图像中可能具有不同的纵横比和位置，从而导致大量的区域，最终在计算上变得难以处理。

为了解决这个问题，已经开发了 R-CNN、Fast R-CNN 和 YOLO 等算法。在本文中，我们将实现 R-CNN 来检测给定图像中的人物。

R-CNN

Regions with CNN (R-CNN) 由 Ross、Jeff 和 Jitendra 于 2014 年提出。其思想是，我们不直接在大量区域上进行检测，而是通过选择性搜索（selective search）来处理图像，从图像中提取约 2000 个区域，称为区域建议（region proposals）。现在，我们可以只处理这 2000 个建议区域，而不是尝试对大量区域进行分类。接下来，我们计算建议区域的交并比（IOU），并使用真实标签数据添加标签。为了让这一切都变得清晰，我们将在这里从头开始使用 Keras 实现 R-CNN，但我们将在系列的后续文章中更详细地介绍 R-CNN。

准备对象检测数据集

我们将使用 INRIAPerson 数据集，该数据集在 Kaggle 上很容易获得。提到的数据集包含 2 个子目录，包含测试和训练数据，并且这两个子目录都包含图像及其关联的注释。图像注释基本上是对图像中的数据进行标记，使对象对人工智能和机器学习模型可感知。这些图像可能包含人物、车辆或任何其他类型的对象，以便机器能够识别。但是，INRIAPerson 数据集是专门为检测图像文件中的人物而创建的，因此只包含人物的注释。查看以下文件将使这个概念更加清晰。

<?xml version="1.0" ?>
<annotation>
   	<folder>VOC2007</folder>
   	<filename>crop_000010.png</filename>
   	<source>
          	<database>PASperson Database</database>
          	<annotation>PASperson</annotation>
   	</source>
   	<size>
          	<width>594</width>
          	<height>720</height>
          	<depth>3</depth>
   	</size>
   	<segmented>0</segmented>
   	<object>
          	<name>person</name>
          	<pose>Unspecified</pose>
          	<truncated>0</truncated>
          	<difficult>0</difficult>
          	<bndbox>
                 	<xmin>194</xmin>
                 	<ymin>127</ymin>
                 	<xmax>413</xmax>
                 	<ymax>647</ymax>
          	</bndbox>
   	</object>
</annotation>

我们将使用这些注释来使我们的模型能够识别对象。但在继续之前，我们需要先解析这些注释到 csv 文件，并提取我们需要的数据。Python 提供了 ElementTree API 来解析 xml 文件。下面的函数可用于轻松加载和解析 xml 注释文件。

def parse_xml_to_csv(path):
	xml_list = []
	#iterate over all files to extract the bounding box for person present in the corresponding image
	for xml_annot in glob.glob(path + '/*.xml'):
    		#load and parse file
    		tree = ET.parse(xml_annot)
    		doc = etree.parse(xml_annot)
    		count = doc.xpath("count(//object)")
              #getting root of the document
    		root = tree.getroot()
    		with open(str(xml_annot)[0:-4]+".csv","w+") as file:
            		file.write(str(int(count)))
    		for person in root.findall('object'):
            		value = (
                     	person[4][0].text,
                     	person[4][1].text,
                     	person[4][2].text,
                     	person[4][3].text
                     	)
            		coors = " ".join(value)
        	
            		with open(str(xml_annot)[0:-4]+".csv","a") as file:
                    	file.write("\n")
                    	file.write(coors)

调用上述函数，将注释文件的路径作为参数传递

annot_path ="./Annotations"
xml_df = parse_xml_to_csv(annot_path)

函数完成后，您可以看到所有已转换的 csv 文件。

使用 Keras 实现 R-CNN

数据准备好后，我们就可以继续实现 R-CNN 了。首先，让我们导入我们将要使用的所有库。

import os                        # to interact with OS
import cv2                       # to perform selective search on images
import keras                     # to implement neural net
import numpy as np               # to work with arrays
import pandas as pd              # for data manipulation
import tensorflow as tf          # for deep learning models
import matplotlib.pyplot as plt  # for plotting

正如我们之前提到的，搜索感兴趣的区域在计算上是耗时的，所以我们将尝试在这里实现一个高效的解决方案。选择性搜索根据颜色、纹理、大小或形状计算相似性，并分层地组合最相似的区域。这个过程一直持续到整个图像变成一个区域。OpenCV 提供了使用 createSelectiveSearchSegmentation 函数来实现选择性搜索。如下所示将优化和选择性搜索添加到您的解决方案中

# OpenCV optimization
cv2.setUseOptimized(True);
# selective search
selective_search = cv2.ximgproc.segmentation.createSelectiveSearchSegmentation()

现在，如果我们对测试图像应用选择性搜索，它将在目标对象周围生成边界框。

at this point, we’re interested in how accurate our bounding boxes currently are. For that, we can simply use intersection over union (IOU), which is an evaluation metric that measures the accuracy of object detectors. It can be calculated by computing the area of overlap (area of intersection) between the predicted bounding box and the ground-truth bounding box divided by the total area bounded by both (area of union)

 def compute_iou(box1, box2):
	x_left = max(box1['x1'], box2['x1'])
	y_top = max(box1['y1'], box2['y1'])
	x_right = min(box1['x2'], box2['x2'])
	y_bottom = min(box1['y2'], box2['y2'])
 
	intersection_area = (x_right - x_left) * (y_bottom - y_top)
 
	box1_area = (box1['x2'] - box1['x1']) * (box1['y2'] - box1['y1'])
	box2_area = (box2['x2'] - box2['x1']) * (box2['y2'] - box2['y1'])
	
	union_area = box1_area + box2_area - intersection_area
 
	iou = intersection_area / union_area
 
	return iou

现在我们需要预处理数据以创建可以传递给我们的模型的数据集。我们将遍历所有图像，并将它们设置为选择性搜索的基础。然后，我们将遍历选择性搜索产生的最初 2000 个建议区域，并计算 IOU，以便我们可以注释我们想要的对象（一个人）的区域。图像将根据对象是否存在进行标记，并添加到我们的 training_images 数组中。

training_images=[]
training_labels=[]
for e,i in enumerate(os.listdir(annot)):
	try:
    	filename = i.split(".")[0]+".png"
    	img = cv2.imread(os.path.join(path,filename))
    	dataframe = pd.read_csv(os.path.join(annot,i))
    	ground_truth_values=[]
    	for row in dataframe.iterrows():
        	x1 = int(row[1][0].split(" ")[0])
        	y1 = int(row[1][0].split(" ")[1])
        	x2 = int(row[1][0].split(" ")[2])
        	y2 = int(row[1][0].split(" ")[3])
            ground_truth_values.append({"x1":x1,"x2":x2,"y1":y1,"y2":y2})
        	
    	# setting the image as base image for selective search
    	selective_search.setBaseImage(img)
    	
    	# initializing fast selective search
        selective_search.switchToSelectiveSearchFast()
    	
    	# getting proposed regions
    	ssresults = selective_search.process()
    	imout = img.copy()
    	counter = 0
    	f_counter = 0
    	flag = 0
    	fflag = 0
    	bflag = 0
    	for e,result in enumerate(ssresults):
        	
        	# iterating over the first 2000 results from selective search to colculate IOU
        	if e < 2000 and flag == 0:
          	  for val in ground_truth_values:
                	x,y,w,h = result
                	iou = compute_iou(val,{"x1":x,"x2":x+w,"y1":y,"y2":y+h})
                	
                	# limiting the maximum positive samples to 20
                	if counter < 20:
                    	
                    	# setting IOU > 0.70 as goodness measure for positive i.e. person detected
                    	if iou > 0.70:
                        	image = imout[y:y+h,x:x+w]
                 	       resized = cv2.resize(image, (224,224), interpolation = cv2.INTER_AREA)
                            training_images.append(resized)
                            training_labels.append(1)
                        	counter += 1
                	else :
                    	fflag =1
                    	
                	# limiting the maximum negative samples to 20
                	if f_counter <20:
                    	if iou < 0.3:
                        	image = imout[y:y+h,x:x+w]
                        	resized = cv2.resizetimage, (224,224), interpolation = cv2.INTER_AREA)
                            training_images.append(resized)
                        	training_labels.append(0)
                        	f_counter += 1
                	else :
                    	bflag = 1
            	if fflag == 1 and bflag == 1:
                	flag = 1
	except Exception as e:
    	print(e)
    	continue

现在 training_images 和 training_labels 包含了我们模型的新 x 和 y 坐标。让我们从模型的导入开始。

from keras.layers import Dense
from keras import Model
from keras import optimizers
from keras.optimizers import Adam
from keras.applications.vgg16 import VGG16
from keras.preprocessing.image import ImageDataGenerator
from keras.callbacks import ModelCheckpoint, EarlyStopping

R-CNN 模型理论上可以从头开始训练，但需要花费大量时间且性能较差。在这里，我们将使用迁移学习来节省时间并获得更好的性能。您可以根据自己的喜好使用 imagenet 或 coco 权重进行迁移学习。

vggmodel = VGG16(weights='imagenet', include_top=True)
vggmodel.summary()

上面的代码片段会产生以下输出

接下来，我们将通过将 trainable 设置为 false 来冻结模型的前十层。

for layers in (vggmodel.layers)[:10]:
layers.trainable = False

我们只关心人类是否存在，这意味着我们只有两个类需要预测，所以我们将添加两个单元的密集层和 softmax 激活。使用 softmax 激活的原因是它确保输出的总和为 1（即，输出是概率）。

X= vggmodel.layers[-2].output
predictions = Dense(2, activation="softmax")(X)

最后，我们将使用 Adam 优化器来编译模型。

model_final = Model(vggmodel.input, predictions)
model_final.compile(loss = keras.losses.categorical_crossentropy, optimizer = Adam(lr=0.001), metrics=["accuracy"])

我们已经创建了模型。在继续之前，我们需要对数据集进行编码。我们可以使用 LabelBinarizer 进行编码。

class Label_Binarizer(LabelBinarizer):
	def transform(self, y_old):
    	Y = super().transform(y_old)
    	if self.y_type_ == 'binary':
        	return np.hstack((Y, 1-Y))
    	else:
    	    return Y
	def inverse(self, Y):
    	if self.y_type_ == 'binary':
        	return super().inverse(Y[:, 0])
    	else:
        	return super().inverse(Y)
encoded = Label_Binarizer()
Y =  encoded.fit_transform(y_new)

我们还需要将数据集分割成训练集和测试集，这可以使用 sklearn 的 train_test_split 来完成。这里我们将数据分成 80% 的训练集和 20% 的测试集。

X_train, X_test , y_train, y_test = train_test_split(X_new,Y,test_size=0.20)

Keras 提供了 ImageDataGenerator 将数据集传递给模型。您还可以应用水平或垂直翻转来增加数据集。

train_data_prep = ImageDataGenerator(horizontal_flip=True, vertical_flip=True, rotation_range=90)
trainingdata = train_data_prep.flow(x=X_train, y=y_train)
test_data_prep = ImageDataGenerator(horizontal_flip=True, vertical_flip=True, rotation_range=90)
testingdata = test_data_prep.flow(x=X_test, y=y_test)

添加 Keras 回调

训练深度神经网络需要很长时间，我们总是面临浪费计算资源的风险。为了避免这个问题，Keras 提供了两个回调：EarlyStopping 和 ModelCheckPoint。EarlyStopping 在每个 epoch 完成时调用。它一旦不再改进就会中止训练过程，允许您配置任意数量的 epoch。ModelCheckpoint 也在每个 epoch 后调用，并自动保存性能最佳的模型。我们可以使用 fit_generator 同时使用这两个回调来训练我们的模型，如下所示

checkpoint = ModelCheckpoint("rcnn.h5", monitor='val_loss', verbose=1, save_best_only=True, save_weights_only=False, mode='auto', save_freq=1)
early = EarlyStopping(monitor='val_loss', min_delta=0, patience=100, verbose=1, mode='auto')
hist = model_final.fit_generator(generator= traindata, steps_per_epoch= 10, epochs= 500, validation_data= testdata, validation_steps=2, callbacks=[checkpoint,early])

测试我们的模型

我们的模型现在将被创建并保存为 rcnn.h5，我们可以很好地对模型进行预测。我们将遵循与之前相同的步骤：遍历所有图像并将它们设置为选择性搜索的基础图像。稍后，我们将把选择性搜索的结果传递给我们的模型进行预测，当我们的模型在图像中检测到人时，它将创建边界框。

count=0
for e,i in enumerate(os.listdir(path)):
	
	count += 1
	image = cv2.imread(os.path.join(path,i))
	selective_search.setBaseImage(image)
    selective_search.switchToSelectiveSearchFast()
	ssresults = selective_search.process()
	imout = image.copy()
	for e, res in enumerate(ssresults):
    	if e < 2000:
        	x,y,w,h = res
        	test_image = imout[y:y+h,x:x+w]
        	resized = cv2.resize(test_image, (224,224), interpolation = cv2.INTER_AREA)
        	image = np.expand_dims(resized, axis=0)
        	out= model_final.predict(image)
        	if out[0][0] > 0.65:
            	cv2.rectangle(imout, (x, y), (x+w, y+h), (0, 255, 0), 1, cv2.LINE_AA)
	plt.figure()
	plt.imshow(imout)

注意：结果来自早期终止

R-CNN 的局限性

R-CNN 存在一些缺点。它在其根部仍然实现了滑动窗口。唯一的区别是它实际上是以卷积的形式实现的，这比传统的滑动窗口技术更有效。但它仍然需要对 2000 个区域建议中的每一个运行完整的 CNN 前向传播，并且具有复杂的、多阶段的训练流程，导致严重的性能问题。此外，由于测试时间长，R-CNN 在实时或拥挤的区域中是不可行的。

接下来是什么？

在本文中，我们学习了如何使用 Keras 中的深度神经网络实现我们的第一个自定义对象检测器。我们还讨论了该方法的一些局限性。在系列的下一篇文章中，我们将尝试克服 R-CNN 的限制，并将估计一个区域内存在的人数。