How To use Deep Learning and OpenCV to Detect Groups of Targets on an Image

Updated on July 1, 2022
Read — 4 minutes

In the detection of objects and people, sometimes the question arises of identifying groups of people in a separate frame for subsequent processing. There are many variations of solutions to this problem, but in this article, I will describe an approach that allows one to achieve the desired result by connecting Deep Learning with Computer Vision.

First, you need to understand what is considered a group. It was decided to treat a group of people as a group of three or more people visually located nearby.

Tools

First of all, the Python programming language will be used for this, since many implementations of machine learning algorithms are written on it. Also, it is simple and elegant. It is necessary to process the video through frames, which can be obtained through the OpenCV library. Each frame, in this case, will be considered as a separate image like in OpenCV tutorial.

Next was the choice of what to use to determine AI targets in the image because the method should not be cumbersome and fast. The choice was made to use YOLO. This system works out quite quickly, even on a CPU. It has many implementations and pre-trained models.

How YOLO works?

Prior detection systems repurpose classifiers or localizers to perform detection. They apply the model to an image at multiple locations and scales. High-scoring regions of the image are considered detections.

Were applied a single neural network to the full image. This network divides the image into regions and predicts bounding boxes and probabilities for each region. These bounding boxes are weighted by the predicted probabilities.

The Darknet model has several advantages over classifier-based systems. This system looks at the whole image at test time so its predictions are informed by the global context in the image. It also makes predictions with a single network evaluation unlike systems like R-CNN which requires thousands of evaluations for a single image. This makes it extremely fast, more than 1000x faster than R-CNN and 100x faster than Fast R-CNN.

Deep Learning Backend

The model itself is easily loaded via the GluonCV (a deep learning toolkit for computer vision), which uses the MXNet deep learning library as the backend. The library also has good documentation and many implemented algorithms.

In order for the video frame to work with the MXNet library, you need to turn NumPy array into mxnet array.

right_frame = mxnet.nd.array(frame)

Processing and finding groups

Next, the implementation of YOLO was chosen yolo3_darknet_coco, pre-trained on COCO dataset. In this set, there is already a class “person” which made it possible not to waste time on training the neural network.

net = model_zoo.get_model(
     'yolo3_darknet53_coco',
     pretrained=True
)
x, img = data.transforms.presets.yolo.transform_test(
    right_frame,
    short=512,
    max_size=1024
)
class_IDs, scores, bounding_boxes = net(x)

As a result of processing, arrays containing the names of the classes (class_IDs), the probability of the resulting class (scores), and an array of bounding boxes were obtained (bounding_boxes). After receiving the bounding boxes array, you can find groups through intersecting player boxes using the coordinates of the extreme points of the rectangle.

Further, the following actions were performed: if the bounding boxes of two targets are intersected, add this pair of values to the array target_intersection. Then, looping over, we check if there is a target in both arrays. If exists, then create or update a group of them.

target_intersection = [
   sorted((k1,k2))
   for k1,v1 in targets.items()
   for k2, v2 in targets.items()
   if k1 != k2 and bounding_box_intersection(v1, v2)
]
groups = []
for pair in target_intersection:
   found_group = False
   for group in groups:
      if set(pair).intersection(group):
         group.update(pair)
         found_group = True
         break
   if not found_group:
      groups.append(set(pair))

Result

After the process has completed its analysis, it can provide information by identifying groups as shown.

In the picture, the AI targets are highlighted through the bounding boxes along with the name of the class and the probability of getting to this class. The program result:

groups = find_groups(persons)
print(f"Result = {groups}")
Result = [4,3]

Result = [4, 3] means the two groups of people with four in the first and three in the second. If the result was equal to [] it means that no groups of people were detected on an image.

Conclusion

The ability to recognise groups and targets within images has been a long-anticipated advancement in technology. The speed at which AI and ML can perform this task is infinitely greater than that of a human. Specific apps have been designed for identifying groups of AI targets within a frame using deep learning and OpenCV. This method is fast and accurate and also can be used for video surveillance systems, and video montages.

How can we help you?