How To: Detect Groups of Targets on an Image Using Deep Learning and OpenCV

In the detection of objects and people, sometimes the question arises of identifying groups of people in a separate frame for subsequent processing. There are many variations of solutions to this problem, but in this article I will describe an approach that allows one to achieve the desired result by connecting Deep Learning with Computer Vision.

First you need to understand what is considered a group. It was decided to treat a group of people as a group of three or more people visually located nearby.


First of all, the Python programming language will be used for this, since many implementations of machine learning algorithms are written on it. Also, it is simple and elegant. It is necessary to process the video through frames, which can be obtained through the OpenCV library. Each frame, in this case, will be considered as a separate image like in OpenCV tutorial.

Next was the choice of what to use to determine AI targets in the image, because the method should not be cumbersome and fast. The choice was made to use YOLO. This system works out quite quickly, even on a CPU. It has many implementations and pretrained models.

How YOLO works?

Prior detection systems repurpose classifiers or localizers to perform detection. They apply the model to an image at multiple locations and scales. High scoring regions of the image are considered detections.

Were applied a single neural network to the full image. This network divides the image into regions and predicts bounding boxes and probabilities for each region. These bounding boxes are weighted by the predicted probabilities.

The Darknet model has several advantages over classifier-based systems. It looks at the whole image at test time so its predictions are informed by global context in the image. It also makes predictions with a single network evaluation unlike systems like R-CNN which require thousands evaluations for a single image. This makes it extremely fast, more than 1000x faster than R-CNN and 100x faster than Fast R-CNN.

Deep Learning Backend

The model itself is easily loaded via the GluonCV (a deep learning toolkit for computer vision), which uses the MXNet deep learning library as the backend. The library also has good documentation and has many implemented algorithms.

In order for the video frame to work with the MXNet library, you need to turn numpy array in mxnet array.

right_frame = mxnet.nd.array(frame)

Processing and finding groups

Next, the implementation of YOLO was chosen yolo3darknetcoco, pre-trained on COCO dataset. In this set there is already a class “person” which made it possible not to waste time on training the neural network.

net = model_zoo.get_model(
x, img = data.transforms.presets.yolo.transform_test(
class_IDs, scores, bounding_boxes = net(x)

As a result of processing, arrays containing the names of the classes (classIDs), the probability of the resulting class (scores), and an array of bounding boxes were obtained (boundingboxes). After receiving the bounding boxes array, you can find groups through intersecting player boxes using the coordinates of the extreme points of the rectangle.

Further, the following actions were performed: if the bounding boxes of two targets are intersected, add this pair of values to the array target_intersection. Then, looping over , we check if there is a target in both arrays. If exists, then create or update a group of them.

target_intersection = [
   for k1,v1 in targets.items()
   for k2, v2 in targets.items()
   if k1 != k2 and bounding_box_intersection(v1, v2)
groups = []
for pair in target_intersection:
   found_group = False
   for group in groups:
      if set(pair).intersection(group):
         found_group = True
   if not found_group:


In the picture, the AI targets are highlighted through the bounding boxes along with the name of the class and the probability of getting to this class. The program result:

groups = find_groups(persons)
print(f"Result = {groups}")
Result = [6,5]

Result = [6, 5] means the two groups of people with six in first and five in second. If result was equal to [] it means that no groups of people on an image.


Were designed concept for getting groups of AI targets on a frame using deep learning and OpenCV. This method is fast and accurate and also can be used for video surveillance systems, video montage, etc.

Rating — 5
Updated on October 11, 2021
Read — 7 minutes