The Real-Time Detection Transformer (RT-DETR), developed by Baidu, is an advanced end-to-end object detector offering real-time performance with high accuracy. It utilizes Vision Transformers (ViT) to efficiently handle multiscale features by separating intra-scale interaction from cross-scale fusion. RT-DETR is highly flexible, allowing for the adjustment of inference speed through different decoder layers without the need for retraining. The model performs exceptionally well on accelerated platforms like CUDA with TensorRT, surpassing many other real-time object detectors in performance.
In a recent research paper by Baidu Inc. titled "DETRs Beat YOLOs on Real-Time Object Detection," the negative impact of non-maximum suppression (NMS) on real-time detectors was analyzed, and an efficient hybrid encoder for multi-scale feature processing was proposed. The IoU-aware query selection enhances performance. RT-DETR-L achieves 53.0% AP on COCO val2017 at 114 FPS, outperforming YOLO detectors. RT-DETR-X achieves 54.8% AP at 74 FPS, surpassing YOLO in both speed and accuracy. RT-DETR-R50 achieves 53.1% AP at 108 FPS, outperforming DINO-DeformableDETR-R50 by 2.2% AP in accuracy and 21 times in FPS.
Figure 1. Compared to previous state-of-the-art real-time object detectors, RT-DETR achieves superior performance.
Object detection involves identifying or localizing specific objects within images or videos, and its models have diverse practical applications across various fields:
Autonomous Vehicles: Essential for enabling autonomous vehicles to detect and track pedestrians, vehicles, traffic signs, and other road objects.
Retail Analytics: Helps in tracking and analyzing customer behavior, monitoring inventory, and reducing theft by identifying suspicious activities.
Facial Recognition: A key component of facial recognition systems, used for access control, identity verification, and security.
Environmental Monitoring: Useful for tracking wildlife movements, monitoring deforestation, and assessing ecosystem changes.
Gesture Recognition: Facilitates interaction with devices by interpreting human gestures, used in gaming and virtual reality applications.
Agriculture: Assists in crop monitoring, pest detection, and yield estimation by identifying and analyzing plants, fruits, and pests in agricultural images.
These are just a few examples, as object detection plays a crucial role in many other areas.
Recently, transformer-based detectors have shown remarkable performance by using Vision Transformers (ViT) to efficiently process multiscale features. This is achieved by decoupling intra-scale interaction from cross-scale fusion. These models are highly adaptable, allowing for flexible adjustment of inference speed through various decoder layers without the need for retraining.
Model Architecture
The RT-DETR model comprises a backbone, a hybrid encoder, and a transformer decoder with auxiliary prediction heads. The architecture leverages features from the last three stages of the backbone (S3, S4, S5) as input to the encoder, which uses intra-scale interaction and cross-scale fusion to transform multi-scale features into an image feature sequence. IoU-aware query selection is then applied to choose a fixed number of image features from the encoder output as initial queries for the decoder. The decoder, along with auxiliary prediction heads, iteratively refines these queries to generate object boxes and confidence scores.
Figure 2. Overview RT-DETR.
A novel Efficient Hybrid Encoder is proposed for RT-DETR. This encoder consists of two modules, the Attention-based Intrascale Feature Interaction (AIFI) module and the CNN-based Cross-scale Feature-fusion Module (CCFM). Additionally, to create a scalable version of RT-DETR, the ResNet backbone has been replaced with HGNetv2.
Key Features
Efficient Hybrid Encoder: RT-DETR, developed by Baidu, employs an efficient hybrid encoder that handles multiscale features by separating intra-scale interaction from cross-scale fusion. This distinctive Vision Transformers-based approach lowers computational costs and enables real-time object detection.
IoU-Aware Query Selection: RT-DETR enhances the initialization of object queries through IoU-aware query selection, allowing the model to concentrate on the most pertinent objects in the scene and improving detection accuracy.
Adaptable Inference Speed: RT-DETR offers flexible inference speed adjustments by utilizing different decoder layers without needing retraining. This flexibility makes it highly practical for various real-time object detection applications.