Comparison of Backbones for Microscopic Object Detection Algorithms

Main Article Content

Natthaphon Hongcharoen
Parinya Sanguansat
Sanparith Marukatat


Modern object detection methods are mostly comprised of feature extractor parts and detection parts. With the rise of Vision Transformers and more advanced variants of Convolutional Neural Networks, we present the comparative experimental results of using different feature extractors on the Cascade Faster R-CNN object detection technique. We also prove the significance of using the complete pre-trained weight for the entire object detection model over the slight increase in feature extractor performance but need to randomly initialize all detection layers. The trained models were evaluated using the mean Average Precision (mAP) metric on the unseen laboratory-generated data and also visual evaluation of real-world data from medical diagnoses. The modern Vision Transformer techniques such as PVT and Swin significantly outperformed the traditional Convolutional Neural Network model such as ResNet or ResNeXt with PVT V2 achieved 78% mAP at IOU 0.7 with only the feature extractor pre-trained on ImageNet dataset compared to 60.5% of ResNet 101 and 59.2% of ResNeXt 101-64x4 with similar weight initialization. The results also show a significant increase in the accuracy of using the pre-trained model entirely as a weight initializer in every layer but the final output. ResNet 50 and ResNet 101 achieved 75.6% and 77.2% mAP respectively. A significant improvement over 59.5% and 60.5%. ResNeXt with a pre-trained detector also achieved 78.8% and 79.2% on 64 and 32 cardinality sizes respectively, actually better than PVT V2 with only random weight initialized on the detector part.

Article Details

How to Cite
Hongcharoen, N., Sanguansat, P., & Marukatat, S. (2023). Comparison of Backbones for Microscopic Object Detection Algorithms. INTERNATIONAL SCIENTIFIC JOURNAL OF ENGINEERING AND TECHNOLOGY (ISJET), 7(1), 25–40. Retrieved from
Research Article


F. Bray, J. Ferlay, I. Soerjomataram et al. (2018, Sep). Global Cancer Statistics 2018: Globocan Estimates of Incidence and Mortality Worldwide for 36 Cancers in 185 Countries. CA: A Cancer Journal for Clinicians. [Online]. 68(6), pp. 394-424. Available: https://acsjournals.

T. Y. Lin, P. Goyal, R. Girshick et al. (2018, Aug). Focal Loss for Dense Object Detection. Computer Vision Foundation. [Online]. 1, pp. 2980-2988. Available:

Z. Cai and N. Vasconcelos. (2019, Nov). Cascade R-CNN: High-Quality Object Detection and Instance Segmentation. IEEE. [Online]. 43(5), pp. 1483-1498. Available:

N. Carion, F. Massa, G. Synnaeve et al. (2020, May). End-to-End Object Detection with Transformers. European Conference on Computer Vision. [Online]. 12346, pp. 1-17. Available:

D, Bolya, C, Zhou, F, Xiao et al. (2019, Apr. 4). YOLACT: Real-Time Instance Segmentation. Computer Vision and Pattern Recognition. [Online]. Available:

A. Dosovitskiy, L. Beyer, A. Kolesnikov et al. (2022, Jan. 10). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. [Online]. Available:

K. Chen, J. Wang, J. Pang et al., (2022, Jan. 10). Mmdetection: Open MMlab Detection Toolbox and Benchmark. [Online]. Available:

O. Russakovsky, J. Deng, H. Su et al. (2022, Jan. 10). ImageNet: A Large-scale Hierarchical Image Database. [Online]. Available:

T. Y. Lin, M. Maire, S. Belongie et al. (2022, Jan. 15). Microsoft coco: Common Objects in Context. [Online]. Available:

S. Ren, K. He, R. Girshick et al. (2022, Jan. 15). Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. [Online]. Available:

S. Xie, R. Girshick, P. Dollár et al. (2022, Jan. 15). Aggregated Residual Transformations for Deep Neural Networks. [Online] Available:

H. Zhang, C. Wu, Z. Zhang et al. (2022, Jan. 15). ResNeSt: Split-Attention Networks. [Online]. Available:

W. Wang, E. Xie, X. Li et al. (2022, Jan. 15). Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions. [Online]. Available:

Z. Liu, Y. Lin, Y. Cao et al. (2021, Jan. 15). Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. [Online]. Available:

W. Wang, E. Xie, X. Li et al. (2022, Jan. 15). PVTv2: Improved Baselines with Pyramid Vision Transformer. [Online]. Available: