Aryn/deformable-detr-DocLayNet – open-source Layout Model
The Deformable DETR model, trained on DocLayNet, achieves 57.1 mAP for object detection using a transformer architecture. It is available on Hugging Face and has been downloaded 108,960 times recently.
Read original articleThe Deformable DETR model, trained on the DocLayNet dataset, is designed for object detection tasks. It utilizes an encoder-decoder transformer architecture with a convolutional backbone, incorporating two heads for class label prediction and bounding box regression. The model employs object queries to identify specific objects within images, with a typical configuration of 100 queries for datasets like COCO. Training involves a bipartite matching loss, which aligns predicted classes and bounding boxes with ground truth annotations using the Hungarian matching algorithm. The model achieves a mean Average Precision (mAP) of 57.1 on the DocLayNet dataset, which consists of 80,000 annotated pages across 11 classes.
To utilize the model, users can import necessary libraries, load an image, and process it through the model to obtain detection results. The outputs include bounding boxes and class logits, which can be filtered based on a confidence threshold. The model is accessible via the Hugging Face platform, where users can find additional resources and related models. The Deformable DETR model is licensed under Apache 2.0, and its development is documented in the paper "Deformable DETR: Deformable Transformers for End-to-End Object Detection." The model has been downloaded 108,960 times in the last month, indicating significant interest and usage within the community.
Related
DETRs Beat YOLOs on Real-Time Object Detection
DETRs outperform YOLOs with RT-DETR model, balancing speed and accuracy by adjusting decoder layers. Achieving 53.1% / 54.3% AP on COCO and 108 / 74 FPS on T4 GPU, RT-DETR-R50 surpasses DINO-R50 by 2.2% AP and 21 times in FPS.
The Illustrated Transformer
Jay Alammar's blog explores The Transformer model, highlighting its attention mechanism for faster training. It outperforms Google's NMT in some tasks, emphasizing parallelizability. The blog simplifies components like self-attention and multi-headed attention for better understanding.
Dola Decoding by Contrasting Layers Improves Factuality in Large Language Models
A decoding strategy named DoLa reduces hallucinations in large language models without external knowledge. It contrasts logits from different layers to enhance truthfulness, improving factual generation by 12-17% in tasks like TruthfulQA.
Depth Anything V2
Depth Anything V2 is a monocular depth estimation model trained on synthetic and real images, offering improved details, robustness, and speed compared to previous models. It focuses on enhancing predictions using synthetic images and large-scale pseudo-labeled real images.
What Happened to Bert and T5?
Yi Tay analyzes transformer model evolution, emphasizing denoising methods over BERT-like models. He discusses encoder-decoder structures, bidirectional attention, and the value of denoising objectives for efficient language modeling.
Related
DETRs Beat YOLOs on Real-Time Object Detection
DETRs outperform YOLOs with RT-DETR model, balancing speed and accuracy by adjusting decoder layers. Achieving 53.1% / 54.3% AP on COCO and 108 / 74 FPS on T4 GPU, RT-DETR-R50 surpasses DINO-R50 by 2.2% AP and 21 times in FPS.
The Illustrated Transformer
Jay Alammar's blog explores The Transformer model, highlighting its attention mechanism for faster training. It outperforms Google's NMT in some tasks, emphasizing parallelizability. The blog simplifies components like self-attention and multi-headed attention for better understanding.
Dola Decoding by Contrasting Layers Improves Factuality in Large Language Models
A decoding strategy named DoLa reduces hallucinations in large language models without external knowledge. It contrasts logits from different layers to enhance truthfulness, improving factual generation by 12-17% in tasks like TruthfulQA.
Depth Anything V2
Depth Anything V2 is a monocular depth estimation model trained on synthetic and real images, offering improved details, robustness, and speed compared to previous models. It focuses on enhancing predictions using synthetic images and large-scale pseudo-labeled real images.
What Happened to Bert and T5?
Yi Tay analyzes transformer model evolution, emphasizing denoising methods over BERT-like models. He discusses encoder-decoder structures, bidirectional attention, and the value of denoising objectives for efficient language modeling.