Read Detection Boxes from Model Output
| Field | Value |
|---|---|
| Difficulty | Intermediate |
| Estimated Read Time | 15-20 minutes |
| Labels | postprocessing, boxdecode, detection |
Concept
Decode raw model output into usable bounding boxes using SimaBoxDecode — thresholding, NMS, and coordinate mapping built into one postprocessing stage. Read both decoded tensors and the raw byte format so you can handle any runtime shape.
BoxDecode is a highly optimized detection postprocessing path for vision workloads. It transforms inference tensors into final bounding-box results with thresholding and NMS in a single step.
Common box-decode controls in this tutorial:
decode_type(for exampleyolov8): selects model-family decode behavior.score_threshold: drops low-confidence detections early.nms_iou_threshold: controls overlap suppression aggressiveness.top_k: limits final detection count for deterministic downstream cost.original_width,original_height: maps decoded boxes to the source image coordinate space.
Use-case guidance
- Too many noisy boxes: increase
score_thresholdand/or reducetop_k. - Duplicate overlapping boxes: lower
nms_iou_thresholdto make suppression stricter. - Missed true positives: decrease
score_thresholdcautiously. - Boxes appear scaled/offset incorrectly: verify
original_widthandoriginal_heightmatch real source frames. - Porting between detector variants: ensure
decode_typematches the model family expected by the model archive.
APIs introduced
pyneat.ModelOptions()with.decode_type,.score_threshold,.nms_iou_threshold,.top_k,.original_width/height.pyneat.Tensor.from_numpy(array, image_format=...)— build the input tensor.sample.tensorwithdtype=UInt8— the packedBBOXbyte buffer (wire format documented below).
Prerequisites
Chapter 001. Chapter 004 for ModelOptions basics.
References
Learning Process
- Configure model/postproc options for a detector-style pipeline.
- Run deterministic preproc + inference + boxdecode flow.
- Inspect decoded output signals (box count, output kind/fields).
Run
Python:
python3 share/sima-neat/tutorials/006_read_detection_boxes/read_detection_boxes.py \
--model /tmp/yolo_v8s.tar.gz --width 640 --height 640
C++ (prebuilt):
./lib/sima-neat/tutorials/tutorial_006_read_detection_boxes \
--model /tmp/yolo_v8s.tar.gz --image /path/to/frame.jpg
C++ (build from source):
./build.sh --target tutorial_006_read_detection_boxes
./build/tutorials-standalone/tutorial_006_read_detection_boxes \
--model /tmp/yolo_v8s.tar.gz --image /path/to/frame.jpg
To integrate this chapter's C++ source into your own project with a custom CMakeLists.txt (no extras folder required), see How to Run Tutorials on the landing page.
In Practice
SimaBoxDecode emits a single output tensor tagged BBOX. The tensor carries a
packed byte buffer that the runtime parser interprets into floating-point
detections. Understanding that two-level contract (wire buffer vs. parsed
Box records) is the key to reading the output from either Python or C++.
Sample wrapper
Regardless of language, the decode stage produces one Sample per input frame
with:
| Field | Value |
|---|---|
kind | SampleKind.Tensor (single-tensor sample, not a bundle) |
payload_tag / format | "BBOX" |
media_type | "application/vnd.simaai.tensor" |
tensor.dtype | UInt8 |
tensor.shape | rank-1: [N_bytes], where N_bytes is the model archive-packed buffer capacity (for example [20160] on the stock YOLOv8 pack) |
fields | empty (all payload lives in tensor) |
The tensor shape is a byte count, not a detection count. The packed bytes
hold both a small header and a contiguous array of fixed-size box records.
N_bytes is determined by the model archive's buffers.input[0].size field (inside
the boxdecode stage's config JSON) and bounds the maximum number of
detections the decoder can emit in a single frame (see "Override contract"
below for how runtime dims interact with packaged values).
Packed wire format
The uint8 buffer is laid out little-endian:
offset size content
------ ---- -------
0 4 uint32 N = number of valid detections in this frame
4 24 RawBox[0]
28 24 RawBox[1]
. . ...
. . RawBox[N-1]
(trailing bytes up to buffer capacity are padding, ignored)
Each RawBox record is 24 bytes:
| Offset in record | Size | Type | Field | Meaning |
|---|---|---|---|---|
| 0 | 4 | int32 | x | top-left x, in source pixels |
| 4 | 4 | int32 | y | top-left y, in source pixels |
| 8 | 4 | int32 | w | width, in source pixels |
| 12 | 4 | int32 | h | height, in source pixels |
| 16 | 4 | float32 | score | post-NMS detection confidence in [0.0, 1.0] (the value detection_threshold gates on) |
| 20 | 4 | int32 | class_id | predicted class id (model-defined; 0-indexed; class-name map lives in the model archive metadata) |
The canonical Python struct format matching one record is "<iiiifi"
(little-endian, 4 signed ints, one float, one signed int).
The runtime's parsing helpers (parse_bbox_bytes /
decode_bbox_tensor in include/pipeline/DetectionTypes.h,
tests/unit_testing/unit_detection_types_bbox_test.cpp pins the wire contract)
expand each RawBox into a Box struct for downstream code:
struct Box {
float x1, y1, x2, y2; // x2 = x + w, y2 = y + h; clamped to [0, img_w|h]
float score;
int class_id;
};
Coordinate space
Coordinates decoded from BBOX are in original-image pixels, the same
coordinate system you passed as original_width / original_height (or that
the model archive was packaged with). They are not normalized to [0, 1], and they
are not expressed in the model's internal letterboxed input space. The
parser clamps (x1, y1, x2, y2) to [0, original_width] / [0, original_height]
so caller code can draw them directly on the source frame.
Worked example
With the tutorial's runtime configuration (original_width = 640,
original_height = 640, top_k = 100) and the stock YOLOv8 pack
(buffers.input[0].size = 20160 in the boxdecode config), a single decoded
frame yields:
out.kind == SampleKind.Tensorout.payload_tag == "BBOX"out.tensor.dtype == UInt8,out.tensor.shape == [20160]- Bytes
[0:4]giveNin little-endian;0 <= N <= 100becausetop_k = 100. AnNof0means "no detections above threshold this frame" — iterate zero times and emit nothing. - Bytes
[4 : 4 + 24 * N]hold the valid detections; everything after that offset is zero/padding and must be ignored.
Reading a box in Python is a struct.unpack_from:
import struct
payload = out.tensor.copy_payload_bytes()
count = struct.unpack_from("<I", payload, 0)[0]
for i in range(count):
x, y, w, h, score, cls = struct.unpack_from("<iiiifi", payload, 4 + 24 * i)
# (x, y, w, h) in source pixels; x2 = x + w, y2 = y + h
In C++ the stages::BoxDecode helper returns a BoxDecodeResult that has
already done this unpack for you: result.boxes[i] is a Box with
(x1, y1, x2, y2) already populated from (x, y, x+w, y+h) and clamped to
the image.
Override contract: runtime dims vs. packaged model archive defaults
SimaBoxDecode is constructed from a trained model archive that ships with packaged
defaults for decode_type, detection_threshold, nms_iou_threshold,
top_k, original_width, and original_height. The public constructor
SimaBoxDecode(const Model& model,
const std::string& decode_type = "",
int original_width = 0, int original_height = 0,
double detection_threshold = 0.0,
double nms_iou_threshold = 0.0,
int top_k = 0);
and its Python twin pyneat.nodes.sima_box_decode(model, ...) use a simple
"positive overrides, zero/empty preserves" rule per field.
Naming note.
detection_thresholdis the name used bySimaBoxDecode's constructor.ModelOptions.score_threshold(used in the Python tutorial) is plumbed into that same argument. The two names refer to the same underlying control.
| Runtime argument | Value passed | Behavior |
|---|---|---|
decode_type | "" (empty) | preserve model archive / model-path inference |
decode_type | non-empty string | override the model archive value for this run |
original_width / original_height | 0 | preserve model archive packaged dimension |
original_width / original_height | positive int | rewrite original_width / original_height in the effective config |
detection_threshold | 0.0 | preserve model archive packaged threshold |
detection_threshold | > 0.0 | override (also triggers the YOLOv8 cliff-warning below) |
nms_iou_threshold | 0.0 | preserve model archive packaged NMS IoU |
nms_iou_threshold | > 0.0 | override |
top_k | 0 | preserve model archive packaged top-K |
top_k | > 0 | override |
The rule is strictly per-field:
- Python path — the tutorial overrides every field because
ModelOptionssets positive values. - C++ path —
read_detection_boxes.cpppasses0.55f, 0.5f, 100(sodetection_threshold,nms_iou_threshold, andtop_kare overridden) plusbgr.cols, bgr.rowspositively (sooriginal_width/original_heightare overridden too).
Practical consequences:
- If your model archive was packed for a different resolution than your source frames,
pass
original_widthandoriginal_heightexplicitly so coordinates land in source pixels. - Leaving
detection_thresholdandnms_iou_thresholdat0.0is the safest way to get the model archive's validated defaults; only override when you are deliberately retuning. - Be deliberate with a low
detection_threshold. The lower it is, the more candidate boxes survive thresholding, and NMS cost grows with the square of the surviving-box count — so a very low threshold can sharply increase postprocess compute and latency. Lower it only as far as you need to catch weak detections; pair it withtop_kto cap the worst case.
Decode types and tensor contracts
BoxDecodeType is a typed API (simaai::neat::BoxDecodeType / neat.BoxDecodeType) and should always be set explicitly for decode stages. The runtime contract below comes from internals/gst_plugins/genericboxdecode_v2/gstneatboxdecode.cpp (infer_num_classes, infer_yolo_decoupled_classes, infer_yolo_packed_classes, compute_required_output_size).
Core tensor contract rules:
- YOLO-family decode types (
yolo,yolov5*,yolov7*,yolov8*,yolov9*,yolov10*):- Decoupled heads: class-head depths must be repeatable and
> 4. - Packed heads: each head depth must satisfy
depth = 3 * (num_classes + 5)and be consistent across heads.
- Decoupled heads: class-head depths must be repeatable and
yolo26: decoupled grouped heads with 4-channel raw l/t/r/b bbox tensors and repeatable class-head depths> 4.detr: class channels are inferred from the maximum depth across heads, and must be> 4.- Other non-YOLO decode types (
effdet,rcnn-stage1,centernet): fallback class inference uses max depth and requires> 4. - Segmentation decode tokens (
*-seg) enable segmentation-like output sizing in v2 (adds mask payload per detection).
| API enum | Backend token | Expected contract |
|---|---|---|
BoxDecodeType::Yolo | yolo | YOLO decoupled or packed depth contract |
BoxDecodeType::YoloV5 | yolov5 | YOLO decoupled or packed depth contract |
BoxDecodeType::YoloV5Seg | yolov5-seg | YOLO depth contract + segmentation path |
BoxDecodeType::YoloV7 | yolov7 | YOLO decoupled or packed depth contract |
BoxDecodeType::YoloV7Seg | yolov7-seg | YOLO depth contract + segmentation path |
BoxDecodeType::YoloV8 | yolov8 | YOLO decoupled or packed depth contract |
BoxDecodeType::YoloV8Seg | yolov8-seg | YOLO depth contract + segmentation path |
BoxDecodeType::YoloV8Pose | yolov8-pose | YOLO decoupled or packed depth contract |
BoxDecodeType::YoloV9 | yolov9 | YOLO decoupled or packed depth contract |
BoxDecodeType::YoloV9Seg | yolov9-seg | YOLO depth contract + segmentation path |
BoxDecodeType::YoloV10 | yolov10 | YOLO decoupled or packed depth contract |
BoxDecodeType::YoloV10Seg | yolov10-seg | YOLO depth contract + segmentation path |
BoxDecodeType::YoloV26 | yolo26 | YOLO26 grouped raw l/t/r/b bbox heads + class-score heads |
BoxDecodeType::Detr | detr | num_classes = max(depth) (must be > 4) |
BoxDecodeType::EffDet | effdet | fallback max-depth inference (> 4) |
BoxDecodeType::RcnnStage1 | rcnn-stage1 | fallback max-depth inference (> 4) |
BoxDecodeType::Centernet | centernet | fallback max-depth inference (> 4) |
Fail-fast behavior:
stages::BoxDecodeOptionsrequires explicit construction with a decode type.stages::BoxDecode(...)andnodes::SimaBoxDecode(...)fail fast onBoxDecodeType::Unspecified.
Setting the decode type explicitly:
simaai::neat::stages::BoxDecodeOptions opt(simaai::neat::BoxDecodeType::YoloV8);
opt.detection_threshold = 0.25;
opt.nms_iou_threshold = 0.5;
opt.top_k = 100;
opt = neat.ModelOptions()
opt.decode_type = neat.BoxDecodeType.YoloV8
Code
// Decompose model execution into stages: Preproc -> Infer -> BoxDecode.
//
// Usage:
// tutorial_006_read_detection_boxes --model /path/to/yolo_v8s.tar.gz --image /path/to.jpg
#include "neat.h"
#include "pipeline/StageRun.h"
#include <opencv2/imgcodecs.hpp>
#include <iostream>
#include <stdexcept>
#include <string>
namespace {
bool get_arg(int argc, char** argv, const std::string& key, std::string& out) {
for (int i = 1; i + 1 < argc; ++i) {
if (key == argv[i]) {
out = argv[i + 1];
return true;
}
}
return false;
}
} // namespace
int main(int argc, char** argv) {
try {
std::string model_path, image;
if (!get_arg(argc, argv, "--model", model_path) || !get_arg(argc, argv, "--image", image)) {
std::cerr << "Usage: tutorial_006_read_detection_boxes --model <path> --image <path>\n";
return 1;
}
cv::Mat bgr = cv::imread(image, cv::IMREAD_COLOR);
if (bgr.empty())
throw std::runtime_error("failed to load image: " + image);
simaai::neat::Model::Options opt;
opt.preprocess.color_convert.input_format = simaai::neat::PreprocessColorFormat::BGR;
opt.preprocess.input_max_width = bgr.cols;
opt.preprocess.input_max_height = bgr.rows;
opt.preprocess.input_max_depth = bgr.channels();
opt.decode_type = simaai::neat::BoxDecodeType::YoloV8;
simaai::neat::Model model(model_path, opt);
// CORE LOGIC
// Stage-by-stage: each stages::* call runs one piece of the model pipeline.
simaai::neat::TensorList pre = simaai::neat::stages::Preproc(std::vector<cv::Mat>{bgr}, model);
simaai::neat::Sample infer_samples = simaai::neat::stages::Infer(
simaai::neat::Sample{simaai::neat::sample_from_tensors(pre)}, model);
if (infer_samples.empty())
throw std::runtime_error("infer stage returned no samples");
simaai::neat::Sample infer = infer_samples.front();
simaai::neat::stages::BoxDecodeOptions box(simaai::neat::BoxDecodeType::YoloV8);
(void)box.decode_type;
(void)bgr.cols;
(void)bgr.rows;
box.detection_threshold = 0.55;
box.nms_iou_threshold = 0.5;
box.top_k = 100;
// BoxDecode parses the "BBOX" tensor into {x1, y1, x2, y2, score, class_id}
// entries clamped to original_width x original_height source pixels.
simaai::neat::BoxDecodeResultList decoded_results =
simaai::neat::stages::BoxDecodeResults(simaai::neat::Sample{infer}, model, box);
if (decoded_results.empty())
throw std::runtime_error("boxdecode result parser returned no results");
const simaai::neat::BoxDecodeResult& decoded = decoded_results.front();
std::cout << "boxes=" << decoded.boxes.size() << "\n";
std::cout << "[OK] 006_read_detection_boxes\n";
return 0;
} catch (const std::exception& e) {
std::cerr << "[FAIL] " << e.what() << "\n";
return 1;
}
}