Read Detection Boxes from Model Output

Field	Value
Difficulty	Intermediate
Estimated Read Time	15-20 minutes
Labels	`postprocessing`, `boxdecode`, `detection`

Concept

Decode raw model output into usable bounding boxes using SimaBoxDecode — thresholding, NMS, and coordinate mapping built into one postprocessing stage. Read both decoded tensors and the raw byte format so you can handle any runtime shape.

BoxDecode is a highly optimized detection postprocessing path for vision workloads. It transforms inference tensors into final bounding-box results with thresholding and NMS in a single step.

Common box-decode controls in this tutorial:

decode_type (for example yolov8): selects model-family decode behavior.
score_threshold: drops low-confidence detections early.
nms_iou_threshold: controls overlap suppression aggressiveness.
top_k: limits final detection count for deterministic downstream cost.
original_width, original_height: maps decoded boxes to the source image coordinate space.

Use-case guidance

Too many noisy boxes: increase score_threshold and/or reduce top_k.
Duplicate overlapping boxes: lower nms_iou_threshold to make suppression stricter.
Missed true positives: decrease score_threshold cautiously.
Boxes appear scaled/offset incorrectly: verify original_width and original_height match real source frames.
Porting between detector variants: ensure decode_type matches the model family expected by the model archive.

APIs introduced

pyneat.ModelOptions() with .decode_type, .score_threshold, .nms_iou_threshold, .top_k, .original_width/height.
pyneat.Tensor.from_numpy(array, image_format=...) — build the input tensor.
sample.tensor with dtype=UInt8 — the packed BBOX byte buffer (wire format documented below).

Prerequisites Chapter 001. Chapter 004 for ModelOptions basics.

References

Learning Process

Configure model/postproc options for a detector-style pipeline.
Run deterministic preproc + inference + boxdecode flow.
Inspect decoded output signals (box count, output kind/fields).

Run

Python:

python3 share/sima-neat/tutorials/006_read_detection_boxes/read_detection_boxes.py \
  --model /tmp/yolo_v8s.tar.gz --width 640 --height 640

C++ (prebuilt):

./lib/sima-neat/tutorials/tutorial_006_read_detection_boxes \
  --model /tmp/yolo_v8s.tar.gz --image /path/to/frame.jpg

C++ (build from source):

./build.sh --target tutorial_006_read_detection_boxes
./build/tutorials-standalone/tutorial_006_read_detection_boxes \
  --model /tmp/yolo_v8s.tar.gz --image /path/to/frame.jpg

To integrate this chapter's C++ source into your own project with a custom CMakeLists.txt (no extras folder required), see How to Run Tutorials on the landing page.

In Practice

SimaBoxDecode emits a single output tensor tagged BBOX. The tensor carries a packed byte buffer that the runtime parser interprets into floating-point detections. Understanding that two-level contract (wire buffer vs. parsed Box records) is the key to reading the output from either Python or C++.

Sample wrapper

Regardless of language, the decode stage produces one Sample per input frame with:

Field	Value
`kind`	`SampleKind.Tensor` (single-tensor sample, not a bundle)
`payload_tag` / `format`	`"BBOX"`
`media_type`	`"application/vnd.simaai.tensor"`
`tensor.dtype`	`UInt8`
`tensor.shape`	rank-1: `[N_bytes]`, where `N_bytes` is the model archive-packed buffer capacity (for example `[20160]` on the stock YOLOv8 pack)
`fields`	empty (all payload lives in `tensor`)

The tensor shape is a byte count, not a detection count. The packed bytes hold both a small header and a contiguous array of fixed-size box records. N_bytes is determined by the model archive's buffers.input[0].size field (inside the boxdecode stage's config JSON) and bounds the maximum number of detections the decoder can emit in a single frame (see "Override contract" below for how runtime dims interact with packaged values).

Packed wire format

The uint8 buffer is laid out little-endian:

offset  size  content
------  ----  -------
  0      4    uint32  N = number of valid detections in this frame
  4     24    RawBox[0]
 28     24    RawBox[1]
  .      .      ...
  .      .    RawBox[N-1]
                   (trailing bytes up to buffer capacity are padding, ignored)

Each RawBox record is 24 bytes:

Offset in record	Size	Type	Field	Meaning
0	4	int32	`x`	top-left x, in source pixels
4	4	int32	`y`	top-left y, in source pixels
8	4	int32	`w`	width, in source pixels
12	4	int32	`h`	height, in source pixels
16	4	float32	`score`	post-NMS detection confidence in `[0.0, 1.0]` (the value `detection_threshold` gates on)
20	4	int32	`class_id`	predicted class id (model-defined; 0-indexed; class-name map lives in the model archive metadata)

The canonical Python struct format matching one record is "<iiiifi" (little-endian, 4 signed ints, one float, one signed int).

The runtime's parsing helpers (parse_bbox_bytes / decode_bbox_tensor in include/pipeline/DetectionTypes.h, tests/unit_testing/unit_detection_types_bbox_test.cpp pins the wire contract) expand each RawBox into a Box struct for downstream code:

struct Box {
  float x1, y1, x2, y2;  // x2 = x + w, y2 = y + h; clamped to [0, img_w|h]
  float score;
  int   class_id;
};

Coordinate space

Coordinates decoded from BBOX are in original-image pixels, the same coordinate system you passed as original_width / original_height (or that the model archive was packaged with). They are not normalized to [0, 1], and they are not expressed in the model's internal letterboxed input space. The parser clamps (x1, y1, x2, y2) to [0, original_width] / [0, original_height] so caller code can draw them directly on the source frame.

Worked example

With the tutorial's runtime configuration (original_width = 640, original_height = 640, top_k = 100) and the stock YOLOv8 pack (buffers.input[0].size = 20160 in the boxdecode config), a single decoded frame yields:

out.kind == SampleKind.Tensor
out.payload_tag == "BBOX"
out.tensor.dtype == UInt8, out.tensor.shape == [20160]
Bytes [0:4] give N in little-endian; 0 <= N <= 100 because top_k = 100. An N of 0 means "no detections above threshold this frame" — iterate zero times and emit nothing.
Bytes [4 : 4 + 24 * N] hold the valid detections; everything after that offset is zero/padding and must be ignored.

Reading a box in Python is a struct.unpack_from:

import struct
payload = out.tensor.copy_payload_bytes()
count = struct.unpack_from("<I", payload, 0)[0]
for i in range(count):
    x, y, w, h, score, cls = struct.unpack_from("<iiiifi", payload, 4 + 24 * i)
    # (x, y, w, h) in source pixels; x2 = x + w, y2 = y + h

In C++ the stages::BoxDecode helper returns a BoxDecodeResult that has already done this unpack for you: result.boxes[i] is a Box with (x1, y1, x2, y2) already populated from (x, y, x+w, y+h) and clamped to the image.

Override contract: runtime dims vs. packaged model archive defaults

SimaBoxDecode is constructed from a trained model archive that ships with packaged defaults for decode_type, detection_threshold, nms_iou_threshold, top_k, original_width, and original_height. The public constructor

SimaBoxDecode(const Model& model,
              const std::string& decode_type = "",
              int original_width = 0, int original_height = 0,
              double detection_threshold = 0.0,
              double nms_iou_threshold = 0.0,
              int top_k = 0);

and its Python twin pyneat.nodes.sima_box_decode(model, ...) use a simple "positive overrides, zero/empty preserves" rule per field.

Naming note. detection_threshold is the name used by SimaBoxDecode's constructor. ModelOptions.score_threshold (used in the Python tutorial) is plumbed into that same argument. The two names refer to the same underlying control.

Runtime argument	Value passed	Behavior
`decode_type`	`""` (empty)	preserve model archive / model-path inference
`decode_type`	non-empty string	override the model archive value for this run
`original_width` / `original_height`	`0`	preserve model archive packaged dimension
`original_width` / `original_height`	positive int	rewrite `original_width` / `original_height` in the effective config
`detection_threshold`	`0.0`	preserve model archive packaged threshold
`detection_threshold`	`> 0.0`	override (also triggers the YOLOv8 cliff-warning below)
`nms_iou_threshold`	`0.0`	preserve model archive packaged NMS IoU
`nms_iou_threshold`	`> 0.0`	override
`top_k`	`0`	preserve model archive packaged top-K
`top_k`	`> 0`	override

The rule is strictly per-field:

Python path — the tutorial overrides every field because ModelOptions sets positive values.
C++ path — read_detection_boxes.cpp passes 0.55f, 0.5f, 100 (so detection_threshold, nms_iou_threshold, and top_k are overridden) plus bgr.cols, bgr.rows positively (so original_width / original_height are overridden too).

Practical consequences:

If your model archive was packed for a different resolution than your source frames, pass original_width and original_height explicitly so coordinates land in source pixels.
Leaving detection_threshold and nms_iou_threshold at 0.0 is the safest way to get the model archive's validated defaults; only override when you are deliberately retuning.
Be deliberate with a low detection_threshold. The lower it is, the more candidate boxes survive thresholding, and NMS cost grows with the square of the surviving-box count — so a very low threshold can sharply increase postprocess compute and latency. Lower it only as far as you need to catch weak detections; pair it with top_k to cap the worst case.

Decode types and tensor contracts

BoxDecodeType is a typed API (simaai::neat::BoxDecodeType / neat.BoxDecodeType) and should always be set explicitly for decode stages. The runtime contract below comes from internals/gst_plugins/genericboxdecode_v2/gstneatboxdecode.cpp (infer_num_classes, infer_yolo_decoupled_classes, infer_yolo_packed_classes, compute_required_output_size).

Core tensor contract rules:

YOLO-family decode types (yolo, yolov5*, yolov7*, yolov8*, yolov9*, yolov10*):
- Decoupled heads: class-head depths must be repeatable and > 4.
- Packed heads: each head depth must satisfy depth = 3 * (num_classes + 5) and be consistent across heads.
yolo26: decoupled grouped heads with 4-channel raw l/t/r/b bbox tensors and repeatable class-head depths > 4.
detr: class channels are inferred from the maximum depth across heads, and must be > 4.
Other non-YOLO decode types (effdet, rcnn-stage1, centernet): fallback class inference uses max depth and requires > 4.
Segmentation decode tokens (*-seg) enable segmentation-like output sizing in v2 (adds mask payload per detection).

API enum	Backend token	Expected contract
`BoxDecodeType::Yolo`	`yolo`	YOLO decoupled or packed depth contract
`BoxDecodeType::YoloV5`	`yolov5`	YOLO decoupled or packed depth contract
`BoxDecodeType::YoloV5Seg`	`yolov5-seg`	YOLO depth contract + segmentation path
`BoxDecodeType::YoloV7`	`yolov7`	YOLO decoupled or packed depth contract
`BoxDecodeType::YoloV7Seg`	`yolov7-seg`	YOLO depth contract + segmentation path
`BoxDecodeType::YoloV8`	`yolov8`	YOLO decoupled or packed depth contract
`BoxDecodeType::YoloV8Seg`	`yolov8-seg`	YOLO depth contract + segmentation path
`BoxDecodeType::YoloV8Pose`	`yolov8-pose`	YOLO decoupled or packed depth contract
`BoxDecodeType::YoloV9`	`yolov9`	YOLO decoupled or packed depth contract
`BoxDecodeType::YoloV9Seg`	`yolov9-seg`	YOLO depth contract + segmentation path
`BoxDecodeType::YoloV10`	`yolov10`	YOLO decoupled or packed depth contract
`BoxDecodeType::YoloV10Seg`	`yolov10-seg`	YOLO depth contract + segmentation path
`BoxDecodeType::YoloV26`	`yolo26`	YOLO26 grouped raw l/t/r/b bbox heads + class-score heads
`BoxDecodeType::Detr`	`detr`	`num_classes = max(depth)` (must be `> 4`)
`BoxDecodeType::EffDet`	`effdet`	fallback max-depth inference (`> 4`)
`BoxDecodeType::RcnnStage1`	`rcnn-stage1`	fallback max-depth inference (`> 4`)
`BoxDecodeType::Centernet`	`centernet`	fallback max-depth inference (`> 4`)

Fail-fast behavior:

stages::BoxDecodeOptions requires explicit construction with a decode type.
stages::BoxDecode(...) and nodes::SimaBoxDecode(...) fail fast on BoxDecodeType::Unspecified.

Setting the decode type explicitly:

simaai::neat::stages::BoxDecodeOptions opt(simaai::neat::BoxDecodeType::YoloV8);
opt.detection_threshold = 0.25;
opt.nms_iou_threshold = 0.5;
opt.top_k = 100;

opt = neat.ModelOptions()
opt.decode_type = neat.BoxDecodeType.YoloV8

Code

tutorials/006_read_detection_boxes/read_detection_boxes.cpp
// Decompose model execution into stages: Preproc -> Infer -> BoxDecode.
//
// Usage:
//   tutorial_006_read_detection_boxes --model /path/to/yolo_v8s.tar.gz --image /path/to.jpg

#include "neat.h"

#include "pipeline/StageRun.h"

#include <opencv2/imgcodecs.hpp>

#include <iostream>
#include <stdexcept>
#include <string>

namespace {

bool get_arg(int argc, char** argv, const std::string& key, std::string& out) {
  for (int i = 1; i + 1 < argc; ++i) {
    if (key == argv[i]) {
      out = argv[i + 1];
      return true;
    }
  }
  return false;
}

} // namespace

int main(int argc, char** argv) {
  try {
    std::string model_path, image;
    if (!get_arg(argc, argv, "--model", model_path) || !get_arg(argc, argv, "--image", image)) {
      std::cerr << "Usage: tutorial_006_read_detection_boxes --model <path> --image <path>\n";
      return 1;
    }

    cv::Mat bgr = cv::imread(image, cv::IMREAD_COLOR);
    if (bgr.empty())
      throw std::runtime_error("failed to load image: " + image);

    simaai::neat::Model::Options opt;
    opt.preprocess.color_convert.input_format = simaai::neat::PreprocessColorFormat::BGR;
    opt.preprocess.input_max_width = bgr.cols;
    opt.preprocess.input_max_height = bgr.rows;
    opt.preprocess.input_max_depth = bgr.channels();
    opt.decode_type = simaai::neat::BoxDecodeType::YoloV8;

    simaai::neat::Model model(model_path, opt);

    // CORE LOGIC
    // Stage-by-stage: each stages::* call runs one piece of the model pipeline.
    simaai::neat::TensorList pre = simaai::neat::stages::Preproc(std::vector<cv::Mat>{bgr}, model);
    simaai::neat::Sample infer_samples = simaai::neat::stages::Infer(
        simaai::neat::Sample{simaai::neat::sample_from_tensors(pre)}, model);
    if (infer_samples.empty())
      throw std::runtime_error("infer stage returned no samples");
    simaai::neat::Sample infer = infer_samples.front();

    simaai::neat::stages::BoxDecodeOptions box(simaai::neat::BoxDecodeType::YoloV8);
    (void)box.decode_type;
    (void)bgr.cols;
    (void)bgr.rows;
    box.detection_threshold = 0.55;
    box.nms_iou_threshold = 0.5;
    box.top_k = 100;

    // BoxDecode parses the "BBOX" tensor into {x1, y1, x2, y2, score, class_id}
    // entries clamped to original_width x original_height source pixels.
    simaai::neat::BoxDecodeResultList decoded_results =
        simaai::neat::stages::BoxDecodeResults(simaai::neat::Sample{infer}, model, box);
    if (decoded_results.empty())
      throw std::runtime_error("boxdecode result parser returned no results");
    const simaai::neat::BoxDecodeResult& decoded = decoded_results.front();

    std::cout << "boxes=" << decoded.boxes.size() << "\n";
    std::cout << "[OK] 006_read_detection_boxes\n";
    return 0;
  } catch (const std::exception& e) {
    std::cerr << "[FAIL] " << e.what() << "\n";
    return 1;
  }
}

Concept​

Learning Process​

Run​

In Practice​

Sample wrapper​

Packed wire format​

Coordinate space​

Worked example​

Override contract: runtime dims vs. packaged model archive defaults​

Decode types and tensor contracts​

Code​

Source​