Prometheus event counters are exported with incorrect metric names

Issue Description

I noticed that some Prometheus metrics seem to have values that I couldn’t explain, and after looking a bit into it, I suspect it’s because the metrics are mapped to the wrong counters.

For example, I noticed occasional increases on smtp_message_parse_failed with no other signs of any message failing at being parsed, and if I trust the LLM analysis (see below), it looks like in fact this metric reports the counter of smtp.dkim-pass instead.

Similarly I noticed unexplainable spikes in delivery_greeting_failed, and again it’s possibly counting delivery.completed instead.

Here’s the LLM analysis that I haven’t validated, but on the first glance sounds like a plausible root cause. At the very least the increases of these wrong metrics do seem to be correlated with Stalwart reporting smtp.dkim-pass or delivery.completed, that’s why I thought it’s worth sharing it:

Prometheus event counters are exported with incorrect metric names because counter indexes are labeled using unsorted EventType::variants()

Stalwart’s metrics counter export path appears to mix two different event orderings:

  • Counters are incremented by numeric event id:
  EVENT_COUNTERS.add(event_id, 1);

crates/trc/src/ipc/metrics.rs#L72-L75

  • But counters are later exported by labeling counter index event_id as EVENT_TYPES[event_id]:
  id: EVENT_TYPES[event_id],

crates/trc/src/ipc/metrics.rs#L210-L220

  • EVENT_TYPES is defined as:
  pub(crate) static EVENT_TYPES: &[EventType] = EventType::variants();

crates/trc/src/ipc/collector.rs#L31

The problem is that EventType::variants() is not ordered by EventType::to_id(). Therefore the counter value stored at numeric event id N is exported with the name of EventType::variants()[N], which may be a different event.

This affects Prometheus metrics because the Prometheus exporter uses Collector::collect_counters():

crates/common/src/telemetry/metrics/prometheus.rs#L30-L35

It also appears to affect OpenTelemetry counter export, which uses the same method:

crates/common/src/telemetry/metrics/otel.rs#L25-L36

Suggested fix

Use EventType::from_id(event_id as u16) when converting a counter index back to an event type:

  diff --git a/crates/trc/src/ipc/metrics.rs b/crates/trc/src/ipc/metrics.rs
  @@
   use ipc::{
  -    collector::{Collector, EVENT_TYPES, GlobalInterests},
  +    collector::{Collector, GlobalInterests},
       subscriber::Interests,
   };
  @@
                   if value > 0 {
                       Some(EventCounter {
  -                        id: EVENT_TYPES[event_id],
  +                        id: EventType::from_id(event_id as u16)?,
                           value,
                       })
                   } else {

Stalwart Version

v0.16.x

Installation Method

Docker

Database Backend

PostgreSQL

Blob Storage

S3-compatible

Search Engine

PostgreSQL

Directory Backend

Internal

I have reviewed the documentation and FAQ and confirm that my issue is NOT addressed there.

on

I have searched this support forum (open and closed topics) and confirm this is not a duplicate.

on

I understand that topics in this category are triaged by a bot first but a human reply will follow up. If I’d prefer a human-only reply, I’ll add the no-ai tag to my topic.

on

This is a well-researched bug report. Let me verify the relevant source code to understand the issue accurately.