Issue Description
I noticed that some Prometheus metrics seem to have values that I couldn’t explain, and after looking a bit into it, I suspect it’s because the metrics are mapped to the wrong counters.
For example, I noticed occasional increases on smtp_message_parse_failed with no other signs of any message failing at being parsed, and if I trust the LLM analysis (see below), it looks like in fact this metric reports the counter of smtp.dkim-pass instead.
Similarly I noticed unexplainable spikes in delivery_greeting_failed, and again it’s possibly counting delivery.completed instead.
Here’s the LLM analysis that I haven’t validated, but on the first glance sounds like a plausible root cause. At the very least the increases of these wrong metrics do seem to be correlated with Stalwart reporting smtp.dkim-pass or delivery.completed, that’s why I thought it’s worth sharing it:
Prometheus event counters are exported with incorrect metric names because counter indexes are labeled using unsorted EventType::variants()
Stalwart’s metrics counter export path appears to mix two different event orderings:
- Counters are incremented by numeric event id:
EVENT_COUNTERS.add(event_id, 1);
crates/trc/src/ipc/metrics.rs#L72-L75
- But counters are later exported by labeling counter index event_id as EVENT_TYPES[event_id]:
id: EVENT_TYPES[event_id],
crates/trc/src/ipc/metrics.rs#L210-L220
- EVENT_TYPES is defined as:
pub(crate) static EVENT_TYPES: &[EventType] = EventType::variants();
crates/trc/src/ipc/collector.rs#L31
The problem is that EventType::variants() is not ordered by EventType::to_id(). Therefore the counter value stored at numeric event id N is exported with the name of EventType::variants()[N], which may be a different event.
This affects Prometheus metrics because the Prometheus exporter uses Collector::collect_counters():
crates/common/src/telemetry/metrics/prometheus.rs#L30-L35
It also appears to affect OpenTelemetry counter export, which uses the same method:
crates/common/src/telemetry/metrics/otel.rs#L25-L36
Suggested fix
Use EventType::from_id(event_id as u16) when converting a counter index back to an event type:
diff --git a/crates/trc/src/ipc/metrics.rs b/crates/trc/src/ipc/metrics.rs
@@
use ipc::{
- collector::{Collector, EVENT_TYPES, GlobalInterests},
+ collector::{Collector, GlobalInterests},
subscriber::Interests,
};
@@
if value > 0 {
Some(EventCounter {
- id: EVENT_TYPES[event_id],
+ id: EventType::from_id(event_id as u16)?,
value,
})
} else {
Stalwart Version
v0.16.x
Installation Method
Docker
Database Backend
PostgreSQL
Blob Storage
S3-compatible
Search Engine
PostgreSQL
Directory Backend
Internal
I have reviewed the documentation and FAQ and confirm that my issue is NOT addressed there.
on
I have searched this support forum (open and closed topics) and confirm this is not a duplicate.
on
I understand that topics in this category are triaged by a bot first but a human reply will follow up. If I’d prefer a human-only reply, I’ll add the no-ai tag to my topic.
on