Stalwart does not respect MX priority (maybe reversed logic?)

Issue Description

The recipient’s DNS records contains four MX records. Two with priority 10 and two with priority 20. The servers with priority 20 are offline.

The MX servers are using self-signed certificate, so Stalwart first fails and then eventually succeeds, but in the mean time, it has several attempts to first connect to MX with priority 20 which is not available.

The problem is, because Stalwarts will connect to the MX with priority 20 before trying to connect to MX with priority 10.

Expected Behavior

Use priority 10 MX before trying higher (in number) priorities which are actually lower priority.

Actual Behavior

Stalwart first connects to MX with priority 20 instead of 10 (mail3.redacted.tld). Then on retry, it connects to another MX with priority 20 (mail4.redacted.tld).

Reproduction Steps

Send an email to a domain with MX records of different priorities.

Relevant Log Output

2026-05-11T08:49:48Z INFO Error fetching TLSA record (dane.tlsa-record-fetch-error) queueId = 306623311685615616, queueName = "remote", from = "[email protected]", to = ["[email protected]"], size = 1296, total = 1, domain = "redacted.tld", hostname = "mail3.redacted.tld", causedBy = DNS error (mail-auth.dns-error) { details = "DNSSEC Negative Record Response for _25._tcp.mail3.redacted.tld. IN TLSA, Bogus" }, strict = false, elapsed = 106ms
2026-05-11T08:52:03Z INFO Connection error (delivery.connect-error) queueId = 306623311685615616, queueName = "remote", from = "[email protected]", to = ["[email protected]"], size = 1296, total = 1, domain = "redacted.tld", hostname = "mail3.redacted.tld", localIp = 0.0.0.0, remoteIp = 163.159.68.233, remotePort = 25, causedBy = SMTP error occurred (smtp.error) { details = "I/O Error", reason = "Connection timed out (os error 110)" }, elapsed = 134879ms
2026-05-11T08:52:03Z INFO Error fetching TLSA record (dane.tlsa-record-fetch-error) queueId = 306623311685615616, queueName = "remote", from = "[email protected]", to = ["[email protected]"], size = 1296, total = 1, domain = "redacted.tld", hostname = "mail4.redacted.tld", causedBy = DNS error (mail-auth.dns-error) { details = "DNSSEC Negative Record Response for _25._tcp.mail4.redacted.tld. IN TLSA, Bogus" }, strict = false, elapsed = 28ms
2026-05-11T08:54:18Z INFO Connection error (delivery.connect-error) queueId = 306623311685615616, queueName = "remote", from = "[email protected]", to = ["[email protected]"], size = 1296, total = 1, domain = "redacted.tld", hostname = "mail4.redacted.tld", localIp = 0.0.0.0, remoteIp = 163.159.68.236, remotePort = 25, causedBy = SMTP error occurred (smtp.error) { details = "I/O Error", reason = "Connection timed out (os error 110)" }, elapsed = 135112ms
2026-05-11T08:56:18Z INFO Error fetching TLSA record (dane.tlsa-record-fetch-error) queueId = 306623311685615616, queueName = "remote", from = "[email protected]", to = ["[email protected]"], size = 1296, total = 1, domain = "redacted.tld", hostname = "mail2.redacted.tld", causedBy = DNS error (mail-auth.dns-error) { details = "DNSSEC Negative Record Response for _25._tcp.mail2.redacted.tld. IN TLSA, Bogus" }, strict = false, elapsed = 29ms
2026-05-11T08:56:18Z INFO Connecting to remote server (delivery.connect) queueId = 306623311685615616, queueName = "remote", from = "[email protected]", to = ["[email protected]"], size = 1296, total = 1, domain = "redacted.tld", hostname = "mail2.redacted.tld", localIp = 0.0.0.0, remoteIp = 84.39.211.236, remotePort = 25, elapsed = 26ms
2026-05-11T08:56:19Z INFO SMTP STARTTLS command (delivery.start-tls) queueId = 306623311685615616, queueName = "remote", from = "[email protected]", to = ["[email protected]"], size = 1296, total = 1, domain = "redacted.tld", hostname = "mail2.redacted.tld", version = "TLSv1_3", details = "TLS13_AES_256_GCM_SHA384", elapsed = 83ms
2026-05-11T08:56:23Z INFO Message delivered (delivery.delivered) queueId = 306623311685615616, queueName = "remote", from = "[email protected]", to = ["[email protected]"], size = 1296, total = 1, hostname = "mail2.redacted.tld", to = "[email protected]", code = 250, details = "Ok: queued as 4gDYVY3WMBz1C7k", elapsed = 2253ms
2026-05-11T08:56:23Z INFO DSN success notification (delivery.dsn-success) queueId = 306623311685615616, queueName = "remote", from = "[email protected]", to = ["[email protected]"], size = 1296, total = 1, to = "[email protected]", hostname = "mail2.redacted.tld", code = 250, details = "Ok"

Stalwart Version

v0.16.x

Installation Method

Docker

Database Backend

RocksDB

Blob Storage

RocksDB

Search Engine

Internal

Directory Backend

Internal

I have reviewed the documentation and FAQ and confirm that my issue is NOT addressed there.

on

I have searched this support forum (open and closed topics) and confirm this is not a duplicate.

on

I understand that topics in this category are triaged by a bot first but a human reply will follow up. If I’d prefer a human-only reply, I’ll add the no-ai tag to my topic.

on

Let me look into how Stalwart handles MX record sorting and priority.

Now let me check how mail_auth::MX is sorted when returned from mx_lookup:

The MX sorting is handled upstream in the mail_auth library. Let me check how the MX struct is returned from that library and whether Stalwart sorts it.

I checked the mail_auth crate too and the logic there seems OK.

Maybe it tries the others due to invalid TLS certs?

Hello this is very likely. How does it perform if you ask to use invalid-tls route from the very first attempt?

Instead of asking it to use a different strategy, I configured Stalwart to accept invalid certs on the default strategy (a couple of hours ago).

But it still chose mail4 after a while. I’m monitoring the logs and will provide more info if I find a pattern.

FWIW:

  • day started with mail1 and mail2 (correct, priority 10)
  • then, mail triggered by gitlab was attempted using mail4.redacted.tld for no particular reason

Is there anything I should focus on in the log file?

The MX sort by preference is straightforward, and there is no per-message state that blocklists a host after a failure: each new delivery attempt re-resolves and starts from preference 10 again. So a clean “always tries 20 first” would be a real regression, but the log slice you posted is consistent with the prio-10 hosts having transiently failed DNS or MX resolution for that one attempt and Stalwart falling through.

The events to focus on for the next bad delivery: grep for the queueId of one bad message, and pull the full sequence from MxLookup (which logs the resolved MX list and order) through every delivery.connect and delivery.connect-error until either delivered or rescheduled. If MxLookup shows mail1/mail2 in the result but no delivery.connect for them, that’s a real ordering bug. If MxLookup returns only mail3/mail4 (DNS truncation, SERVFAIL on a particular resolver), the issue is upstream of Stalwart. The TLSA-bogus events you are already seeing for prio-20 don’t gate the connect attempt in non-strict mode, so they are noise here.

Should I enable more logging (e.g. a debug level?)

Analyzing Friday’s log:

  • Gitlab sends email to two recipients
  • Log shows two DNSSEC negative responses for mail1.redacted.tld (OK) and mail4.redacted.tld (???)
  • First email gets delivered through mail1.redacted.tld
  • The second email (to the second recipient) fails with connection timeout to mail4.redacted.tld
  • The second email is then retried through mail2.redacted.tld and succeed.

I turned on debug level, to catch the MxLookup which you reference in your post…

OK, so after enabling debug logging and restarting, I can see MX lookup logs, which look suspicious, but, read on…

I also found out the cause and where the misconfiguration lies.

DNS MX records for domain redated.tld are allright. But the failed messages are all for another-redacted.tld, which is served by the same mail server and those are incorrect since they’re all listed at priority 10. I will contact the admin.

I’m sorry I failed to check and verify that from the start, consider this one solved.