Issue Description
AcmeRenewal task places a full ACME order unconditionally — no freshness gate, no per-domain dedup → duplicate issuance until Let’s Encrypt’s duplicate-certificate limit
Version: Stalwart v0.16.5 (Docker, RocksDB store)
Challenge type: HTTP-01, Let’s Encrypt production
Summary
Executing an AcmeRenewal task always performs a complete ACME issuance — there is no check against the stored certificate’s validity, and externally-created tasks for the same domain are not coalesced. Any management automation (or operator) that enqueues AcmeRenewal tasks more than very occasionally will re-issue identical certificates until Let’s Encrypt’s duplicate-certificate limit (5 per exact SAN set per 168h) trips, after which the domain cannot obtain a certificate at all until the window slides — including legitimate first issuance after a rebuild.
Where (v0.16.5 source)
crates/common/src/network/acme/renew.rs—Server::acme_renew()resolves the Domain/provider and calls straight intoAcmeRequestBuilder::renew(). No comparison of any storedCertificateobject’snot_valid_afteragainstrenew_beforehappens on this path.crates/common/src/network/acme/order.rs—renew()generates a key pair and callsself.new_order(...)unconditionally.- Task layer: each
x:Task/set {"@type":"AcmeRenewal", domainId}creates an independent task; a failed one goes into its own retry schedule. There is no “an AcmeRenewal for this domain is already pending” coalescing, so callers that fire periodically build up a queue of N tasks that all execute (or all retry) independently.
What we observed in production
Our platform’s reconciler had its own bug (it fired AcmeRenewal every 30 minutes believing the task was idempotent on certificate freshness — we’ve since fixed our side to gate on stored-cert presence + pending-task dedup). The Stalwart-side consequences were striking:
- 23 duplicate stored certificates for the same single-SAN hostname accumulated in the store (each task execution = one full successful issuance, ~5/week as LE allowed them).
- Once the duplicate limit tripped, every further execution failed with HTTP 429; because each enqueued task retries independently, we accumulated 97 queued
AcmeRenewaltasks whose retry timestamps all landed in the same ~10-minute window — a thundering herd guaranteed to re-trip the limit the moment it opened. - On a freshly rebuilt cluster the same hostname then could not get its first certificate at all (“too many certificates (5) already issued for this exact set of identifiers in the last 168h”) — the prior cluster had burned the week’s budget invisibly. A fresh ACME account doesn’t help; the limit is SAN-set-scoped.
Suggestions
- Freshness gate in
acme_renew(): if a storedCertificatecovers the requested SAN set andnot_valid_after − renew_before > now, complete the task as a no-op (or with an explicit “not due” result). This makes the task safely idempotent, which is what callers naturally assume of a “renewal” primitive. - Per-domain coalescing: creating an
AcmeRenewalwhile one is already pending/retrying for the samedomainIdcould return the existing task instead of enqueuing a duplicate. - Rate-limit-aware backoff: on a 429 with
Retry-After, collapsing the domain’s queued renewals into a single retry at the indicated time would avoid the herd.
Stalwart Version
v0.16.x
Installation Method
Docker
Database Backend
RocksDB
Blob Storage
RocksDB
Search Engine
Internal
Directory Backend
Internal
I have reviewed the documentation and FAQ and confirm that my issue is NOT addressed there.
on
I have searched this support forum (open and closed topics) and confirm this is not a duplicate.
on
I understand that topics in this category are triaged by a bot first but a human reply will follow up. If I’d prefer a human-only reply, I’ll add the no-ai tag to my topic.
on