AcmeRenewal task places a full ACME order unconditionally

Issue Description

AcmeRenewal task places a full ACME order unconditionally — no freshness gate, no per-domain dedup → duplicate issuance until Let’s Encrypt’s duplicate-certificate limit

Version: Stalwart v0.16.5 (Docker, RocksDB store)
Challenge type: HTTP-01, Let’s Encrypt production

Summary

Executing an AcmeRenewal task always performs a complete ACME issuance — there is no check against the stored certificate’s validity, and externally-created tasks for the same domain are not coalesced. Any management automation (or operator) that enqueues AcmeRenewal tasks more than very occasionally will re-issue identical certificates until Let’s Encrypt’s duplicate-certificate limit (5 per exact SAN set per 168h) trips, after which the domain cannot obtain a certificate at all until the window slides — including legitimate first issuance after a rebuild.

Where (v0.16.5 source)

  • crates/common/src/network/acme/renew.rsServer::acme_renew() resolves the Domain/provider and calls straight into AcmeRequestBuilder::renew(). No comparison of any stored Certificate object’s not_valid_after against renew_before happens on this path.
  • crates/common/src/network/acme/order.rsrenew() generates a key pair and calls self.new_order(...) unconditionally.
  • Task layer: each x:Task/set {"@type":"AcmeRenewal", domainId} creates an independent task; a failed one goes into its own retry schedule. There is no “an AcmeRenewal for this domain is already pending” coalescing, so callers that fire periodically build up a queue of N tasks that all execute (or all retry) independently.

What we observed in production

Our platform’s reconciler had its own bug (it fired AcmeRenewal every 30 minutes believing the task was idempotent on certificate freshness — we’ve since fixed our side to gate on stored-cert presence + pending-task dedup). The Stalwart-side consequences were striking:

  • 23 duplicate stored certificates for the same single-SAN hostname accumulated in the store (each task execution = one full successful issuance, ~5/week as LE allowed them).
  • Once the duplicate limit tripped, every further execution failed with HTTP 429; because each enqueued task retries independently, we accumulated 97 queued AcmeRenewal tasks whose retry timestamps all landed in the same ~10-minute window — a thundering herd guaranteed to re-trip the limit the moment it opened.
  • On a freshly rebuilt cluster the same hostname then could not get its first certificate at all (“too many certificates (5) already issued for this exact set of identifiers in the last 168h”) — the prior cluster had burned the week’s budget invisibly. A fresh ACME account doesn’t help; the limit is SAN-set-scoped.

Suggestions

  1. Freshness gate in acme_renew(): if a stored Certificate covers the requested SAN set and not_valid_after − renew_before > now, complete the task as a no-op (or with an explicit “not due” result). This makes the task safely idempotent, which is what callers naturally assume of a “renewal” primitive.
  2. Per-domain coalescing: creating an AcmeRenewal while one is already pending/retrying for the same domainId could return the existing task instead of enqueuing a duplicate.
  3. Rate-limit-aware backoff: on a 429 with Retry-After, collapsing the domain’s queued renewals into a single retry at the indicated time would avoid the herd.

Stalwart Version

v0.16.x

Installation Method

Docker

Database Backend

RocksDB

Blob Storage

RocksDB

Search Engine

Internal

Directory Backend

Internal

I have reviewed the documentation and FAQ and confirm that my issue is NOT addressed there.

on

I have searched this support forum (open and closed topics) and confirm this is not a duplicate.

on

I understand that topics in this category are triaged by a bot first but a human reply will follow up. If I’d prefer a human-only reply, I’ll add the no-ai tag to my topic.

on

Let me look at the relevant source code to understand the current behavior before responding.

The source code confirms the behavior described. The acme_renew() function proceeds directly to AcmeRequestBuilder::renew() without checking whether the stored certificate is still valid, and task_set creates independent task entries without any coalescing for duplicate AcmeRenewal tasks against the same domainId. This is a genuine bug report with solid source-level evidence.


Thank you for the detailed and well-researched bug report, phoenixtechnam. The source code confirms all three of your findings:

  1. No freshness gate: acme_renew() in crates/common/src/network/acme/renew.rs resolves the domain/provider and immediately calls AcmeRequestBuilder::renew() – there is no comparison of the stored certificate’s not_valid_after against renew_before before placing the ACME order.

  2. No per-domain coalescing: task_set in crates/jmap/src/registry/mapping/task.rs validates foreign keys and schedules each AcmeRenewal task independently via batch.schedule_task_with_id(...) – there is no check for an already-pending AcmeRenewal for the same domainId.

  3. Rate-limit backoff handling: AcmeError::Backoff does carry a wait duration and will schedule a single retry at the correct time, but since each independently-enqueued task has its own retry schedule, a queue of N tasks will all converge on that same window regardless.

Your three suggested mitigations (freshness gate making the task idempotent, per-domain creation dedup, and collapsing queued retries on 429) are all technically sound. This is a confirmed upstream issue that needs to be addressed in the server code. A human staff member will follow up here.

As a short-term operational workaround while waiting for a fix: gate your reconciler (as you have already done) on both stored-cert presence and an explicit query for existing pending AcmeRenewal tasks for the domainId before enqueuing – the JMAP task query API can filter by @type and domainId to detect an already-pending task.

This is an automated reply from the Stalwart Help Bot. Other community members may follow up if this answer is incomplete or wrong.

Fixed. The fix will be included in v0.16.9.