Clustering - Web interface access via AWS NLB - Works at first, then resets all connections

Your question

@helpbot might just be able to resolve this for me.

I create a Network Load Balancer in AWS (Because ALB does not support Proxy Protocol at OSI 7). I use an AWS ACM generated certificate of *.example.com.

I put my target group of Stalwart nodes behind it, using TLS 443, and I enable Proxy Protocol v2.

I put the appropriate IP CIDR in the Trusted Proxy box in Settings → Network.

I put the appropriate IP CIDR in the Allowed section of Settings → Security.

I create a CNAME for the NLB to mx.example.com in DNS.

I navigate to https://mx.example.com/admin and it works great! … for about one minute. Every attempt to connect after that, Stalwart simply resets the connection.

The health checks still resolve as green.

Running Qualys SSL test shows everything is responding fine, except the actual page request fails.

Restarting Stalwart has no effect.

Running tshark shows the requests arriving and preserving the client IP, plus all of the health check packets arriving from the proxy IP.

Is it possibly opening connections to multiple nodes and that’s when it falls apart?

I understand that topics in this category are triaged by a bot first but a human reply will follow up. If I’d prefer a human-only reply, I’ll add the no-ai tag to my topic.

on

The most likely culprit here is Stalwart’s loitering-connection auto-ban being triggered by the AWS NLB health check probes.

Here’s what is almost certainly happening:

  1. AWS NLB health checks repeatedly open TCP connections to port 443 and close them without sending any meaningful application-layer traffic (they just confirm the port is open/TLS handshakes).
  2. Stalwart’s auto-ban system tracks “loitering connections” – clients that connect but never send meaningful traffic – with a default threshold of 150 disconnections per day. NLB health checks, which poll frequently (often every 10-30 seconds), can easily blow past this threshold in roughly a minute, exactly matching your timeline.
  3. Once the NLB’s IP is banned, Stalwart drops all subsequent connections from it, which resets connections for all real users going through that NLB.
  4. A restart doesn’t help because the ban persists in storage (it’s a BlockedIp entry, not in-memory state).

To confirm: Go to Settings > Security > Blocked IPs in the WebUI and look for the NLB’s IP address. If it’s there with reason: loitering, that’s your culprit.

To fix this, you have two options:

  • Remove the ban entry from Blocked IPs, and then either:
    • Raise the loiterBanRate (Settings > Security > Settings > loiterBanRate) so health check frequency doesn’t trigger it, or
    • Add the NLB’s source IP CIDR to an allowlist so it’s never subject to the auto-ban. The NLB health checks originate from the NLB nodes themselves (the same IPs you already added to Trusted Proxy networks), so ensuring those IPs are also excluded from auto-ban enforcement is the cleanest solution.

The proxy trusted networks setting (which you already configured correctly) covers IP forwarding, but it does not automatically exempt those IPs from the auto-ban system. Those are two separate mechanisms.

Docs references:

This is an automated reply from the Stalwart Help Bot. Other community members may follow up if this answer is incomplete or wrong. If you found it useful, mention @helpbot in any reply to ask follow-up questions and the bot will respond again.

The most likely cause of your issue is Stalwart’s auto-ban mechanism triggering on the NLB health check traffic – and possibly also the NLB health check IPs not being in your AllowedIPs list, or the wrong CIDR being used.

Here is what to check:

1. NLB health checks are triggering a loitering or scan ban

AWS NLB health checks open TCP connections rapidly and repeatedly. Even with Proxy Protocol v2 enabled, NLB health checks may not send HTTP traffic – they just open and close TCP connections. Stalwart’s auto-ban classifies clients that “repeatedly connect but never send meaningful traffic” as loitering (default threshold: 150 disconnections/day), and will ban the originating IP. Since the NLB presents the health-check source IP (not the NLB’s forwarding CIDR) to Stalwart, if that IP is not in your AllowedIPs list, it will accumulate loitering events and get banned.

The fix: Go to Settings > Security > Allowed IPs and add the specific IP ranges used by the NLB health check probes (these are the NLB node IPs, not the NLB’s forwarding CIDR – they can be found in the EC2 console under the NLB’s network interfaces). The docs describe AllowedIp records as bypassing both rate limits and auto-banning.

2. Check Settings > Security > Blocked IPs

After the connection resets begin, look in Settings > Security > Blocked IPs in the WebUI to see if any NLB or client IPs have been automatically banned. This will confirm whether auto-banning is the culprit.

3. Verify the Proxy Protocol is only sending the NLB CIDR as trusted proxy

The proxy-protocol docs note: “A mismatch (only one side sending or accepting the header) will break connections silently or expose the proxy’s address instead of the client’s.” If Stalwart is receiving a Proxy Protocol header but treating the source as untrusted (wrong CIDR), it would fall back to the NLB’s IP as the client address – which may hit rate limits faster than expected.

4. For NLB + TLS termination at the NLB

If TLS is being terminated at the NLB (using ACM), the NLB is establishing a plain TCP (or re-encrypted TLS) connection to your Stalwart nodes. Make sure the listener on Stalwart is configured to accept the appropriate protocol (TLS or plain HTTPS) for what the NLB actually forwards. A TLS mismatch at the backend would cause resets after the handshake.

Relevant docs:

This is an automated reply from the Stalwart Help Bot. Other community members may follow up if this answer is incomplete or wrong. If you found it useful, mention @helpbot in any reply to ask follow-up questions and the bot will respond again.

@helpbot I stated in the original topic that the CIDR for the proxies was already added to the Allowed list. They are not in the blocked list.

A mismatch of protocols would break the connections immediately, not allow them to work temporarily.

I am now also receiving this in my logs, most likely from the health checks, though “localIp = ::” would suggest IPv6, which is not in use.

listenerId = "https", localIp = ::, localPort = 443, tls = true, reason = "end of stream"```

Any other suggestions?

The localIp = :: in the log entry is a strong diagnostic signal. That :: (the IPv6 unspecified address) is the local bind address of the listener – meaning the HTTPS listener is bound to [::]:443, which on Linux with dual-stack sockets is normal and means “listen on all interfaces, both IPv4 and IPv6.” The :: in localIp is just showing which end of the socket Stalwart is bound to, not the client’s IP. That part of the log is expected and not a problem.

The key detail in that log line is reason = "end of stream". This means the TCP connection was established (and TLS handshaked, since tls = true is logged) but the client closed the connection without sending any HTTP request data – the stream ended before any application data arrived. This is exactly what AWS NLB TLS health checks do: they complete the TLS handshake to confirm the backend is alive, then immediately close the connection without sending an HTTP request.

Since you’ve confirmed:

  • The NLB CIDR is already in Allowed IPs (bypasses auto-ban)
  • The NLB IPs are not appearing in Blocked IPs

…then the connection resets after ~1 minute are almost certainly not an auto-ban issue. A few other things to investigate:

1. NLB connection draining / idle timeout interaction

AWS NLBs have a connection idle timeout (default 350 seconds for TCP, 60 seconds for TLS listeners). If TLS termination is happening at the NLB and it re-encrypts to the backend, the NLB may be reusing or draining backend connections in a way that causes Stalwart’s HTTP/2 or keep-alive sessions to break when the NLB rotates or rebalances its backend pool. Check your NLB target group’s deregistration delay and connection draining settings.

2. Are sessions being routed to different cluster nodes?

The admin WebUI uses session tokens. If the NLB is round-robin routing and your cluster nodes don’t share session state for the HTTP admin interface, requests after the initial login may land on a different node that doesn’t recognize the session, causing it to reject the connection. Check whether you have sticky sessions (session affinity) enabled on the NLB target group – for the admin WebUI behind a multi-node cluster, this is typically required.

3. Verify the listener bind address and TLS configuration

In Settings > Network > Listeners, check what bind address your https listener uses. If it’s bound to a specific IPv4 address rather than [::]:443 or 0.0.0.0:443, and the NLB is connecting from an address that maps to a different interface, some connections may be reaching nodes that aren’t properly handling them.

The docs note that binding to [::]:port listens on all interfaces (both IPv4 and IPv6), which is the typical correct setting for a node behind a load balancer.

Docs references:

This is an automated reply from the Stalwart Help Bot. Other community members may follow up if this answer is incomplete or wrong. If you found it useful, mention @helpbot in any reply to ask follow-up questions and the bot will respond again.

@helpbot Two issues with stickiness. To test the setup, only one node was added, so it’s not possible to hit a different node at the moment, and, attempting to enable stickiness results in

InvalidConfigurationRequestException: You cannot enable stickiness on target groups with the TLS protocol```

Since you’ve confirmed only one node is in the target group and IPs are allowed (not banned), the issue is narrowed considerably. The "end of stream" log with a single node rules out session-affinity/multi-node problems. Let me check if there’s something specific to how the NLB TLS health check interacts with Stalwart’s HTTP/2 or request handling.

This is very helpful. Since it’s a single-node setup, session routing across nodes is ruled out. The LimiterResult concurrency check is interesting – if the concurrency limit is hit, Stalwart silently drops the connection. Let me check the loitering/limiter behavior more closely.

The source code shows a ConcurrencyLimiter – let me look at that to understand if exhausted connection slots could cause resets.

Good - now I have a clear picture of the concurrency limiter. The limiter is per-listener with a default of 8192. The InFlight counter only decrements when the connection drops. So NLB health checks that TLS-handshake and then immediately close do consume a slot briefly. But with a default of 8192, this shouldn’t be the bottleneck unless connections are somehow not being cleaned up.

Let me look at what happens specifically with the proxy protocol path when a connection closes early (end of stream during proxy header read).

@helpbot While you finish your thought process, I am deregistering the node from the balancer, and will give the server time to expire out connections. If I can start it back up later and resume normal functionality, it will prove we are hitting a limit.

The maxConnections limit on the HTTPS listener is the most likely culprit given your single-node setup and ~1-minute failure window.

Looking at the source code for the concurrency limiter: when maxConnections is reached, Stalwart silently drops new incoming connections (returns None from build_session, which means the TCP connection is accepted at the kernel level but immediately dropped at the application level – resulting in a connection reset). The default maxConnections for a listener is 8192, but NLB health checks establish and hold TLS connections. If health checks are opening connections faster than they’re closing them, and real connections are also being held open (HTTP keep-alive, browser sessions, etc.), you could exhaust the per-listener concurrency cap.

What to check:

  1. In the WebUI under Settings > Network > Listeners, look at the maxConnections value on your https listener.
  2. Also check the server-wide maxConnections on SystemSettings (Settings > Network > General).
  3. When the failure occurs, look in your logs for a Limit(ConcurrentConnection) event – this would confirm the concurrency cap is being hit.

What deregistering from the NLB will prove: If removing the node stops the health check floods and connections drain, and service resumes, it points to either the concurrency limiter being exhausted by health check connection accumulation, or a proxy protocol parse error on health check connections causing them to hang rather than close cleanly (consuming limiter slots indefinitely).

A key thing to verify about your NLB TLS setup: When AWS NLB terminates TLS using ACM and forwards to a backend on port 443, it re-encrypts the connection to the backend. The backend listener must be configured with useTls: true. But the NLB health checks for a TLS target group also do a full TLS handshake with the backend. If the Proxy Protocol v2 header is being sent on health check connections too (which it is, when enabled on the target group), Stalwart must successfully parse that header before the TLS handshake – confirm this is working by checking for Network(ProxyError) events in the logs.

Refs:

This is an automated reply from the Stalwart Help Bot. Other community members may follow up if this answer is incomplete or wrong. If you found it useful, mention @helpbot in any reply to ask follow-up questions and the bot will respond again.

@helpbot It’s been over an hour and I re-enabled the target group, and immediately went to resets. I will mention the imaps target group was active this whole time, so there’s a possibility of a global connection overload as well. The only events I see in the log are the “end of stream” warnings.

@helpbot Still experimenting to find the failure point.

I removed the entire NLB and recreated everything from scratch. This time I skipped doing TLS at the NLB and am simply passing TCP 443 to remove TLS as a manipulated variable.

If I turn up the NLB with Proxy Protocol OFF, everything works great, but of course the log fills up with the “end of stream” proxy errors. That much is understandable.

I then turn on the Proxy Protocol on the NLB target group, and I can work OK until I stop interacting - when I return, I receive a SSL decode error in browser. From that point on, I cannot start up ANY new session on any PC or different browser - Stalwart refuses to handshake SSL until I issue a systemctl restart.

Wiresharking the node I’m interacting with (stickiness is on so the sessions are staying put) gives me this on the initial working connection, starts with 1.2 and upgrades to 1.3 .. which seems odd.

0.012139551 1.2.3.4 → 10.0.1.100 TLSv1.2 1956 Client Hello (SNI=mx.example.com)
0.012509334 10.0.1.100 → 1.2.3.4 TLSv1.3 1514 Server Hello, Change Cipher Spec
0.012562282 10.0.1.100 → 1.2.3.4 TLSv1.3 579 Application Data

Returning to the session later results in

459.840623537 1.2.3.4 → 10.0.1.100 SSL 154 Continuation Data
459.840623584 1.2.3.4 → 10.0.1.100 TLSv1.2 1956 Client Hello (SNI=mx.example.com)
459.866610707 10.0.1.100 → 1.2.3.4 TLSv1.2 61 Alert (Level: Fatal, Description: Decode Error)

I don’t know if this is an AWS or Stalwart issue. It seems wrong to me that the NLB would start with 1.2 and upgrade to 1.3, but that is not my specialty.

The Wireshark trace you’ve provided is the key diagnostic clue here. What you’re seeing – “Continuation Data” followed by a new Client Hello (TLSv1.2) and a fatal Decode Error alert – is a classic TLS session resumption / NLB connection reuse issue, not a Stalwart bug per se.

Here’s what’s happening:

The “Continuation Data” at byte 459 is the NLB reusing an existing TCP connection (or the client sending application data on what it thinks is an already-established TLS session). Then the client sends a new Client Hello on that same stream – which Stalwart’s TLS stack sees as garbage mid-stream and responds with a fatal Decode Error alert. From that point, Stalwart’s TLS state machine for that connection is broken, and since the AWS NLB is holding and reusing backend TCP connections (connection pooling), subsequent client requests get routed to that same broken backend connection.

The TLS 1.2 initial / TLS 1.3 upgrade you see is normal: the client offers TLS 1.2 in the Client Hello (for compatibility), and Stalwart negotiates up to TLS 1.3. That’s correct behavior.

The root cause is that AWS NLB in TCP passthrough mode (no TLS termination at the NLB) with Proxy Protocol enabled uses connection multiplexing or reuse at the NLB layer. The NLB can reuse a backend TCP connection for multiple client TLS sessions. Stalwart (backed by rustls) maintains TLS state per-connection, so when the NLB reuses a backend connection for a new client TLS session, Stalwart sees the new Client Hello as malformed application data mid-stream.

This is an architectural mismatch between AWS NLB TCP pass-through mode and Stalwart’s per-connection TLS state. The recommended fix is:

  1. Use TLS termination at the NLB (using ACM) and have Stalwart listen on plain HTTP (port 80 or a plain HTTP listener) on the backend, passing Proxy Protocol v2. This is the standard pattern for NLB + backend apps – the NLB handles TLS, forwarding plain HTTP with the proxy header.
  2. Alternatively, disable connection reuse/pooling at the NLB level – but AWS NLB TCP listeners do not expose a setting to disable backend connection reuse when Proxy Protocol is active.

The docs cover the Proxy Protocol configuration but do not specifically document the NLB TCP passthrough + Proxy Protocol interaction. This is an AWS NLB architectural constraint rather than a Stalwart configuration issue.

This is an automated reply from the Stalwart Help Bot. Other community members may follow up if this answer is incomplete or wrong. If you found it useful, mention @helpbot in any reply to ask follow-up questions and the bot will respond again.

@helpbot Fair enough, so effectively then a full TLS entirely through the connection chain won’t be possible using AWS NLB, and as a workaround we’re simply going to move the TLS handling up one level. The only other choice then would be to run HAProxy or similar directly on another EC2 node and handle the load balancing ourselves if security policy dictated a fully TLS compliant path.

Does this connection reuse also explain then why I cannot open a session on a completely different PC/IP, since the NLB is just ramming it back down a pipe it thinks it still has.

And given all the above, should I not be able to wait a period of time, the connection would eventually terminate (should be within the 350 seconds I believe), and I should be able to start up a new session, if connection re-use was the root cause?

Yes, the connection reuse explains why a completely different PC/IP also fails. The NLB’s backend connection pool is shared – when a new client (any client) makes a request, the NLB picks an available backend TCP connection from its pool and routes the new client’s TLS Client Hello down it. If that backend connection is in a broken TLS state (because Stalwart already sent a fatal Decode Error alert on it), every client that gets routed to that poisoned connection will also fail. With only one backend node, the NLB’s entire pool eventually fills with broken connections.

And yes – you are correct that waiting for the NLB’s idle timeout (350 seconds for TCP) should allow the broken connections to drain and expire, after which the NLB would open fresh backend connections and things would work again until the next reuse event triggers it. The fact that a systemctl restart fixes it is consistent with this: restarting Stalwart closes all backend TCP connections, forcing the NLB to open fresh ones.

To summarize your options:

  1. TLS termination at the NLB (recommended for AWS): ACM cert on the NLB listener, plain HTTP (not HTTPS) listener on Stalwart backend, Proxy Protocol v2 for client IP. This avoids the connection reuse / TLS state mismatch entirely.
  2. Self-managed proxy on EC2 (e.g. HAProxy): Full TLS passthrough with proper per-connection handling. HAProxy’s TCP mode does not pool/reuse backend connections the way NLB does, so each client TLS session gets its own backend TCP connection. The Stalwart docs have an HAProxy configuration guide that covers this pattern including Proxy Protocol.

This is an automated reply from the Stalwart Help Bot. Other community members may follow up if this answer is incomplete or wrong. If you found it useful, mention @helpbot in any reply to ask follow-up questions and the bot will respond again.

@helpbot

Is there a section in the documentation that outlines this recommendation, as I did not come across it.

I will alter my security group to only allow my single IP to pass through to test the theory, for the sake of proving the theory through fully.

The docs do explicitly document this. From the Reverse Proxy overview:

HTTP upstream (recommended for simplicity). The proxy terminates TLS for clients and forwards plain HTTP to Stalwart’s HTTP listener (default port 8080). This is the conventional reverse-proxy pattern and the one used by every example in the proxy guides below. No backend TLS configuration is required.

So the “TLS at the NLB, plain HTTP to the backend” pattern is the documented recommendation. There is no Stalwart-specific guidance for AWS NLB specifically, but the reverse proxy overview covers both patterns (HTTP upstream vs. HTTPS/TCP-passthrough) and calls out exactly why TCP-passthrough has constraints – the TCP-passthrough path is described as appropriate for “end-to-end TLS as a deployment requirement” or when the proxy is on an untrusted network, but it carries the per-connection TLS state complexity you’ve run into.

For your setup: configure the NLB listener on port 443 with TLS termination (ACM cert), set the target group to plain HTTP (port 8080 on your Stalwart nodes), enable Proxy Protocol v2 on the target group, and add the NLB node IPs to Stalwart’s trusted proxy networks. Stalwart’s port 8080 listener (protocol: http, useTls: false) is already present by default for exactly this pattern.

Good luck with the 350-second drain test – if connections drain and the service recovers without a restart, that confirms the connection reuse root cause conclusively.

This is an automated reply from the Stalwart Help Bot. Other community members may follow up if this answer is incomplete or wrong. If you found it useful, mention @helpbot in any reply to ask follow-up questions and the bot will respond again.

@helpbot What would be a recommended endpoint for the NLB to check for service status? Can we create one specific for this use, rather than banging on something like /admin all day long?

Stalwart has dedicated health check endpoints built in: /healthz/live and /healthz/ready.

  • /healthz/live – always returns HTTP 200 if the server process is running (liveness).
  • /healthz/ready – returns HTTP 200 if the data store is available, or HTTP 503 if not (readiness).

These are exactly what you should point your NLB health check at. Configure the NLB target group health check as:

  • Protocol: HTTP (since you’re now terminating TLS at the NLB and forwarding plain HTTP to port 8080)
  • Path: /healthz/live (or /healthz/ready if you want the health check to reflect store availability)

These endpoints are lightweight – they don’t trigger any authentication, session state, or application logic beyond a simple status check. The Kubernetes docs confirm their purpose: /healthz/live for liveness, /healthz/ready for readiness. They are served on the standard HTTP listener.

Docs reference: Liveness and readiness endpoints

This is an automated reply from the Stalwart Help Bot. Other community members may follow up if this answer is incomplete or wrong. If you found it useful, mention @helpbot in any reply to ask follow-up questions and the bot will respond again.

@helpbot So a reasonable approach would be to use “live” for the main GUI to allow for emergency repair while using “ready” for the IMAP/POP3 NLB endpoints, though I do recognize that doesn’t stop JMAP from attempting to access data then.

Using /healthz/live for the admin/HTTP target group and /healthz/ready for IMAP/POP3/JMAP target groups is a reasonable split. The logic holds: live confirms the process is up (good enough to reach the admin UI for emergency repair), while ready confirms the data store is reachable before routing mail protocol traffic to a node.

You’re right that ready returning 200 doesn’t guarantee every downstream operation will succeed – it only checks that the primary data store is non-null/initialized, not that every query will complete. But it’s the best available built-in signal short of a full synthetic transaction, and it will correctly mark a node as unhealthy if the store connection is completely down.

One practical note from the source code: both /healthz/live and /healthz/ready pass through is_http_anonymous_request_allowed, which is the anonymous rate-limit check. Since these are unauthenticated requests, make sure the NLB health check source IPs are in your AllowedIPs list (Settings > Security > Allowed IPs) so the health check probes are not subject to anonymous rate limiting. This is especially relevant now that you’re switching to the HTTP upstream pattern where the NLB health check IPs will be directly visible to Stalwart (not hidden behind the proxy header, since the health check itself won’t send a Proxy Protocol header with a client IP – or will send one with the NLB node IP as the source).

Docs reference: IP address allow list · Liveness and readiness endpoints

This is an automated reply from the Stalwart Help Bot. Other community members may follow up if this answer is incomplete or wrong. If you found it useful, mention @helpbot in any reply to ask follow-up questions and the bot will respond again.