Zentyal Forum, Linux Small Business Server

Zentyal Server => Directory and Authentication => Topic started by: AxxelH on July 28, 2021, 04:20:22 am

Title: Addtional Domain Controller causing timeouts, other problems
Post by: AxxelH on July 28, 2021, 04:20:22 am
For some time (at least 9 months) I've been running Zentyal with two domain controllers, one as "Domain Controller" (DC) and the other as "Addtional Domain Controller" (ADC) in a homelab environment. The DC's served a mix of Mac and Linux clients without issue. Both servers are running Zentyal 7.0, and are up-to-date.

Sometime in the last few weeks, I've noticed that login operations have become problematic. Examples:
- Some Mac clients will login to a user session, but once the screensaver locks the password is refused.
- File server operations from other Samba servers bound to the domain will sometimes hang for extended periods until some timeout expires, after which the operation completes.

During debugging I've tried shutting down each DC, with unexpected effects:

- If the main DC is on and the ADC is off everything seems to run reasonably:
  - Mac login operations succeed.
  - There are occasional delays in some file server operations, but they are rare.

My presumption is in this state that while the DNS entries for the ADC are still present in this state, attempts to use the ADC time out rapidly and switch to the main DC.

- If the main DC is off and the ADC is on:
  - Mac logins fail, as does screensaver unlock.
  - File servers operations fail, or prompt for passwords which are then rejected.
  - Direct SMB commands ('smbclient //server/netlogon -U diradmin -c 'ls') run on the ADC work, but take exceptionally long (40-60s).

This obviously means that any failover benefits I might get from the ADC aren't in effect.

I'm unsure how to debug this, as I let Zentyal set this up, and I don't really know the underlying Samba stack. What I know:

- 'samba-tool drs showrepl' shows replication is running without errors when both servers are up.

- 'samba-tool fsmo show' has all FSMO roles assigned to the DC (where I would expect them).

- There are no obvious errors in the ADC logs (but maybe I have a different expection of "obvious").

Any suggestions? My current thinking is to just force remove the offline ADC from Samba using something like 'samba-tool domain demote --remove-other-dead-server' but its not clear to me that's safe in Zentyal, or if I'm creating other problems.