Very few (meaning here "zero"
) chance that MSI replaces components on you mobo. Best case they will replace mobo itself.
I may have missed some parts of this thread but I'm getting confused feeling that you try to achieve 2 different goals at same time:
- repair your server without the burden of handling migration, copy or whatever else of TB of data
- defining better fault tolerant implementation
My advice would be that you split it into two different threads:
- one to relaunch services asap (this should be easily done by replacing your mobo with no change at disk level)
- one to improve design
I've a couple of comment/question:
- did you look at swapping figures? I would be surprised that such huge server with 16GB of memory shows any swap activity especially if used to run Zentyal only. This could be different if you were running application server with lot of java based sessions but infrastructure services + Samba...
How is your swappiness parameter tuned?
- I would rather dedicate SSD to system than swap, moving /log elsewhere
- I hope you realize, thank (kind of) to this hardware issue, that:
- LVM "alone" is useless in case of hardware failure (except that impact is wider in case you have multiple machines running on same hardware)
- RTO is sometimes different for internet & mail vs. Samba: running everything on same server is not always a good idea
Moving to HA aspects (sorry for this long post), I would like to clarify some concepts or at least to explain how I perceive it:
- LVM is an efficient way of providing fault tolerance if VM files are not stored on one server only.
- SAN or NAS can be used to achieve this. DRBD is another approach. There is however significant difference between these designs:
- SAN (and DRBD) works at block level while NAS works at file level. This means that data on NAS can be accessed from different servers at same time while SAN allocates data to one server only. This impact the way you swing from one server to another in case of failure.
- DRBD can have noticeable impact in term of performance depending on amount of data to be synchronized.
- If you decide to go for DRDB without losing Raid6 performance impact, you will have to build another dedicated file system, meaning more disks.
Because of the above, I would investigate something based on:
- NAS for your data (fault tolerant disk, in case of mobo failure, replace it if you can afford this RTO)
- LVM based on either SAN or better DRBD for infrastructure services so that you keep a live copy of you VM and restart quickly on available hardware in case of failure.
Unfortunately this has a cost