Author Topic: Zentyal 4.2 - BUG: soft lockup - CPU #1, after latest update  (Read 23881 times)

pcready.cl

  • Zen Samurai
  • ****
  • Posts: 286
  • Karma: +13/-1
  • Zentyal Installer in Chile
    • View Profile
    • PC Ready Chile SpA
Re: Zentyal 4.2 - BUG: soft lockup - CPU #1, after latest update
« Reply #90 on: April 11, 2016, 09:26:17 pm »
That's my point. I would like to find a way to reproduce the issue in order to be sure that is gone from the installed kernel.
Waiting is not the correct option imo. It doesn't give You the assurance that the kernel is bug-free
E.g. mine run with .51 and .56 and had been well for days...more than a week (and then one started to crush...)

Honestly I'm still trying to figure how to reproduce it. Looks latched to some concurrency with samba's calls...but i'm not sure.

I'll update You all asa i've more infos...

L

My servers are production can not come and change the kernel, leave you with these versions and hope for the best lol.

I will report any errors here in the forum.
Email: contacto@pcready.cl
Teléfono: (+56 32) 314 0883
Skype: pcready.cl
Web: https://www.pcready.cl

jwilliams1976

  • Zen Apprentice
  • *
  • Posts: 23
  • Karma: +1/-0
    • View Profile
Re: Zentyal 4.2 - BUG: soft lockup - CPU #1, after latest update
« Reply #91 on: April 11, 2016, 10:35:07 pm »
The 'ip route ls' command I mentioned earlier has worked for me to test that the bug exists or does not in a given kernel. See this post for more info:
https://forum.zentyal.org/index.php/topic,26954.msg99367.html#msg99367

Quote
It stems from a bug in the kernel that makes the ip command output the first rule infinitely.  You can use this command to see if you're affected:
ip route ls

Broken Output:
0:   from all lookup local
0:   from all lookup local
0:   from all lookup local
0:   from all lookup local
0:   from all lookup local
0:   from all lookup local
0:   from all lookup local
<repeats indefinitely - ctrl+c to quit>

In Zentyal, this causes one of the network scripts to hang because it's waiting for that command to end.  This prevents loading of other services and resulted in my network being severely broken.

Besides the previously mentioned fix of rolling back the kernel, you can modify the script in question:
/usr/share/zentyal-network/flush-fwmarks

LaM

  • Zen Apprentice
  • *
  • Posts: 41
  • Karma: +0/-0
    • View Profile
Re: Zentyal 4.2 - BUG: soft lockup - CPU #1, after latest update
« Reply #92 on: April 12, 2016, 01:34:56 am »
That's my point. I would like to find a way to reproduce the issue in order to be sure that is gone from the installed kernel.
Waiting is not the correct option imo. It doesn't give You the assurance that the kernel is bug-free
E.g. mine run with .51 and .56 and had been well for days...more than a week (and then one started to crush...)

Honestly I'm still trying to figure how to reproduce it. Looks latched to some concurrency with samba's calls...but i'm not sure.

I'll update You all asa i've more infos...

L

My servers are production can not come and change the kernel, leave you with these versions and hope for the best lol.

I will report any errors here in the forum.

Mine are production servers either of course, tha'ts why I don't want to wait the bug to happen during day (production time) but I would like to stress the system during night (not-so-much production time) in order to try finding the solution. =)

The 'ip route ls' command I mentioned earlier has worked for me to test that the bug exists or does not in a given kernel. See this post for more info:
https://forum.zentyal.org/index.php/topic,26954.msg99367.html#msg99367

Quote
It stems from a bug in the kernel that makes the ip command output the first rule infinitely.  You can use this command to see if you're affected:
ip route ls

Broken Output:
0:   from all lookup local
0:   from all lookup local
0:   from all lookup local
0:   from all lookup local
0:   from all lookup local
0:   from all lookup local
0:   from all lookup local
<repeats indefinitely - ctrl+c to quit>

In Zentyal, this causes one of the network scripts to hang because it's waiting for that command to end.  This prevents loading of other services and resulted in my network being severely broken.

Besides the previously mentioned fix of rolling back the kernel, you can modify the script in question:
/usr/share/zentyal-network/flush-fwmarks

I'll test tomorrow night and check the script, THX FOR THE HINT!! =)

L

Andreas Wirth

  • Zen Apprentice
  • *
  • Posts: 8
  • Karma: +0/-0
    • View Profile
Re: Zentyal 4.2 - BUG: soft lockup - CPU #1, after latest update
« Reply #93 on: April 12, 2016, 05:23:42 am »
@Carlos:
Thanks for letting run 3.19.0-58, keeping us up2date and giving feedback... I do very much appreciate that.  ;)

You wrote:
Quote
.56 Kernel which has no active users samba, only used as Firewall, perhaps why it has not failed.
I suppose, yes you are right with that suggestion: no active users = no crash
Because, if I remember correctly, my last system crash was with kernel 3.19.0-56, after I had enough of the experiments and downgraded to 3.19.0-47 as suggested here.
I am running it virtualized on KVM as well, but towards all collected examples in comparison, this bug obviously seems to affect both, running it virtualized and on dedicated systems.
Obviously no difference.

@jwilliams1976:
May I ask you, which Zentyal version you are running and to post your kernel version here (, which is affected by the 'ip' bug)?
(Just to make sure, the kernel version is not mistaken...)

For example:
hotsummer55 tested your command example with: linux-image-3.19.0-49-generic
Carlos (alias pcready.cl) tested your command example with: linux-image-3.19.0-56-generic
And their results are obviously fine, so no:
Quote
Broken Output:
0:   from all lookup local
0:   from all lookup local
0:   from all lookup local
0:   from all lookup local
0:   from all lookup local
0:   from all lookup local
0:   from all lookup local
<repeats indefinitely - ctrl+c to quit>
But both kernel versions, 3.19.0-49 and 3.19.0-56, according to this forum thread are confirmed to be affected by this CPU soft lockup bug in combination and syslog-ed towards the smbd process.

E.g.:
I installed Zentyal 4.2 directly in December 2015 starting from kernel version 3.19.0.42, if I remember correctly, so never upgraded from Zentyal 4.1 and its 3.16.0-xx kernel version.
And in the Zentyal forum thread you posted yesterday ( https://forum.zentyal.org/index.php/topic,26954.15.html alias high CPU usage + huge issues with network services), in there the users reported, that an upgrade to Zentyal 4.2 and working with kernel version > 3.19.0-37-generic fixed their 'ip' problem.
E.g. 3.19.0-39 did it
Or as solution they purged (not only removed) and downgraded in Zentyal 4.1 to kernel version < 3.16.0-52.
E.g. 3.16.0-51 is in this forum thread to be confirmed to work properly without the 'ip' issue.

And that covers again perfectly and is identical with your previously posted Ubuntu launchpad bug report @ https://bugs.launchpad.net/ubuntu/+source/linux-lts-utopic/+bug/1514785
But therefore mostly valid with Zentyal version 4.1 and in there again reported by using the following kernels:
Quote
PBR NOT working on:
3.13.0-69
3.13.0-70
3.16.0-52 <used by Zentyal 4.1>
3.16.0-53 <used by Zentyal 4.1>
3.19.0-37 <probably initially used by Zentyal 4.2>
Yes, there was obviously a problem with ip in the previous Zentyal/Ubuntu release.
Symptoms are often similar, but could it be, that you mix up here the kernel 3.16.0-xx and 3.19.0-xx?
(Please no offence. Everything helps, that might help to identify affected kernel versions for our problem) 

@LaM:
Yes I'd love to have a test as well to quickly confirm, if a kernel is affected by this bug, and not to wait until the system crashes.
Especially for the future it might be relevant as well, because the other examples e.g. in UCS (seems2Bfixed) and with Fedora seem to confirm, that also the newer kernel 4.x versions, e.g. 4.1.16-xx and 4.4.3-xx are affected by this bug.

Cheers,
Andreas
« Last Edit: April 12, 2016, 07:44:44 am by Andreas Wirth »

BerT666

  • Zen Warrior
  • ***
  • Posts: 228
  • Karma: +17/-0
    • View Profile
Re: Zentyal 4.2 - BUG: soft lockup - CPU #1, after latest update
« Reply #94 on: April 12, 2016, 04:22:54 pm »
Howdy,

maybe a huge data transfer (~500GB to 1TB) will force this to occur.

In my setting I only get problems when it comes to transfers like this...

Regards

Thomas

LaM

  • Zen Apprentice
  • *
  • Posts: 41
  • Karma: +0/-0
    • View Profile
Re: Zentyal 4.2 - BUG: soft lockup - CPU #1, after latest update
« Reply #95 on: April 12, 2016, 04:49:33 pm »
In our case we only have transfers which are less than 1 GB...   =(

I rather think it occurs when there are many concurrency calls...but still i'm not able to produce a test.

@BerT666, did You reproduce the issue?

L

jwilliams1976

  • Zen Apprentice
  • *
  • Posts: 23
  • Karma: +1/-0
    • View Profile
Re: Zentyal 4.2 - BUG: soft lockup - CPU #1, after latest update
« Reply #96 on: April 12, 2016, 05:54:31 pm »
@Andreas
I am currently using 3.19.0-49 on my server without issue for 9 days. I ran into the bug when I upgraded the kernel to the 3.19.0-53 version. I know at the time I was having issues I had found the 'ip route ls' command and tested on these two versions and the -53 version kept running the line over and over until CTRL-C while the -49 exited on it's own. That is pretty much all the testing I did. It's not clear if Carlos or anyone else has ever tried to run the 'ip route ls' command on -56 or -58 kernels. It may not be an indicator at all or is unrelated. Hopefully someone finds a reliable way to test it. I'm going to work on a test bed today as I can't mess with my production server during the day.

Andreas Wirth

  • Zen Apprentice
  • *
  • Posts: 8
  • Karma: +0/-0
    • View Profile
Re: Zentyal 4.2 - BUG: soft lockup - CPU #1, after latest update
« Reply #97 on: April 13, 2016, 08:02:57 am »
@LaM and @BerT666:
I don't think, that that can or should be the test to find out, if a kernel-version is affected by this bug or not.  :-\
- data-transfer of 500 GB up to 1 TB...
- or instruct all users to put as much operation on it as possible at the same time...
That's IT technically destructive for the own reputation, "Oh yeah, please help me to crash the server"

@Carlos:
You reached my magic uptime mark today with with more than 6 days uptime.
I'll probably give 3.19.0-58 a try, when you've got an uptime longer than 14 days without any issues...  ;)
Sorry, that you are the guinea pig (laboratory-technically seen)

@jwilliams1976:
hotsummer55 wrote as answer towards your post @April 09, 2016, 11:24:04 am, that he/she tested 'ip route ls' with:
Quote
Not sure about this .I tested this against know bad kernel linux-image-3.19.0-49-generic.And it did not produce any problems when running ip rule show.
What kernel are you running now

and Carlos (alias pcready.cl) tested and wrote as answer towards your post @ April 09, 2016, 04:47:11 pm
Quote
use the command on the kernel  linux-image-3.19.0-56-generic and nothing happened, and that is an affected version according to the forums...  ???

And if my interpretation is right, then both kernels are affected by the bug of this forum thread 'CPU soft lockup', but regarding towards their reply not by the 'ip' bug you posted. Sorry I can't test it, but the Ubuntu launchpad bug track + the Zentyal forum thread towards the 'ip' bug, you posted, are covering perfectly and are identical towards the won cognition.

So 'ip' bug occurs in these relevant kernels:

3.16.0-52, 3.16.0-53 (Zentyal 4.1)
and
3.19.0-37 (probably Zentyal 4.2)

and have been fixed in:
>= 3.16.0-55.74 (Zentyal 4.1)
and
>= 3.19.0-39.44 (Zentyal 4.2)

Don't get me wrong a re-occurrence of this 'ip' bug is definitely possible, and then only you have tested 3.19.0-53 towards that 'ip' bug.
But when I simply type this into my commandline:
Code: [Select]
██ root@dcrc-dcx1:/var/log
██ 13:31:03 ᛤ  dpkg --list | grep linux-image
rc  linux-image-3.19.0-25-generic         3.19.0-25.26~14.04.1                amd64        Linux kernel image for version 3.19.0 on 64 bit x86 SMP
rc  linux-image-3.19.0-39-generic         3.19.0-39.44~14.04.1                amd64        Linux kernel image for version 3.19.0 on 64 bit x86 SMP
rc  linux-image-3.19.0-41-generic         3.19.0-41.46~14.04.2                amd64        Linux kernel image for version 3.19.0 on 64 bit x86 SMP
rc  linux-image-3.19.0-42-generic         3.19.0-42.48~14.04.1                amd64        Linux kernel image for version 3.19.0 on 64 bit x86 SMP
ii  linux-image-3.19.0-43-generic         3.19.0-43.49~14.04.1                amd64        Linux kernel image for version 3.19.0 on 64 bit x86 SMP
ii  linux-image-3.19.0-47-generic         3.19.0-47.53~14.04.1                amd64        Linux kernel image for version 3.19.0 on 64 bit x86 SMP
ii  linux-image-3.19.0-49-generic         3.19.0-49.55~14.04.1                amd64        Linux kernel image for version 3.19.0 on 64 bit x86 SMP
ii  linux-image-3.19.0-51-generic         3.19.0-51.58~14.04.1                amd64        Linux kernel image for version 3.19.0 on 64 bit x86 SMP
ii  linux-image-3.19.0-56-generic         3.19.0-56.62~14.04.1                amd64        Linux kernel image for version 3.19.0 on 64 bit x86 SMP
ii  linux-image-3.19.0-58-generic         3.19.0-58.64~14.04.1                amd64        Linux kernel image for version 3.19.0 on 64 bit x86 SMP
rc  linux-image-extra-3.19.0-25-generic   3.19.0-25.26~14.04.1                amd64        Linux kernel extra modules for version 3.19.0 on 64 bit x86 SMP
rc  linux-image-extra-3.19.0-39-generic   3.19.0-39.44~14.04.1                amd64        Linux kernel extra modules for version 3.19.0 on 64 bit x86 SMP
rc  linux-image-extra-3.19.0-41-generic   3.19.0-41.46~14.04.2                amd64        Linux kernel extra modules for version 3.19.0 on 64 bit x86 SMP
rc  linux-image-extra-3.19.0-42-generic   3.19.0-42.48~14.04.1                amd64        Linux kernel extra modules for version 3.19.0 on 64 bit x86 SMP
rc  linux-image-extra-3.19.0-43-generic   3.19.0-43.49~14.04.1                amd64        Linux kernel extra modules for version 3.19.0 on 64 bit x86 SMP
ii  linux-image-extra-3.19.0-47-generic   3.19.0-47.53~14.04.1                amd64        Linux kernel extra modules for version 3.19.0 on 64 bit x86 SMP
ii  linux-image-extra-3.19.0-49-generic   3.19.0-49.55~14.04.1                amd64        Linux kernel extra modules for version 3.19.0 on 64 bit x86 SMP
ii  linux-image-extra-3.19.0-51-generic   3.19.0-51.58~14.04.1                amd64        Linux kernel extra modules for version 3.19.0 on 64 bit x86 SMP
ii  linux-image-extra-3.19.0-56-generic   3.19.0-56.62~14.04.1                amd64        Linux kernel extra modules for version 3.19.0 on 64 bit x86 SMP
ii  linux-image-extra-3.19.0-58-generic   3.19.0-58.64~14.04.1                amd64        Linux kernel extra modules for version 3.19.0 on 64 bit x86 SMP
ii  linux-image-generic-lts-vivid         3.19.0.58.41                        amd64        Generic Linux kernel image

██ root@dcrc-dcx1:/var/log
██ 13:31:11 ᛤ 

Then I don't even get offered a linux-image-3.19.0-53-generic. There's only a linux-image-3.19.0-51-generic and a linux-image-3.19.0-56-generic. (I have both installed in my system as you see). Probably you mean 3.19.0-51. So I just want to make sure, that we are all on the same page, and not mistaken by a kernel version mix up.

So now -56 and -49 have been tested towards the 'ip' bug by hotsummer55 and Carlos. And they didn't find any issues on their system.

But here one time all affected versions of the 'CPU soft lockup' bug, we are dealing with in this forum thread, as far as I know and carried together quickly:

linux-image-3.19.0-49 (e.g. https://tracker.zentyal.org/issues/4977 , in this forum thread and by my own experience )
linux-image-3.19.0-51 (in this forum thread and by my own experience)
linux-image-3.19.0-56 (in this forum thread and by my own experience)
and Carlos is currently 'long-term' ;) testing linux-image-3.19.0-58, and finding indirectly out for us, if it is safe to switch back to the main kernel upstream.

I don't know the background of your Zentyal system, if you upgraded from Zentyal 4.1 with kernel 3.16.0-53 or ran it with 3.19.0-37, which is by the way also not available among my installable kernels, but then definitely yes, it was affected by the 'ip' bug.
So to quote you:
Quote
[..] but on any versions with the bug it just loops and never exits. Zentyal must use this command somewhere and after a while it eats up all CPU and memory resources and results in the CPU soft hang.

Quick way to test it instead of waiting a week for Zentyal to crap out.
and:
Quote
The 'ip route ls' command I mentioned earlier has worked for me to test that the bug exists or does not in a given kernel.

and to quote me:
Quote
Na' sorry, I don't think, that your mentioned bug https://bugs.launchpad.net/ubuntu/+source/linux-lts-utopic/+bug/1514785 has got anything to do with it.

So for me everything seems to speak against the theory, that the 'CPU soft lockup' bug, we're having issues with, in common has anything to do with the 'ip' bug, you experienced with your system(s). I'd just like to avoid following down a wrong path. That's all.

By the way you are running currently definitely a bug affected kernel version with 3.19.0-49. But 9 days is great, I never made it that long.

But I would love to have verified:
  • that the bug is gone
  • and finding a quick test to verify, if a system is affected by this bug or not

So the only most plausible test procedure seems to be at the moment to bring as much concurrent user activity on the system as possible, which is very semi-optimal.
And I am very sorry for the fact, that I can't test by myself.  :-[
But if I risk any further inconvenience on that system probably my head gets chopped of.

I mean I have 2x Zarafa (not to be mistaken with Zentyal) on CentOS servers running virtualized on KVM since middle of 2011, ticking like a Switzer clockwork, and never experienced any issues, no update issues, works like a charm. Only through the cat and mouse game caused by MS Office updates, which lead to problems with MS Outlook's connector compatibility, but was always fixed immediately by Zarafa with a client-connector update. But that's also only client-sided and has nothing to do with the server itself. 
Since Zarafa announced to not to support officially MS Outlook as Email/Groupware client anymore, Zentyal seems to be the only alternative, to set up something beside the cost intensive MS licenses trap.
I tested SoGO 3 ( http://sogo.nu/ ) as well in the beginning of the year... email fine, groupware functionality stuffed up with MS Outlook, so that I won't really recommend to use it productive.

Since 2 months it is a nightmare with the Zentyal system, every morning system checks and to evaluate, if the traffic still flows.
Zentyal is on a good way, that's for sure... But this bug costs really nerves.

It would be great, if Zentyal would give away a free commercial test lab license to play around with for such purposes, while being subscripted and maintaining a productive system. That'll be great.
« Last Edit: April 13, 2016, 09:06:30 am by Andreas Wirth »

Andreas Wirth

  • Zen Apprentice
  • *
  • Posts: 8
  • Karma: +0/-0
    • View Profile
Re: Zentyal 4.2 - BUG: soft lockup - CPU #1, after latest update
« Reply #98 on: April 14, 2016, 08:04:20 am »
Hi to Everybody,

so I did a bit of more research towards that bug and want to keep the forum up2date.

The bug has been initially reported in the official Ubuntu 14.04 LTS kernel in version 3.13.0-77.
The bug was then officially fixed @ the 05 April 2016 in the build of this kernel - 3.13.0-85.129.
( https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1543980 )

other affected versions of Ubuntu's 3.x kernel line so far reported containing this bug:
Quote
v3.14: 9d054f57adc981a5f503d5eb9b259aa450b90dc5
v3.12: 9964b4c4ee925b2910723e509abd7241cff1ef84
v3.10: da8db0830a2ce63f628150307a01a315f5081202
ckt/linux-3.13.y: 6505b15f7f7efde1853b5a7641e9ce675c2b1a96
v3.4: -
v3.2: a3b0f6e8a21ef02f69a15abac440572d8cde8c2a
I don't know, if fixed, but irrelevant for us.

The 4.x version of the kernel was reported to contain the bug as well, mainly by the maintainers of the Univention Corporate Server:
There the upstream kernel 4.1.16 was affected by the bug.
The bug was then temporary fixed by reverting 2 previous kernel commits already @ the 12 Feb. 2016 in this build of their kernel 4.1.6-1.174.201602110938 to keep their customers up and running. (Univention has their own Linux kernel developers internally and proven here: http://errata.software-univention.de/ucs/4.1/114.html )

If you follow up on the developers' communications @ https://www.mail-archive.com/kernel-packages@lists.launchpad.net/msg159625.html
This temporary patch for the bug was provided by Philipp Hahn (working for Univention):
Quote
Reverting the patch "unix: avoid use-after-free in ep_remove_wait_queue"
in 4.1 fixes my problem (for now). The original patch went into 4.4, but
was back-ported to several stable trees:


v3.2: a3b0f6e8a21ef02f69a15abac440572d8cde8c2a
v3.18: 72032798034d921ed565e3bf8dfdc3098f6473e2
v4.1: 5c77e26862ce604edea05b3442ed765e9756fe0f
v4.2: bad967fdd8ecbdd171f5f243657be033d2d081a7
v4.3: 58a6a46a036ce81a2a8ecaa6fc1537c894349e3f
v4.4: 7d267278a9ece963d77eefec61630223fce08c6c

After integrating this temporary patch by Philipp Hahn to keep their customers' systems fully operational, there then is mentioned the true official bug fix at the 23 Feb 2016 for the first time:
Quote
Rainer Weikusat sent a patch named
 [PATCH net] af_unix: Guard against other == sk in unix_dgram_sendmsg
 < https://patchwork.ozlabs.org/patch/582017/ >
which fixes the problem.


For our distribution we released chose to revert the original patch as
we needed a working kernel as fast as possible, as several of our
customers were hit by that bug.

I tested the patch from Rainer and it also made the bug disappear.
David Miller also picked the patch for stable and we will do the same
when next be build a new kernel for our release.

Philipp


Rainer Weikusat's official fix for the bug and the official commit was done by David S. Miller into the newest 4.5 main Linux kernel development version maintained by Linus Torvalds @ the 16 Feb. 2016:
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=a5527dda344fff0514b7989ef7a755729769daa1

For the most recent 4.4.x stable kernel version it was announced to be in review:
Quote
For 4.4 it is in review right now for 4.4.4 as announced by greg k-h
yesterday: < https://lkml.org/lkml/2016/3/1/828 >
I didn't find any information, if this fix has found its way into the 4.4.4 kernel version, but if it is contained and Fedora in current version 23 considers to update from kernel 4.4.3 to 4.4.4, that will make this Cubietruck user very happy again: http://www.cubieforums.com/index.php?topic=4076.0

As I initially mentioned in this post, Ubuntu fixed the bug officially with LTS kernel build 3.13.0-85.129 @ the 05 April 2016, but reported to them was already @ the 10 Feb. 2016 by Karolin Seeger (probably also working for Univention) and in the kernel communication:
Quote
Thanks Philipp!

I just hope to trigger some reaction from the ubuntu maintainers
in order get a usable kernel more than two week after breaking it.
So it basically took the Ubuntu maintainers quite a while (almost 2 months) to officially fix it in their 14.04 LTS kernel in version 3.13.0-xx.

So now the big question: Where are we with Zentyal at the moment?

Zentyal 4.2 is not using the Ubuntu 14.04 LTS kernel 3.13.0-xx alias Trusty.
For some reason they chose to use the 3.19.0-xx kernel from the Vivid release alias Ubuntu 15.04.

And I couldn't find any hints or clues, that the 3.19.0-xx Vivid kernel tree has been reported towards that bug as well.
But we all experience the effect of this bug, our system(s) is/are crashing!!!
And no recognizable action has been taken yet by the reported bug: https://tracker.zentyal.org/issues/4977 (not been assigned, confirmed, marked as duplicate, refused or anything else)

So I would now realistically suggest, that the last week received kernel-version 3.19.0-58 won't contain any bug fix. (Please prove me wrong!!!)
What I love about OpenSource is, that it is truly open and transparent.

So basically, if someone from the Zentyal support or development team has a look into this forum thread: Please help us, bring in a statement towards the matter or a confirmation to take it on.
We can't patch the kernel by ourself, that's a bit too deep.

The bug can be reproduced and confirmed, but obviously only as developer on an affected system (To be honest I don't know how to do it):
Quote
It's easily reproducible by running the following commands in the Samba master branch:
./configure.developer TDB_NO_FSYNC=1 make -j test FAIL_IMMEDIATELY=1 SOCKET_WRAPPER_KEEP_PCAP=1 TESTS="samba3.raw.composite"

If I am right (and I really want to be wrong towards that matter):
The only way for us in Zentyal 4.2 right now, would be to change and install a kernel from another version tree, which contains the fix, but I don't want to risk any incompatibilities towards other installed components of Zentyal probably or possibly coming along with it.

Cheers,
Andreas
« Last Edit: April 14, 2016, 09:39:37 am by Andreas Wirth »

pcready.cl

  • Zen Samurai
  • ****
  • Posts: 286
  • Karma: +13/-1
  • Zentyal Installer in Chile
    • View Profile
    • PC Ready Chile SpA
Re: Zentyal 4.2 - BUG: soft lockup - CPU #1, after latest update
« Reply #99 on: April 14, 2016, 05:19:59 pm »
SERVER 15 USERS SAMBA - ZENTYAL 4.2
Code: [Select]
root@servet:~# dmidecode | grep "^System Information" -A8
System Information
        Manufacturer: HP
        Product Name: ProLiant ML150 G6
        Version: 1.0
        Serial Number: MXS108003W
        UUID: 745FC10B-XXXX-DF11-XXXX-C192EAA48B93
        Wake-up Type: Power Switch
        SKU Number: 466132-001
        Family: ProLiant Server

root@servet:~# uname -a
Linux servet 3.19.0-58-generic #64~14.04.1-Ubuntu SMP Fri Mar 18 19:05:43 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

root@servet:~# uptime
 12:18:20 up 6 days, 20:01,  1 user,  load average: 0,18, 0,25, 0,21

FIREWALL NON SAMBA USERS - ZENTYAL 4.2
Code: [Select]
root@servpcr-fw:~# dmidecode | grep "^System Information" -A8
System Information
        Manufacturer: HP
        Product Name: ProLiant ML110 G5
        Version:      NA
        Serial Number: MX2014011G
        UUID: 44F48208-XXXX-5606-XXXX-560649F92209
        Wake-up Type: Power Switch
        SKU Number: AT040A
        Family: 1234567890

root@servpcr-fw:~# uname -a
Linux servpcr-fw 3.19.0-56-generic #62~14.04.1-Ubuntu SMP Fri Mar 11 11:03:15 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

root@servpcr-fw:~# uptime
 12:19:19 up 8 days, 12:18,  1 user,  load average: 0,29, 0,36, 1,38
Email: contacto@pcready.cl
Teléfono: (+56 32) 314 0883
Skype: pcready.cl
Web: https://www.pcready.cl

LaM

  • Zen Apprentice
  • *
  • Posts: 41
  • Karma: +0/-0
    • View Profile
Re: Zentyal 4.2 - BUG: soft lockup - CPU #1, after latest update
« Reply #100 on: April 18, 2016, 07:19:36 am »
Hi everybody!

Sorry I was traveling for job and have been away for almost a week.

Before quoting and reporting "news" let me show some new stats (I will call the 3 servers A, B and C):

Server
|
Kernel
|
Uptime
|
load average
|
Samba load
A|3.19.0-51-generic|06:14:14 up 6 days, 19:52|0.00, 0.01, 0.05|High
B|3.19.0-51-generic|06:14:16 up 5 days, 17:28|0.04, 0.07, 0.12|mid/low
C|3.19.0-56-generic|06:14:19 up 21 days,  8:10|0.09, 0.11, 0.10|mid/high

The 3 servers have the same exact HW and diff only for the installed kernel.

Ok. Now:

@Andreas
@LaM and @BerT666:
I don't think, that that can or should be the test to find out, if a kernel-version is affected by this bug or not.  :-\
- data-transfer of 500 GB up to 1 TB...
- or instruct all users to put as much operation on it as possible at the same time...
That's IT technically destructive for the own reputation, "Oh yeah, please help me to crash the server"

I wouldn't ask anybody to help out to stress test my systems; of course it isn't quite a clever idea even to think about something like that and never hit my mind (or been close thinking to write here something like that).
Being a system administrator and having full control over server and clients in network I can use some scripts/jobs to stress test the sys during night (aka outside production time).
Talking about reputation I think that a hit to our reputation will be sitting here waiting for the system to collapse (Which luckily isn't what we're doing).

There's a slight difference in waiting for the issue to come out and react before it comes out and as You later wrote in Your post:

I would love to have verified:
  • that the bug is gone
  • and finding a quick test to verify, if a system is affected by this bug or not

and of course I'm with You saying

Where are we with Zentyal at the moment?
but even better
Where are the Zentyal maintainers? Do they read this forum?!

Anyway

(Disclaimer: this is just a hint) I've found out that samba shares' RecycleBin (but some 'regular' folders are either affected) are being filled with .tmp files.
I think that this started to occur after the first buggy kernel update. My bad that I've ignored the situation before, thinking it was related to something else (and latched only to one directory) but after a quick search I've discovered that my share's RecycleBin folders are quite full and I've also been reported that now this files are shown in regular folders either and sometimes being manually deleted (sigh). Could it be related?
Looks like that creating a tmp file is a normal procedure for samba (in order to preserve files during write ops) but tmp files not being always correctly deleted (they shouldn't be in RecycleBin) after normal operations is quite strange.

@Andreas
I'll dig the temporary patch by Philipp and in this direction:
The bug can be reproduced and confirmed, but obviously only as developer on an affected system (To be honest I don't know how to do it):
Quote
It's easily reproducible by running the following commands in the Samba master branch:
./configure.developer TDB_NO_FSYNC=1 make -j test FAIL_IMMEDIATELY=1 SOCKET_WRAPPER_KEEP_PCAP=1 TESTS="samba3.raw.composite"

Good call!

L
« Last Edit: April 18, 2016, 07:24:37 am by LaM »

Andreas Wirth

  • Zen Apprentice
  • *
  • Posts: 8
  • Karma: +0/-0
    • View Profile
Re: Zentyal 4.2 - BUG: soft lockup - CPU #1, after latest update
« Reply #101 on: April 18, 2016, 10:39:38 am »
Hi LaM,

have much thanks for your detailed post.
But you don't need to dig anymore in Philipp's direction, I think. The "lockup - CPU" bug in combination with samba is already officially fixed by Rainer Weikusat in the most recent official Linux kernel version (v. 4.5) and will be back ported in the kernel source to other kernel versions.

Just to bring in order what really has happened and a bit more summarized, as far as I understood the whole communication in the mail archive correctly:
This bug was original brought to us by 2 patches committed into the source code of Linux Kernel version 4.4.
These 2 patches, responsible for our trouble, have been then back-ported into these known kernel versions:
1.) mentioned in the communication of the Univention maintainers (-> their kernel is obviously relying on the Ubuntu kernel as well):
Quote
v3.2: a3b0f6e8a21ef02f69a15abac440572d8cde8c2a
v3.18: 72032798034d921ed565e3bf8dfdc3098f6473e2
v4.1: 5c77e26862ce604edea05b3442ed765e9756fe0f
v4.2: bad967fdd8ecbdd171f5f243657be033d2d081a7
v4.3: 58a6a46a036ce81a2a8ecaa6fc1537c894349e3f
2.) but then also identified by and reported to the Ubuntu maintainers in Feb 2016:
Quote
v3.14: 9d054f57adc981a5f503d5eb9b259aa450b90dc5
v3.12: 9964b4c4ee925b2910723e509abd7241cff1ef84
v3.10: da8db0830a2ce63f628150307a01a315f5081202
ckt/linux-3.13.y: 6505b15f7f7efde1853b5a7641e9ce675c2b1a96
v3.4: -
v3.2: a3b0f6e8a21ef02f69a15abac440572d8cde8c2a
v3.19 is not mentioned or contained!!!

But in Zentyal 4.2 we also experienced this bug since beginning of Feb. in the used Ubuntu Vivid kernel version v3.19:
  • affected and experienced in 3.19.0-49, 3.19.0-51, 3.19.0-56 (reported in this forum thread and also reported by zentyal bug tracker, at least with 3.19.0-49)
  • If now fixed in 3.19.0-58 is still a big '?'

Philipp Hahn (working for Univention) reverted then these 2 official kernel patches (back-ported from kernel 4.4 obviously through the Ubuntu maintainers), which affected their 4.1.16 kernel. In Univention they call this bug in a lovely way the 'samba deadlock'. So basically he reverted simply these 2 patches coming from the official (Ubuntu) Linux kernel, which fixed the 'samba deadlock' bug in their Univention kernel, but which has to be only seen as temporary. They built this kernel 4.1.6-1.174.201602110938 out of the source code with these 2 reverted patches by Philipp on the 12. Feb. and delivered it immediately via updates to all their customers using the Univention Corporate Server, that's how their Linux distribution is called.

Then they went into communication with the Linux kernel developers (around Linus Torvalds) and to the Ubuntu maintainers and reported the kernel bug to them.
Then Philipp found out on the 23. Feb., that the bug has been already fixed by an official patch provided by Rainer Weikusat. Philipp tested his patch in their kernel and it fixed the 'samba deadlock' as well. This official patch was also already brought into the v4.5 Linux kernel version (committed in the GIT repo into Torvald's kernel branch at the 16. Feb.) and from then obviously back-ported to other kernel versions, e.g. like v4.4. 
Now they were obviously just waiting, when the Ubuntu maintainers bring their fixed kernel built, which happened @ 05. April, e.g. as happened for the Ubuntu 14.04 LTS (Trusty Thar) with kernel in version 3.13.0-85.

So now I mentioned, that Zentyal 4.2 is not using this kernel v3.13 from Ubuntu LTS 14.04 (Trusty Thar). It is using v3.19 from the Ubuntu release 15.04 (Vivid Vervet).
So and as you can see here https://wiki.ubuntu.com/Releases Ubuntu 15.04 with its kernel v3.19 is at its end of life since the 4. Feb. 2016.
So basically it could mean the Ubuntu maintainers don't even need to fix kernel v3.19 towards the 'samba deadlock' bug anymore.

I have hope, because we received @ 6 April a kernel update in version 3.19.0-58, but it is not clear yet, that it contains the 'samba deadlock' fix.
Especially because in the official Ubuntu bug report ( https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1543980 ) the kernel v3.19 from Vivid has been never mentioned.

It is definitely awesome and honourably dedicated, that you want to write a script to test the bug occurrence.
But right now, we don't even know, what exact combination of circumstances let break the kernel in our live systems.
The mentioned:
Quote
It's easily reproducible by running the following commands in the Samba master branch:
./configure.developer TDB_NO_FSYNC=1 make -j test FAIL_IMMEDIATELY=1 SOCKET_WRAPPER_KEEP_PCAP=1 TESTS="samba3.raw.composite"
Seems to be for me nothing, that you can bring and execute on a fully configured productive system.
I think it's more something you test on a system in combination with the relevant kernel and your own samba built. 
And to produce with a script as many concurrent samba access simultaneously by different sessions is just a theory for right now.
 
However thank you for posting your uptimes here.
(also to Carlos alias pcready.cl)

Update:
I compared the change logs for both Ubuntu kernels
http://changelogs.ubuntu.com/changelogs/pool/main/l/linux/linux_3.13.0-85.129/changelog
That's for Kernel v3.13 (Ubuntu 14.04 LTS, which is definitely containing the fix assigned to LaunchPad ID: 1543980 => https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1543980 )
The changelog is containing:
Quote
  * af_unix: Guard against other == sk in unix_dgram_sendmsg
    - LP: #1543980, #1557191

And one time for kernel v3.19 (our kernel used in Zentyal 4.2) in build 3.19.0-58.64~14.04.1
http://changelogs.ubuntu.com/changelogs/pool/main/l/linux-lts-vivid/linux-lts-vivid_3.19.0-58.64~14.04.1/changelog
is containing as well:
Quote
  * af_unix: Guard against other == sk in unix_dgram_sendmsg
    - LP: #1556297
So obviously the fix has been merged back by the ubuntu maintainers to kernel v3.19

So the kernel version 3.19.0-58 should fix the 'samba deadlock' alias 'soft lockup - CPU #1' bug and should be safe to use!!! :)  + ;D + 8)
(3.19.0-56 is not ... because the fix was integrated in Ubuntu's internal build of kernel 3.19.0-57, probably a test build)

Cheers,
Andreas
« Last Edit: April 18, 2016, 11:50:47 am by Andreas Wirth »

LaM

  • Zen Apprentice
  • *
  • Posts: 41
  • Karma: +0/-0
    • View Profile
Re: Zentyal 4.2 - BUG: soft lockup - CPU #1, after latest update
« Reply #102 on: April 18, 2016, 03:54:04 pm »
Update:
I compared the change logs for both Ubuntu kernels
http://changelogs.ubuntu.com/changelogs/pool/main/l/linux/linux_3.13.0-85.129/changelog
That's for Kernel v3.13 (Ubuntu 14.04 LTS, which is definitely containing the fix assigned to LaunchPad ID: 1543980 => https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1543980 )
The changelog is containing:
Quote
  * af_unix: Guard against other == sk in unix_dgram_sendmsg
    - LP: #1543980, #1557191

And one time for kernel v3.19 (our kernel used in Zentyal 4.2) in build 3.19.0-58.64~14.04.1
http://changelogs.ubuntu.com/changelogs/pool/main/l/linux-lts-vivid/linux-lts-vivid_3.19.0-58.64~14.04.1/changelog
is containing as well:
Quote
  * af_unix: Guard against other == sk in unix_dgram_sendmsg
    - LP: #1556297
So obviously the fix has been merged back by the ubuntu maintainers to kernel v3.19

So the kernel version 3.19.0-58 should fix the 'samba deadlock' alias 'soft lockup - CPU #1' bug and should be safe to use!!! :)  + ;D + 8)
(3.19.0-56 is not ... because the fix was integrated in Ubuntu's internal build of kernel 3.19.0-57, probably a test build)

Cheers,
Andreas

Ok, i'll take this for gold, i'll jump B to the .58 tonight (if nothing occurs meanwhile).

Thanks Andreas for everything so far!

I'll keep You up to date

L
« Last Edit: April 18, 2016, 04:04:37 pm by LaM »

pcready.cl

  • Zen Samurai
  • ****
  • Posts: 286
  • Karma: +13/-1
  • Zentyal Installer in Chile
    • View Profile
    • PC Ready Chile SpA
Re: Zentyal 4.2 - BUG: soft lockup - CPU #1, after latest update
« Reply #103 on: April 19, 2016, 04:12:37 am »
SERVER 15 USERS SAMBA - ZENTYAL 4.2
Code: [Select]
root@servet:~# dmidecode | grep "^System Information" -A8
System Information
        Manufacturer: HP
        Product Name: ProLiant ML150 G6
        Version: 1.0
        Serial Number: MXS108003W
        UUID: 745FC10B-XXXX-DF11-XXXX-C192EAA48B93
        Wake-up Type: Power Switch
        SKU Number: 466132-001
        Family: ProLiant Server

root@servet:~# uname -a
Linux servet 3.19.0-58-generic #64~14.04.1-Ubuntu SMP Fri Mar 18 19:05:43 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

root@servet:~# uptime
 12:18:20 up 6 days, 20:01,  1 user,  load average: 0,18, 0,25, 0,21

FIREWALL NON SAMBA USERS - ZENTYAL 4.2
Code: [Select]
root@servpcr-fw:~# dmidecode | grep "^System Information" -A8
System Information
        Manufacturer: HP
        Product Name: ProLiant ML110 G5
        Version:      NA
        Serial Number: MX2014011G
        UUID: 44F48208-XXXX-5606-XXXX-560649F92209
        Wake-up Type: Power Switch
        SKU Number: AT040A
        Family: 1234567890

root@servpcr-fw:~# uname -a
Linux servpcr-fw 3.19.0-56-generic #62~14.04.1-Ubuntu SMP Fri Mar 11 11:03:15 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

root@servpcr-fw:~# uptime
 12:19:19 up 8 days, 12:18,  1 user,  load average: 0,29, 0,36, 1,38

SERVER 15 USERS SAMBA - ZENTYAL 4.2
Code: [Select]
root@servet:~# dmidecode | grep "^System Information" -A8
System Information
        Manufacturer: HP
        Product Name: ProLiant ML150 G6
        Version: 1.0
        Serial Number: MXS108003W
        UUID: 745FC10B-XXXX-DF11-XXXX-C192EAA48B93
        Wake-up Type: Power Switch
        SKU Number: 466132-001
        Family: ProLiant Server

root@servet:~# uname -a
Linux servet 3.19.0-58-generic #64~14.04.1-Ubuntu SMP Fri Mar 18 19:05:43 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

root@servet:~# uptime
 23:10:45 up 11 days,  6:54,  1 user,  load average: 0,00, 0,01, 0,05

FIREWALL NON SAMBA USERS - ZENTYAL 4.2
Code: [Select]
root@servpcr-fw:~# dmidecode | grep "^System Information" -A8
System Information
        Manufacturer: HP
        Product Name: ProLiant ML110 G5
        Version:      NA
        Serial Number: MX2014011G
        UUID: 44F48208-XXXX-5606-XXXX-560649F92209
        Wake-up Type: Power Switch
        SKU Number: AT040A
        Family: 1234567890

root@servpcr-fw:~# uname -a
Linux servpcr-fw 3.19.0-56-generic #62~14.04.1-Ubuntu SMP Fri Mar 11 11:03:15 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

root@servpcr-fw:~# uptime
 23:12:13 up 12 days, 23:11,  1 user,  load average: 2,69, 2,14, 2,03
Email: contacto@pcready.cl
Teléfono: (+56 32) 314 0883
Skype: pcready.cl
Web: https://www.pcready.cl

pcready.cl

  • Zen Samurai
  • ****
  • Posts: 286
  • Karma: +13/-1
  • Zentyal Installer in Chile
    • View Profile
    • PC Ready Chile SpA
Re: Zentyal 4.2 - BUG: soft lockup - CPU #1, after latest update
« Reply #104 on: April 21, 2016, 10:21:50 pm »
SERVER 15 USERS SAMBA - ZENTYAL 4.2
Code: [Select]
root@servet:~# dmidecode | grep "^System Information" -A8
System Information
        Manufacturer: HP
        Product Name: ProLiant ML150 G6
        Version: 1.0
        Serial Number: MXS108003W
        UUID: 745FC10B-XXXX-DF11-XXXX-C192EAA48B93
        Wake-up Type: Power Switch
        SKU Number: 466132-001
        Family: ProLiant Server

CRASH YESTERDAY!!!!! with kernel .58  :'(
Email: contacto@pcready.cl
Teléfono: (+56 32) 314 0883
Skype: pcready.cl
Web: https://www.pcready.cl