Created attachment 578932 [details] kernel panic Description of problem: Our NFS server runs fine latest kernel 2.6.18-308.4.1.el5 x86_64. As soon as as we boot the clients to a 5.8 kernel the NFS server crashes. When the NFS clients run a 5.7 kernel, all is well. We see this since the initial 5.8 release. Version-Release number of selected component (if applicable): $ uname -a Linux ng-bak1.xxx 2.6.18-308.4.1.el5 #1 SMP Wed Mar 28 01:54:56 EDT 2012 x86_64 x86_64 x86_64 GNU/Linux How reproducible: always, boot the client with $ uname -a Linux ng-bak1.xxx 2.6.18-308.4.1.el5 #1 SMP Wed Mar 28 01:54:56 EDT 2012 x86_64 x86_64 x86_64 GNU/Linux Steps to Reproduce: 1. exports file on server /srv/backup/orabak 192.168.100.0/24(rw,async,all_squash,anonuid=501,anongid=501) 2. fstab on clients 192.168.100.10:/srv/backup/orabak /srv/orabak nfs intr 0 0 3. boot the clients with any 5.8 kernel and see the NFS server crahs Actual results: kernel panic, screenshot attached Expected results: no kernel panic Additional info: I attached DRAC screenshot of server, all hardware is DELL PE2950
Tried reproducing it here but to no avail: 1) on server, add this to /etc/exports: /orabak 192.168.100.0/24(rw,async,all_squash,anonuid=501,anongid=501) 2) mount the export on one RHEL5.8 client: mount -t nfs -o intr,vers=3 nfsserver:/orabak /mnt Nothing else to report, works fine. Anything I may have missed? Vincent
@vincent no, that's exactly how I can reproduce. It's Dell PE2950 server hardware. Sadly in the Dell firmware repository it is missing latest broadcom firmware for the NICs. I will update the firmware by hand tonight and see if this makes a difference.
Tried mounting by IP and by fstab as well, no panic. Server is a Dell Precision workstation, client is a VM (VMWare Workstation 8).
@Rainer: if I can reproduce the issue here, then I'll open a support ticket. Your issue looks bad enough that I should take care that it's being looked into.
The NFS server is up now for 9 hours since I updated to dell's latest firmware 6.4.5 for broadcom cards. Until then it was running 6.2.6 which is latest in the firmware repository. So I guess the problem is solved because we saw the crashes instantly. For the record, this is the problematic nic: 07:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5708 Gigabit Ethernet (rev 12) Subsystem: Dell Device 01b2 Flags: bus master, 66MHz, medium devsel, latency 64, IRQ 114 Memory at d6000000 (64-bit, non-prefetchable) [size=32M] Capabilities: [40] PCI-X non-bridge device Capabilities: [48] Power Management version 2 Capabilities: [50] Vital Product Data Capabilities: [58] MSI: Enable+ Count=1/1 Maskable- 64bit+ Kernel driver in use: bnx2 Kernel modules: bnx2
Interesting, on my T5400 workstation acting as a server, I also had a Broadcom card as eth0 (but a different type, though): [root@palanthas ~]# lspci |grep -i broad 09:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5754 Gigabit Ethernet PCI Express (rev 02). [root@palanthas ~]# ethtool -i eth0 driver: tg3 version: 3.119 firmware-version: 5754-v3.24 bus-info: 0000:09:00.0 It would be interesting to investigate this issue a little further. Would you, per chance, have a vmcore of the previous crashes? The fact that a firmware issue can cause a kernel crash doesn't feel good. As the most, it should (IMHO) cause timeouts, cut the server off from the network but it shouldn't panic. Regards,
As I told on the rhel5 mailing list, I also found the workaround to use UDP instead of TCP to mount the NFS share. This showed exactly the behaviour you describe (client): Apr 24 23:10:15 ng-db1 kernel: nfs: server 192.168.100.10 not responding, still trying Apr 24 23:11:06 ng-db1 last message repeated 2 times Apr 24 23:11:06 ng-db1 kernel: nfs: server 192.168.100.10 OK Apr 24 23:11:06 ng-db1 last message repeated 2 times Apr 25 00:21:03 ng-db1 kernel: nfs: server 192.168.100.10 not responding, still trying Apr 25 00:21:03 ng-db1 kernel: nfs: server 192.168.100.10 OK And in dmesg there were many "link up" and link down" on the server. I asked our Oracle admin if rman was working w/o problems (though these messages) which he affirmed. These messages made me look for the new firmware which obviously cured the kernel panics. But I guess the real problem are the kernel panics if TCP is used to mount NFS and link goes away on the server. Sadly this is our production environment and I do not have a testing lab.
Well, it died down in the bowels of the TCP layer code (tcp_sendpage). It seems possible that there's a crashable bug in there, and somehow adding the correct firmware papered over the bug somehow. I suppose it's also possible that there's a driver firmware bug that just happened to cause a crash in this layer (mem corruption maybe?). Either way, it's doubtful we'd be able to make much progress w/o a vmcore or at least the entire oops message. Without that, we'll probably have to close this with a resolution of INSUFFICIENT_DATA. Any chance you have either of those things or have a way to get them?
I will try to get these informations on the weekend.
Either way, it would also help expedite things to open a RH support case.
I would open a support ticket with our contract (large bank) but alas I don't have Dell Hardware, only HP proliants.. If Rainer can get a vmcore and a sosreport of the crashed server, I could open a ticket for him. Thanks, Vincent
I have done so, but we only have basic support. Case # is 00633163
Good enough -- thanks.
Unable to access https://access.redhat.com/knowledge/solutions/109263, even with my support login. Is there really a solution? Vincent
Ok, I have a vmcore. Maybe important, I only got it with this fstab line: 192.168.100.10:/srv/backup/orabak /srv/orabak nfs intr 0 0 this was working: 192.168.100.10:/srv/backup/orabak /srv/orabak nfs rsize=32768,wsize=32768,timeo=14,hard,intr 0 0 I'm on vacation from thursday on and be back on may 21.
Created attachment 581493 [details] kernel panic with old broadcom firmware
or just the link: http://awaro.com/de/download/vmcore
Ok, some notes... Oops occurred in compound_head() call here: if (unlikely(PageTail(page))) ...so...some disassembly around the crash area: 0xffffffff8025dbd2 <tcp_sendpage+0x388>: jmp 0xffffffff8025dc29 <tcp_sendpage+0x3df> 0xffffffff8025dbd4 <tcp_sendpage+0x38a>: mov 0x28(%rsp),%rdx 0xffffffff8025dbd9 <tcp_sendpage+0x38f>: mov (%rdx),%rax <<< CRASH HERE 0xffffffff8025dbdc <tcp_sendpage+0x392>: and $0x24000,%eax 0xffffffff8025dbe1 <tcp_sendpage+0x397>: cmp $0x24000,%rax 0xffffffff8025dbe7 <tcp_sendpage+0x39d>: jne 0xffffffff8025dbed <tcp_sendpage+0x3a3> 0xffffffff8025dbe9 <tcp_sendpage+0x39f>: mov 0x10(%rdx),%rdx So this appears to be in the midst of the PageTail macro, but it does some "funny stuff" with the stack here. It's expecting to pull the . In any case, it's calling tcp_sendpage is loaded with heavily inlined functions, but we should be ending up with %rdx holding the address of the page. Unfortunately, that address is clearly bogus: 0x007808000001022c ...so apparently the page array that got passed into do_tcp_sendpages was corrupt.
The kernel_sendpage request was this one in svc_sendto: 0x7e2b is in svc_sendto (net/sunrpc/svcsock.c:423). /* send head */ if (slen == xdr->head[0].iov_len) flags = 0; len = kernel_sendpage(sock, rqstp->rq_respages[0], 0, <<<< xdr->head[0].iov_len, flags); if (len != xdr->head[0].iov_len) goto out; Poking around on the stack tells me that the rqstp is at 0xffff810221dda000. That gives me: crash> struct svc_rqst.rq_respages ffff810221dda000 rq_respages = 0xffff810101b22840 ...which shows: crash> struct page.flags 0xffff810101b22840 flags = 0x7808000001022c ...so either we have a bug where we passed in a pointer where it should have been a double pointer, or I've misinterpreted the assembly above...
Ooops, nm -- that is wrong... req_respages is a **page, so that should be a pointer to an array of page pointers, and the first page pointer in that array is bad. Now to see if we can determine how it got there in the first place...
Well...no... rq_respages should contain a **page, but it seems to have a *page instead: crash> kmem 0xffff810101b22840 PAGE PHYSICAL MAPPING INDEX CNT FLAGS ffff810101b22840 7c0b8000 ffff8101ff009220 100 3 7808000001022c I'll go back through the code and see if we have a single/double pointer confusion somewhere...
Dump of the svc_rqst in memory Here's the last bit of the rq_pages array: ffff810221ddaa40: ffff810106e39310 ffff810106d51270 ........p....... ffff810221ddaa50: ffff810106ea9c78 ffff810106d69798 x............... ffff810221ddaa60: ffff810106e50760 ffff810106e543c0 `........C...... ffff810221ddaa70: ffff810104c06cd8 ffff810104c07250 .l......Pr...... ffff810221ddaa80: ffff810104c06e98 ffff810104c07cd0 .n.......|...... ffff810221ddaa90: ffff810104c26b30 ffff810104c26a88 0k.......j...... ...and here's the rq_respages pointer, followed by the kvec array: ffff810221ddaaa0: ffff810101b22840 ffff81009553c000 @(........S..... ffff810221ddaab0: 0000000000001000 ffff8101f246c000 ..........F..... Interestingly, the "index" field in each page is increasing as it goes: crash> kmem ffff810104c26b30 PAGE PHYSICAL MAPPING INDEX CNT FLAGS ffff810104c26b30 15c1ea000 ffff8101ff009220 fe 3 15810000001020c crash> kmem ffff810104c26a88 PAGE PHYSICAL MAPPING INDEX CNT FLAGS ffff810104c26a88 15c1e7000 ffff8101ff009220 ff 3 15810000001020c crash> kmem ffff810101b22840 PAGE PHYSICAL MAPPING INDEX CNT FLAGS ffff810101b22840 7c0b8000 ffff8101ff009220 100 3 7808000001022c ...it certainly seems like the page pointer in the rq_respages field fits that pattern. I think this might be an overrun of the rq_pages array.
Ok, playing with a little debug patch here... diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c index a0edff9..f477ba7 100644 --- a/net/sunrpc/svcsock.c +++ b/net/sunrpc/svcsock.c -1247,6 +1247,7 @@ svc_recv(struct svc_serv *serv, struct svc_rqst *rqstp, long timeout) /* now allocate needed pages. If we get a failure, sleep briefly */ pages = 2 + (serv->sv_bufsz + PAGE_SIZE -1) / PAGE_SIZE; + dprintk("%s: allocating %d pages\n", __func__, pages); for (i=0; i < pages ; i++) while (rqstp->rq_pages[i] == NULL) { struct page *p = alloc_page(GFP_KERNEL); ...when I bump nfsd_max_blksize to 1M (which is what it would be when you have >4GB of RAM), I see this on starting knfsd: svc_recv: allocating 259 pages ...which is 1 more than it should be. So, we probably have a situation where rq_respages doesn't get overwritten in certain cases, and in those cases if we try to send the reply...kaboom. In cases where that pointer gets overwritten we still are leaking a page, so that's a pretty nasty bug regardless...
The upstream and rhel6 code seems to be ok since it has this sanity check: BUG_ON(pages >= RPCSVC_MAXPAGES); seems like that should be checked *before* we allocate all of the pages, but it should catch this regardless. So this seems to be a bug only in the backport of the code to allow larger r/wsize.
The buffer size calculations are pretty convoluted but here we go: The second arg here is what becomes sv_bufsize: nfsd_serv = svc_create(&nfsd_program, NFSD_BUFSIZE - NFSSVC_MAXBLKSIZE + nfsd_max_blksize); nfsd_max_blksize == NFSSVC_MAXBLKSIZE (when mem > 4G) NFSSVC_MAXBLKSIZE == RPCSVC_MAXPAYLOAD == (1*1024*1024u) ...but in any case, those should cancel each other out. So it should be NFSD_BUFSIZE, which is: #define NFSD_BUFSIZE ((RPC_MAX_HEADER_WITH_AUTH+26)*XDR_UNIT + NFSSVC_MAXBLKSIZE) ...upstream has the same definition, but it's only used to set pc_xdrressize for the NFSv4 compound RPC.
Ok, reassigning to Bruce since he understands this code better than I do... Thanks, Bruce! Let me know if you need me to help out with testing or anything!
Very interesting read, Thank you. I'll keep watching this thread. Vincent
Looking carefully at rhel5 and upstream (assuming tcp, 4096-byte pages, NFSv3/v4, and >4gigs RAM throughout): In both cases, the maximum read/write size is 1024*1024 bytes = 1M. The number of pages required to hold that much data is 1M/PAGE_SIZE = 256. The rq_pages array has 256 + 2 elements (allowing an extra page to hold the reply (in the case of a write) or request (in the case of a read), and another extra page for headers, padding, etc). In rhel5, we call svc_create with a buffer size of 1MB + some extra (the NFSD_BUFSIZE Jeff mentions in comment 27). After rounding up that adds an extra page. svc_recv then adds two more pages--so we're adding 3 extra pages, whereas rq_pages only got two extra pages. Upstream doesn't have this problem: upstream removes the extra from the svc_create argument, and instead has svc_create add an extra page to account for headers, and then svc_recv add an extra page to account for the request or reply, for a total of 2 extra pages, as it should be. So, where'd I screw up the backport? It appears that upstream actually had the same bug at some point, but c6b0a9f87b82 "knfsd: tidy up up meaning of 'buffer size' in nfsd/sunrpc', while appearing to be pure cleanup, actually fixed the bug. What I'm unclear about is where exactly the bug was introduced upstream--possibly with 44524359484 "knfsd: Replace two page lists in struct svc_rqst with one". In any case, c6b0a9f87b82 is probably more than we want to backport. (In theory I think it may change kabi, though I doubt it's kabi anyone uses.) So in rhel5 we'll probably want to do something simpler: maybe throw out the extra NFSD_BUFSIZE in the svc_create() argument? I'll take a look Monday.
(In reply to comment #30) > In any case, c6b0a9f87b82 is probably more than we want to backport. (In > theory I think it may change kabi, though I doubt it's kabi anyone uses.) So > in rhel5 we'll probably want to do something simpler: maybe throw out the extra > NFSD_BUFSIZE in the svc_create() argument? The problem with doing that is that sv_bufsz is also used to estimate the sizes of socket buffers, for example, and for that purpose we want the size of the full rpc request, not just the read or write payload. So I think backporting that whole patch, with some kabi fixups, probably is the right thing to do.
Created attachment 582792 [details] fix oops due to overrunning server's page array Here's a backport of c6b0a9f87b82f25fa35206ec04b5160372eabab4.
I'm having the exact same issue on a Dell R610. I've opened case #639436. I will attach a vmcore to that ticket now.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux release for currently deployed products. This request is not yet committed for inclusion in a release.
If I limit memory on the server to 3 gigs, I start receiving RPC: bad TCP reclen 0x001000a4 (large) on the console, but the system doesn't appear to kernel panic.
(In reply to comment #38) > If I limit memory on the server to 3 gigs, I start receiving > > RPC: bad TCP reclen 0x001000a4 (large) > > on the console, but the system doesn't appear to kernel panic. This is using which kernel exactly? If you're able to reproduce this crash reliably, and if it were possible to apply the attached patch and retest, the results would be useful.
2.6.18-308.4.1.el5 I can do the testing, but I need to get my oracle backups completed first so it may be 24-48 hours before I can test. I can reproduce this fairly reliably however. I'm poking my DBA's now to try and get them to get things caught up so I can break the server again.
(In reply to comment #38) > If I limit memory on the server to 3 gigs, I start receiving > > RPC: bad TCP reclen 0x001000a4 (large) > > on the console, but the system doesn't appear to kernel panic. Thinking about that some more, I think that is all expected behavior: The server by default sets a maximum IO size based on memory. After lowering the amount of memory, the server lowered that maximum IO size to something less than 1 megabyte, which prevents us from overrunning this array, preventing the panic. However, I'm guessing you had at least one client with existing mounts when you restarted the server. That client was still using the old maximum IO size, hence was sending write requests larger than the newly rebooted server expected. If the client unmounts and remounts, you probably won't see those messages any more.
Created attachment 583282 [details] fix oops due to overrunning server's page array Apologies--on review I found an error in the backported patch. If you're able to test, please test this version rather than the previous one. (The previous patch will probably fix the bug as well, though there's a small chance it could cause some other problems.)
(In reply to comment #41) > (In reply to comment #38) > > If I limit memory on the server to 3 gigs, I start receiving > > > > RPC: bad TCP reclen 0x001000a4 (large) > > > > on the console, but the system doesn't appear to kernel panic. > > Thinking about that some more, I think that is all expected behavior: > > The server by default sets a maximum IO size based on memory. > > After lowering the amount of memory, the server lowered that maximum IO size to > something less than 1 megabyte, which prevents us from overrunning this array, > preventing the panic. > > However, I'm guessing you had at least one client with existing mounts when you > restarted the server. That client was still using the old maximum IO size, > hence was sending write requests larger than the newly rebooted server > expected. > > If the client unmounts and remounts, you probably won't see those messages any > more. Confirmed. When the server crashed originally, we went to cifs mounts as temporary replacements. I unmounted the stale nfs systems with umount -l. It looks like one server was pinging continuously trying to finish whatever it was doing. tcpdump (10.150.50.104 is the client, SHSNS2 is the server): 10:26:58.709743 IP 10.150.50.104.0 > SHSNS2.nfs: 1448 null 10:26:58.709746 IP SHSNS2.nfs > 10.150.50.104.987: . ack 1233525619 win 159 <nop,nop,timestamp 8786454 3405911169> 10:26:58.709748 IP 10.150.50.104.0 > SHSNS2.nfs: 1448 null 10:26:58.709751 IP SHSNS2.nfs > 10.150.50.104.987: . ack 1233527067 win 181 <nop,nop,timestamp 8786454 3405911169> A hard reboot of the client cleared the RPC messages and forcing the max memory to 3 gigabytes is at least allowing us to serve via NFS until this is resolved.
Note: as a workaround you can decrease the maximum IO size to something less than 1MB by writing to /proc/fs/nfsd/max_block_size, for example: echo 524288 >/proc/fs/nfsd/max_block_size Note this has to be done aftering mounting /proc/fs/nfsd, but before starting nfsd.
Created attachment 584796 [details] fix oops due to overrunning server's page array Further testing on ia64 saw another overrun of the array. I believe this may be caused by an off-by-one-error in read, fixed upstream by 250f3915183d377d36e012bac9caa7345ce465b8 "[PATCH] knfsd: fix an NFSD bug with full sized, non-page-aligned reads", so I've added a backport of that patch to the attached. Not yet tested.
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: The kernel version 2.6.18-308.4.1.el5 contained several bugs which led to an overrun of the NFS server page array. Consequently, any attempt to connect an NFS client running on Red Hat Enterprise Linux 5.8 to the NFS server running on the system with this kernel caused the NFS server to terminate unexpectedly and the kernel to panic. This update corrects the bugs causing NFS page array overruns and the kernel no longer crashes in this scenario.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2013-0006.html