Segfaults are Cool: March 2012

Wednesday, March 21, 2012

Debugging a Hard Lockup with RTAI

I have an embedded box running Linux 2.6.29/RTAI/RTnet. I hacked the e1000e driver for use with RTnet and also I eviscerated its ISR as (due to PCI limitations) its IRQ sits on the same line as an FGPA IRQ my client is using.

The FPGA RTAI ISR is calling the old e1000e ISR by
a) signalling a semaphore which
b) wakes an RTAI tasks which
c) calls the e1000e ISR.

Some times when using this hacked driver the machine would lock up hard and we had no clue.

I found out about the "nmi_watchdog=1" kernel option, recompiled the kernel with IO-APIC (fortunately this target has an APIC) and observed. After 5 seconds the answer came on a silver platter:


BUG: NMI Watchdog detected LOCKUP on CPU0, ip f8a4d620
Pid: 990, comm: sh Tainted: P           (2.6.29.6-apic-ipipe #37)
EIP: 0060:[<f8a4d620>] EFLAGS: 00000046 CPU: 0
EIP is at rt_sem_signal+0xf4/0x3d6 [rtai_sched]
EAX: 3fffffff EBX: 00000010 ECX: f8a5aae0 EDX: 3fffffff
ESI: f8a640e0 EDI: f8a63c60 EBP: f7b5be44 ESP: f7b5be30
 DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068
Process sh (pid: 990, ti=f7b5a000 task=f78f2330 task.ti=f7b5a000)
I-pipe domain Linux
Stack:
 00000246 00000000 00000010 00000000 00000003 f7b5be58 f8a630d5 f8a640e0
 00000011 fa63e4c4 f7b5be70 f8e9b694 000001b8 c0297874 00000000 f7819580
 f7b5be88 c01dc543 f74e0748 00000000 f781de40 00000000 f7b5bea8 c01537d1
Call Trace:
 [<f8a630d5>] ? irqregistrar_switch_irq+0x59/0x61 [irqregistrar]
 [<f8e9b694>] ? proprietarydriver_open+0x3b0/0x3ce [proprietarymodule]
 [<c01dc543>] ? misc_open+0xf4/0x150
 [<c01537d1>] ? chrdev_open+0xef/0x106
 [<c015045c>] ? __dentry_open+0xfe/0x1eb
 [<c01505e2>] ? nameidata_to_filp+0x2b/0x42
 [<c01536e2>] ? chrdev_open+0x0/0x106
 [<c015a146>] ? do_filp_open+0x337/0x648
 [<c013595b>] ? __ipipe_restore_root+0x16/0x18
 [<c014f060>] ? kmem_cache_alloc+0x79/0xb0
 [<c0160a26>] ? alloc_fd+0x4c/0xb2
 [<c0150258>] ? do_sys_open+0x46/0xed
 [<c0150341>] ? sys_open+0x1e/0x26
 [<c0102c07>] ? syscall_call+0x7/0xb
Code: 0f 84 16 01 00 00 8b 47 0c 83 e0 f3 89 47 0c 8b 47 0c 48 0f 85 03 01 00 00 8b 87 04 04 00 00 85 c0 74 42 8b 0d f4 ad a5 f8 eb 0d <8b> 89 14 03 00 00 8b 41 1c 85 c0 78 0a 8b 57 1c 8b 41 1c 39 c2
---[ end trace 4a04e77971ee2cfd ]---

Weird: my code was actually calling rt_mutex_unlock but that's aliased in RTAI to rt_sem_signal.

Decoding this was a bit of a puzzle as ksymoops is deprecated for 2.6 The offending instruction was

mov 0x314(%ecx),%ecx

which after a bit of disassembly turned out to be in enq_ready_task() whose code is a bit counterintuitive (it looks at the semaphore's task queue).

Divining that my semaphore may be borked I looked at the code (a kernel module) that I wrote 2 years ago and it read

SEM pollSem;

which should be always initalised to zero at insmod time it just that it isn't.

So I changed the module_init code to
memset(&pollSem,0,sizeof(&pollSem));
rt_typed_sem_init(&pollSem, 1, RES_SEM | PRIO_Q);
which seems to have done the trick.

Apr 10 update: Nope, did not work. The mutex was shared between RTAI and Linux tasks and releasing it from Linux confused the RTAI code -- maybe they've only expected RTAI mutexes to be used solely in RTAI context. My problem was that I was trying to protect a resource (a callback registry) that was changed from Linux but used from RTAI. In the end I used it lockless in RTAI context (Linux is not running when RTAI is executing) and protect it with cli/sti in Linux context when it is manipulated to make sure RTAI does not barge in.

I would like to commend the RTAI people for the utter lack of documentation on mutexes or anything else and sparse sample code that they abundantly put forward for people to learn their wares. If it was hard to write then it must be hard to use seems to be their groupthink.

-ulianov

Thursday, March 15, 2012

Missed interrupts in Hard RT context (Linux/RTAI)

My client has an embedded box (a Core2 Duo SOM) which runs a realtime application under RTAI. The meat of it is in an ISR/IRQ which fires every 100 uS and which is triggered by an FPGA dangling off PCIe.

For the past year they have seen and characterised this "missed interrupt" problem:
a) after boot-up IRQs are missed for 12.5 minutes;
b) interrupts misses occur every 2.9999 seconds.

The working hypothesis was that this happens when SMI interrupts are happening but we looked thru the Phoenix BIOS and did not find any setting for turning SMIs off. We turned off everything we can, including power management to no avail.

The maker of the SOM was of no help.

The missed interrupts happened while a huge software stack was running: Linux, RTAI, a Java/AWT+X11 app. Hard to narrow it.

We took a few steps:
1. boot to LILO prompt and wait for 13 minutes => still happened;
2. boot bare Linux and pause /sbin/init for 13 minutes => did not happen.

The LILO thing was interesting. So I reckoned I might do some stuff under good old DOS and see what happens (no drivers loaded).

I wrote this C (djgpp) program to observe. Basically I am reading a Pentium performance counter in a tight loop and do a bit of math with interrupts off:

#include <stdio.h>

// COMPILER-DEPENDENT CODE GOES HERE

#define SAMPLE_COUNT 1000

typedef struct {
    unsigned long long tsc;
    unsigned long long pass;
    unsigned long long dT;
} fault_t;

int faultc = -1;
fault_t faults[SAMPLE_COUNT] = {0};     

int main (int argc, char* argv[])
{
    int i, count=0;
    int max_samples = SAMPLE_COUNT;
    unsigned long long pass = 0;
    unsigned long long last_ts = 0;   

    if(argc > 1)  {
        if(sscanf(argv[1], "%d", &max_samples) != 1) exit(1);
        if(max_samples > SAMPLE_COUNT) max_samples = SAMPLE_COUNT;
    }

    if(max_samples < 0) return 0;

    cli();

    last_ts = RDTSC();
    for(pass=0; faultc < max_samples; pass++) {
        unsigned long long dT = 0, tmp = RDTSC();
        if(last_ts > tmp) { last_ts = tmp; continue; } // overflow
        dT = tmp - last_ts;
        last_ts = tmp;
        if(dT > 350000) {
            ++faultc;
            faults[faultc].tsc = tmp;
            faults[faultc].pass= pass;
            faults[faultc].dT  = dT;
            if((faultc + 1) % 5 == 0) { fprintf(stderr, "!"); fflush(stderr); }
        }

        if(faultc >= 0 && (tmp - faults[faultc].tsc) > 10000000000LL) {
            fprintf(stderr, "TIMEOUT"); fflush(stderr);
            break;
        }
    }

    sti();

    for(i = 0; i < faultc; i++)
        printf("pass=%012lld T=%016lld dT=%016lld\n", faults[i].pass, faults[i].tsc, faults[i].dT); 

    return 0;
}

For djgpp the low-level code goes:

static inline unsigned long long RDTSC(void)
{
    unsigned long lo, hi;

    __asm__ __volatile__ ( 
 "xorl %%eax,%%eax \n        cpuid"
 ::: "%rax", "%rbx", "%rcx", "%rdx");
    /* We cannot use "=A", since this would use %rax on x86_64 and return only the lower 32bits of the TSC */
    __asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));

    return (uint64_t)hi << 32 | lo;
}

void cli() { __asm__( "cli" ); }
void sti() { __asm__( "sti" ); }

And bingo! I reproduced the observed behaviour down to the 12.5 min runtime. So this is not Linux's fault at all.

The djgpp code runs under DOS but in Protected Mode. The LILO test prompted me to run this code in Real Mode but djgpp cannot generate this code (easily).

So I fetched OpenWatcom and modified the low-level code like so:

unsigned long long RDTSC()
{
    unsigned long long ret = 0;
    unsigned long hi = 0, lo = 0;
    __asm {
        rdtsc
        mov dword ptr lo, eax
        mov dword ptr hi, edx
    };
    ret = ((unsigned long long)hi << 32) | lo;
    return ret;
}
void cli() { _asm cli; }
void sti() { _asm sti; }

The Watcom C's assembler syntax gave me some headache.

This code ran in pure Real Mode with the same result. So it must be SMI.

I g00gled for RTAI+SMI and found out about the smi-module. Tried that but my hardware went undetected. Hmm. This module has a list of "known" hardware by PCI id but that told me not much.

So I fetched the PCI ids database from Sourceforge and had a look and interestingly some of the known hardware were ISA bridge controllers. Hmm. Then I looked at my hardware and noticed one of those as 0x8086:0x27bd. I added this to the lookup table and the module worked.

Indeed with the SMI interrupts turned off we had no more mysterious missed interrupts and the saga was closed. I guess I was lucky this time.

-ulianov

Thursday, March 1, 2012

On Doing Ugly Linux Kernel Hacks for Need & Profit

A client has two targets ("new" and "legacy") and they insist they run the same system image on both. The two have subtle differences, the most annoying being that the legacy one does not support (U)DMA for CompactFlash cards.

Linux can be told to disable (U)DMA support by passing "ide-core.nodma=0.0" in the command line but there's no way to re-enable later on, performance is dead-slow and these folks have tons of code that has to be loaded from the CF and they like fast boot times.

I cut some of their load time by upx'ing their XFree86 binary and proprietary app and that helped somewhat on both platforms. And it shrank their installer by 1M which is always a bonus.

I tried to muck with LILO but second.S is written in assembler (mine is quite rusty) and getting to the actual command line sector is yucky.

The fall back was to butcher the kernel. I have simple target detection routine which uses cpuid, let's call that chkLegacyCPU().

I changed start_kernel() in init/main.c to append " ide-core.nodma=0.0" to the command line (which is now a global in main.c) if the legacy platform is detected.

Also the legacy target does not support booting in graphics mode (vga=8) so I modified main() in arch/x86/boot/main.c to force the VGA mode to NORMAL_VGA if legacy platform.

Interestingly the two modifications are in two different parts of the kernel -- the one in x86/boot executes in x86 real mode and has a totally different software stack so I had to duplicate the code of chkLegacyCPU() for it.

-ulianov