Segfaults are Cool: 2012

Monday, October 15, 2012

SH-2 Aquarius Kernel Debugging Done in Unusual Ways

I had problems and lockups (mostly caused by me) while developing device drivers for Linux kernel 3.4.

However this needed debugging and sometimes psychic debugging and Jedi mind tricks do not suffice. SH-2 has no protection between user/kernel mode code, memory access; the only thing it cares about is unaligned access to memory.

JTAG was not implemented in RTL on this hacked instance of the CPU core (which ran in a Spartan-6 FPGA). GDB/serial was working fine but only for the boot loader as the kernel took over the interrupt vectors used by GDB. The core ran at 25 MHz which is a bit too fast for printf debugging.

So I did the reasonable thing: I applied the thumb screws to the powers-that-be and the RTL engineer that was supporting me to give me hacked up serial port:
a) on the SH-2 side it looked like a memory-mapped device (8 bit) with an enable/disable bit -- a value put there by the SH-2 instruction set would be emitted continuously as serial output at 9600 baud -- at no cost/overhead for the CPU;
b) on the RS-232 side it looked as a transmit-only device (TX, GND) using only 1 I/O line of the FPGA.

The RTL code had 2 complete instances of RS-232 [umm, no flow control tho] already so it was just a matter of adding another instance and hacking it.

So debugging boiled down to checkpointing the code: writing some ASCII character in one place of the code and one in another, plus enabling/disabling the port as needed. I would monitor the output with Minicom and have an idea of what the kernel was doing and roughly for how long.

Yes, barbaric but one must make do with what he can scrounge.

-ulianov

Wednesday, March 21, 2012

Debugging a Hard Lockup with RTAI


BUG: NMI Watchdog detected LOCKUP on CPU0, ip f8a4d620
Pid: 990, comm: sh Tainted: P           (2.6.29.6-apic-ipipe #37)
EIP: 0060:[<f8a4d620>] EFLAGS: 00000046 CPU: 0
EIP is at rt_sem_signal+0xf4/0x3d6 [rtai_sched]
EAX: 3fffffff EBX: 00000010 ECX: f8a5aae0 EDX: 3fffffff
ESI: f8a640e0 EDI: f8a63c60 EBP: f7b5be44 ESP: f7b5be30
 DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068
Process sh (pid: 990, ti=f7b5a000 task=f78f2330 task.ti=f7b5a000)
I-pipe domain Linux
Stack:
 00000246 00000000 00000010 00000000 00000003 f7b5be58 f8a630d5 f8a640e0
 00000011 fa63e4c4 f7b5be70 f8e9b694 000001b8 c0297874 00000000 f7819580
 f7b5be88 c01dc543 f74e0748 00000000 f781de40 00000000 f7b5bea8 c01537d1
Call Trace:
 [<f8a630d5>] ? irqregistrar_switch_irq+0x59/0x61 [irqregistrar]
 [<f8e9b694>] ? proprietarydriver_open+0x3b0/0x3ce [proprietarymodule]
 [<c01dc543>] ? misc_open+0xf4/0x150
 [<c01537d1>] ? chrdev_open+0xef/0x106
 [<c015045c>] ? __dentry_open+0xfe/0x1eb
 [<c01505e2>] ? nameidata_to_filp+0x2b/0x42
 [<c01536e2>] ? chrdev_open+0x0/0x106
 [<c015a146>] ? do_filp_open+0x337/0x648
 [<c013595b>] ? __ipipe_restore_root+0x16/0x18
 [<c014f060>] ? kmem_cache_alloc+0x79/0xb0
 [<c0160a26>] ? alloc_fd+0x4c/0xb2
 [<c0150258>] ? do_sys_open+0x46/0xed
 [<c0150341>] ? sys_open+0x1e/0x26
 [<c0102c07>] ? syscall_call+0x7/0xb
Code: 0f 84 16 01 00 00 8b 47 0c 83 e0 f3 89 47 0c 8b 47 0c 48 0f 85 03 01 00 00 8b 87 04 04 00 00 85 c0 74 42 8b 0d f4 ad a5 f8 eb 0d <8b> 89 14 03 00 00 8b 41 1c 85 c0 78 0a 8b 57 1c 8b 41 1c 39 c2
---[ end trace 4a04e77971ee2cfd ]---

Weird: my code was actually calling rt_mutex_unlock but that's aliased in RTAI to rt_sem_signal.

Decoding this was a bit of a puzzle as ksymoops is deprecated for 2.6 The offending instruction was

mov 0x314(%ecx),%ecx

which after a bit of disassembly turned out to be in enq_ready_task() whose code is a bit counterintuitive (it looks at the semaphore's task queue).

Divining that my semaphore may be borked I looked at the code (a kernel module) that I wrote 2 years ago and it read

SEM pollSem;

which should be always initalised to zero at insmod time it just that it isn't.

So I changed the module_init code to
memset(&pollSem,0,sizeof(&pollSem));
rt_typed_sem_init(&pollSem, 1, RES_SEM | PRIO_Q);
which seems to have done the trick.

Apr 10 update: Nope, did not work. The mutex was shared between RTAI and Linux tasks and releasing it from Linux confused the RTAI code -- maybe they've only expected RTAI mutexes to be used solely in RTAI context. My problem was that I was trying to protect a resource (a callback registry) that was changed from Linux but used from RTAI. In the end I used it lockless in RTAI context (Linux is not running when RTAI is executing) and protect it with cli/sti in Linux context when it is manipulated to make sure RTAI does not barge in.

I would like to commend the RTAI people for the utter lack of documentation on mutexes or anything else and sparse sample code that they abundantly put forward for people to learn their wares. If it was hard to write then it must be hard to use seems to be their groupthink.

-ulianov

Thursday, March 15, 2012

Missed interrupts in Hard RT context (Linux/RTAI)

My client has an embedded box (a Core2 Duo SOM) which runs a realtime application under RTAI. The meat of it is in an ISR/IRQ which fires every 100 uS and which is triggered by an FPGA dangling off PCIe.

For the past year they have seen and characterised this "missed interrupt" problem:
a) after boot-up IRQs are missed for 12.5 minutes;
b) interrupts misses occur every 2.9999 seconds.

The working hypothesis was that this happens when SMI interrupts are happening but we looked thru the Phoenix BIOS and did not find any setting for turning SMIs off. We turned off everything we can, including power management to no avail.

The maker of the SOM was of no help.

The missed interrupts happened while a huge software stack was running: Linux, RTAI, a Java/AWT+X11 app. Hard to narrow it.

We took a few steps:
1. boot to LILO prompt and wait for 13 minutes => still happened;
2. boot bare Linux and pause /sbin/init for 13 minutes => did not happen.

The LILO thing was interesting. So I reckoned I might do some stuff under good old DOS and see what happens (no drivers loaded).

I wrote this C (djgpp) program to observe. Basically I am reading a Pentium performance counter in a tight loop and do a bit of math with interrupts off:

#include <stdio.h>

// COMPILER-DEPENDENT CODE GOES HERE

#define SAMPLE_COUNT 1000

typedef struct {
    unsigned long long tsc;
    unsigned long long pass;
    unsigned long long dT;
} fault_t;

int faultc = -1;
fault_t faults[SAMPLE_COUNT] = {0};     

int main (int argc, char* argv[])
{
    int i, count=0;
    int max_samples = SAMPLE_COUNT;
    unsigned long long pass = 0;
    unsigned long long last_ts = 0;   

    if(argc > 1)  {
        if(sscanf(argv[1], "%d", &max_samples) != 1) exit(1);
        if(max_samples > SAMPLE_COUNT) max_samples = SAMPLE_COUNT;
    }

    if(max_samples < 0) return 0;

    cli();

    last_ts = RDTSC();
    for(pass=0; faultc < max_samples; pass++) {
        unsigned long long dT = 0, tmp = RDTSC();
        if(last_ts > tmp) { last_ts = tmp; continue; } // overflow
        dT = tmp - last_ts;
        last_ts = tmp;
        if(dT > 350000) {
            ++faultc;
            faults[faultc].tsc = tmp;
            faults[faultc].pass= pass;
            faults[faultc].dT  = dT;
            if((faultc + 1) % 5 == 0) { fprintf(stderr, "!"); fflush(stderr); }
        }

        if(faultc >= 0 && (tmp - faults[faultc].tsc) > 10000000000LL) {
            fprintf(stderr, "TIMEOUT"); fflush(stderr);
            break;
        }
    }

    sti();

    for(i = 0; i < faultc; i++)
        printf("pass=%012lld T=%016lld dT=%016lld\n", faults[i].pass, faults[i].tsc, faults[i].dT); 

    return 0;
}

For djgpp the low-level code goes:

static inline unsigned long long RDTSC(void)
{
    unsigned long lo, hi;

    __asm__ __volatile__ ( 
 "xorl %%eax,%%eax \n        cpuid"
 ::: "%rax", "%rbx", "%rcx", "%rdx");
    /* We cannot use "=A", since this would use %rax on x86_64 and return only the lower 32bits of the TSC */
    __asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));

    return (uint64_t)hi << 32 | lo;
}

void cli() { __asm__( "cli" ); }
void sti() { __asm__( "sti" ); }

And bingo! I reproduced the observed behaviour down to the 12.5 min runtime. So this is not Linux's fault at all.

The djgpp code runs under DOS but in Protected Mode. The LILO test prompted me to run this code in Real Mode but djgpp cannot generate this code (easily).

So I fetched OpenWatcom and modified the low-level code like so:

unsigned long long RDTSC()
{
    unsigned long long ret = 0;
    unsigned long hi = 0, lo = 0;
    __asm {
        rdtsc
        mov dword ptr lo, eax
        mov dword ptr hi, edx
    };
    ret = ((unsigned long long)hi << 32) | lo;
    return ret;
}
void cli() { _asm cli; }
void sti() { _asm sti; }

The Watcom C's assembler syntax gave me some headache.

This code ran in pure Real Mode with the same result. So it must be SMI.

I g00gled for RTAI+SMI and found out about the smi-module. Tried that but my hardware went undetected. Hmm. This module has a list of "known" hardware by PCI id but that told me not much.

So I fetched the PCI ids database from Sourceforge and had a look and interestingly some of the known hardware were ISA bridge controllers. Hmm. Then I looked at my hardware and noticed one of those as 0x8086:0x27bd. I added this to the lookup table and the module worked.

Indeed with the SMI interrupts turned off we had no more mysterious missed interrupts and the saga was closed. I guess I was lucky this time.

-ulianov

Thursday, March 1, 2012

On Doing Ugly Linux Kernel Hacks for Need & Profit

A client has two targets ("new" and "legacy") and they insist they run the same system image on both. The two have subtle differences, the most annoying being that the legacy one does not support (U)DMA for CompactFlash cards.

Linux can be told to disable (U)DMA support by passing "ide-core.nodma=0.0" in the command line but there's no way to re-enable later on, performance is dead-slow and these folks have tons of code that has to be loaded from the CF and they like fast boot times.

I cut some of their load time by upx'ing their XFree86 binary and proprietary app and that helped somewhat on both platforms. And it shrank their installer by 1M which is always a bonus.

I tried to muck with LILO but second.S is written in assembler (mine is quite rusty) and getting to the actual command line sector is yucky.

The fall back was to butcher the kernel. I have simple target detection routine which uses cpuid, let's call that chkLegacyCPU().

I changed start_kernel() in init/main.c to append " ide-core.nodma=0.0" to the command line (which is now a global in main.c) if the legacy platform is detected.

Also the legacy target does not support booting in graphics mode (vga=8) so I modified main() in arch/x86/boot/main.c to force the VGA mode to NORMAL_VGA if legacy platform.

Interestingly the two modifications are in two different parts of the kernel -- the one in x86/boot executes in x86 real mode and has a totally different software stack so I had to duplicate the code of chkLegacyCPU() for it.

-ulianov

Thursday, February 23, 2012

IEC GOOSE Application -- Network Latency On a RTAI/RTnet Box

I have an embedded box running Linux 2.6.29/RTAI/RTnet. I hacked the e1000e driver for use with RTnet and also I eviscerated its ISR as (due to PCI limitations) its IRQ sits on the same line as an FGPA IRQ my client is using.

The FPGA RTAI ISR is calling the old e1000e ISR by
a) signalling a semaphore which
b) wakes an RTAI tasks which
c) calls the e1000e ISR.

On top of this there is an LXRT hard RT user-mode application (proprietary IEC GOOSE stack adapted by me to RTnet) which scoops all GOOSE packets from the wire.

Alas rteth0 must be put in promiscuous mode as Ethernet cards generally only allow 4 or 8 MACs to be programmed in their multicast list and GOOSE frames are emitted with the destination MAC to be the same as the source MAC plus the multicast bit turned on. So we can easily shoot past 4/8.

The LXRT application also knows how to send GOOSE packets.

The application is hooked to a Linux/RTAI driver which executes every 100 uS and does stuff. The comms between the app and driver are
a) RX -- RTAI shared memory for the incoming data and a volatile change counter
b) TX -- an RTAI semaphore signalled by the driver and RTAI shared memory for the outgoing data.

So the app stuffs data (GOOSE points) into the driver and sends driver points packaged as GOOSE frames.

One of the things the driver can be programmed to do is to respond to incoming point changes by reflecting them into outgoing points.

So the data flow is

e1000e->RTnet->LXRT app->driver->LXRT app->RTnet->e1000e

We timed using wireshark running on an external box the response time to a change in a GOOSE point and the answer came to ~1ms ±10% which is not bad at all considering the PCIe latency, scheduling latencies and FPGA ISR -- the latter can account to up to 200 uS.

-ulianov

Friday, January 6, 2012

Simple/Elegant way to determine uptime on XP

This has been eluding me for a while... I've tried "net stats srv", WBEM, uptime.exe from M$ all of them sucking for use as CGIs under Xitami/WinXP.

Then I learned this trick on a Perl forum:

<?php
$DAY = 3600 * 24;

$stat = stat('c:\WINDOWS\bootstat.dat');
$mtime = $stat['mtime'];

$dT = time() - $mtime;

$d = intval($dT / $DAY); $dT -= $d * $DAY;
$h = intval($dT / 3600); $dT -= $h * 3600;
$m = intval($dT / 60);   $dT -= $m * 60;
$s = $dT;

$pl = "s"; if($d == 1) { $pl = ""; }
printf("up %d day%s %d:%02d:%02d\n", $d,$pl, $h,$m,$s);
?>

-ulianov