Thursday, March 15, 2012

Missed interrupts in Hard RT context (Linux/RTAI)

My client has an embedded box (a Core2 Duo SOM) which runs a realtime application under RTAI. The meat of it is in an ISR/IRQ which fires every 100 uS and which is triggered by an FPGA dangling off PCIe.

For the past year they have seen and characterised this "missed interrupt" problem:
a) after boot-up IRQs are missed for 12.5 minutes;
b) interrupts misses occur every 2.9999 seconds.

The working hypothesis was that this happens when SMI interrupts are happening but we looked thru the Phoenix BIOS and did not find any setting for turning SMIs off. We turned off everything we can, including power management to no avail.

The maker of the SOM was of no help.

The missed interrupts happened while a huge software stack was running: Linux, RTAI, a Java/AWT+X11 app. Hard to narrow it.

We took a few steps:
1. boot to LILO prompt and wait for 13 minutes => still happened;
2. boot bare Linux and pause /sbin/init for 13 minutes => did not happen.

The LILO thing was interesting. So I reckoned I might do some stuff under good old DOS and see what happens (no drivers loaded).

I wrote this C (djgpp) program to observe. Basically I am reading a Pentium performance counter in a tight loop and do a bit of math with interrupts off:
#include <stdio.h>

// COMPILER-DEPENDENT CODE GOES HERE

#define SAMPLE_COUNT 1000

typedef struct {
unsigned long long tsc;
unsigned long long pass;
unsigned long long dT;
} fault_t;

int faultc = -1;
fault_t faults[SAMPLE_COUNT] = {0};

int main (int argc, char* argv[])
{
int i, count=0;
int max_samples = SAMPLE_COUNT;
unsigned long long pass = 0;
unsigned long long last_ts = 0;

if(argc > 1) {
if(sscanf(argv[1], "%d", &max_samples) != 1) exit(1);
if(max_samples > SAMPLE_COUNT) max_samples = SAMPLE_COUNT;
}

if(max_samples < 0) return 0;

cli();

last_ts = RDTSC();
for(pass=0; faultc < max_samples; pass++) {
unsigned long long dT = 0, tmp = RDTSC();
if(last_ts > tmp) { last_ts = tmp; continue; } // overflow
dT = tmp - last_ts;
last_ts = tmp;
if(dT > 350000) {
++faultc;
faults[faultc].tsc = tmp;
faults[faultc].pass= pass;
faults[faultc].dT = dT;
if((faultc + 1) % 5 == 0) { fprintf(stderr, "!"); fflush(stderr); }
}

if(faultc >= 0 && (tmp - faults[faultc].tsc) > 10000000000LL) {
fprintf(stderr, "TIMEOUT"); fflush(stderr);
break;
}
}

sti();

for(i = 0; i < faultc; i++)
printf("pass=%012lld T=%016lld dT=%016lld\n", faults[i].pass, faults[i].tsc, faults[i].dT);

return 0;
}
For djgpp the low-level code goes:
static inline unsigned long long RDTSC(void)
{
unsigned long lo, hi;

__asm__ __volatile__ (
"xorl %%eax,%%eax \n cpuid"
::: "%rax", "%rbx", "%rcx", "%rdx");
/* We cannot use "=A", since this would use %rax on x86_64 and return only the lower 32bits of the TSC */
__asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));

return (uint64_t)hi << 32 | lo;
}

void cli() { __asm__( "cli" ); }
void sti() { __asm__( "sti" ); }
And bingo! I reproduced the observed behaviour down to the 12.5 min runtime. So this is not Linux's fault at all.

The djgpp code runs under DOS but in Protected Mode. The LILO test prompted me to run this code in Real Mode but djgpp cannot generate this code (easily).

So I fetched OpenWatcom and modified the low-level code like so:
unsigned long long RDTSC()
{
unsigned long long ret = 0;
unsigned long hi = 0, lo = 0;
__asm {
rdtsc
mov dword ptr lo, eax
mov dword ptr hi, edx
};
ret = ((unsigned long long)hi << 32) | lo;
return ret;
}
void cli() { _asm cli; }
void sti() { _asm sti; }
The Watcom C's assembler syntax gave me some headache.

This code ran in pure Real Mode with the same result. So it must be SMI.

I g00gled for RTAI+SMI and found out about the smi-module. Tried that but my hardware went undetected. Hmm. This module has a list of "known" hardware by PCI id but that told me not much.

So I fetched the PCI ids database from Sourceforge and had a look and interestingly some of the known hardware were ISA bridge controllers. Hmm. Then I looked at my hardware and noticed one of those as 0x8086:0x27bd. I added this to the lookup table and the module worked.

Indeed with the SMI interrupts turned off we had no more mysterious missed interrupts and the saga was closed. I guess I was lucky this time.

-ulianov