Tuesday, June 24, 2008

Cargo-Cult Programming

Here is an example I encountered of a chronic Cargo-Cult Programming affliction in a Junior engineer -- all code he wrote read:
if(CONSTANT == var) {
// something
} else {
// something else
}
The reason behind this is a fake "defensive programming" strategy as he protected himself against
if(var = CONSTANT) { ... }
which is does not do a comparison, it does an assignment which sometimes fails (when CONSTANT=0).

This is just bad coding. First, it's an eye-sore, second this programmer is not aware that new GCC iterations warn about this when using -Wall and even halt compilation when used in conjunction with -Werror

The fact that he was very stubborn did not help either.

-ulianov

Monday, June 23, 2008

The Forgotten Junk in libc

libc is a patchwork of stuff: syscalls and helper functions that got set in stone aeons ago, warts and all. During the bad times unthinking people stuck in it junk such as:
  • atoi() -- does not return errors;
  • gets() -- has no way of knowing the length of the buffer it populates;
  • strcpy() -- the all time favourite way of causing a crash: if the src pointer is junk and has no '\0' then the buffer pointed by dst will be filled beyond its boundaries with garbage;
  • ditto strcat();
  • sprintf() -- what if the stuff you wish to print into the buffer exceeds the buffer's length?
  • heck, even strlen(NULL) will crash;
  • strtok() -- this one is just evil.
DO NOT USE THESE CALLS! I cannot count how many time I had to look for crashes and refactor code using this garbage.

-ulianov

Friday, June 20, 2008

Why atoi() is NOT Your Friend

atoi() is a hold-over of the bad times when people put junk such as strcpy() and gets() into libc.

It is unsuitable as it has no way of returning and error, e.g.
atoi("adhgsjgdas") = 0
Alas many people still use it.

Please, please use instead:
if(sscanf(str, "%d", &int_var) != 1) {
// handle the error!!!
}
-ulianov

Thursday, June 19, 2008

Passing Params thru Registers Woes on x86

One year ago I was asked to look into a problem some colleagues had with a networking framework that lived as a set of Linux kernel modules.

The problem they saw was that when certain framework functions were called a parameter [which was a pointer] contained garbage.

They had the hairy idea to start using "-mregparm=3" to compile the kernel altho until them we lived happily with stack-based calls.

I looked at the code and the makefile and here is how gcc was invoked:
gcc -ggdb -O2 -pipe \
-mregparm=3 \
-Wall -Wstrict-prototypes -Wno-trigraphs \
-fno-strict-aliasing -fno-common \
-fomit-frame-pointer \
-mpreferred-stack-boundary=2 \
-march=pentium code.c
and here is how the offending code looked like [not a verbatim copy]:
#include <stdio.h>
#include <stdlib.h>
#ifdef FORCE_STDARG
#include <stdarg.h>
#endif

#define ENOMEM -2
#define EBADSLT -3
#define ENODEV 0

typedef int (*func_t)(void*, ...);

struct _S2; // forward declaration
struct _S1 {
char filler1[3];
func_t f1;
int (*f2)(long, long, struct _S2*, long);
};
struct _S2 {
char filler1[13];
long filler2;
struct _S1* s;
char filler3[7];
};

struct _S1 g_S1;
struct _S2 g_S2;

#ifdef FORCE_STDARG
int f1(struct _S1* s, ...)
#else
int f1(struct _S1* s)
#endif
{
#ifdef FORCE_STDARG
va_list ap;
va_start(ap, s);
#endif
if(s != &g_S1)
return -1;
#ifdef FORCE_STDARG
va_end(ap);
#endif
return 1;
}
int f2(long i1, long i2, struct _S2* s, long i3)
{
if(s != &g_S2)
return -1;
return 1;
}
int main()
{
g_S1.filler1[0] = 'A';
g_S1.f1 = (func_t)f1;
g_S1.f2 = f2;

g_S2.filler2 = 13;
g_S2.s = &g_S1;

if(g_S1.f1(&g_S1) < 0)
return -ENOMEM;
if(g_S2.s->f2(1, 2, &g_S2, 3) < 0)
return -EBADSLT;

return -ENODEV;
}
I noticed comparing disassembled code (objdump -dSl code.o) that the call setup for
g_S1.f1(&g_S1);
was the same regardless of "-mregparm" -- this is because the compiler applies the template
typedef int (*func_t)(void*, ...);
which forces it to put the arguments on the stack.

However the declaration
int f1(struct _S1* s);
in connection with "-mregparm=3" has f1() looking for its first argument in %eax which contains some garbage!!

Hint: compile the code with -DFORCE_STDARG and without and see the difference in execution.

The moral is two-fold:
1. if you use "..." in a function prototype then also use it in the function implementation!!;
2. in the kernel passing function arguments thru registers will yield little or no gain in execution speed (as only the leaf functions will fully benefit from the stack-free operation).

-ulianov

Tuesday, June 17, 2008

Bad Idiom: Using Strings on Stack

Here's what I've seen recently:
#include <stdexcept>

Class::Class(int arg)
{
...
if(/* error */) {
// original code
//throw std::runtime_error("Class::Class: Wrong argument!");

// later addition
char str[100];
sprintf(str, "Class::Class: argument %d is too big", arg);
throw std::runtime_error(str);
}
...
}
int main()
{
try { c = new Class(100000); } // say it's allocating memory
catch(std::runtime_error& e) {
printf("Got an error: %s\n", e.what());
}
}
The code sins in many ways:
1. the catch() is wrong as the code does not throw a reference but an object!
2. the throw() is poisoned: it throws a string from the stack which will go out of scope when the constructor is exited; the application won't crash immediately [the stack pages are not unmapped];
3. text of the exception will be junk (and may not have a \0 terminator) and attempting to print it may crash the app in funny and random ways.

The person who butchered the code and used the class hierarchy suffers from the cargo cult programming syndrome because:
1. did not pay attention to original code;
2. did not bother to read the Doxygen documentation of the original classes;
3. does not understand the life cycle of an auto var allocated on stack;
4. does not analyse all the implications of using and changing other people's code.

-ulianov

Monday, June 16, 2008

Compressed Output from Perl CGI Scripts

You know how you say in PHP
<?php ob_start("ob_gzhandler"); ?>
I wanted the same for a Perl CGI script so I dug a bit on the Net and here's the code I came up with:
use strict;
use Compress::Zlib qw(gzopen);

print <<EOF;
Content-Type: text/html
Content-Encoding: gzip

EOF

my $print;
binmode STDOUT;
{ # closure
my $gz = gzopen(\*STDOUT, "wb");
$print = sub { $gz->gzwrite(@_) };
}

$print->("Some content....");
Mind you I am not using CGI.pm as it's a memory hog and in this particular script I did not have to parse form variables.

On another note importing just what you need from a Perl module [i.e. qw(gzopen)] is a great way to cut down on memory consumption.

-ulianov

Saturday, June 14, 2008

Smashed Stack in a Multithreaded Application

One of the most annoying thing in debugging an app is to have it crash and GDB show garbage for the topmost 3-4 frames.

There is not much to do other than to do a diff between your current code (that is crap!) and the last known working version that you have in revision control (I hope you do have that!).

-ulianov

Friday, June 13, 2008

The Fine Distinction Between a Pointer and an Array

I keep finding crashes cause by people not comprehending what C pointers and arrays really are.

Definitions:
1. array = some auto variable that was declared as a collection of objects:
char x[100];
be it as a global or even worse on stack;
2. pointer = something that has been malloc'ed or something that holds the address of another object:
char* x1 = (char*)malloc(100);
or
char c;
char* x2 = &c;
A pointer may live as a global or on stack, regardless of this its memory footprint is 4 bytes (on a 32-bit architecture).

To many people pointers are arrays and arrays are pointers. True, but not all the time.

To simplify the discussion let us assume that a segfault occurs when we go beyond the boundaries of a buffer regardless of access type (read/write), its size and alignment.

Let's assume you populated x and x1 with a 99-byte message and you say:
write(STDOUT_FILENO, &x, 99); // mind the \0!
This is correct.

If you say:
write(STDOUT_FILENO, &x[0], 99); // mind the \0!
This is also correct.

If you say:
write(STDOUT_FILENO, x1, 99); // mind the \0!
This is correct.

If you say:
write(STDOUT_FILENO, &x1, 99); // mind the \0!
This is a SEGFAULT because you will try to access (99-4) bytes on the stack that follow x1 and there is nothing mapped there and you program will crash and burn!

The explanation is that
x == &x == &x[0]
but
x1 != &x1
as x1 means the allocated memory buffer to which x1 points to whereas &x1 means the address in memory where the pointer x1 lives!!!

In real life it gets even worse if x1 lives on the stack and you write to &x1 -- here you put garbage on stack and you may go past the red page. If there are some other auto vars on stack after it and they happen to be pointers then you have a recipe for a clusterfuck.

If x1 is a global then you will corrupt other globals that live after x1. This is even more fun!

-ulianov

Unlock of Mutex Failed

This was the one of the most puzzling problems I faced. The Teja application we had was failing once in a blue moon printing the message "unlock of mutex failed".

Looking at he code I saw that pthread_unlock_mutex returned an error which should never happen.

I spent a month writing a wrapper for the pthread/mutex library (a mutex is in this particular case a 64-byte memory area, futex was not called).

I put signatures before and after the mutex, checked signatures & various mutex fields before entering the true mutex call to no avail.

It was a random memory corruption that I solved elsewhere so this error ceased to happen.

-ulianov

Thursday, June 12, 2008

Re-Inventing the Wheel is Evil

Once I had to debug a program that ran in the Teja application framework for IXP 1200. We experienced random crashes at sites that were using a Walled Garden wireless authentication method.

By another name the Captive Portal consist in a TCP stream hijacker. The intercepted stream is HTTP. The component that handles this functionality simply pretends to be any HTTP destination, analyses the request and sends a crafted HTTP/307 redirection message in which the original URL is encoded as a GET parameter.

The Teja module that implemented the redirection got as an input an Ethernet frame (TCP reassembly was not performed but that was not a practical problem as most people simply go to http://yahoo.com, i.e. use short URLs that fit within one frame).

The data structure that enveloped the Ethernet frame was called Packet and it looked like:

struct Packet {
char buffer[1500];
short start;
short end;
};
To an astute reader this resembles an skbuff.

The code was parsing the input request and sending back a 307 that looked like "Location: http://auth.host/login.php?dest=OriginalURL"

The response was kept in a 6000-byte buffer which was large enough but here's what they were doing:

memcpy(packet.buffer+packet.end, response, response_len);
packet.end += response_len;
See anything wrong? If response_len>1500 then the memcpy thrashes packet.end, then one adds something to it resulting in a bigger junk. How about the junk in packet.start?

The Packets were malloc'ed [thus living in a pool of objects and having neighbours] so this had the potential (if len>1508) to corrupt the "secret stuff" the GNU allocator puts before each malloc'ed pointer. This will cause a happy crash when you want to free that pointer!

Later on the packet was copied into a hardware buffer that lived in IXP DRAM mapped in the Arm CPU address space like so:

memcpy(pHwBuf->buffer, \
packet.buffer+packet.start, \
(packet.end-packet.start));
Here you could copy garbage from a garbage address (packet.buffer+packet.start) [no crash if the memory was mapped] that had a garbage length (packet.end-packet.start) thus thrashing memory left and right in unexpected places.

Murphy's Laws dictate that the memory belonged to other threads of execution and that these threads will crash at another time in a such a way to turn you green.

In order to track that I had to change the Packet like so:

struct Packet {
unsigned sig1 = 0xdeadbeef; // initialised when
// allocating a Packet
char buffer[1500];
unsigned sig2 = 0xfeedabba;
short start;
short end;
unsigned sig3 = 0xdeafabba;
};
and check the signatures after each Packet operation. Took me only a month and one very pissed customer to fix this (not my code).

-ulianov

P.S. The Linux skbuff infrastructure comes with a lot of accessory functions that prevent this kind of crap from happening by Oops'ing.

Tuesday, June 3, 2008

Shooting your Own Foot and Jumping on it Afterwards

While I was developing my anti-recruiter Sendmail milter I was relying on nohup(1) to collect stderr output. Each time I would change code I would re-start the milter and wipe nohup.out like so "> nohup.out".

Then I changed code to redirect stderr to "milter.err" and at every restart I would do "> milter.err". One night at midnight I did not pay attention to the BASH-completion and wiped the code itself!

Now if you delete a file on an ext2 filesystem you can recover it using some utilities. But wiping the content of the file severs the relation between the i-node and its data blocks. debugfs ain't no help.

Moral: I had to "dd if=/dev/hda1 of=/storage/hda1.saved" and find the source code block by block by searching code patterns I remembered in hda1.saved. Trust me, file blocks are not contiguous on-disk!

I managed to piece together everything but the top of the file where the Perl use statements live. No biggie but I also skipped a small function and the code only complained at run time.

Took me three hours in the middle of the night and a huge head ache.

-ulianov

Thrashing the Kernel (Linux 2.4.x)

Today a co-worker told me that a library I've been writing is causing a Segfault but he could not give me an IP. Hmm. The library calls a custom driver and tries to pull data from the driver via ioctl's.

Then I used the lib in a utility and presto! I hosed the PPC 450. Funny is that the utility executes picture-perfect, does not crash, the kernel does not panic(), but everything executed after that segfaults.

Now that's artful.

Jun 4 update: I traced the code by #if 0'ing code and the problem is in the kernel driver.

Jun 10 update: The root cause was that they changed the way they were DMA-ing data into my user buffer that got mapped into the kernel space. As I was malloc-ing the buffer and not touching it the associated hardware pages were not actually mapped and that was confusing the PPC DMA engine.

memset-ing the buffer to zero in user space to fault the pages in or forcing the mapping of pages in kernel mode fixed the problem.

-ulianov