El bug

classic Classic list List threaded Threaded
6 messages Options
Martin Guy Martin Guy
Reply | Threaded
Open this post in threaded view
|

El bug

I just found a nasty bug that I've been chasing since December 15th
(bang in the middle of the run-up to a product launch - just what we
didn't need!)

The symptom is that, dependent on the exact combination of elua
microversion, toolchain, compiler flags, scons options, exact program
text and whether or not you press "Enter" once before running your
program, with a probability of about one in twenty the program will
run for a few seconds, then the interpreter just stops dead.

Long story short: It's a memory corruption bug that bites when you are
using the newlib memory allocator, have several discontiguous areas of
RAM (like the Mizar32 and EVK1100, with 64KB internal SRAM at 0x0 and
32MB SDRAM at 0xD0000000) and the version of newlib in your toolchain
is older than 1.19.0.

All the avr32 toolchains I've been using (Atmel GNU toolchain,
jsnyder-avr32-toolchain and the recommended avr32 combination in
crosstool-ng) contain newlib-1.{16,17}.0, which contains
dlmalloc-2.6.4 (from 1996!), whereas newlib-1.19.0 onwards contain
dl-malloc 2.6.5, which, according to its commentary, "differs from
2.6.4 only by correcting a statement ordering error that could cause
failures only when calls to this malloc are interposed with calls to
other memory allocators.”

And when eLua's newlib/stubs.c's sbrk() clone starts using the second
memory area, this look the same, to dlmalloc, as if someone else had
called sbrk() between one malloc call and the next. Result: the
boundary-stomping bug corrupts memory at the edges of its allocated
regions.

fixes: use one of:
- use a toolchain with newlib >= 1.19.0
- use the simple or multiple allocators
- only use one memory region

I dunno how you would code that in the validation thing.

// newlib before 2.6.4 has a boundary-stomping bug when sbrk() returns
non-contiguous memory regions
#if !defined(USE_SIMPLE_ALLOCATOR) && !defined(USE_MULTIPLE_ALLOCATOR)
if ( sizeof(MEM_START_ADDRESS) / sizeof(MEM_START_ADDRESS[0] ) > 1 ) {
    complain();
}
#endif

I don't think you can use sizeof() in a define test, even though it is
constant at compile-time, since the preprocessor doesn't have enough
knowledge about struct and data sizes.

I assume this would affect other platforms than AVR32, if any of them
have multiple memory regions and ancient versions of newlib in their
toolchains.

Sigh

Now, where was I on the 14th...?

    M
_______________________________________________
eLua-dev mailing list
[hidden email]
https://lists.berlios.de/mailman/listinfo/elua-dev
BogdanM BogdanM
Reply | Threaded
Open this post in threaded view
|

Re: El bug

Hi Martin,

On Fri, Jan 13, 2012 at 4:19 AM, Martin Guy <[hidden email]> wrote:

> I just found a nasty bug that I've been chasing since December 15th
> (bang in the middle of the run-up to a product launch - just what we
> didn't need!)
>
> The symptom is that, dependent on the exact combination of elua
> microversion, toolchain, compiler flags, scons options, exact program
> text and whether or not you press "Enter" once before running your
> program, with a probability of about one in twenty the program will
> run for a few seconds, then the interpreter just stops dead.
>
> Long story short: It's a memory corruption bug that bites when you are
> using the newlib memory allocator, have several discontiguous areas of
> RAM (like the Mizar32 and EVK1100, with 64KB internal SRAM at 0x0 and
> 32MB SDRAM at 0xD0000000) and the version of newlib in your toolchain
> is older than 1.19.0.
>
> All the avr32 toolchains I've been using (Atmel GNU toolchain,
> jsnyder-avr32-toolchain and the recommended avr32 combination in
> crosstool-ng) contain newlib-1.{16,17}.0, which contains
> dlmalloc-2.6.4 (from 1996!), whereas newlib-1.19.0 onwards contain
> dl-malloc 2.6.5, which, according to its commentary, "differs from
> 2.6.4 only by correcting a statement ordering error that could cause
> failures only when calls to this malloc are interposed with calls to
> other memory allocators.”
>
> And when eLua's newlib/stubs.c's sbrk() clone starts using the second
> memory area, this look the same, to dlmalloc, as if someone else had
> called sbrk() between one malloc call and the next. Result: the
> boundary-stomping bug corrupts memory at the edges of its allocated
> regions.
>
> fixes: use one of:
> - use a toolchain with newlib >= 1.19.0
> - use the simple or multiple allocators
> - only use one memory region
>
> I dunno how you would code that in the validation thing.
>
> // newlib before 2.6.4 has a boundary-stomping bug when sbrk() returns
> non-contiguous memory regions
> #if !defined(USE_SIMPLE_ALLOCATOR) && !defined(USE_MULTIPLE_ALLOCATOR)
> if ( sizeof(MEM_START_ADDRESS) / sizeof(MEM_START_ADDRESS[0] ) > 1 ) {
>    complain();
> }
> #endif
>
> I don't think you can use sizeof() in a define test, even though it is
> constant at compile-time, since the preprocessor doesn't have enough
> knowledge about struct and data sizes.
>
> I assume this would affect other platforms than AVR32, if any of them
> have multiple memory regions and ancient versions of newlib in their
> toolchains.
>
> Sigh
>
> Now, where was I on the 14th...?

First of all, congratulations for tracking this bug, it must've been a
nightmare. Second, you are going to hate me in about a couple of
minutes. The thing is that 'multiple' SHOULD be used for any board
with non-contiguous RAM areas. 'multiple' is just a version of
dlmalloc (the allocator also used by Newlib) specifically compiled
with support for multilple memory spaces. Using the allocator from
Newlib for multiple memory spaces might lead to two main problems:
Newlib's version might be too old to support multiple memory spaces or
dlmalloc might not be compiled with the proper options for multiple
memory spaces. To address this issue, the eLua build system
automatically sets the 'multiple' allocator for specific boards (which
are known to have non-contigours RAM areas). See this part of
SConstruct:

# CPU/allocator mapping (if allocator not specified)
  if comp['allocator'] == 'auto':
    if comp['board'] in ['LPC-H2888', 'ATEVK1100', 'MBED']:
      comp['allocator'] = 'multiple'
    else:
      comp['allocator'] = 'newlib'

(there is a similar construct in build_elua.lua).
You can probably see the problem now: somebody (must likely me) forgot
to add Mizar32 to the list of boards that need the 'multiple'
allocator. Doing so should fix the problem. Again, I am very sorry
about this. Things like this happen, unfortunately.

Best,
Bogdan

>
>    M
_______________________________________________
eLua-dev mailing list
[hidden email]
https://lists.berlios.de/mailman/listinfo/elua-dev
Martin Guy Martin Guy
Reply | Threaded
Open this post in threaded view
|

Re: El bug

On 16 January 2012 14:45, Bogdan Marinescu <[hidden email]> wrote:
> First of all, congratulations for tracking this bug, it must've been a
> nightmare.

Like most of these things, three weeks' sweat following dozens of
things that it wasn't.. and in the end the change was to just move one
line of code down by four lines.

> 'multiple' is just a version of
> dlmalloc (the allocator also used by Newlib) specifically compiled
> with support for multilple memory spaces. Using the allocator from
> Newlib for multiple memory spaces might lead to two main problems:
> Newlib's version might be too old to support multiple memory spaces or
> dlmalloc might not be compiled with the proper options for multiple
> memory spaces.

Thanks for the extra info.

Even the most recent newlib (with the bug fixed) only uses dlmalloc
2.6.5 from 2007, whereas eLua has 2.8.3 (most recent is 2.8.5).  I
think the reason for them sticking to that is that dlmalloc has
doubled in size over the years. Code sizes are:
simple    832 bytes + 1 bss (!)
newlib   2312 + 1040 data + 52 bss
multiple 7092 + 480 data

They all seem too handle multiple memory spaces OK in eLua.

In the tests i've run, the overall speed difference has been
negligable between the different dlmallocs:
simple: 22.0 seconds
newlib: 13.5 seconds
multiple: 13.5 seconds

But yes, avoiding newlib's malloc might be a good default strategy, in
case anyone is using a toolchain with newlib<1.19.0

incidentally, there is also the more recent TLSF allocator, which is
guaranteed to run in constant time for every malloc and each free(),
however much/little/fragmented RAM you are using, as well as having
the usual good fragmentation, overhead and code size properties.
Furthermore, instead of blindly calling sbrk() for "More!" when it
runs out of memory, as the other three do, you pass it the available
memory regions at program startup and it uses those.  For our
platforms, which have a fixed amount of RAM that is known in advance,
that seems a more effective strategy.
It's a bit hard to find but someone is conserving v2.0 at
http://tlsf.baisoku.org/ which has compiled code size of 5k and no
data.

Though I've sometimes grumbled about having three memory allocators,
two build systems and so on, in finding this bug and noticing that it
only happened when using one of them was the key to finding it.

Thanks again for your suggestions along the way, which certainly helped

    M

_______________________________________________
eLua-dev mailing list
[hidden email]
https://lists.berlios.de/mailman/listinfo/elua-dev

TLSF-2.2.1.tbz2 (16K) Download Attachment
BogdanM BogdanM
Reply | Threaded
Open this post in threaded view
|

Re: El bug

On Mon, Jan 16, 2012 at 5:14 PM, Martin Guy <[hidden email]> wrote:
> On 16 January 2012 14:45, Bogdan Marinescu <[hidden email]> wrote:
>> First of all, congratulations for tracking this bug, it must've been a
>> nightmare.
>
> Like most of these things, three weeks' sweat following dozens of
> things that it wasn't.. and in the end the change was to just move one
> line of code down by four lines.

This kind of thing tends to happen a lot to me. And it's extremely frustrating.

>
>> 'multiple' is just a version of
>> dlmalloc (the allocator also used by Newlib) specifically compiled
>> with support for multilple memory spaces. Using the allocator from
>> Newlib for multiple memory spaces might lead to two main problems:
>> Newlib's version might be too old to support multiple memory spaces or
>> dlmalloc might not be compiled with the proper options for multiple
>> memory spaces.
>
> Thanks for the extra info.
>
> Even the most recent newlib (with the bug fixed) only uses dlmalloc
> 2.6.5 from 2007, whereas eLua has 2.8.3 (most recent is 2.8.5).  I
> think the reason for them sticking to that is that dlmalloc has
> doubled in size over the years. Code sizes are:
> simple    832 bytes + 1 bss (!)
> newlib   2312 + 1040 data + 52 bss
> multiple 7092 + 480 data

I think I tried older versions of dlmalloc too and dismissed them
because of some missing features. Can't remember the details though,
this happened a few years ago.

>
> They all seem too handle multiple memory spaces OK in eLua.
>
> In the tests i've run, the overall speed difference has been
> negligable between the different dlmallocs:
> simple: 22.0 seconds
> newlib: 13.5 seconds
> multiple: 13.5 seconds
>
> But yes, avoiding newlib's malloc might be a good default strategy, in
> case anyone is using a toolchain with newlib<1.19.0
>
> incidentally, there is also the more recent TLSF allocator, which is

I tried to integrate TLSF quite a while ago and couldn't make it to
work for the life of me. I got so frustrated that I gave up entirely.
In any case, it does have a penalty: it has a two level zone size
directory, as opposed to dl which keeps of all its zone sizes on a
single level, thus wasting precious RAM (while increasing speed, of
course). If you want to give it a spin, be my guest :)

> guaranteed to run in constant time for every malloc and each free(),
> however much/little/fragmented RAM you are using, as well as having
> the usual good fragmentation, overhead and code size properties.
> Furthermore, instead of blindly calling sbrk() for "More!" when it
> runs out of memory, as the other three do, you pass it the available
> memory regions at program startup and it uses those.  For our
> platforms, which have a fixed amount of RAM that is known in advance,
> that seems a more effective strategy.
> It's a bit hard to find but someone is conserving v2.0 at
> http://tlsf.baisoku.org/ which has compiled code size of 5k and no
> data.
>
> Though I've sometimes grumbled about having three memory allocators,
> two build systems and so on, in finding this bug and noticing that it
> only happened when using one of them was the key to finding it.
>
> Thanks again for your suggestions along the way, which certainly helped

Thanks again for being patient enough to track this bastard :)

Best,
Bogdan

>
>    M
_______________________________________________
eLua-dev mailing list
[hidden email]
https://lists.berlios.de/mailman/listinfo/elua-dev
jbsnyder jbsnyder
Reply | Threaded
Open this post in threaded view
|

Re: El bug

On Mon, Jan 16, 2012 at 9:36 AM, Bogdan Marinescu
<[hidden email]> wrote:

> On Mon, Jan 16, 2012 at 5:14 PM, Martin Guy <[hidden email]> wrote:
>> On 16 January 2012 14:45, Bogdan Marinescu <[hidden email]> wrote:
>>> First of all, congratulations for tracking this bug, it must've been a
>>> nightmare.
>>
>> Like most of these things, three weeks' sweat following dozens of
>> things that it wasn't.. and in the end the change was to just move one
>> line of code down by four lines.
>
> This kind of thing tends to happen a lot to me. And it's extremely frustrating.

Actually, almost always :-P

>
>>
>>> 'multiple' is just a version of
>>> dlmalloc (the allocator also used by Newlib) specifically compiled
>>> with support for multilple memory spaces. Using the allocator from
>>> Newlib for multiple memory spaces might lead to two main problems:
>>> Newlib's version might be too old to support multiple memory spaces or
>>> dlmalloc might not be compiled with the proper options for multiple
>>> memory spaces.
>>
>> Thanks for the extra info.
>>
>> Even the most recent newlib (with the bug fixed) only uses dlmalloc
>> 2.6.5 from 2007, whereas eLua has 2.8.3 (most recent is 2.8.5).  I
>> think the reason for them sticking to that is that dlmalloc has
>> doubled in size over the years. Code sizes are:
>> simple    832 bytes + 1 bss (!)
>> newlib   2312 + 1040 data + 52 bss
>> multiple 7092 + 480 data
>
> I think I tried older versions of dlmalloc too and dismissed them
> because of some missing features. Can't remember the details though,
> this happened a few years ago.

Interesting point on the size analysis, I hadn't checked the code size
difference, but that is a rather significant change and likely would
explain why they haven't adopted more recent versions.

>
>>
>> They all seem too handle multiple memory spaces OK in eLua.
>>
>> In the tests i've run, the overall speed difference has been
>> negligable between the different dlmallocs:
>> simple: 22.0 seconds
>> newlib: 13.5 seconds
>> multiple: 13.5 seconds
>>
>> But yes, avoiding newlib's malloc might be a good default strategy, in
>> case anyone is using a toolchain with newlib<1.19.0

So the patched/fixed version is included in newlib 1.19.  I'll see if
there's an easy way to bring my toolchain builder up to that rev or at
least include patched newlib with it.

>>
>> incidentally, there is also the more recent TLSF allocator, which is
>
> I tried to integrate TLSF quite a while ago and couldn't make it to
> work for the life of me. I got so frustrated that I gave up entirely.
> In any case, it does have a penalty: it has a two level zone size
> directory, as opposed to dl which keeps of all its zone sizes on a
> single level, thus wasting precious RAM (while increasing speed, of
> course). If you want to give it a spin, be my guest :)

Yeah, in fact, there's a really old branch for this (which you may
have already noticed):
https://github.com/elua/elua/tree/tlsf_from_rtportal

If you'd like to tinker with it, I'd be happy to do some testing of it
on all the various platforms I've got on hand (mostly ARM).

>
>> guaranteed to run in constant time for every malloc and each free(),
>> however much/little/fragmented RAM you are using, as well as having
>> the usual good fragmentation, overhead and code size properties.
>> Furthermore, instead of blindly calling sbrk() for "More!" when it
>> runs out of memory, as the other three do, you pass it the available
>> memory regions at program startup and it uses those.  For our
>> platforms, which have a fixed amount of RAM that is known in advance,
>> that seems a more effective strategy.
>> It's a bit hard to find but someone is conserving v2.0 at
>> http://tlsf.baisoku.org/ which has compiled code size of 5k and no
>> data.
>>
>> Though I've sometimes grumbled about having three memory allocators,
>> two build systems and so on, in finding this bug and noticing that it
>> only happened when using one of them was the key to finding it.
>>
>> Thanks again for your suggestions along the way, which certainly helped
>
> Thanks again for being patient enough to track this bastard :)

Sorry we weren't of more help, I'm glad you were able to track this
one down.  Some of the initial behavior you described with the stack
trace and parameters seemed to point either in the direction of the
allocator or compiler.  I'm glad it turned out to be the former since
we can more easily work around that by, in the worst case, using the
working version of the allocator from our own sources instead of
opting for the troublesome version rather than requiring the whole
toolchain to apply a patch and rebuild + get integration upstream.

It is rather frustrating though that it was one of those bugs that
was, er, fixed 5 years ago :-)




>
> Best,
> Bogdan
>
>>
>>    M
_______________________________________________
eLua-dev mailing list
[hidden email]
https://lists.berlios.de/mailman/listinfo/elua-dev
Martin Guy Martin Guy
Reply | Threaded
Open this post in threaded view
|

Re: El bug

On 16 January 2012 19:35, James Snyder <[hidden email]> wrote:
>>> in the end the change was to just move one
>>> line of code down by four lines.
>>
>> This kind of thing tends to happen a lot to me. And it's extremely frustrating.
>
> Actually, almost always :-P

It's like life (at least, the way I live it): actually doing things is
not what takes the time, it's understanding what to do.
I sometimes wish I had a nice simple job, like threading beads on
pieces of string or something :)


>>> But yes, avoiding newlib's malloc might be a good default strategy, in
>>> case anyone is using a toolchain with newlib<1.19.0
>
> So the patched/fixed version is included in newlib 1.19.

Yes

>  I'll see if
> there's an easy way to bring my toolchain builder up to that rev or at
> least include patched newlib with it.

I found (with crosstool-ng) that gcc-4.2.2 (the only avr32 config they
have) it would not compile newlib >1.17 because there is some GCC flag
in the configure script that gcc-4.2.2 does not undestand.  You may
have more success with 4.[34]

>>> there is also the more recent TLSF allocator

>> I tried to integrate TLSF quite a while ago and couldn't make it to
>> work for the life of me. I got so frustrated that I gave up entirely.

Right. Thanks for the warning!

> Yeah, in fact, there's a really old branch for this (which you may
> have already noticed):
> https://github.com/elua/elua/tree/tlsf_from_rtportal

ahA! No, I hadn't noticed that. thanks.

>>> Thanks again for your suggestions along the way, which certainly helped
>>
>> Thanks again for being patient enough to track this bastard :)
>
> Sorry we weren't of more help

You were, you were.

    M
_______________________________________________
eLua-dev mailing list
[hidden email]
https://lists.berlios.de/mailman/listinfo/elua-dev