I just found a nasty bug that I've been chasing since December 15th
(bang in the middle of the run-up to a product launch - just what we didn't need!) The symptom is that, dependent on the exact combination of elua microversion, toolchain, compiler flags, scons options, exact program text and whether or not you press "Enter" once before running your program, with a probability of about one in twenty the program will run for a few seconds, then the interpreter just stops dead. Long story short: It's a memory corruption bug that bites when you are using the newlib memory allocator, have several discontiguous areas of RAM (like the Mizar32 and EVK1100, with 64KB internal SRAM at 0x0 and 32MB SDRAM at 0xD0000000) and the version of newlib in your toolchain is older than 1.19.0. All the avr32 toolchains I've been using (Atmel GNU toolchain, jsnyder-avr32-toolchain and the recommended avr32 combination in crosstool-ng) contain newlib-1.{16,17}.0, which contains dlmalloc-2.6.4 (from 1996!), whereas newlib-1.19.0 onwards contain dl-malloc 2.6.5, which, according to its commentary, "differs from 2.6.4 only by correcting a statement ordering error that could cause failures only when calls to this malloc are interposed with calls to other memory allocators.” And when eLua's newlib/stubs.c's sbrk() clone starts using the second memory area, this look the same, to dlmalloc, as if someone else had called sbrk() between one malloc call and the next. Result: the boundary-stomping bug corrupts memory at the edges of its allocated regions. fixes: use one of: - use a toolchain with newlib >= 1.19.0 - use the simple or multiple allocators - only use one memory region I dunno how you would code that in the validation thing. // newlib before 2.6.4 has a boundary-stomping bug when sbrk() returns non-contiguous memory regions #if !defined(USE_SIMPLE_ALLOCATOR) && !defined(USE_MULTIPLE_ALLOCATOR) if ( sizeof(MEM_START_ADDRESS) / sizeof(MEM_START_ADDRESS[0] ) > 1 ) { complain(); } #endif I don't think you can use sizeof() in a define test, even though it is constant at compile-time, since the preprocessor doesn't have enough knowledge about struct and data sizes. I assume this would affect other platforms than AVR32, if any of them have multiple memory regions and ancient versions of newlib in their toolchains. Sigh Now, where was I on the 14th...? M _______________________________________________ eLua-dev mailing list [hidden email] https://lists.berlios.de/mailman/listinfo/elua-dev |
Hi Martin,
On Fri, Jan 13, 2012 at 4:19 AM, Martin Guy <[hidden email]> wrote: > I just found a nasty bug that I've been chasing since December 15th > (bang in the middle of the run-up to a product launch - just what we > didn't need!) > > The symptom is that, dependent on the exact combination of elua > microversion, toolchain, compiler flags, scons options, exact program > text and whether or not you press "Enter" once before running your > program, with a probability of about one in twenty the program will > run for a few seconds, then the interpreter just stops dead. > > Long story short: It's a memory corruption bug that bites when you are > using the newlib memory allocator, have several discontiguous areas of > RAM (like the Mizar32 and EVK1100, with 64KB internal SRAM at 0x0 and > 32MB SDRAM at 0xD0000000) and the version of newlib in your toolchain > is older than 1.19.0. > > All the avr32 toolchains I've been using (Atmel GNU toolchain, > jsnyder-avr32-toolchain and the recommended avr32 combination in > crosstool-ng) contain newlib-1.{16,17}.0, which contains > dlmalloc-2.6.4 (from 1996!), whereas newlib-1.19.0 onwards contain > dl-malloc 2.6.5, which, according to its commentary, "differs from > 2.6.4 only by correcting a statement ordering error that could cause > failures only when calls to this malloc are interposed with calls to > other memory allocators.” > > And when eLua's newlib/stubs.c's sbrk() clone starts using the second > memory area, this look the same, to dlmalloc, as if someone else had > called sbrk() between one malloc call and the next. Result: the > boundary-stomping bug corrupts memory at the edges of its allocated > regions. > > fixes: use one of: > - use a toolchain with newlib >= 1.19.0 > - use the simple or multiple allocators > - only use one memory region > > I dunno how you would code that in the validation thing. > > // newlib before 2.6.4 has a boundary-stomping bug when sbrk() returns > non-contiguous memory regions > #if !defined(USE_SIMPLE_ALLOCATOR) && !defined(USE_MULTIPLE_ALLOCATOR) > if ( sizeof(MEM_START_ADDRESS) / sizeof(MEM_START_ADDRESS[0] ) > 1 ) { > complain(); > } > #endif > > I don't think you can use sizeof() in a define test, even though it is > constant at compile-time, since the preprocessor doesn't have enough > knowledge about struct and data sizes. > > I assume this would affect other platforms than AVR32, if any of them > have multiple memory regions and ancient versions of newlib in their > toolchains. > > Sigh > > Now, where was I on the 14th...? First of all, congratulations for tracking this bug, it must've been a nightmare. Second, you are going to hate me in about a couple of minutes. The thing is that 'multiple' SHOULD be used for any board with non-contiguous RAM areas. 'multiple' is just a version of dlmalloc (the allocator also used by Newlib) specifically compiled with support for multilple memory spaces. Using the allocator from Newlib for multiple memory spaces might lead to two main problems: Newlib's version might be too old to support multiple memory spaces or dlmalloc might not be compiled with the proper options for multiple memory spaces. To address this issue, the eLua build system automatically sets the 'multiple' allocator for specific boards (which are known to have non-contigours RAM areas). See this part of SConstruct: # CPU/allocator mapping (if allocator not specified) if comp['allocator'] == 'auto': if comp['board'] in ['LPC-H2888', 'ATEVK1100', 'MBED']: comp['allocator'] = 'multiple' else: comp['allocator'] = 'newlib' (there is a similar construct in build_elua.lua). You can probably see the problem now: somebody (must likely me) forgot to add Mizar32 to the list of boards that need the 'multiple' allocator. Doing so should fix the problem. Again, I am very sorry about this. Things like this happen, unfortunately. Best, Bogdan > > M _______________________________________________ eLua-dev mailing list [hidden email] https://lists.berlios.de/mailman/listinfo/elua-dev |
On 16 January 2012 14:45, Bogdan Marinescu <[hidden email]> wrote:
> First of all, congratulations for tracking this bug, it must've been a > nightmare. Like most of these things, three weeks' sweat following dozens of things that it wasn't.. and in the end the change was to just move one line of code down by four lines. > 'multiple' is just a version of > dlmalloc (the allocator also used by Newlib) specifically compiled > with support for multilple memory spaces. Using the allocator from > Newlib for multiple memory spaces might lead to two main problems: > Newlib's version might be too old to support multiple memory spaces or > dlmalloc might not be compiled with the proper options for multiple > memory spaces. Thanks for the extra info. Even the most recent newlib (with the bug fixed) only uses dlmalloc 2.6.5 from 2007, whereas eLua has 2.8.3 (most recent is 2.8.5). I think the reason for them sticking to that is that dlmalloc has doubled in size over the years. Code sizes are: simple 832 bytes + 1 bss (!) newlib 2312 + 1040 data + 52 bss multiple 7092 + 480 data They all seem too handle multiple memory spaces OK in eLua. In the tests i've run, the overall speed difference has been negligable between the different dlmallocs: simple: 22.0 seconds newlib: 13.5 seconds multiple: 13.5 seconds But yes, avoiding newlib's malloc might be a good default strategy, in case anyone is using a toolchain with newlib<1.19.0 incidentally, there is also the more recent TLSF allocator, which is guaranteed to run in constant time for every malloc and each free(), however much/little/fragmented RAM you are using, as well as having the usual good fragmentation, overhead and code size properties. Furthermore, instead of blindly calling sbrk() for "More!" when it runs out of memory, as the other three do, you pass it the available memory regions at program startup and it uses those. For our platforms, which have a fixed amount of RAM that is known in advance, that seems a more effective strategy. It's a bit hard to find but someone is conserving v2.0 at http://tlsf.baisoku.org/ which has compiled code size of 5k and no data. Though I've sometimes grumbled about having three memory allocators, two build systems and so on, in finding this bug and noticing that it only happened when using one of them was the key to finding it. Thanks again for your suggestions along the way, which certainly helped M _______________________________________________ eLua-dev mailing list [hidden email] https://lists.berlios.de/mailman/listinfo/elua-dev TLSF-2.2.1.tbz2 (16K) Download Attachment |
On Mon, Jan 16, 2012 at 5:14 PM, Martin Guy <[hidden email]> wrote:
> On 16 January 2012 14:45, Bogdan Marinescu <[hidden email]> wrote: >> First of all, congratulations for tracking this bug, it must've been a >> nightmare. > > Like most of these things, three weeks' sweat following dozens of > things that it wasn't.. and in the end the change was to just move one > line of code down by four lines. This kind of thing tends to happen a lot to me. And it's extremely frustrating. > >> 'multiple' is just a version of >> dlmalloc (the allocator also used by Newlib) specifically compiled >> with support for multilple memory spaces. Using the allocator from >> Newlib for multiple memory spaces might lead to two main problems: >> Newlib's version might be too old to support multiple memory spaces or >> dlmalloc might not be compiled with the proper options for multiple >> memory spaces. > > Thanks for the extra info. > > Even the most recent newlib (with the bug fixed) only uses dlmalloc > 2.6.5 from 2007, whereas eLua has 2.8.3 (most recent is 2.8.5). I > think the reason for them sticking to that is that dlmalloc has > doubled in size over the years. Code sizes are: > simple 832 bytes + 1 bss (!) > newlib 2312 + 1040 data + 52 bss > multiple 7092 + 480 data I think I tried older versions of dlmalloc too and dismissed them because of some missing features. Can't remember the details though, this happened a few years ago. > > They all seem too handle multiple memory spaces OK in eLua. > > In the tests i've run, the overall speed difference has been > negligable between the different dlmallocs: > simple: 22.0 seconds > newlib: 13.5 seconds > multiple: 13.5 seconds > > But yes, avoiding newlib's malloc might be a good default strategy, in > case anyone is using a toolchain with newlib<1.19.0 > > incidentally, there is also the more recent TLSF allocator, which is I tried to integrate TLSF quite a while ago and couldn't make it to work for the life of me. I got so frustrated that I gave up entirely. In any case, it does have a penalty: it has a two level zone size directory, as opposed to dl which keeps of all its zone sizes on a single level, thus wasting precious RAM (while increasing speed, of course). If you want to give it a spin, be my guest :) > guaranteed to run in constant time for every malloc and each free(), > however much/little/fragmented RAM you are using, as well as having > the usual good fragmentation, overhead and code size properties. > Furthermore, instead of blindly calling sbrk() for "More!" when it > runs out of memory, as the other three do, you pass it the available > memory regions at program startup and it uses those. For our > platforms, which have a fixed amount of RAM that is known in advance, > that seems a more effective strategy. > It's a bit hard to find but someone is conserving v2.0 at > http://tlsf.baisoku.org/ which has compiled code size of 5k and no > data. > > Though I've sometimes grumbled about having three memory allocators, > two build systems and so on, in finding this bug and noticing that it > only happened when using one of them was the key to finding it. > > Thanks again for your suggestions along the way, which certainly helped Thanks again for being patient enough to track this bastard :) Best, Bogdan > > M _______________________________________________ eLua-dev mailing list [hidden email] https://lists.berlios.de/mailman/listinfo/elua-dev |
On Mon, Jan 16, 2012 at 9:36 AM, Bogdan Marinescu
<[hidden email]> wrote: > On Mon, Jan 16, 2012 at 5:14 PM, Martin Guy <[hidden email]> wrote: >> On 16 January 2012 14:45, Bogdan Marinescu <[hidden email]> wrote: >>> First of all, congratulations for tracking this bug, it must've been a >>> nightmare. >> >> Like most of these things, three weeks' sweat following dozens of >> things that it wasn't.. and in the end the change was to just move one >> line of code down by four lines. > > This kind of thing tends to happen a lot to me. And it's extremely frustrating. Actually, almost always :-P > >> >>> 'multiple' is just a version of >>> dlmalloc (the allocator also used by Newlib) specifically compiled >>> with support for multilple memory spaces. Using the allocator from >>> Newlib for multiple memory spaces might lead to two main problems: >>> Newlib's version might be too old to support multiple memory spaces or >>> dlmalloc might not be compiled with the proper options for multiple >>> memory spaces. >> >> Thanks for the extra info. >> >> Even the most recent newlib (with the bug fixed) only uses dlmalloc >> 2.6.5 from 2007, whereas eLua has 2.8.3 (most recent is 2.8.5). I >> think the reason for them sticking to that is that dlmalloc has >> doubled in size over the years. Code sizes are: >> simple 832 bytes + 1 bss (!) >> newlib 2312 + 1040 data + 52 bss >> multiple 7092 + 480 data > > I think I tried older versions of dlmalloc too and dismissed them > because of some missing features. Can't remember the details though, > this happened a few years ago. Interesting point on the size analysis, I hadn't checked the code size difference, but that is a rather significant change and likely would explain why they haven't adopted more recent versions. > >> >> They all seem too handle multiple memory spaces OK in eLua. >> >> In the tests i've run, the overall speed difference has been >> negligable between the different dlmallocs: >> simple: 22.0 seconds >> newlib: 13.5 seconds >> multiple: 13.5 seconds >> >> But yes, avoiding newlib's malloc might be a good default strategy, in >> case anyone is using a toolchain with newlib<1.19.0 So the patched/fixed version is included in newlib 1.19. I'll see if there's an easy way to bring my toolchain builder up to that rev or at least include patched newlib with it. >> >> incidentally, there is also the more recent TLSF allocator, which is > > I tried to integrate TLSF quite a while ago and couldn't make it to > work for the life of me. I got so frustrated that I gave up entirely. > In any case, it does have a penalty: it has a two level zone size > directory, as opposed to dl which keeps of all its zone sizes on a > single level, thus wasting precious RAM (while increasing speed, of > course). If you want to give it a spin, be my guest :) Yeah, in fact, there's a really old branch for this (which you may have already noticed): https://github.com/elua/elua/tree/tlsf_from_rtportal If you'd like to tinker with it, I'd be happy to do some testing of it on all the various platforms I've got on hand (mostly ARM). > >> guaranteed to run in constant time for every malloc and each free(), >> however much/little/fragmented RAM you are using, as well as having >> the usual good fragmentation, overhead and code size properties. >> Furthermore, instead of blindly calling sbrk() for "More!" when it >> runs out of memory, as the other three do, you pass it the available >> memory regions at program startup and it uses those. For our >> platforms, which have a fixed amount of RAM that is known in advance, >> that seems a more effective strategy. >> It's a bit hard to find but someone is conserving v2.0 at >> http://tlsf.baisoku.org/ which has compiled code size of 5k and no >> data. >> >> Though I've sometimes grumbled about having three memory allocators, >> two build systems and so on, in finding this bug and noticing that it >> only happened when using one of them was the key to finding it. >> >> Thanks again for your suggestions along the way, which certainly helped > > Thanks again for being patient enough to track this bastard :) Sorry we weren't of more help, I'm glad you were able to track this one down. Some of the initial behavior you described with the stack trace and parameters seemed to point either in the direction of the allocator or compiler. I'm glad it turned out to be the former since we can more easily work around that by, in the worst case, using the working version of the allocator from our own sources instead of opting for the troublesome version rather than requiring the whole toolchain to apply a patch and rebuild + get integration upstream. It is rather frustrating though that it was one of those bugs that was, er, fixed 5 years ago :-) > > Best, > Bogdan > >> >> M _______________________________________________ eLua-dev mailing list [hidden email] https://lists.berlios.de/mailman/listinfo/elua-dev |
On 16 January 2012 19:35, James Snyder <[hidden email]> wrote:
>>> in the end the change was to just move one >>> line of code down by four lines. >> >> This kind of thing tends to happen a lot to me. And it's extremely frustrating. > > Actually, almost always :-P It's like life (at least, the way I live it): actually doing things is not what takes the time, it's understanding what to do. I sometimes wish I had a nice simple job, like threading beads on pieces of string or something :) >>> But yes, avoiding newlib's malloc might be a good default strategy, in >>> case anyone is using a toolchain with newlib<1.19.0 > > So the patched/fixed version is included in newlib 1.19. Yes > I'll see if > there's an easy way to bring my toolchain builder up to that rev or at > least include patched newlib with it. I found (with crosstool-ng) that gcc-4.2.2 (the only avr32 config they have) it would not compile newlib >1.17 because there is some GCC flag in the configure script that gcc-4.2.2 does not undestand. You may have more success with 4.[34] >>> there is also the more recent TLSF allocator >> I tried to integrate TLSF quite a while ago and couldn't make it to >> work for the life of me. I got so frustrated that I gave up entirely. Right. Thanks for the warning! > Yeah, in fact, there's a really old branch for this (which you may > have already noticed): > https://github.com/elua/elua/tree/tlsf_from_rtportal ahA! No, I hadn't noticed that. thanks. >>> Thanks again for your suggestions along the way, which certainly helped >> >> Thanks again for being patient enough to track this bastard :) > > Sorry we weren't of more help You were, you were. M _______________________________________________ eLua-dev mailing list [hidden email] https://lists.berlios.de/mailman/listinfo/elua-dev |
Free forum by Nabble | Edit this page |