Writing for Underpowered Processors

Background

In 2020 I was exceptionally bored at home, and of course did what any bored engineer does- spent all my money on projects. The biggest of these was a guidance and control computer that I made for high-powered consumer and sounding rocketry (which I actually got private funding for, but that’s a story for another day). There are many great processors around that would have been great choices for this, but being a high schooler and still learning such things I decided an ATMEGA32U4 was a great choice. It was not. This is the story of how I got that potato-grade thing to run 1-2-3 Euler angle guidance, and how you can too.

Main issues

Turns out vector calculus actually takes quite a lot of memory to execute properly. I simply didn’t have the memory space for quaternions (2.5KB is not exactly a lot, only 2KB were left after bootloader and libraries), so those were effectively out immediately. Similarly, the thing only has 8-bit processing, no FPU, and runs mind-bogglingly slow at only 8MHz. The entire compiled firmware binary also has to fit in 32KB. This whole thing was, in hindsight, unnecessarily restrictive, but 16-year-old me didn’t realize that, so here we are.

Solutions

For a start, limiting the libraries I used very heavily knocked a lot of the memory usage off the top. A lot of the sensors I was using at the time had libraries written for them by various open-source hardware companies like Adafruit, but those are notoriously inefficient and do way too much error checking to run quickly- they were immediately ditched in favor of smaller versions that I wrote myself, most of which involved bit-banging pin registers to send I2C commands and such. There are certainly more elegant ways to implement this, but not really any that take less processing power or program space than the 10 or so assembly instructions that compiled to.

Secondly, I effectively had to limit myself to C features only, no fancy C++ stuff like classes or vectors. Generally speaking this is a bad idea because of memory safety and readability concerns, but given the constraints I was under it pretty much had to be done. Static casts, unions, direct allocation and deallocation and a whole bunch else effectively made the entire thing run 10x faster than it otherwise would have. Also frankly the C++ vector library is a pig in terms of memory usage even on more capable systems, so it wasn’t like I was planning on using it anyway.

The most major thing I did here, though, was effectively writing an entirely novel way to perform the matrix operations I needed for attitude determination. Any existing libraries that would do matrix multiplication for me would have been far too heavy to run on such a system, so actually storing the Euler matrices as matrices wasn’t going to work. I was able to get a lot of the processing time down by just analytically solving the equations down on paper, but that still produced some exceptionally long expressions that were taking a few hundred milliseconds to run (largely because an 8-bit processor running math on 32-bit floats takes minimum 4 clock cycles per operation). So what’s the solution here? Examine everything at the assembly level. Sine and cosine operations, turns out, take a really long time to run, especially on a system as underpowered as this one. About a third of my control loop program size and execution time was literally just the instructions to do trigonometric functions. Clearly that was going to bottleneck things horribly, so I had to come up with a solution to get refresh rates up and memory usage down. I tried a variety of things, from optimizing other areas of the code to overclocking the processor, but none of them really worked as I had hoped. I was effectively still limited to around 5-7 Hz data sampling rates, which just wasn’t going to cut it for active control. There were two main things that actually did work, and I was able to get up to 33 Hz, which isn’t great but definitely isn’t terrible considering where we started. First thing I did was optimize the hell out of the underlying assembly. Compiler optimization wasn’t really working properly on this code, so I decided to get in there and edit some things before it compiled to a binary file. That alone managed to shave my runtime down so I could hit 10Hz or so pretty reliably, but I wanted to go further. I wrote a bit of logic into the program that would use small-angle approximations for angular rates under 1 degree per sample. This definitely introduced a bit of inaccuracy, but in all my testing it seemed to be minor enough not to matter, and it let me bypass those inefficient functions entirely for most of the flight (at least in stabilized situations). That on its own 3xed my refresh rate and got it up to the 33Hz I mentioned earlier. All this while sampling more than 16 data axes and handling parachute deployment and radio communications. Obviously that refresh rate would be disgustingly low by my current standards, but given the time at which I built it and the power that was available to me, safe to say I was pretty happy with it.

Takeaways

Let me be so clear: you should not do this. This was an ill-conceived way to construct a guidance system and I should definitely have used a more powerful processor in this situation. I’m really glad past me didn’t see that solution though, because this project taught me more than anything else about the importance of optimization and basically forced me to learn assembly, which made my life a hell of a lot easier later on when classes required me to write it. So no, this is not a good way to do this project, but it’s an amazing way to learn in a constrained environment.

I’m working on redoing this project now, with a more efficient processor and the cheaper and more accurate MEMS sensors that I can actually buy now since we’re not in as much of a chip shortage. Current plan is one of the higher-spec STM32 models, which is going to feel downright blazing compared to what I used before. With that extra power I’m hoping to make something with proper active control using fin tabs, and hopefully (eventually, when I have motor money) send it supersonic and beyond. I’ll have more posts out later as that develops, but for now I hope this was an informative read. In summary, do as I say not as I do- but if you do emulate this, you might learn a thing or two about low-level optimization.

Writing for Underpowered Processors

Published: 2024-10-08

By Allison Byrnes

Background

Main issues

Solutions

Takeaways