For my master’s thesis I’m working on computational models of the inferior olivary nucleus, a region in the brain involved in motor control and learning. The lab already produced multi-CPU, multi-GPU and multi-FPGA brain simulators, and another student and me thought it would be a cute to add a multi-MCU simulator as well. In this way we would be able to simulate a scalable real time inferior olive networks where rewiring the neural connections is just reconnecting wires in an electrical circuit and an oscilloscope is the main tool to analyze the network behaviour (instead of python scripts) (yes it’s a toy project :)).

We tried all microcontrollers we had lying around; The ATtiny did not have enough program memory, the ATmega was over 140x slower than real-time (even after conversion to fixed point) and the NodeMCUv3 scored quite good with being just under 30x slower than biology. The Cortex-M7 (Teensy 4.1) and iMX6ULL outperformed all candidates, but those are costly in large quantities and the last one is not an actual microprocessor. Luckily the FPU equiped ESP32 hinted at being able to compete with biology, while also being low-cost and low-power. By coding the software division and reciprocal operations in assembly and building a look up table for the exponential function, we could finally execute the model at biological speeds. However, we thought we could do better in terms of cost and power use.

After looking for better options we found the one microcontroller that should be perfect for this task: the Expressif ESP32-S2. The faster LX7 architecture, lower power usage, cheaper price and not having one core sitting idle doing nothing where convincing. More importantly, every article online writing about the ESP32-S2 likes to highlight that it is faster for floating point operations, which translates to even more detailed neural models for us.

One author, posting two very similar articles on hackster.io and Medium, writes (twice):

[The LX7 core] should be capable of more floating point operations per cycle […]

A different author on Elektor magazine also writes that

The LX7 core is capable of performing many more floating point operations per cycle

And even on Hackaday the statement is repeated that

[…] it appears the LX7 core is capable of many more floating point operations per cycle: apparently 2 FLOPS / cycle for the LX6, but 64 FLOPS / cycle for the LX7. This is fantastic for DSP and other computationally heavy applications. […]

Enthusiastically, we bought a few development boards and started prototyping. After waiting some days for the delivery, we quickly found out that the code (which contains a lot of floating point operations) executes at the same speed as the slow initial ESP32 version. Next we tried enabling our custom floating point division/reciprocal assembly implementations, but the compiler marked all floating point instructions as illegal instructions. Turns out, the ESP32-S2 does not have a floating point unit at all, so it could never be faster than the regular ESP32!

Yes, it’s clear if you compare the datasheets. The ESP32 one says “Support for Floating Point Unit”, while in the ESP32-S2 datasheet that line is removed. Its subtle, but I guess Expressif is not to blame here. However, the only place online where this fact is stated explicitly is this github issue, which I found after writing this post. I’m still a huge fan of the ESP32-S2; it does have a better instruction architecture and the builtin USB HID and Wi-Fi Time of Flight are really nice. It’s just not as well suited for floating point applications, but that has never been a good idea for microcontrollers :)

## So how do you speed up floating point operations on the ESP32?

A reason for this strange reporting might be that the ESP32 development environment (especially Arduino) by default does have some problems with floating point code and will make floating point division slow. The last link there gives a hint on why (but beware as it solves one problem, it introduces another by not setting the right clobber registers in the asm volatile statement which will lead to strange behaviour if you enable inlining). The ESP32 FPU only supports addition and multiplication, division still needs to happen (partially) in software. For some reason gcc likes to link to its own software floating point routines instead of using the optimized LX6 assembly version (or something like that). So a simple uninformed comparison between the two chips might lead to the ESP32-S2 looking a bit faster.

If you want actual fast divisions on the ESP32, look at the floating point routines from the xtensa libgcc library (GPLv3 licensed). Depending on your setup and/or esp-idf version, you might already be using their division function. But while that division function led to a nice 60% speed-up for us, the biggest performance gain came from replacing division by multiplication with the reciprocal (taken from the same library):

static __attribute__((always_inline)) inline
float recipsf2(float a) {
float result;
asm volatile (
"wfr f1, %1\n"

"recip0.s f0, f1\n"
"const.s f2, 1\n"
"msub.s f2, f1, f0\n"
"const.s f2, 1\n"
"msub.s f2, f1, f0\n"

"rfr %0, f0\n"
:"=r"(result):"r"(a):"f0","f1","f2"
);
return result;
}
#define DIV(a, b) (a)*recipsf2(b)


A simple old trick. In a time where the cheapest available MCU with floating point unit has Wi-Fi and Bluetooth builtin, it’s sometimes nice to optimize code while you don’t have to worry about L1/L2 cache invalidation or frequency scaling :)

(Added a few days later): By the way; 64 FLOPS/cycle! Thats what you get with AVX-512 on the newest intel processors, which when used throttles the CPU frequency to crawl to prevent overheating. Absolutely not something a single core microprocessor without FPU could ever have. As one of the comments on hackaday says (should have read that earlier :)), its probably a misinterpretation of the Cadence Xtensa LX7 release statement. Yes the LX7 architecture has an optional vectorized floating point processor, but the ESP32-s2 does not ship with that module. Sad that all tech sites decided to copy each other without verification.

From the LX7 datasheet: Configurable ISA options include:

• IEEE 754-compliant single-/double-precision scalar/vector floating-point units
• Double-precision scalar floating-point acceleration
• Single-precision vector floating-point (VFPU) option

Some pretty cool stuff coming up in the future! Let’s hope that the upcoming ESP32-s3 with “additional vector instructions” has some of those enabled.