"No, the ESP32-S2 is not faster at floating point operations (and how do you actually speed up division on the ESP32?)"

08 apr 2021

For my master's thesis I'm working on computational models of the inferior olivary nucleus, a region in the brain involved in motor control and learning. The lab already produced multi-CPU, multi-GPU and multi-FPGA brain simulators, and another student and me thought it would be a cute to add a multi-MCU simulator as well. In this way we would be able to simulate a scalable real time inferior olive networks where rewiring the neural connections is just reconnecting wires in an electrical circuit and an oscilloscope is the main tool to analyze the network behaviour (instead of python scripts) (yes it's a toy project :)).

We tried all microcontrollers we had lying around; The ATtiny did not have enough program memory, the ATmega was over 140x slower than real-time (even after conversion to fixed point) and the NodeMCUv3 scored quite good with being just under 30x slower than biology. The iMX6ULL outperformed all candidates, but that's not an actual microprocessor. Luckily the ESP32, equiped with a FPU, hinted at being able to compete with biology, while also being low-cost and low-power. By hand-coding the ESP32 software division and reciprocal operations in assembly and building a look up table for the exponential function, we could finally execute the model at biological speeds. However, we thought we could do better in terms of cost and power use.

After looking for better options we found the one microcontroller that should be perfect for this task: the Expressif ESP32-S2. The faster LX7 architecture, lower power usage, cheaper price and not having one core sitting idle doing nothing where convincing. More importantly, every article online writing about the ESP32-S2 likes to highlight that it is faster for floating point operations, which translates to even more detailed neural models for us.

One author, writing two very similar articles on hackster.io: and Medium, writes (twice):

[The LX7 core] should be capable of more floating point operations per cycle [...]

A different author on Elektor magazine also writes that

The LX7 core is capable of performing many more floating point operations per cycle

And even on Hackaday the statement is repeated that

[...] it appears the LX7 core is capable of many more floating point operations per cycle

Enthusiastically, we bought a few development boards and started prototyping. After waiting some days for the delivery, we quickly found out that the code (which contains a lot of floating point operations) executes at the same speed as the slow initial ESP32 version. Next we tried enabling our custom floating point division/reciprocal assembly implementations, but the compiler marked all floating point instructions as illegal instructions. The ESP32-S2 does not have a floating point unit at all, so it could never be faster than the regular ESP32!

Yes, it's clear if you compare the datasheets. The ESP32 one says "Support for Floating Point Unit", while in the ESP32-S2 datasheet that line is removed. Its subtle, but I guess Expressif is not to blame here.

So how do you speed up floating point operations on the ESP32?

One reason for this strange reporting might be that the ESP32 development environment (especially Arduino) by default does have some problems with floating point code and will make floating point division slow. The last link there gives a hint on why (but beware as it solves one problem, it introduces another by not setting the right clobber registers in the asm volatile statement which will lead to strange behaviour if you enable inlining). The ESP32 FPU only supports addition and multiplication, division still needs to happen (partially) in software. For some reason gcc likes to link to its own software floating point routines instead of using the optimized LX6 assembly version (or something like that). So a simple uninformed comparison between the two chips might lead to the ESP32-S2 looking a bit faster.

If you want actual fast divisions on the ESP32, look at the floating point routines from the xtensa libgcc library (GPLv3 licensed). Depending on your setup and/or esp-idf version, you might already be using their division function. But while that division function led to a nice 60% speed-up for us, our biggest performance gain came from replacing division by multiplication with the reciprocal (taken from the same library):

static __attribute__((always_inline)) inline
float recipsf2(float a) {
    float result;
    asm volatile (
        "wfr f1, %1\n"

        "recip0.s f0, f1\n"
        "const.s f2, 1\n"
        "msub.s f2, f1, f0\n"
        "maddn.s f0, f0, f2\n"
        "const.s f2, 1\n"
        "msub.s f2, f1, f0\n"
        "maddn.s f0, f0, f2\n"

        "rfr %0, f0\n"
        :"=r"(result):"r"(a):"f0","f1","f2"
    );
    return result;
}
#define DIV(a, b) (a)*recipsf2(b)

A simple old trick. In a time where the cheapest available MCU with floating point unit has Wi-Fi and Bluetooth builtin, it's sometimes nice to optimize code while you don't have to worry about cache invalidation, instruction pipelining or frequency scaling :)