Issue
english is not my first language, so sorry for my bad writting.
i need to optimize an Algorithm, which is written with Python and running on a Raspberry Pi. The clue is, that i need to write the optimized Code as a C-Programm running on a stm32f4.
It is a image-processing algorithm (i know, image-processing with C on a Microcontroller sounds fun ...) and the functionality muss stay the same (so same output with tolerance). Ofcourse i need a method of benchmarking the two programms.
In my case "Optimization" means that the programm should run faster (which it automatically will be, but i need to show that it is faster because of optimized Code and not because it is written in C and running on a bare-metal system).
i know that for examaple i can compare the number of Code lines, because the less lines the faster is the programm. is there more "factors", which are system independend and i can compare to explain why the optimized Code is faster?
kind regards, Dan
PS: i thought about converting the Python code in C code with cython. Than i can compile it and compare the assembly or machine Code. But i am not sure if it is the right way, because i dont know what exactly cython is doing.
Solution
Ofcourse i need a method of benchmarking the two programms.
For embedded systems, this is always done by toggling an GPIO pin at the start and end of the algorithm, then measure the time with an oscilloscope. This should be possible on Rasp PI and a STM32 target both. But you'll be measuring raw execution speed and not just the algorithm - Rasp PI will be messing around with context switches and the like.
i know that for examaple i can compare the number of Code lines, because the less lines the faster is the programm.
No, that's nonsense. The number of lines do not necessarily have any relation to execution speed. If you think that, then I'd say you are yet far too inexperienced to do manual code optimization for a specific target.
As for specific performance improvements things to look for, dropping Linux in favour of bare metal will give a huge performance boost. On the other hand, you will at the same time downsize from some Cortex A to a M4, which runs at much lower clock and lacks cache. But this also means that if you get it running faster on the M4, that's mission completed, since it is a less powerful target. (And outperforming a Linux PC should be a walk in the park for a bare metal Cortex M4.)
I suspect that merely converting from Python to C will improve performance quite a bit, since all manner of type-generic goo and implicit/hidden function calls performed "behind the scenes" in Python will simply get removed.
Other than that, STM32F4 is advanced enough to have a form of branch prediction and it also has a FPU. So you can still look at reducing the number of branches and floating point operations. You can also look at CPU clock used versus flash wait states, see if there are possible improvements. As far as I know this MCU doesn't have data cache, meaning it can't compensate for flash wait states. So maybe consider executing code in RAM if wait states is a bottleneck. Or simply clock it up as much as possible.
Answered By - Lundin Answer Checked By - David Marino (WPSolving Volunteer)