Bit banging (how fast can you go in code)

 

One of my first tests I did was to bitbang the hell out of the IO and see how fast you could get it. The thing I wanted to know is how fast you could flip a GPIO from 0 to 1 and reverse. I repeated that a few times to prevent the while from interfering with the timing. The while will be measured separated. 

#include <stdio.h>
#include "pico/stdlib.h"
#include "hardware/gpio.h"
#include "pico/binary_info.h"

const uint LED1_PIN = 16;

void onebitspeedtest()
{
  while (true) {
    gpio_put(LED1_PIN, 0);
    gpio_put(LED1_PIN, 1);
    gpio_put(LED1_PIN, 0);
    gpio_put(LED1_PIN, 1);
    gpio_put(LED1_PIN, 0);
    gpio_put(LED1_PIN, 1);
    gpio_put(LED1_PIN, 0);
    gpio_put(LED1_PIN, 1);
    gpio_put(LED1_PIN, 0);
    gpio_put(LED1_PIN, 1);
  }
}

void init_io()
{
  gpio_init(LED1_PIN);
  gpio_set_dir(LED1_PIN, GPIO_OUT);
}

int main() {
  stdio_init_all();
  init_io();
  onebitspeedtest();
}

Source code converted to html with http://hilite.me/  and I have put the source code of the project (incl .vscode directory) in this zipfile.

For the test setup I used a Pico installed on a breadboard in and programmed it with a PicoProbe. For measuring the output signal I used my Rigol DS1054Z oscilloscoop and a frequency counter. 

bitbang_01

 

Expecting a little delay because of the while I used a series of 10  gpio_put(xx, yy) statements to generate 5 pulses and I measured the periode and frequency with my Rigol DS1054Z oscilloscoop. 

The image below shows that when the Pico running the test program (Debug build / no optimization) on 133 MHz I get a nice 16nSec period which is 62.5MHz. Remember that measuring period and frequency with this scope can be a bit off (note to my self :  build a stable 10MHz GPS locked frequency reference source). The large positive pulse is the expected elapsed time taken by  while loop. 

bitbang_debug_0_1_16nSec

The image below shows that when the Pico running the test program (Release optimization for speed) on 133 MHz again I get a nice 16nSec period which is 62.5Mhz. The large positive pulse from the elapsed time taken by  while loop is now shorter. We will take a look into that in a minute.  

bitbang_release_0_1_16nSec

 So it looks debug of release doesn't effect the gpio_put(xx, yy) statements but does effect the timing of the while loop. Lets dive deeper into that.

Diving deeper. 

When we zoom into the signals, we can measure elapsed time of the while loop. 

Looking at the source code, the first statement in the while is gpio_put(xx, 0) to set the GPIO to "0" and the last statement in the while is a gpio_put(xx, 1) to set the GPIO to "1". The elapsed time of the while statement it will stay "1", so its safe to say that the total timespan shown below consists of : 

elapsed time gpio_put(xx, 1)
elapsed time while(true)
elapsed time gpio_put(xx 0)

When we look at the image below which is taken from a execution of the Debug build / no optimization, we can see that the total elapsed time of those 3 statements is 64nSec. 

Because from the  first measurement we know that the a 0 -> 1 periode is 16nSec, we can calculate the  the elapsed time of the while is being 48nSec (64 - 16 nSec). #1

bitbang_debug_while_64nSec

 

When we look at the image below which is taken from a execution of the Release build / optimization, we can see that the total elapsed time of those 3 statements is 32nSec. 

Knowing that the a 0 -> 1 periode is 16nSec, we can calculate the  the elapsed time of the while is being 16nSec (32 - 16 nSec). #2

bitbang_release_while_32nSec

Conclusion : in our test program, the while statement  in Debug build is twice as long the while statement in the Release build.
Let find out whats the reason for that. 

For that we take a look at another file.

When you compile the source code the compiler / linker also produces a file with the extension .dis  (meaning disassembled). When you look into that file you will find the assembly output of your program. And when we look at that for this program It becomes clear why the while statement in the Debug build is longer than Release build.

The listing below is the assembly output of the program in Debug build. Lets take a look.

10000370 <onebitspeedtest>:
10000370: 23d0 movs r3, #208 ; 0xd0
10000372: 061b lsls r3, r3, #24
10000374: 2280 movs r2, #128 ; 0x80
10000376: 0252 lsls r2, r2, #9
10000378: 619a str r2, [r3, #24]
1000037a: 615a str r2, [r3, #20]
1000037c: 619a str r2, [r3, #24]
1000037e: 615a str r2, [r3, #20]
10000380: 619a str r2, [r3, #24]
10000382: 615a str r2, [r3, #20]
10000384: 619a str r2, [r3, #24]
10000386: 615a str r2, [r3, #20]
10000388: 619a str r2, [r3, #24]
1000038a: 615a str r2, [r3, #20]
1000038c: e7f0 b.n 10000370 <onebitspeedtest>

We see that the onbitspeedtest functions starts at  address 10000370  

After setting up the registers R2 and R3 (MOVS and LSLS instructions), we see the code for the actual GPIO bit flipping.  (4 instructions)

str r2, [r3, #24] makes GPIO16 go low.
str r2, [r3, #20] makes GPIO16 go high.

Repeat 4 times and jump back to the beginning of the function at  address 10000370. Looking back at our earlier finding (see #1), we see that  jumping back and setting up the R2 and R3 register will take 48nSec.

Note * : I started to write down the exact content and addressing of these function, but got stuck (the offsets for the GPIO16_CTRL register didn't match) and will try it again later.  

Now we take a look at  assembly output of the program in Release build.  

10000370 <onebitspeedtest>:
10000370: 22d0 movs r2, #208 ; 0xd0
10000372: 2380 movs r3, #128 ; 0x80
10000374: 0612 lsls r2, r2, #24
10000376: 025b lsls r3, r3, #9
10000378: 6193 str r3, [r2, #24]
1000037a: 6153 str r3, [r2, #20]
1000037c: 6193 str r3, [r2, #24]
1000037e: 6153 str r3, [r2, #20]
10000380: 6193 str r3, [r2, #24]
10000382: 6153 str r3, [r2, #20]
10000384: 6193 str r3, [r2, #24]
10000386: 6153 str r3, [r2, #20]
10000388: 6193 str r3, [r2, #24]
1000038a: 6153 str r3, [r2, #20]
1000038c: e7f4 b.n 10000378 <onebitspeedtest+0x8>
1000038e: 46c0 nop ; (mov r8, r8)

Look al most the same, the MOVS and LSLS are in a bit different order, but the rest looks ok, except that the jump back after the bit flipping doesn't jump back to 10000370 as it did in the Debug build, but jumps back to 10000378 and skips continues setting up of R2 and R3. 

Looking back at our earlier finding (see #2), we see now that the release build doesn't make the while go faster, but optimized the loop time by jumping back after setting up the R2 and R3 register. Now we also know that setting and preparing R2 and R3 was taking 48 - 16 nSec = 32 nSec so 8nSec per instruction.

Note : Optimization  is off course very code depending and one of the reasons why embedded software developers sometimes use assembly for very critical or fast code. 


Post experiment remark 

I also tested it by copying the release UF2 file to the mounted drive. There where no difference in timing against loading the release build into it by SWD.  

bitbang_02

 

 

 

 


Previous page: Pico pin label cap
Next page: New kid in town : Pico W