Yes, you are abolutely right. I did calculations that way because I calculated the times the loop is executed more than once, since at least it must be executed 1 time to check HBlank. In short, the original code is faster if the loop executes once, so I calculated the number of times it is executed making my code faster.
The proper exact calculation would be yours, which implies that loop is executed about 11 times in average.
Looking at the original code in a statistical way, maybe changing the multiply code could make the difference, because you are saving a lot of cycles, although I think those task are not very repetitive, since multiply is used to access name tables to print items, weapons, armors and such...
Multiplication FunctionYou could axe that last NOP and it would take 34 cycles, as opposed to the original 39, or your 45.
Multiplies low bit of A * high bit of A. Stores result in 16-bit A.
C2/4781: 08 PHP
C2/4782: C2 20 REP #$20
C2/4784: 8F 02 42 00 STA $004202
C2/4788: EA NOP
C2/4789: EA NOP
C2/478A: EA NOP
C2/478B: EA NOP
C2/478C: AF 16 42 00 LDA $004216
C2/4790: 28 PLP
C2/4791: 60 RTS
Why could I axe the last NOP? Isn't it supposed hardware multiplication is 8 machine cycles long? I just checked that all $4202 multiplications wait for 6 machine cycles, not 8, as I always thought was the correct number.
By the way, I just discovered some 24 bit multiplications in bank $C3 in routines:
I'll post all my relocatable dissassembly when it's done.