King M, whether or not that's a bottleneck depends on how often the decompression routine is called and thus how often it's executed. If it runs say once per scene change or something like that, despite being inefficient, I doubt it's really causing a problem. If however it's being called regularly, it very well could be.
I don't have the data in front of me, but those three instructions are probably somewhere on the order of say 4 cycles on average, which would be say 6 master cycles each. So we have maybe 12x6=72 master cycles for the loop. 21Mhz master clock cycle multiplied by 72 cycles multiplied by ~65000 iterations. We're talking somewhere in the neighborhood of under a quarter of a second I think. Feel free to grab Anomie's timing docs and a 65816 doc to do more precise math.
How and when it this particular loop being called? If my estimate is correct, it can't be called too terribly often or you'd have massive delay.
Simple fix as MoN said is to do DMA to WRAM registers. Alternatively, just doing 16-bit manual copy or MVN will cut it down significantly.
Also, it's probably not even needed. When you decompress a stream, you're typically not reading ahead in the output buffer. What purpose would it serve? I don't see why you'd need to clear it out every time it's run. Did you look at if the game would fail without it and if so, why?