There's a bunch of different changes here but there are only really three
big wins. The biggest win comes from restructuring the 2-bit RLE decode
loop to avoid the inner function (~20%) but the switch to 16-bit writes in
_fill() and adoption of quick_write (e.g. no CS toggling) are also
note worthy (and about 5% each).
Migrate the filling of the line buffer into a seperate function.
This does naturally reduce the cost of the loop management but
much more importantly allows us to use viper native code
generator.