Quote for those too lazy to click on links.
Microcode Optimization: IMEM (Part8)(Editor's Note: Part 8?!?!?)
Today I am very happy to give you finally news on the recent developments on the microcode. In the last months, as previously underlined, I tackled lighting and reflection mapping. I am not yet totally done with it but with all the changes already carried out, it is worthwhile to share the current status.
I) Change the way to load in RSP the inverse of transpose of the 3x3 part of the model matrix
In order to compute lighting, you have to use the inverse of transpose of the 3x3 part of the model matrix (which I will call 3x3 matrix as from now) to get the light direction in model coordinates.
a) In Fast3D, the model matrix is first loaded and then data are changed by using some special RSP (LTV, STV) instructions which should help to do so. It requires some scratch DMEM space to do so and about 16 instructions to have in the RSP one 3x3 matrix.
b) In F3DEX2, the 3x3 matrix is loaded directly from DMEM where the model matrix is stored. There is no need to load the model matrix and then transpose it and no scratch DMEM space is necessary. It provides again one 3x3 matrix.
My implementation is different as my goal is also to implement point lighting which is totally missing in the original F3D. In order to do point lighting, it is required to have the model matrix and the 3x3 matrix and thus for each vertex to be processed.
I do underline “for each vertex” as with directional lighting, till you got a new matrix or new lights, you may use the same computed light directions in model coordinates for each vertex to be processed, meaning that the 3x3 matrix would have to be used only once for doing such computations.
So point lighting requires to have the model matrix and the 3x3 matrix at the same time. On top of it, the RSP vector instructions (VU) can hold 8 16 bits lanes, which is used to processed two vertex at once. If for F3D, one 3x3 matrix was sufficient to compute a limited number of light directions in model coordinates, it cannot be the case when processing point lighting per vertex. So it does mean that the VU must hold two model matrix and 3x3 matrix, duplicated on the same vectors.
It took me a significant time to find a good solution for those requirements. After loading the model matrix twice (as for Fast3D), I take me only 22 instructions to get the two 3x3 matrix as desired. In this respect I used a RSP instruction in a different way than usually intended, VMRG and register VCC.
Additionally I have replicated the way the normalization of the light direction in model coordinates as in F3DEX2, more efficient than Fast3D.
II) Separate lighting computations from vertex processing
As explained, lighting/reflection mapping computations is done per vertex. If it may be fine with directional light, with point lighting it would mean reloading over and over, for each vertex, the model and 3x3 matrix as the RSP is unable to hold all those data and performing the rest of vertex processing tasks (i.e. multiply vertices with ModelViewProjection (MVP) matrix, clipping, fog, etc.). In order to do so, lighting computations has to be done separately. It does mean to break the Vtx structure as per gbi.h where vertex coordinates, colors/normals and textures coordinates were tied together.
There is many advantages to do so:
- You may load with flexibility vertices, colors/normals and texture coordinates, in separated buffers.
- You may avoid duplications of data (same color, same normal, same texture coordinates)
- You can do computations more efficiently for/on those buffers.
There is of course some inconveniences:
- You need to have new structures for colors/normals and texture coordinates. As the vectors in RSP are 8 lanes of 16 bits, structures will have to hold 2 colors/normals or textures coordinates. It is also required to have those structure 64 bits aligned due to the RSP DMA engine.
- GBI compatibility is broken. N64 programmers have to rewrite a portion of their code.
- Flexibility comes with more complexity for the programmers (multiplication of the GBI macros)
Many customized microcodes has done, at least partially, such a change (i.e. Perfect Dark or Indiana Jones)
So I got to create a new command to compute lighting and texture coordinates, along with multiple GBI macros.
I took advantages of having two 3x3 matrix (instead of 1) to compute lighting directions in model coordinates nearly 2 times faster. I separated directional lighting computation from texture coordinates computation entirely in order to avoid many unnecessary computations done by F3D. If the computation speed on lighting should remain more or less the same, the one on S&T for reflection mapping has been improved.
Of course I have considered as well to optimize the DMEM space used for those computations.
In the original Fast3D, each light structure used 32 bits. In F3DX2, it uses 24 bits. In my implementation, it uses only 16 bits. In this respect I got to change the way some libultra functions were returning data (guLookAtHilite, guLookAtReflect, guPosLight, guPosLightHilite). I could notice something which could be optimized in those functions (and some others as well) and which I have implemented. According to the MIPS 4200 documentation, SQRT and DIV CPU instructions are slow so where possible it is better to use an inverse fast square root instead. (Thinking about it, it is noticeable that reflection mapping was not used in a lot of games, likely because reflection mapping was actually too demanding).
Now obviously you have to link vertex, output colors and output texture coordinates. I could have decided to use the 2 remaining byte of the vertex structure, currently padding, to do so. However I took the approach to do so at triangle level as it is at this level that the control should be given to the graphic designer. It also helps to avoid unnecessary duplications of data in buffers. However a careful approach have to be taken when some advanced lighting features will be used, as reflection mapping or point lighting, where all is interdependent at vertex level and not triangle level.
It means that I got to rewrite entirely the G_TRI command. Additionally G_VTX commands got to be adapted as lighting and texture coordinates computations are not done anymore with this command.
Now I will focus on point lighting and what I wanted to create from the very beginning of this project: implement a circular buffer to store transformed vertex and generate strip triangles.
Side note: I will update soon my previous article as I continued to do some interesting changes for G_BRANCH_Z and G_CULLDL.
HELP NEEDED: After implementation of point lighting, triangle strips and optimizing DMEM, it will be time to test the microcode. In this respect I am calling for people (with sufficient programming skills of course) to help out in this respect. It can be done in different ways: adapt SDK N64 demos, adapt already written homebrew games, write new test demos using the new features of the microcode, etc. Contact me on N64brew discord channel in this respect. Thanks!