Viewport FX Development Diary
Monday, May 21st
My first target are deprecated geometry functions. These include Begin, Vertex, End, and Rect. I am finding and classifying each "doodle", which is what I'm calling any reasonably self contained chunk of OpenGL drawing code.
The first major classification is to distinguish between "heavy doodles" and "light doodles". A heavy doodle may potentially contain millions of vertexes while it would be very surprising to find a light doodle with more than a couple hundred (and most contain only a handful). Example of heavy doodles are mesh objects and the mesh used in the UV editor. Light doodles include key frame markers on the timeline and lots of other simple jots and scribbles used indicate things in the UI.
Some light doodles, such as text rendering, may require special attention because in some situations they do generate a fairly large amount of drawing (imagine a screen full of text with drop shadows), however most light doodles are never going to be a bottleneck. For that reason in the next couple of days, for light doodles only, I'm going to replace Begin, Vertex, End with a limited but source compatible version. The replacements will function similarly to Begin, Vertex, End, and although they will have all the functionality that Blender needs, they will not be identical to their original OpenGL counterparts.
The justification is that much of the code for drawing even light doodles is complicated and it is probably not worth it to rewrite it to use vertex arrays directly for what will be a negligible performance gain.
Here is my summary of User:Jwilkins/GSoC2012/Blender Doodles.
Tuesay, May 22nd
The notion that an OpenGL application is “wrong” to ever use immediate mode is overzealous. The OpenGL 3.0 specification has even gone so far as to mark immediate mode in OpenGL for “deprecation” (whatever that means!); such extremism is counter-productive and foolish.
Mark J. Kilgard
Graphics Software Engineer
Committed a preliminary set of functions to replace the deprecated immediate mode functions. These functions are for light doodles where the trouble of rewriting routines to use vertex arrays would be a waste or programmer effort. You can take a look at the interface here.
Several things to note. First is that it does not copy every single function. Only GLfloat and GLubyte types are supported and only 2, 3, and occasionally 4 dimensional vectors (and certainly not obscure things like edge flags!). Even this list may be whittled down further as code is actually ported to use this interface. These functions were chosen because they are what Blender currently actually uses (minus functions that use GLdouble or other types). The second thing to note are functions that tell the immediate mode exactly what kind of vertex data you will be sending. This keeps the implementation from having to be prepared for anything.
As I have a chance to profile this code, if it turns out to actually matter, I may reduce the flexibility even further by choosing a limited set of vertex formats that are the fastest and then rewrite any code that does not quite fit. Hopefully though this code will be good enough and will just work as a replacement so I won't have to spend a great deal of time on it after this week.
Wednesday, May 23rd
Towards More Efficient Font Drawing
Swiss-cheese now renders whole strings as single batches instead of single letters.
Blender used one Begin/End call for each letter. Granted that sometimes a letter is anti-aliased or has a drop shadow and this adds more geometry to each letter, but still, at most about 50 quads per letter. Now, in swiss-cheese, as long as all the letters in a string are located in a single texture map, entire strings can be sent to OpenGl at once. In some cases, such as the Python Console or the Text Editor this could provide a significant speed up.
There is a terrible inefficiency in the drawing of syntax highlighted text. When in highlighting mode the text editor only sends one character at a time to blf. The result is that even with efforts to minimize the overhead of using vertex arrays or vbo that all of that overhead will have to be paid once per character! I cannot see a way out of rewriting that part of the text editor to send whole strings.
OpenGL Usage Policy
I've modified how Blender handles changing the PixelStore state.
Going to start a User:Jwilkins/GSoC2012/Blender OpenGL Usage Policy. The first thing to go on the list is that PixelStore values should be restored to OpenGL defaults after you use them. This seems like a sane policy because in most places Blender already assumes that the values that are not changed are the default values. It is two API calls to get the original value and then restore whatever was there before. It is only one call to restore the value to its default. PushClientAttribs requires two calls and probably is almost certainly more heavy weight, although it could be a win if multiple values are changed (never mind all that, PushClientAttribs is deprecated). The verdict is to just change the value and then change it back.
I tested the User:Jwilkins/GSoC2012/Immediate Mode Replacement interface, which I'm calling gpuImmediate for short. I also made some key decisions about how gpuImmediate will work.
Decided today that it should be OK to use state setting functions between gpuBegin and gpuEnd. This should be OK as long as one does not mess with any geometry rendering state of any kind from any version of OpenGL. In OpenGL it is illegal to set state (except the "current" vertex values) between Begin and End. This is probably just to give the driver programmer one less thing to worry about. In the case of gpuImmediate however as long as one does not change geometry state it should be OK to change other state. The key difference is that whatever state is in effect when gpuEnd is called will be what is used. This means that if one wants state changes to affect geometry then gpuEnd needs to be called first.
To use the compatibility layer, include "GPU_compatibility.h". This will include the gpuImmediate interface as well as define a bunch of macros to catch use of the replaced functionality. The result of using glBegin for example should a compiler error stating that: The symbol DO_NOT_USE_glBegin was not found.
To test gpuImmediate I replaced font rendering. Now all text in swiss-cheese is rendered using the new code. This is a very visible portion of the interface so I figured it something was wrong it would reveal itself quickly. Now, except for matrix transformations the blenfont module has no deprecated functions.
It occurred to me that gpuBegin/gpuEnd is doing too much work in situations where one may need to make many state changes. To make these functions lighter I think something like gpuImmediateLock/gpuImmediateUnlock may be needed. These functions would setup and tear down vertex array state just once per vertex format. Right now gpuBegin/gpuEnd sets up and tears down vertex arrays every time they are called. I tried to fix this by using a flag to indicate the state had changed, but that still needed some way to indicate drawing was completed. If gpuImmediate could completely monopolize the geometry state this may not be such a bad thing, but since it cannot at this point (or ever) a function would be needed to indicate completion. It just seems much simpler to have bookend functions that handle this task.
So, tomorrow I'll implement gpuImmediateLock/gpuImmediateUnlock.
Here are some things to watch out for because they could be caused by my changes to Blender.
Thursday, May 24th
Text Editor /facepalm
"I don't think you should spend too much time on the text editor." - general mood of the responses when I asked about this on IRC.
I ended up spending way more time fiddling with the text editor than I wanted to. The good news is that without such a "pathological case" I might not have been driven to profile and improve the performance of gpuImmediate. However there isn't much that can be done when over half of CPU time is being spent in the kernel handling OpenGL calls, so I ended up having to rewrite some parts of the text editor to send fewer batches containing more characters. I'll have to revisit it later because there are 3 different places I had to rework and they are all almost identical. But as long as it works I'm going to move on.
Instead of trying to be clever, let the programmer say when things are ready and when they are completed.
Calling gpuImmediatelock means that the vertex format has been set and the buffer can be setup, while gpuImmediateUnlock means that there is nothing left to draw and the OpenGL buffer state can be returned to defaults. That doesn't mean to free buffers, those are allocated when gpuNewImmediate is called and are kept until gpuDeleteImmediate is used to destroy them.
This is meant to improve the situation where gpuBegin had to setup the buffer state every time it was called. You can lock multiple times with little overhead, but you must unlock the same number of times before the state is freed. I did this because I did not want to have to put gpuImmediateLock/gpuImmediateUnlock around every single call to BLF_draw, but putting it inside that function did not solve the problem of there being too much overhead in drawing a single character. So I created BLF_draw_begin/BLF_draw_end which can be placed around large blocks of text rendering, but if you don't have a large block you can still just use BLF_draw and it will lock things for you.
Turns out I cannot really improve upon memcpy.
I was not entirely convinced that memcpy was the best way to transfer vertexes from the staging area to the vertex buffer, so I broke out AMD CodeAnalyst and did some profiling. I ended up streamlining the code quite a bit, but was not able to come up with something faster than memcpy. The only thing I can think of to improve gpu_copy_vertex would be to make it polymorphic and specialize it for each case that exists in Blender. It isn't really that important however at this point. I gotta remember this is for light doodles and that bad performance should only come from Begin/End code that was badly written in the first place.
My AMD CodeAnalyst works but is old, while my gDEBugger is completely broken but I really need it.
I had to go all the way back to the version of AMD CodeAnalyst that I used when optimizing sculpt in the Summer of 2010 to find a version that did not crash. I'm going to have to do a binary search on the versions if I want the latest version that works. But what I have works, so I'll probably just leave it.
I was having no problems debugging Blender OpenGL using gDEBugger last week, but now that I need to use it for real it seems to cause OpenGL to fail with GL_OUT_OF_MEMORY errors on most of the drawing calls. I had planned on using gDEBugger to find deprecated GLenum usage as well as performance problems, so I have to find a solution soon.
Friday, May 25th
Did not feel very productive today. I did get gDEBugger working. It seemed to have something to do with multi-monitor compatibility settings. This also fixed my problem with Blender only wanting to work on my primary monitor. I ended up using gDEBugger to discover and remove some redundant state changes and fix some bugs.
Next week I will try to completely replace all uses of the GL immediate mode with gpuImmediate and start implementation of gpuRetained to handle situations where geometry could be reused or partially modified. If it starts to look like too much work then I'm going to shelve the porting work and move on to other interfaces. My goal is to port enough code to test the new library, and if it works well then porting the rest of Blender to use it can be done in the Fall. It is more important to cover all the use cases than it is to fix everything now.
I'm so ready for the weekend. My head hurts from all the coding.
Sunday, May 27th
Not working today, but I did notice that Blender slows down a lot when displaying edge lengths in the view port. Other's reported that slow down is noticeable with up to 10,000 edges, while 50k edges gives 1 fps or is not usable at all. However I was running swiss-cheese and 3k edges were starting to chug pretty badly. I took it as a lesson that even "light doodles" can become surprisingly heavy at times. Drawing an edge label is actually more work than drawing an edge.
Later today, had a clever idea for drawing edge lengths really fast. Pass in the edge lengths as a VBO and have the shader figure out the texture coordinates from the digits. It may even be possible to render them as point sprites instead of quads.
However, more realistically it is probably better to just limit how many edge lengths will be drawn. Who is going to get any use out of 50,000 numeric labels??
Monday, May 28th - Memorial Day
Tommorrow I'm going to try replacing all of Begin/End with gpuBegin/gpuEnd. I'm not sure this is going to work very well. If it doesn't then I'm going to just make a patch and save it for later. Unfortunately I expect Blender to be slower after this change, but I do not know this for sure so I actually just need to do it and try. One problem may actually be all the checks and safety I've written into the gpuImmediate functions. I may need to actually be able to turn this off separately using a WITH_GPU_SAFETY flag or something.
There are a couple of changes I realize I need to make to gpuImmediateLock/gpuImmediateUnlock to make things easier.
1. Instead of banning vertex format changes inside the scope of gpuImmediateLock, just assert that the format does not require more space. This will allow for nesting of compatible drawing commands more easily. Not sure how badly this will be needed.
2. When the lock count falls back to zero gpuImmediateUnlock should zero out the vertex format automatically. Otherwise the code will be littered with commands to set the vertex format back to defaults.
Tuesday, May 29th
Ankh SVN versus Tortoise SVN
I'm using both of these tools for managing SVN. I've made a few mistakes because I'm learning Ankh, but it provides integration with MSVC which so far has been really helpful. The first thing I've done is accidentally committed more files than I intended because I forgot to de-select them. The second is more serious, I did a merge and then committed it using Ankh, but Ankh only sees the files I've configured to build in MSVC using CMake, so it leaves half of the merge on the disk. The first time I naively thought that Tortoise had gone crazy when it was telling me there were more files to commit and I just reverted them (dumb). I paid the price this time by having to fix up the next merge which had not gone very well. It's all good now, I just need to remember to use Tortoise for merges.
For a lark I converted a bunch of heavy doodles to use gpuImmediate. At first it was disheartening because performance dropped drastically, but then I noticed a few mistakes I had made and after those were fixed up things ran OK unless you applied a sub-surface modifier. There was a big difference between the performance of these three sets of steps:
- Make a subsurf out of a high poly object:
- Add cube
- Add subsurf
- Set view to 5
- Add another subsurf
- Make a subsurf out of a low poly object:
- Add cube
- Add subsurf
- Set view to 6
- Just have a plain high poly object:
- Add cube
- Add subsurf
- set view to 6
Theoretically, these should all perform the same because they all have the same number of polygons. However the first case had horrendous performance before I adjusted the gpuImmediate version of the rendering code to perform better. Keep in mind I could have made these same adjustments to the glBegin/glEnd code, it just turns out that gpuImmediate is more sensitive to bad usage (trunk shows the same pattern, it is just not as pronounced). The last two cases performed almost identically. Unfortunately, even after my optimizations the first case still has problems on my machine (but no worse than trunk), but I cannot track it down to a problem with OpenGL, so I believe it may be a CPU bottleneck.
To try this same test on your own computer you may need to start with a subdivided cube. This test only makes about 150k polygons, so it may not tax your CPU or GPU at all until you go higher.
Wednesday, May 30th
An advanced build option that enables a bunch of checks that would ordinarily be too expensive or that should never fire in release code.
Spent the day formalizing the GPU safety option. Right now it just checks that gpuImmediate functions are properly nested and initialized, but I can imagine it being used for more. I really like to put paranoid checks everywhere in my code, but they can be detrimental to performance and also to aesthetics. Hopefully the work put in today on this will pay off in letting me add as many checks as I need without a performance penalty and that it also does not obscure what the code is supposed to be doing normally by distracting the reader with a lot of apparently superfluous code.
Sunday, June 3rd
I wrote so much on Thursday that I didn't see much need to update this diary. Had a discussion with Brecht that helped me refine my ideas on Friday. User:Jwilkins/GSoC2012/Viewport FX Design
Finalizing Replacement of Immediate Mode and Future Plans
The past couple of days have been spent actually replacing glBegin/glEnd with gpuBegin/gpuEnd. This is a rather tedious and time consuming process! I'm making notes of patterns and common cases so I can factor them out. My original plan was to have all of the light doodles fixed up by the end of next week and then spend 3 weeks after that on the heavy doodles. The main focus during this time will be on geometry throughput. The second half of the summer of code will then be spent on other state (starting with deprecated state first) and raising the level of abstraction.
Monday, June 4th
Utility Drawing Functions