Improvements to Blueprint Load Performance
Hello there :)
This isn't the usual feedback request, but I wasn't quite sure about how to reach out appropriately--personally, I feel like this doesn't belong here, but it is what it is. I've spent the past hour thoroughly testing to make sure I've got a good baseline to compare against.
While briefly profiling the game just to get an insight as to why blueprints take so long to load on my hardware, I found some interesting results. I'll provide insights, my thought process as well as a trivial solution to the problem.
While loading a 45k block blueprint, approx. 31 seconds of CPU time is spent in the "Keen::VRage::Library::Memory::buffer`1::get_DataStart" procedure with a lesser 10 seconds of CPU time being consumed by "get_Data".
They have a baseline usage of 500ms each over the entire profiling session!!!
After viewing the respective procedures in both a disassembler and a decompiler, some causes were immediately apparent:
get_DataStart:
1. The procedure is massive; all of the debug assertion code should not be here.
2. The procedure is not inlined, and the JIT compiler will therefore generate useless complexity.
get_Data:
1. While not immediately obvious, a disassembler will reveal that the procedure is never inlined, despite fact the only call in the procedure being tainted with the "AggressiveInlining" attribute.
After tainting both procedures with the "AggressiveInlining" attribute, and changing the code with those simple observations in mind, (removing the assertion related code) the overall performance in this regard was drastically improved.
That's a massive improvement under the same conditions!
NOTE: The information below is likely overkill given the minimal performance gain, but I've decided to include it regardless.
Here we see the new "get_DataStart" implementation in all of its... "glory."
Considering that this is a hot spot, telling the CLR to try its best to emit code marginally better than that of which a toddler could write (via AggressiveOptimization) is probably a good idea...
Here, without "AggressiveOptimization," the compiler seems to struggle to find places to inline the procedure. Which makes sense, as the procedure's body is quite large.
After applying "AggressiveOptimization," the runtime pulled itself together and actually emitted code that doesn't entirely suck.
The performance gain here seems to be about minus half a second on both "ValidateOverlappingBlocks" and "Contains."
It really is quite unfortunate that you actually need to tell the JIT compiler to perform basic optimizations. in a lot of cases.
If any thing must be discussed further, you may reach me on discord @hoshikonnu
Thanks!
Replies have been locked on this page!