Bjoern Knafla, Paralellisation PhD. Researcher and additionally Julien Hamaide, Markus Mohr on the panel.
Bjoern will provide an overview of the concepts and techniques that are the most commonly used for parallelizing code in the games industry, and present some of his own results applying multi-threading to a large crowd simulation.
This talk either seriously lacked presentation time or timing, or maybe both. It kind of just stops 🙁 I didn’t see any mention of large crowd simulation sadly, since a practical example of some of it would have been a good idea, especially since the subject really wasn’t made very interesting to me (obviously I need to do it first I guess 🙂 ).
There is a cap on the speed of a single processor core, so there is a need for parallel programming with AI.
However it is not easy. Errors occur from race conditions (where one CPU has different data then another) and deadlocks (a waiting on b waiting on a). Performance can also go down the toilet if there are contention point problems.
The architecture is one core, one L1 cache for each core, 2 cores sharing L2 cache, and then access to memory. Can be many cores – Larabee has 32 cores, while graphics cards have a massive amount of parallel processing pipes.
If you have a lot of cores, you have the memory wall – latency of accessing memory with each processor, and bandwidth of throughput from the memory. RAM is lacking behind CPU’s in the speed increases in both.
Combined problem if you have to share data – need to synchronise data between core’s own data caches, which can be very slow.
Need to try and maximise the throughput. To better use bandwidth is to both reuse the CPU cache as much as possible and have regular accesses to memory. To better use cache storage, you want to keep things per-core, minimising sharing, have a minimum amount of times to synchronise so can reuse cache.
Looking at examples: First, Asynchronous calls to something (such as calling pathfinding) can use a lot of cores, when every agent calls pathfinding at different times with a new thread, with no requirement for when it ends. The problem is you don’t know when the task finishes, which is an issue. There is also the problem if there are not many, or only one core – so there is a lot of software threads.
An example of how to deal with asynchronous calls is to use a task pool. This library tool gives the tasks to the task queue. This monitors its queues and decides to take a task and schedule it to one of the threads. This generally makes as many software threads as hardware threads. This is all managed automatically, and not dealt with directly.
This means you don’t need to describe raw threads, but instead tasks. The tasks still need synchronisation so a task can be sorted when finished. One method is checking at the originator of the task every time to see “are you done yet”.
Asynchronous has pros – it scales with the task pool, and potentially better memory locality. However it doesn’t get syncing right, and determinism needs care, since there can be side effects (as always, but in this case especially).
Second example; Parallel agents – each agent is ordered into the task pool separately. However, there is problems accessing the other agents since they are not sequential – lots of synchronisation and checks of world data. There is no order anymore, such as game logic problems like “Who gets the rocket launcher”. Need to arbitrate the calls to an entity (which are all stored from the agents saying what they want to do with it) and resolve conflicts. You also lose determinism – important for game replays for instance.
You split the update into two stages – collection stage which agents can write private data, but only request to change public data. The Instance Update Stage runs after which updates what should be changed from their inputs. Very clear synchronisation.
This method has the pros of guiding parallelisation and adding scale, with implicit scaling. However it has problems with getting determinism, and syncing right, bad locality, and hard to implement.
Can also split up the systems that agent use in parallel – pathing, sensor system, AI logic system into different threads, similar to how the agents are split up.
You have all pros: lean data structures, locality, scaling, deterministic, implicit synchronisation, explicit context, possibilities: “shaders”. Along with cons: Implementation effort is high, internal low-level parallel, syncing problems and deferring when necessary – no “immediately”.
There is also stream processing – not enough time to look at in the session however.
Julien Hamaide, Bjoern Knafla, Markus Mohr
This panel repeated a lot of need for simplicity, that low level systems were vastly different and that designing for it in mind early is important. I didn’t note who answered what, and really only took down two questions.
How about different platforms?
There are different optimisations for different systems so the tools need the low level optimisations that are needed.
Shouldn’t have to deal with platform specific parallel issues.
Going from a single thread system, into at the end a multi threaded system is difficult. Going from just early levels having multiple threads in mind is better since you can always scale it back to a single core later anyway.
Plan that you can scale but make it error free first.
Valve worked from a single threaded engine. So firstly, everything shared was locked. Then they started removing locks as they looked at systems they can make multithreaded. First question you should ask is why is this memory shared? Put all the readable data in one part and writeable in another one (for instance, pathfinding with things on top).
Refactoring code to be cleaner is important for this, so it is a good thing overall right?
Always easier to start from a clean sheet to work on.
Good idea to have simple code, since it’s easier to understand.