Parallelism in AI: Multithreading Strategies and Opportunities for Multi-core Architectures

Julien Hamaide

Julien Hamaide presented a set of notes, recommendations and optimisations for multithreaded code – AI specifically in some cases. Generally SPU specific, but applicable to other architectures too. To me – above my level of expertise with C++ (and so my notes might be a bit off in places, so check the slides), but interesting still.


Currently the AI runs in one thread. Lots of cores in the future, is it going to be the AI programmer’s job to split the task into smaller pieces? Yes.

Currently it is Homogeneous and Heterogeneous architectures – Xbox 360 and PS3 (one shared memory versus separate memory per core that can be accessed by DMA).

Splitting pathfinding and LOS checks into pieces is a good way to start.

Cache coherence is key – make sure there are no cache problems. Accessing the memory bus between CPU’s may cause clashes in memory. If writing one byte the entire line of memory needs to be re-read since there will otherwise be misses.

Separate read from written data, keep the data as small as possible, separate frequently read data from rarely read data – the name of the entity in the same cache line as the location of the entity.


For sharing pointers you need to make sure that DMA between CPU’s doesn’t share pointers – since they cannot be guaranteed to be the same when it reaches the second CPU. All data therefore needs to be independent of pointers.

Use an ID – this can be transformed to a pointer using some helper classes. Both types can profit from this architecture – homo and heterogeneous.

Avoid polymorphic classes – the virtual tables need to be sorted because it won’t be valid for a DMA transfer, and it needs to be patched but this is avoidable.

Avoid ::malloc and ::new, implement instead with locks, else you might find you lose performance due to massive memory usage. You can allocate memory in threads just not using the System functions for it – stack allocator easy to implement lock-free, or do stack-based memory – _alloca, or pre-allocate result memory with reserve.

Lock or not? No, don’t use it the first time. It is not easy to implement or debug. May blow up on the ship day, and doesn’t work well on x86. It also doesn’t always give better performance. Start with a lock based system then remove them where possible.

Techniques for multithreading:

  • Double buffering – have copies of (cache-friendly) data that is task specific, with rules on when it can be read. This is linearisability since the buffers switching create points where effects take place immediately.
  • Message system – collect messages during a frame. Messages dispatched at the beginning of the next frame. For conflicting objects trying to manipulate another needs the queued messages to be processed and a response posted to the successful one, failure noted to the other. Use lock free linked lists for sent messages, sent before objects update so is thread safe.
  • Job scheduling system – Command pattern – such as sending “pathfinding”, might create a “future” object. This is a response to a query that might be available in the future – a task like pathfinding might take multiple frames, and give a result at a later frame. Can block execution of waiting threads with an IsAvailable() check for debugging
  • Asynchronous Request System – Lots of requests may be done at a constant frequency – such as enemies’ state. If queries are multithreaded a job should be pushed for each query executed, the code must wait for the answer. Use a request object (which is a template) that will return the last update information if the newest update isn’t done yet (so in-between two checks).

Use different methods for different tasks.

In the future there may be architecture changes down the road for making it easier – transactional memory for instance. There is also the issue that tasks are not infinitely dividable so it is up to us to find more stuff to improve the AI. This may be improvements to animation (subtle movements and motion control), speech synthesis and recognition (for better immersion), or many more entities.

Herb Sutters –, for C++ person who is working on the multithreading. Work with the old C++ for now but look to the future.

Q – Lock free experience?
You need to do one per platform else it means using kernel debugging!

Q – Do we really need that last 20% performance?
Consumers want more and more when they buy new consoles. Might be better frame rates.
Q – Based on that assumption how best do we utilise the 100%? Example: 50,000 jobs to do but 10,000 in a frame available?
One way was to prioritise based on distance. Also if you need CPU for AI you can shut down effects

Q – What about GPU?
They’ve got that solved for years – you push a list of commands and let the GPU take care of it. If the GPU can provide split screen or effects it can do it. You can even put one frame of delay and add more effects. There is a lot that GPU can do for AI – maybe pathfinding.

Q – Do you do a separate solution for each problem or do one for the entire project?
There was a project we had that was a primitive across the project, but also could have some additional bits for special cases.

Q – what do you think about the shared memory bottleneck?
While we have shared memory we will have to have these workarounds until there is an alternative thought of.

Q – How do you test to get knowledge about getting 100% usage?
Xbox consoles have tools for it, the other systems we use profiles. You need to look at it in detail since attaching a debugger can change the timings.

Q – How about single core machines, do you have waste from having this?
Yes, it’s a waste creating new threads. Either create two versions of the game or build for one CPU. Not a good idea to create threads if there is only one core CPU.

Leave a Reply

Your email address will not be published. Required fields are marked *

Website and journal of Andrew Armstrong