Welcome to my brand new Developer Journal! If you're looking for an update on CoE/KoE's development status, please check out the CoE Development Update: May 2023, which was posted on Tuesday, May 18th.
In this developer journal, I'll discuss various historical and trending architectures used in game development. Specifically, I'll explain how I've combined multiple architectures in the Soulborn Engine (used in CoE & KoE) to enable distributed, parallel processing of the game loop. This is the first in my new series of developer journals, so I'll also discuss the target audience and goals for these journals. Let's begin! To jump to specific topics, please use the navigation tree provided below.
First and foremost, let me start by saying that these new developer journals aren't for everyone. They are not, strictly speaking, CoE/KoE development updates. Instead, based on the diverse interests within our community, I've created a new series of developer journals that focus on the highly technical aspects of game development and the anecdotal lessons of running a game company. That said, some specific audiences who might find this content exciting are:
These developer journals aim to provide insight, education, and encouragement to all readers seeking a deeper understanding of game development or those interested in the development process and challenges faced by another indie studio. However, they are not entirely altruistic. These developer journals also serve as an opportunity for me to achieve catharsis. They allow me to share the more human side of game development and express my thoughts beyond what is suitable for a development update.
Readers can freely choose which blogs to read; the Development Updates, the new Developer Journals, or neither. It's entirely up to you.
This developer journal will cover my architectural work on our existing ECS over the last month and the challenges I faced. To ensure that most readers can follow along, I'll first lay a significant foundation. I apologize if some find these sections too verbose or simplistic. I'm trying to find a balance. Readers are free to skip ahead if they wish.
The traditional game loop in game development consists of a few crucial steps. First, the game is initialized with all necessary data structures and resources loaded and set up based on the current level or stage. This is followed by the game loop itself, which repeats the same actions on each iteration or tick. On each tick, the game goes through roughly the same three steps:
This continuous loop runs for as long as the game is being played.
Historically, the game loop described above was both synchronous and single-threaded. This meant that the amount of computation included in the update loop was limited by what the CPU could process and deliver to the video card, which rendered the images 30-60 times per second. As video cards became faster and GPGPUs more prevalent, we started offloading more computations to the graphics card, freeing up valuable CPU cycles for game logic and AI.
Over time, both CPUs and GPUs have become multi-core, allowing for improved performance on both sides of the CPU/GPU divide and enabling more complex game logic and rendering techniques. However, utilizing parallel processing on the CPU requires safer development practices and greater care. More information can be found in the "Actor Model" section below.
After discussing the main game loop, let's briefly touch on the various programming paradigms and architectures that have emerged over time. In particular, we'll start with object-oriented programming (OOP) and its limitations in game development.
Before we dive into the limitations of object-oriented programming (OOP) for game development, it's worth noting that games can be written in various programming paradigms, such as procedural, object-oriented, functional, declarative, and event-driven. While some are more suitable than others, it's possible to use virtually any paradigm to write games.
For the past 30-40 years, object-oriented programming (OOP) has been the dominant paradigm in game development. OOP involves dividing the application space or "domain model" into objects, which combine the application data with the encapsulated operations that can be performed on that data.
For some examples, in an RPG, characters could be objects, as could items such as consumables, crafting resources, weapons, and armor. Anything described in natural language as a noun could be considered a class of objects. The operations that can be performed on these objects are called methods. For example, you can drink a potion, fire a bow, or close a door. Using objects to represent game elements is an intuitive analogy, which makes programming games using OOP easier than other paradigms.
At the same time, the encapsulation aspect of OOP allows programmers to treat an object as a "black box" and focus solely on the operations that can be performed without worrying about how those operations are carried out. This makes it easier for teams to work together and reuse libraries they've written without relearning implementation details.
But all that glitters is not gold. There are limitations to OOP. What happens if an object is both a weapon and a consumable (I've had some questionable whiskey. Don't ask.), or if it's both armor and a container (plate mail with pockets)? In traditional programming languages, these combinations would require developing different classifications of objects called "classes." To define a new class of objects, you can inherit the functionality of another class of objects and make the necessary changes to define a unique object class. The problem with this approach is that it can quickly lead to many unique class combinations. For example, consider just the following:
In the above listing, the : indicates inheritance. For instance, the above example states that a weapon is a type of item, and a tool is a type of item. Sometimes, a tool can also be a weapon. Additionally, consumables are considered items. If a consumable can be thrown to deal damage (like a Molotov cocktail), it's both a consumable and a weapon.
Programming languages such as C# and Java only allow for multiple interface inheritance and not multiple implementation inheritance, which makes it necessary to create class definitions for various object combinations like the ones mentioned above. This can be time-consuming and difficult to maintain when new features are added. However, an alternative architectural pattern is available - the Entity Component System (ECS).
Instead of using inheritance to create new object classes, alternative patterns use object composition. One such pattern is the Entity Component System (ECS). This architecture has few traditional objects and instead has "entities" defined by a unique identifier such as a name or number. For example, entities A and B can be identified by storing "A" and "B" as identifiers and referencing them later.
The real value of ECS comes when defining data for an entity. Instead of putting all the data in a single class, we can split it up into separate data blocks called "components." These components don't have any functionality or operations but rather store specific subsets of data for a given entity identified by its name or other identifiers. We can "compose" new object classifications by providing an entity with more than one component.
Using the previous example, we might have the following components:
With an ECS, there are only base classes and no more inheritance. I can easily change an object's type with an ECS by simply adding or removing components. To define an object that is both an item, weapon, and tool, I can simply assign each associated component to the entity. For example, if I also want to make it consumable (such as a chocolate-covered hammer), I can also add the "consumable" component.
It's worth noting that there are different ways to implement component storage in an ECS. The most common methods include:
All three of these methods have their benefits and drawbacks, with Archetype-based being the most commonly used due to its strong overall performance and memory footprint. Both Unity and Unreal use this method. The primary disadvantage of the Archetype model is that all components of an entity must be accessible within the same process.
While components contain data, behaviors and operations are handled by systems.
A system is defined as an operation or path of execution done repeatedly for every entity that shares a predefined set of required components, irrespective of its complete set of components (archetype). For instance, a physics system can make every entity in the world "fall" if it has a position component and something identifying its physical properties like mass and shape (known as rigid bodies). It's important to note that the gravity system doesn't distinguish between items, weapons, tools, or consumables. As long as an entity has a rigid body component and a position in the world, the gravity system will check if the entity is falling and handle it accordingly.
With all that out of the way, the ECS pattern is becoming increasingly popular in the game industry. Since the mid-2000s, various commercial and non-commercial engines began adding ECSs as a primary or secondary execution model. With virtually all PCs being multi-core now, there can be significant performance benefits to using an ECS.
When executing the main game loop with an ECS, the initial input stage generally remains the same. However, there are small differences in how the input is provided to the various systems that may require the input data.
The most significant changes arise during the update stage. Instead of updating the game's state by iterating over all objects in the world and performing various operations exposed by the object class, we now iterate over each system one after another. For each system, we iterate over each entity, executing the system's update (tick) code. This approach has the benefit of better cache coherency. And, suppose you divide the systems so that different ones are reading or writing from different components at a time. In that case, an ECS can be made highly parallelizable, running safely in different threads simultaneously.
The Soulborn Engine, the backbone for all Soulbound Studios games, uses its proprietary ECS. Unlike other ECSs, it doesn't assume the component data for an entity exists within a single process. Instead, it uses the primary advantage of a Sparse ECS (remember the three different component storage types) to accept that while each process participating in the ECS needs to know which components an entity has, it doesn't need the component data itself. Consequently, a system only needs access to the component data that the system is concerned about.
When a system is included in the "world," it registers the necessary component data. As part of the implementation, no system in that process can access, query, or update components not registered by one of the systems. However, because component data is stored in shared repositories, we can group processes by related systems and minimize the memory for storing the game state.
The above architecture conveniently aligns with the microservices concept. The Soulborn Engine permits splitting the game's component data into various services with minimal duplication and latency. The data a system service needs is readily available, at least in a read-only mode.
This is all great in theory. In practice, it's more complex than my oversimplification. When we divide the ECS into different processes, we introduce asynchronicity, something that hasn't existed in the game engine until now. When the update loop of one system needs to communicate with another system, it cannot assume that the other is running in the same process or is even responsive.
Similarly, a system may be in the midst of its update loop when the process receives a notification that alters the component data of one of the components the system is presently updating. Suddenly, the "main game loop," which has historically run synchronously, must manage asynchronous events while maintaining thread-safe access to repositories and other resources.
Over the last month, I did a lot of research on various strategies for handling parallel processing. I say parallel rather than concurrent because concurrent execution can technically be handled using a single thread with context switches. But when you're running different processes, potentially on other machines, synchronicity is no longer "free." If you want synchronicity, you have to work for it.
There are two primary methods of enforcing synchronization: synchronization primitives like mutex locks and semaphores, and messaging.
Synchronization primitives involve identifying the code segments where several threads could simultaneously access shared data (called critical sections) and securing them with an OS-supplied primitive, such as mutual exclusion locks (mutexes) or semaphores. The notion is that these primitives will obstruct any efforts to access the section while another thread executes that portion of the code. However, synchronization primitives depend on the programmer correctly placing the locks in all the right places, which can otherwise result in issues like deadlocks and livelocks.
Moreover, ECSs are designed for high-speed execution. Suppose you're rapidly iterating through all the entities in a system while simultaneously processing notifications that change state while other systems are trying to access the same component repositories. In that case, the probability of hitting a critical section is high. In such situations, all other threads attempting to enter the critical section must stop and wait for the current one to finish, resulting in poor performance.
As an alternative to synchronization primitives, its possible to instead use one of the popular messaging-based architectures, such as the Actor Model. As I mentioned before, in OOP, each object has data encapsulated inside and protected from the outside except by its methods. But that doesn't prevent multiple threads from calling the same or different functions simultaneously and corrupting the object's internal state.
The Actor Model is sometimes referred to as the purest form of object-oriented programming because, in contrast to plain OOP, it enforces data safety and thread safety by requiring all interactions with an object to go through a message queue. As an analogy, while your mailbox (or digital inbox) can receive multiple letters (or messages) simultaneously, you only read them one at a time.
To operate on an Actor, a thread must send it a message instead of calling a function. These messages are queued within the Actor and processed synchronously when directed to by a dispatcher. This execution method is highly efficient as it uses a "lockless" approach to synchronization. Instead of using locks that cause other threads to block, we depend on message passing and thread pools to create inherent synchronicity.
Regarding how an Actor Model might be integrated into the main game loop, each game object in the world could be an Actor, allowing for the asynchronous processing of multiple objects simultaneously across numerous threads. If an object needs to call another object as part of its update, it can simply send a message. If a response is necessary, it can be processed later as part of the message handler or in the next update. Alternatively, if the language allows for the async/await pattern, processing can move on to other objects while awaiting a response before continuing.
On an 8-core machine, up to 8 Actors could simultaneously process their update queues, resulting in a total execution time approaching 1/8th of the original time.
Given the above, we have a problem. An ECS fundamentally differs from the "purest form of object-oriented programming" because it does not rely on encapsulated objects. Instead, the internal state of an ECS is intentionally distributed externally, which is not typical of OOP.
Now that we've laid a solid foundation let's dive into my architectural work over the past month. Over the past month, I have been working on an architecture that integrates the Actor Model with our ECS implementation to increase the ECS's efficiency and performance while also preparing it for execution in an asynchronous environment, such as being hosted in different processes.
The first architectural decision is what should and should not be an Actor, as the Actor Model allows for a flexible implementation where not every object needs to be an Actor. In the context of an ECS, we can implement the Actor Model in various places, such as Worlds, Entities, Components, Systems, and in sparse ECSs, Repositories.
A World serves as the top-level object in an ECS, whose primary purpose is to create and manage entities. In our DECS, the World is also used to keep track of all the components an Entity has, whether they're in the process or not. Finally, the World registers the systems and repositories and calls tick on each system at the appropriate time. Given that the World is responsible for adding and removing entities to the world, it is an obvious class to make an Actor in our implementation.
Our previous discussion established that an Entity in the ECS is essentially just an ID. However, the Soulborn Engine provides an abstraction through the Entity object, making it easier to perform operations such as adding & removing components. So I could reasonably make that Entity structure an Actor to limit access to the abstraction.
Components are the real foundation of an ECS as they define the data that determines an entity's behavior. If we make each component an Actor, we could modify all components in the world at the same time, and any attempts to make changes to a single component simultaneously would be met with a message queue. So it's pretty reasonable to make each component an Actor.
Systems act as the primary means of updating components in an ECS. They iterate through every entity with a required set of components and make changes to them, updating the repositories as necessary. Other systems can subscribe to notifications from each other and publish events when significant state changes occur that may not be accessible to other systems. Making Systems Actors would enable synchronous and secure handling of these notifications while not interfering with update loops, which could themselves be initiated by a message.
As previously mentioned, Repositories in our ECS are shared resources that multiple systems can access concurrently. The architecture explicitly encourages grouping systems that use the same component repositories to reduce redundant storage and component synchronization across processes. By making the repositories Actors, you can asynchronously access them in multiple systems simultaneously, each running in a separate thread, without worrying about race conditions within the repositories themselves.
All said, virtually everything in an ECS could be made an Actor. But not everything should. After working on various prototypes and evaluating different networking libraries, I came to the following conclusions.
To handle potentially having multiple Actor types, it's best to implement a new base or intermediate class that can be inherited instead of modifying each selected candidate individually. This way, we can centralize the Actor functionality in a single location, including message queueing, dispatching, and message processing. So let's look at our candidates and decide which ones deserve to be Actors.
Making the World an Actor, we guarantee that the creation and deletion of entities occur in a controlled manner and are processed in between ticks. This is a definite win.
Second, the Entity object in the Soulborn Engine is essentially a utility object; however, since the Entity accesses the repositories directly, one of them must become an Actor to ensure thread safety.
However, as you'll see later, I can't reasonably make the Repository an Actor. Meanwhile, the Entity has no functionality, and everything is forwarded to the repositories. The solution is to use an intermediary. Although the Entity doesn't currently have access to the World object it was a part of, by changing the Entity implementation to make calls to the World rather than directly to the repositories, we can bubble the problem up. This means neither the Entity nor the Repository needs to be an Actor! However, all utility functions on the Entity must be rewritten to call functions on the World object, which, as we've already stated, will itself be made an Actor.
Moving on to components, they are designed to be local to a system, accessed frequently, and as close to real-time as possible. Although not explicitly mentioned earlier, the Actor Model requires all messages to be asynchronous. While it's possible to send a message to a component to make changes, we'd also have to send a message to retrieve data and then await its response. This is impractical as we already have the component, and we would want to avoid blocking the execution or moving on to another entity while awaiting a response from a local object. We cannot make Component data Actors for performance reasons and basic sanity.
Systems can be thought of as Services that also have an Update function. It's common to send commands to services in response to user input. For instance, when a player hits the "W" key, a command is sent to a service that updates the velocity vector of that entity's Movement component. This makes services a nice place to implement the Actor model, not least of which because another tenant of the Actor model is not caring where the Actor is located. Since we're sending Actors messages, it's technically possible to communicate via any networking transport, pipe, or other cross-process methods, or even in-proc. And as the goal is that Services are spread across multiple processes, it nicely aligns with the Actor Model.
Lastly, repositories cannot be Actors for the same reason as components. They are local and must immediately return the component data for the requested entity when needed, despite being a shared resource.
However, we can get around the fact that repositories are shared resources and cannot be Actors in two ways. Firstly, we can implement a scheduler to ensure that no two systems executing simultaneously access the same components, especially if they require write access. This way, numerous accessing systems can read the same data in parallel, distinguished only by the data they write.
Secondly, to allow individual systems to update their entities in parallel, we need to protect a single repository in a thread-safe manner without blocking. Because of our internal implementation, it's possible to update the data of every component in a repository simultaneously. All it does is replace the values in different indices of an array. However, we cannot Add or Remove components since that modifies shared internal data. But, by changing the Systems to require going through the World object to Add or Remove components, we can maintain lightening fast updates while ensuring that adds and removes are done in a thread-safe manner.
"The DECS is dead, long live the DECAS!" I have started refactoring the previous work and introduced the DECAS (Distributed, Entity, Component, Actor System) to replace the existing DECS. This involves making significant changes to the core components of the Soulborn Engine's ECS, including the World, Entity, and System classes, and adding a Dispatcher to handle the execution of the message queues in each new Actor. Once these changes are complete, the ECS will be fully suited for running in an asynchronous environment at high speeds and can be distributed across different processes.
This is no small thing.
Thanks for reading this developer journal! It was my first attempt at writing a technical journal, so your feedback is greatly appreciated. You can contact me through Discord or the Soulbound Studios contact email address. I hope to improve each journal by balancing the information and depth provided, making it accessible to most readers while still being informative and beneficial to other software engineers and programmers.
Thanks for Reading!
Caspian