Achieving hard real-time – less than 1 µsec jitter with µsec-level response – on such a system requires careful tradeoff analysis and system partitioning. It is also essential to consider future-proofing strategies for ever-increasing SoC complexity. There are mainly three approaches to such a system design – Asymmetric Multi-Processing (AMP), hypervisors, and Symmetric Multi-Processing (SMP) with core isolation (Figure 1) – from which system designers can choose to optimize hybrid SoC systems.
AMP is fundamentally a port of multiple Operating Systems (OSs) on physically different processor cores. An example would be to run a bare metal OS dedicated to handle real-time tasks on the first core and to execute a full-featured OS such as embedded Linux on the other cores. Most of time, the initial porting of the OSs onto the cores is straightforward. However, the start-up code and resource managements, such as memory, cache and peripherals, are very error-prone. When multiple OSs access the same peripheral, their behaviours become non-deterministic and the system could become extremely time-consuming to debug. Hence, it often requires careful protection using an architecture such as ARM TrustZone to be in place.
To add more complexity, message passing between the OSs requires memory sharing and needs to be managed together with the other protection measures. Because the cache is usually not shareable between different OSs, message passing needs to happen through non-cache memory regions, which adds latency and jitter to the overall performance. It is also poor software architecture from the scalability viewpoint as it requires significant re-porting when the number of cores increases.
A hypervisor is a low-level software layer that runs directly on the hardware and manages multiple independent OSs on top of it. Though the initial porting is similar to AMP, the benefit is that the hypervisor hides the non-trivial details of the resource management and message passing. One drawback is that it incurs a performance overhead due to the extra software layer degrading the throughput and real-time performance.
SMP with core isolation runs a single OS on multiple cores with internal core partitioning. An example is to instruct an SMP OS to assign a real-time application on the first core and the rest of the non-real-time applications on the remaining cores. This approach is very scalable as the SMP OS is designed to port seamlessly to an increasing number of cores. Because all cores are managed by a single OS, message passing between cores can happen at the L1 data cache level, resulting in faster communication with less jitter.
Core isolation can reserve a core for the hard real-time application to shield effects from other high-throughput cores, preserving the low-jitter, real-time data response. This is generally a good software architecture decision because it allows the designers to consider which OSs to use instead of re-inventing error-prone, low-level software to manage multiple OSs. The initial porting may require some effort if starting from multiple OSs. However, starting from an SMP architecture would be much lower effort.
Figure 1 Comparison of AMP, hypervisor and SMP with core isolation.
Optimising a high-throughput, real-time SoC with SMP
Based on analysing the alternatives, SMP with core isolation offers the best architecture to optimise high-throughput, real-time SoC systems. The architecture we consider is a system similar to Figure 3 where an I/O data stream comes into a SoC, undergoes some form of processing in the cores, is returned to the I/O with a low-jitter and low-latency real-time response. In addition, the SoC consists of multiple cores that run other throughput-intensive applications simultaneously.
First, it is essential to understand what a real-time response (loop time) consists of:
- Transfer new data to the system memory from an I/O (DMA)
- Processor detects the new data in the system memory (Core Isolation)
- Copy the data to a private memory (memcpy)
- Compute on the data
- Copy the result back to the system memory (memcpy)
- Transfer the result back to an I/O (DMA)
Because the jitter and the latency are accumulation of the six steps, it is essential to optimise each step. With an RTOS such as VxWorks with core isolation, the polling/interrupt response can be bounded in the nanosecond range (Step 2). Data computation is also application specific and is fairly predictable (Step 4). Therefore, we focus on the trade-off of the Direct Memory Access (DMA) and the memcpy (Steps 1, 3, 5, and 6). There are two major means to transfer data: transfer with or without cache coherency. The two methods have very different consequences in the DMA and memcpy.
As Figure 2 shows, despite cache coherency (using ARM Cache Coherency Port (ACP)) resulting in a longer path for DMA to complete, the processor only needs to access the L1 cache to obtain the transferred data. Therefore, memcpy time is significantly lower using cache coherency dominating the small degradation in the DMA performance. What this means to designers is that the cache coherent transfer results in much shorter latency, and lower jitter due to the direct cache access.
Figure 2 Memcpy and DMA performance with/without cache coherency.
next; case study