This content requires the Adobe Flash Player and a browser with JavaScript enabled. Click here to get the latest version of Adobe Flash Player.

EXPLORING MEMORY ARCHITECTURES: PILLARS OF PROCESSING PERFORMANCE

PROCESSOR-BASED SYSTEMS RELY ON MULTIPLE, HETEROGENEOUS MEMORY SUBSYSTEMS TO DELIVER BETTER SYSTEM PERFORMANCE, COST EFFICIENCIES AND LESS POWER.

BY ROBERT CRAVOTTA • TECHNICAL EDITOR -- EDN Europe, 01 Aug 2007

A key performance characteristic of processor architectures is how much application-specific work they can perform per unit of time. The EEMBC (Embedded Microprocessor Benchmark Consortium) benchmark, unlike Dhrystone MIPS (millions-of-instructions-persecond) scores, describes the performance of processors executing tasks in embedded-system applications. Version 1.0 of the EEMBC benchmarks does not capture the system-level influences, such as the memory subsystems, of processing performance because the benchmarks can often run from within the processor’s L1 cache. However, EEMBC’s second-generation, system-level benchmarks for networking and digital entertainment more realistically stress even thoseprocessors with large cache memories.

It is increasingly important to consider the system-level impact of the memory subsystems of a processor because the types and sizes of the memories and access methods in the system define the upper limit of a processor core’s performance. According to Gerard Williams III, a Fellow at the processor division of ARM, a processor with an ideal memory system never misses in the cache and has ideal access to the bus. Chip designers must first understand the processor’s IPC (instructions-per-cycle) capability and then try to implement a memory architecture that minimizes the performance loss. This performance loss can result from caching or memory-access effects, such as miss rate dueto capacity misses, cache size, or conflict misses.

A well-matched memory subsystem can in the best case merely preserve the processor’s maximum IPC rate, whereas a mismatched memory subsystem can drastically reduce the processor’s performance by failing to keep the core’s execution units supplied with instructions and/or data, and causing it to run in idle mode. Building and implementing memory subsystems that do not adversely affect the processor core’s performance continues to be challenging because the performance gap between processor logic and the memoriesis widening with each process-technology reduction. In essence, the improvement in memory-access latency—the time it takesto receive the first bit of a memory requestat each process-technology step—getssmaller than the commensurate clock-rateimprovement of the processor’s core logic.

Likewise, the best performance impact software developers can accomplish with insightful placement of program instructions and data within the memory subsystem is the preservation of the processor’s maximum IPC rate. However, mismatching the placement of program instructions and data in the memory subsystem with the application’s usage scenario can significantly degrade the processor’s performance. Freescale’s application note on preventing M1 memory contention provides an example that exhibits a worstcase 54% processor-performance degradation due to memory contention that the developer can avoid with better placementof the data buffers (Reference 1).

In general, compilers and profiling tools can provide limited assistance with global optimizations for placing instruction and data in memory. Green Hills’ optimizing compilers support the reordering of functions in memory to optimize cache hits. Texas Instruments’ CodeSizeTune profile-based-compilation tool assists a developer in exploring configurations by automating the building and profiling of different versions of the software with different compiler settings that affect code size and execution speed (Figure 1). In general, though, for many high-efficiency and real-time-constrained systems, the burden falls on the software developer to understand the memory subsystem. Optimizing software performance will avoid unnecessary BOM (bill-ofmaterials) costs incurred to compensate for the system’s inefficient use of processingand memory resources.

TOLERANCE OF LATENCY

A primary concern when implementing memory architectures is making the processor tolerant of the access latencies of the memories the system uses. A properly designed memory subsystem can mask much of the system’s memoryaccess latency and provide a sufficient read/write throughput rate—that is, the memory-access time for subsequent data in the same block of data—to support continuous access. This scenario avoids starving the processor’s execution units ofinstructions and data. Memory designers must also balance masking the memory’saccess latency against the silicon area ofthe memory, the total power the memoryconsumes, and the ease of use of thememory by software developers and tools(see sidebar “Ease of use”).

AT A GLANCE
  • Memory subsystems and software can best impact a processor by merely preserving its theoretical maximum performance.
  • The processor’s architecture is the main factor that determines the range of options open to the designer of the memory system.
  • A processor’s tolerance of latency results from design decisions balancing speed, cost and power, as applied to a hierarchical structure that uses both fast and expensive, and slower but cheaper memory.
  • Preserving the mechanisms that provide tolerance of memory-access latency is still a mostly manual exercise for developers.

Direct drivers of memory-access latency are the time it takes to perform address decoding, activating the appropriate word line, sensing the bit line, and driving the output from the sense amps. The addressdecoding latency is the time it takes to latch the address and decide which word line requires activation; this process takes O(n log n) time as a function of the size of the memory’s row and column addresses. Consequently, as the memory structure gets bigger, so does the time interval needed to decode the addressing. The word-line-activation latency is the time it takes to raise the word line; it is primarily an RC delay related to the length of the line, with longer lines driving longer delays. The bit-line-sensing latency is the time it takes for the sense amplifiers to detect the cell contents. The bit-line architecture, the RC of the sense-drive line, the cell-to-bit capacitance ratio, and the sense-amplifier topology all affect bit-line-sensing latency. The outputdriving latency, an RC delay, determines the time it takes to propagate the datafrom the sense amps to the output.

Memories and the logic to manage them dominate the silicon area of many processor-based devices. As a result, memories can be the largest components of the device’s silicon cost and the largest consumer of both dynamic and static power in the system. The many types of volatile and nonvolatile memory available involve numerous trade-offs, and the system designer must balance and manage the key parameters to deliver good enough memory performance atlower cost and power consumption.

To balance masking memory-access latency, silicon cost, and power consumption, processor-based devices usually rely on a hierarchical memory structure that places smaller amounts of faster yet more expensive memories closer to the processor core and larger amounts of slower and less expensive memories farther from the processor core (Figure 2). After the processor registers which are the fastest and scarcest memory resources in the system, memory hierarchies may use local memories or TCMs (tightly coupledmemories), multiple layers of caches, and volatile and nonvolatile on- and off-chipmemories.

Modern optimizing compilers are competent at managing the use of the processor registers, but they are weaker at managing and optimizing the other memories. This situation is due partly to the fact that optimizing the use of the registers works well as a tactical exercise with a local view of the program code. To optimize the use of the other memory structures, such as the TCMs, in a processorbased system requires a more global view of the system, and this capability is stillemerging in most compilers.

Local memories or TCMs connect to the processor core through local- or dedicated-memory buses for access performance similar to that of cache memories. Memory-access determinism is a key difference between TCMs and caches. Cache-line locking manually and temporarily enables a cache at the line level to act as a TCM. Program-instruction and code access through a TCM is deterministic, but, with a cache, the designer must consider the worst-case scenario of cache misses. “A typical rule of thumb for the penalty of a cache miss is an order of magnitude longer access latency than the previous level,” says David Fisch, director of architecture at Innovative Silicon. “An L2 memory access has a 10 times longer latency than an L1 cache access, and it also has a 10 times shorter latency than an L3 memory access.” However, using TCMs puts the burden on the software developer to manually manage that memory space, usually with a DMA controller, so that the necessary code and data are in the TCM when the processorneeds them.

Cache comprises less of a faster memory to mask the latency of a larger amount of slower memory—which is denser and, hence, cheaper. Caches rely on the premise of temporal and spatial locality to mask the memory-access latency of the slower memory. “Temporal locality” describes the premise that, if the processor requests some data, then the processor will soon need that same data again. By keeping a copy of the data in its storage, the cache can avoid going to the slower memory. “Spatial locality” describes the premise that, if the processor requests code at a memory location, then it is highly likely that the next processor request will be the code at the next memory location or close to that location. By prefetching some amount of the data near the currently requested data at the same time as the original fetch, the cache can have the next few data locations in its store without incurring the latency of another fetch from the slowermemory.

EASE OF USE

Ease of programming is a feature that is important to software developers. A fl at address space that hides the memory hierarchy makes it easier for the developers to program. Brian Boles, digital-signal-controllerdivision technical fellow at Microchip Technology, says that, “Generally, it is easier for a compiler to target an application to a generalized memory structure.” It is more diffi cult for compilers to optimally allocate the code and data to application- specifi c memory structures without visibility to the global and dynamic characteristics of the application code.

For sophisticated applications that require operating systems, such as Linux, the memory architecture may need to support virtual addressing. However, developers using heavy operating systems to meet time-tomarket schedule pressures need to consider the potential loss of insight into how to partition the software to take advantage of on-chip resources to save power and cost. Part of the challenge is balancing and determining how much of the on-chip memory the operating system requires to operate mainly out of on-chip memory and how much of the memory this approach leaves for the application. “To date, general-purpose operating systems do not have hooks to specify the complete physical-tomemory- system mapping so as to facilitate the most optimum use of the underlying memory system,” says Phil Ames, segment-marketing manager for the Embedded and Communications Group at Intel. “However, it is common in embedded designs to hand-tune the software to make best use of the memory system.”

Managing each different class of memory may require specialized software. For example, small-block NAND fl ash (528 bytes per page) usually requires different fl ash-management software from large-block NAND fl ash (2112 bytes per page). One approach to manage this situation is to modularize the software into layers so that the software developer has to rewrite as little as possible when changes are necessary. According to Doug Wong, member of the technical staff at Toshiba’s memory-products group, “NAND fl ash appears to be the fi rst commodity memory to add signifi cant intelligence to the memory device itself in order to make it easier to use.” Toshiba’s LBA-NAND and eMMCcompliant embedded NAND both contain builtin controllers that perform NAND-management functions, such as block management, wear-leveling, logical-to-physicalblock translation, and automatic error correction. This approach signifi cantly reduces the burden on the system architect or software engineer in managing the NAND-fl ash device for an FFS (fl ash fi le system) or for FTL (fl ash translation layer).

Larger caches usually mean fewer cache misses at the cost of more silicon area. Increasing the cache-set associativity, which refers to the number of locations where a given memory can reside in the cache, almost always reduces cache misses. The cache line’s length can vary positively or negatively based on an application’s behavior. According to Bill Huffman, chief architect at Tensilica,“Configuring caches is an iterative task that is highly dependent on the application set that will executeon the processor.”

Balancing the various cache parameters can be a complex process that involves trade-offs between silicon area and miss rates (Figure 3). In the figure, the explored cache configurations span from a 4-kbyte, direct-mapped, 16-byte-line configuration with a load-miss rate of 13.4% to a 32-kbyte, four-way-setassociative, 64-byte-line configuration with a load-miss rate of 1.9% for a JPEG-encoding application (Reference 2). Even though the larger cache is better, there is a diminishing return of benefits for the 32-kbyte cache. There is a larger performance benefit from increasing the cache-line size than from doubling the size of the cache; the longer cache lines reduce silicon cost. Also, although higher cache-set associativity is better, in this example, going from two-way to four-way-set associativity yields fewer benefits.In short, no clear rule of thumb exists for configuring caches.

DECISION DRIVERS

The processor-core architecture is the first-order driver of the memory-architecture options that a designer has. The reason is that the designer builds the core with assumptions about how the memory components interface with and complement the core. Von Neumann and Harvard architectures are two common processor architectures that model and implement different ways to view and access memory. Processors based on a von Neumann architecture model the system memory as a single storage structure that hosts both the program instructions and data; a single bus interface services all program and data accesses. Processors based on a Harvard architecture model the system memory for program instructions and data as physically and logically separate storage structures with separate bus interfaces—one for instructions and the other for data. The Harvard architecture supports simultaneous access for program instructions and data, whereas the vonNeumann architecture does not.To choose an optimized memory design, a designer must alsounderstand the application’s behavior and requirements. Considerationsfor the memory design are: How will application dataenter and exit the system, and will the processor directly load the data or will an external agent, suchas a DMA controller, load the data intothe processor’s local RAM? You must asksimilar questions about outputs: Will theprocessor directly drive the output ports,or will an external agent, such as a DMAcontroller, transfer the data from the processor’slocal RAM to an I/O interface?Other questions include: What is the application’sstart-up scenario; can the systemmake efficient use of special memoryinterfaces; and can the on-chip-memoryresources accommodate all or even justthe performance-sensitive code and dataof the application?

The application start-up requirements affect where you can store the initialization code and through what interfaces the system can retrieve it. On-chip OTP (one-time-programmable) ROM is useful for storing boot code because it is small with high silicon density. It supports fast start-up because it needs no wait time after start-up to begin executing. The initialization code could reside in and execute in place from flash memory; it could also reside in off-chip memory and be shadowed into on-chip instruction RAM, whichcan result in longer system start-up. If the application code and data can reside inon-chip memories, it might be unnecessaryto support off-chip-memory interfaces.If the performance-sensitive program codecan fit into local memories, the designermay not need to implement caches.

Designers can tailor the processor to the known constraints of the application they are targeting to include only the amount of random and nonvolatile memory the application requires. The sizing and parameters of TCMs, caches, or special memories target the application. Processors targeting a wider set of applications typically implement a generalized memory architecture that includes the maximum resource requirements of the set of applications with variants of the device offering fewer resources to meet lower costs. For systems with similar processor- core architectures, the memory subsystem becomes a higher order driver for differentiating the system’s deliverable processing performance, power consumption, and price (see sidebar “Multiple options” at the Web version of this articleat www.edn.com/070621cs).

Memory controllers abstract the implementation of the memory block they service so that it appears as a data pipe to the processor system. They contain the logic necessary to read the memory block and, as appropriate to the type of memory they service, write, refresh, test, and correct errors. For on-chip memories, the memory controller can manifest the company’s proprietary innovations, which differentiate its processor device from a similar device by its competitor. As a result, most processor vendors are unwilling to discuss the specifics of their memory controllers. They hint at techniques for use in memory controllers, including using wide data buses, multiplexed or staggered access of banks, buffering, pipelining, and transaction reordering, as well as specializedand speculative access patterns.

In addition to the characteristics of the implemented memory, system-level factors that affect the design and efficiency of memory controllers include how physical addresses map to the internal representation of the memory system; the type of addressing patterns, such as burst, random, and concurrent-access patterns; the mix of reads and writes; and how unused memory enters low-power modes. Its primary usage model normally dictates the architecture of a memory controller, such that a graphics or a multimedia memory controller might optimize for sequentialaccesses whereas a memory controller for an embedded-communication-system applicationmight optimize random accessesover a large memory span. For those embeddedmemories with system-level reliabilityrequirements, the memory controller,for additional complexity, can provideECC (error-correction-code) protection.

The traffic pattern at the memory controller differs significantly between single- and multiple-processor-core systems. A memory controller for a singlecore system may use a stream, but the memory controller for a shared memory in a multicore system might need to handle multiple streams and random traffic. For multicore designs, the memory architecture must enable fast and efficient message passing and data sharing between processors. Although different approaches exist for accomplishing these goals, no single configuration is efficient for all types of communications. Fast, point-to-point channels and queues are essential to exchange short and critical messages, whereas shared memory is better for sharing larger data structures. When using shared memories, users need programming support for synchronizationand memory management.

MORE AT EDN.COM
  • For an article by Senior Technical Editor Brian Dipert about memory, go to www.edn.com/article/CA181882.
  • For a related article about allocating memory and dealing with fragmentation, visit www.edn.com/article/CA421501.
  • For an article on handset-memory considerations, visit www.edn.com/ article/CA498769.
  • Go to www.edn.com/070621cs and click on Feedback Loop to post a comment on this article.

As more embedded systems incorporate multiple cores, especially heterogeneous cores, as part of the design, development tools will most likely evolve to better assist developers with the spatial and temporal placement of code and data to sustain better latency tolerance and squeeze out better performance in increasingly complex designs. The development tools must assist developers in better understanding the global behavior of the system and matching that behavior with memory subsystems available in the system. Otherwise, memory and chip designers must continue to incorporate evermore complex control algorithms in their memory controllers to invisibly compensate for software signers’ and development tools’ lack of visibility into the behaviorof the memory system.

REFERENCES
  1. Schuchmann, David, “Tuning an Application to Prevent M1 Memory Contention,” Application Note AN3076, Freescale Semiconductor, May 2006, www.freescale.com/files/32bit/doc/ app_note/AN3076.pdf?fsrch=1.
  2. “How to Optimize SoC Performance Through Memory Tuning,” White Paper, Tensilica, www.tensilica.com/products/ WP_optimize_memory-tuning.htm.

You can reach Technical Editor Robert Cravotta at 1-661-296-5096 and rcravotta@edn.com.


 

Our Sponsors



Ads by Google