Data-plane-processor IP quadruples data bandwidth, doubles instruction size to 128 bits
by Mike Demler -- EDN Europe, 12 May 2011
Tensilica’s new Xtensa LX4 DPU (dataplane processor) for SOCs (systems on chips) has four times more local data-memory bandwidth than the previous-generation LX3 DPU. The LX4 supports as many as two 512-bit load/store operations per cycle and doubles VLIW (very-long-instruction-word) instruction width from 64 bits to 128 bits for increased parallel processing. By applying the cache-memory-prefetch feature in the LX4, you can increase performance in systems with long off-chip latency by fetching data from system memory before its use. Tensilica has applied the capabilities of the LX4 DPU in the recently introduced ConnX BBE (baseboard engine) 64 DSP for LT E (long-term-evolution) Advanced communications (see “High-performance DSP-IP cores are ready for LTE-Advanced,” EDN, March 3, 2011, pg 12).

With the Xtensa LX4 DPU, you can create wide SIMD (single-instruction/multiple-data) DSPs that send more data per clock cycle to MAC (multiply/accumulate) units. You can apply the Xtensa LX4 in applications such as wired and wireless baseband processing, video preprocessing and postprocessing, image-signal processing, and network-packet-processing functions. With Tensilica’s customizable local port and queue interfaces, you can also make connections between Xtensa DPUs and other system blocks in the same manner as with traditional RTL (register-transfer-level)-block interconnections.
The LX4 DPU increases the size of Tensilica’s FLIX (flexible-length-instruction extensions) instructions from 64 to 128 bits, so that you can execute twice as many independent operations per clock cycle. You can intermix FLIX instructions with Tensilica’s shorter Xtensa base instructions to achieve smaller code size than that of other VLIW DSPs. Tensilica’s Xtensa C/C++ compiler automatically extracts parallelism from source code and bundles multiple operations into single FLIX instructions.
The DPU also comes with the company’s Vectorization Assistant tool, which offers suggestions to developers on how to improve compiler vectorization of their C code when running on SIMD DSPs. The tool provides explanations of which operations are preventing further vectorization so that you can improve your C code to take advantage of the DPU’s parallel execution units. The base Xtensa LX4 DPU can achieve speeds greater than 1 GHz. Tensilica manufactures the LX4 in a 45-nm 45GS process technology, and the IP (intellectual property) occupies an area of 0.044 mm2.