Understanding Embedded Memory: ITCM, DTCM, and DDR for Optimal Performance

By ⚡ min read

Why Memory Architecture Matters in Embedded Systems

Embedded engineers often encounter a puzzling scenario: the same code running on the same processor performs swiftly in one context but sluggishly in another. The root cause almost always lies in where code and data reside in memory. Unlike desktop and server CPUs, which rely on multi-level hardware caches to hide memory latency, many embedded processors—especially those based on ARM Cortex-M and Cortex-R cores—offer direct control over distinct memory regions, each with vastly different performance characteristics.

Understanding Embedded Memory: ITCM, DTCM, and DDR for Optimal Performance — Source: www.freecodecamp.org

This article explores the three primary memory types in such systems: ITCM, DTCM, and DDR. You’ll learn how they differ, how to strategically place code and data, and how to profile firmware memory usage over time.

What Is ITCM (Instruction Tightly-Coupled Memory)?

ITCM, or Instruction Tightly-Coupled Memory, is a dedicated memory region attached directly to the processor’s instruction fetch unit. It provides single-cycle access for executable code, making it deterministic and extremely fast. Typical sizes range from 512 KB to 2 MB. ITCM is ideal for time-critical functions—such as interrupt service routines, real-time control loops, and frequently executed algorithms—where every cycle counts.

Because it is tightly coupled, ITCM bypasses the complexities of caching and bus arbitration. The processor fetches instructions from ITCM without any stall cycles, ensuring predictable performance. However, its limited capacity means you must carefully select which code to place there.

What Is DTCM (Data Tightly-Coupled Memory)?

DTCM, or Data Tightly-Coupled Memory, mirrors ITCM but for data. It offers single-cycle, deterministic access for variables, stacks, and buffers. Sizes typically range from 512 KB to 1.5 MB. DTCM is optimal for latency-sensitive data structures, such as real-time state variables, DMA buffers, or the stack of a high-priority task.

Like ITCM, DTCM avoids cache misses and bus contention. But it is a separate region, so code cannot be placed in DTCM, and data cannot be placed in ITCM. Proper allocation in the linker script is crucial to leverage these benefits.

What Is DDR (Double Data Rate) Memory?

DDR (Double Data Rate) memory is the main system memory in many embedded designs. It is typically larger—ranging from 4 MB to several GB—but slower and non-deterministic. Access times involve multiple cycles and can vary due to bus arbitration, refresh cycles, and external memory controller overhead. DDR is suitable for bulk storage, less time-critical code, large data arrays, and non-real-time tasks.

Some designs use external DDR chips, while others integrate DDR on-package. In either case, the latency is orders of magnitude higher than TCM, making it essential to keep performance-critical code and data off DDR.

Comparing ITCM, DTCM, and DDR

The table below summarizes the key differences:

ITCM: Stores instructions; single-cycle deterministic access; typical size 512 KB–2 MB.
DTCM: Stores data; single-cycle deterministic access; typical size 512 KB–1.5 MB.
DDR: Stores everything else; multi-cycle variable access; size 4 MB to several GB.

ITCM and DTCM are on-chip, tightly coupled to the CPU core, while DDR is typically off-chip or on a separate bus. The speed gap can be tens or hundreds of cycles for a single access.

How to Place Code and Data in the Correct Memory Region

Placement is controlled via the linker script (e.g., GNU ld linker script). You define memory regions and then assign sections (like .text for code, .data for initialized data, .bss for uninitialized data) to those regions. For example:

Define the memory regions: ITCM (rx) : ORIGIN = ..., DTCM (rw) : ORIGIN = ..., DDR (rwx) : ORIGIN = ....
Map sections: place time-critical functions into .itcm_text and critical data into .dtcm_data.
Use __attribute__((section(".itcm_text"))) in C code to mark functions for TCM placement.

Profiling tools (e.g., perf on Linux, or hardware performance counters on Cortex-M) can help identify hotspots that should be moved to TCM. Over time, you may adjust the linker script based on real usage patterns.

Common Pitfalls to Avoid

Overloading TCM: Placing too much code or data in ITCM/DTCM can cause linker errors or force spillover to DDR, defeating the purpose.
Ignoring alignment: Some TCM implementations require specific alignment for optimal performance; misaligned accesses may stall.
Assuming uniform speed: Not all TCM regions are equal; check the system reference manual for actual latency.
Neglecting cache coherence: If your processor also has caches, TCM may bypass them, leading to stale data if mixed with cached accesses.

Performance and Power Considerations

Using TCM can drastically reduce execution time for tight loops—often by 5x to 20x compared to DDR. It also lowers power consumption because off-chip memory accesses consume significant energy per transfer. However, TCM consumes static power even when unused, so sizing it appropriately matters.

For battery-powered devices, moving critical code to ITCM can extend runtime by reducing wake times and active power. Conversely, DDR can be put into low-power modes when idle to save energy.

Conclusion

Mastering ITCM, DTCM, and DDR is essential for high-performance embedded firmware. By understanding the trade-offs in speed, size, and determinism, you can place code and data where it runs best. Start by profiling your firmware’s memory accesses, then adjust your linker script accordingly. With careful planning, you’ll avoid the performance pitfalls that plague many embedded projects.