Convolutional Neural Network
It is challenging to adopt computing-intensive and parameter-rich Convolutional Neural Networks (CNNs) in mobile devices due to limited hardware resources and low power budgets. To support multiple concurrently running applications, one mobile device needs to perform multiple CNN tests simultaneously in real-time. Previous solutions cannot guarantee a high enough frame rate when serving multiple applications with reasonable hardware and power cost. We presented a novel process-in-memory architecture to process emerging binary CNN tests in Wide-IO2 DRAMs. We proposed XNOR-POP to process CNN tests with competitive accuracy for mobile devices by XNORing and popcounting in Wide-I02 DRAMs. XNOR-POP conducts XNORs inside DRAM arrays, transfers XNOR results by through-silicon-vias (TSVs) and completes popcounts on the logic die. We also presented XNOR Loop Unrolling to fully take advantage of large DRAM row bufter to increase XNOR operation parallelism in DRAM arrays. XNOR Loop Unrolling supports a board range of convolutional layer parameters, e.g., kernel size. Finally, we evaluated XNOR-POP and compare it against state-of-the-art designs. Our experimental results show that it improves CNN test performance by 4x~11x and reduces test energy consumption by > 90% on average.
Phase Change Memories
Compared to DRAM, PCM suffers from short cell endurance: one typical PCM cell usually can only tolerate 10^8 writes, and the lifetime of a chip is decided by some weak cells that only endure 10^6 writes. Most applications tend to heavily write a small address region in main memory. Wear leveling is used to distribute write traffic among the entire PCM memory space. Without wear leveling, PCM chip may fail within several minutes. Wear leveling in PCM main memory system utilizes a static or dynamic memory line level shuffling schemes to randomize the mappings between physical and device memory addresses in a continuous memory space. Any dead memory block separating memory space makes wear leveling schemes ineffective in PCM main memory system. We proposed a operating system transparent hardware based salvaging scheme to maintain a continuous memory space for PCM wear leveling scheme and realize a graceful main memory capacity degradation to retire broken memory lines. A large PCM RESET current also significantly shortens cell endurance. Due to process variation, different cells require different optimal RESET current amplitudes to achieve the maximum cell endurance. However, one region, e.g., a line, a page or an array of PCM cells can only adopt the largest RESET current to correctly write all cells in this region. In this case, most cells in this region are over RESET and fail to achieve the maximum cell lifetime. We proposed fine-grained current regulation and voltage up-scaling to cut down the RESET current, leaving a small number of difficult-to-reset cells unused. Error correction code (ECC) is utilized to cover these cells. Multiple level cell (MLC) PCM generally takes advantage of a large PCM RESET current to achieve the full resistance spectrum, therefore, MLC PCM cell typical endurance is only 10^5-10^6 writes. We presented elastic RESET to construct non-2^n-state (3-state) MLC PCM cell to reduce RESET current and the maximum resistance in one cell. Instead of storing two bits in one cell, our technique combines two cells to store three bits. Most applications can be compressed and stored without enlarging memory storage requirement. In summary, with our techniques, wear leveling works effectively, when broken lines appear in the PCM based main memory system. The over RESET problem are mitigated by our techniques in both SLC and MLC PCM based main memory systems. And thus, the lifetime of PCM based main memory is substantially prolonged.
Compared to DRAM, another major disadvantage of PCM is the slow write operation (~32× slower than DRAM). Due to cell process variation, composition fluctuation and relatively small differences among resistance levels, MLC PCM typically employs an iterative write scheme to achieve precise control, which suffers from large write access latency. The first technique we proposed is Write Truncation (WT), which dynamically identifies the cells that require more iterations to write, and truncates their last several iterations to finish a PCM write earlier. An extra ECC is introduced to cover the erroneous states of those cells. Through truncation, WT significantly reduces the number of iterations of a write operation. To mitigate the storage overhead of ECC, we also presented Form Switch (FS) which uses frequent pattern compression to compress a line to create storage space. Unlike SLC PCM that enjoys a similar read latency to DRAM, MLC PCM suffers from the doubled read latency. Therefore, if a PCM line can be compressed to less than half of its size, it can be stored in SLC form rather than two-bit MLC form. Since SLC PCM has shorter access latency and better write endurance than MLC PCM, accessing the line as SLC form accelerates performance critical read operations. To improve write performance on SLC PCM, we presented WoM-SET, a low power proactive-SET-based write strategy. WoM (write-once memory) code guarantees only RESETs are required during every two writes. Since PCM has an asymmetric write characteristic: SET is much slower than RESET, PCM main system gains performance improvement from the short RESET latency. By applying WoM-SET only to frequently written pages, the extra space requirement of WoM code is restricted. Moreover, the non-volatility of MLC PCM can be traded for better performance and lower write energy: the write with short write latency and small write energy results in short retention time and requires further refresh operations, but the write with long write latency and large write energy enjoys a long retention time. We designed a compiler directed dual-write selection scheme in embedded systems. Based on static analysis of memory write instructions, our technique estimates their worst case lifetime, which guides the selection of the best write mode for each memory write instruction. In short, with our techniques, the write performance of both SLC and MLC PCM main memory can be significantly improved. With shorter write latency, the bank busy periods are substantially shortened. Read operations on the critical path are less blocked by PCM writes. Furthermore, the MLC PCM memory read performance can be also boosted by the transformation from MLC lines to SLC lines.
A MLC PCM write has one RESET pulse and a varying number of SET pulses. The RESET pulse is short and of large magnitude while the SET pulse is long and of low magnitude. In addition, when writing one PCM line, most cells in the line require only a small number of SET pulses. Allocating power according to the RESET power request and for the duration of the longest cell write is power inefficient. Moreover, one heavily written (hot) PCM chip may block the memory subsystem even though most memory chips are idle. This phenomenon arises because the power that each chip can provide is restricted by the area of its charge pump. When multiple writes compete for a single chip, some writes have to wait to avoid exceeding the capability of charge pump. Otherwise, cell writes become unreliable. Therefore, we proposed two new fine-grained power budgeting (FPB) schemes to address these problems: 1) FPB-IPM is a scheme that regulates write power on each write iteration in MLC PCM. Since writing one MLC line requires multiple iterations with step-down power requirements, FPB-IPM aims to reclaim any unused write power after each iteration and reduce the maximum power requested in a write operation by splitting the first RESET iteration into several RESET iterations. By enabling more MLC line writes in parallel, FPB-IPM improves memory throughput. 2) FPB-GCP is a scheme that mitigates power restrictions at the chip level. Instead of enlarging the charge pump in an individual PCM chip, FPB-GCP integrates a single global charge pump (GCP) on a DIMM. It dynamically pumps (boosts) extra power to hot chips in the DIMM. Since GCP has a lower effective power efficiency (i.e., the percentage of power that can be utilized to write cells), we also considered different cell layout optimizations to maximize throughput. Our techniques achieve significant improvement on write throughput and system performance.
MLC STT-MRAM has become a promising candidate to construct last level caches for high-end embedded processors. However, long write latency and large write energy limit the performance and energy efficiency of MLC STT-MRAM based caches. We addressed these limitations with two novel designs: line pairing (LP) and line swapping (LS). LP forms fast and low power cache lines by re-organizing MLC soft bits which are faster to write and has a lower write power. LS dynamically stores frequently written data (write hot lines) into these fast and low power cache lines. With our techniques, the MLC STT-MRAM based cache enjoys a better performance and a lower power consumption.