TDT01
Server CPU Microarchitecture
- 1 Profiling a warehouse-scale computer
- 2 Weeding out Front-End Stalls with Uneven Block Size Instruction Cache
- 3 ACIC: Admission-Controlled Instruction Cache
Sustainability and Performance Analysis
- 4 FOCAL: A First-Order Carbon Model to Assess Processor Sustainability
- 5 Per-Instruction Cycle Stacks Through Time-Proportional Event Analysis
- 6 AIO: An Abstraction for Performance Analysis Across Diverse Accelerator Architectures
Hardware generation and compilers
- 7 RipTide: A Programmable, Energy-Minimal Dataflow Compiler and Architecture
- 8 R-HLS: An IR for Dynamic High-Level Synthesis and Memory Disambiguation based on Regions and State Edges
- 9 Reusability is FIRRTL ground: Hardware construction languages, compiler frameworks, and transformations
1 Profiling a warehouse-scale computer
Motivation
- Increasing prevalence of warehouse-scale (WSC) and cloud computing
Findings
- WSC workloads are extremely diverse (there is no "killer application" to optimize for.)
- "Datacenter tax" can comprise nearly 30% of cycles
- Common for WSC applications – low IPC, large instruction footprints, bimodal ILP and a preference for latency over bandwidth
Keywords
- Microarchitecture analysis
- Datacenter tax: common low-level functions which show potential for specialized hardware
- "wimpy" cores
- Top-Down performance analysis
- CPI stacks
- SMT
- Datacenter specific SoCs
- I-prefetchers, i/d-cache partitioning
Alternative Approaches
- CloudSuite
- DCBench
- Kozyrakis et al. present data on internet-scale workloads from Microsoft, but their study focuses more on system-level Amdahl ratios rather than microarchitectural implications
2 Weeding out Front-End Stalls with Uneven Block Size Instruction Cache
Motivation
- The core front-end remains a critical bottleneck in modern server workloads owing to their multi-MB instruction footprints stemming from deep software stacks.
Findings
- About 60% of the bytes in a cache block are never accessed before the block is evicted
- Large spatial locality variability in the instruction stream, demands cache block sizing variability
- Present Uneven Block Size (UBS) cache
- simple and highly storage efficient cache organization with unevenly sized ways that gracefully accommodate the varying spatial locality.
- improves the storage efficiency by 32 percentage points over the baseline instruction cache.
- accommodates more than twice the number of blocks than a conventional cache within a given storage budget.
- approaches the performance of a conventional 64KB cache with a footprint of 32KB.
- Present a useful byte predictor that weeds out the cold code and places only the hot code
Keywords
- Core front-end bottleneck
- Instruction cache
- Stalling
- Cache ways
- Spatial locality & cache block sizing
- Predictor
- Cache hit, cache partial miss
- MPKI (misses per kilo instructions)
Alternative Approaches
- Prior work has mainly investigated instruction prefetching and cache replacement policies to mitigate this bottleneck.
- ACIC (3)
- GHRP
- Line Distillation
- UBS can work in congruence with ACIC and GHRP since insertion policy, replacement policy, and block size are complementary aspects of a cache design.
3 ACIC: Admission-Controlled Instruction Cache
Motivation
- Similar to UBS (2)
Findings
- Burstiness in accesses to instruction blocks
- Admission-Controlled Instruction Cache: i-Filter + temporal locality predictor
- ACIC provides 1.0223 speedup over LRU i-cache
Keywords
- ACIC: Admission-Controlled Instruction Cache
- Uses cache burst history rather than cache access history to predict whether to keep a block or not (temporal locality).
- i-Filter: separate spatial from temporal accesses
- victim blocks
- Predictor
- History register table (HRT)
- Pattern table (PT)
- Admission control
- CSHR (Comparison Status Holding Register)
Alternative Approaches
- LRU (default)
- Replacement algorithms
- Bypassing mechanisms
- Victim caches
- UBS (2)
4 FOCAL: A First-Order Carbon Model to Assess Processor Sustainability
Motivation
- Sustainability and global warming
- Environmental impact of computing
Findings
- Findings (#1-#15)
- A case study illustrated how FOCAL can guide the design of future processors that deliver both higher performance and incur a smaller environmental impact by leveraging technology innovation in a sober way.
Keywords
- Normalized carbon footprint (NCF) metric
- Strongly, weakly or less sustainable design choices
- Embodied carbon footprint: emissions from hardware manufacturing and infrastructure
- Operational footprint: emissions from the energy usage during a device's lifetime
- E2O: embodied / operational footprint, a weight parameter
- Jevons' paradox: improvement in efficiency will lead to increase in demand (embodied) and/or usage (operational)
- Fixed-work vs. fixed-time
- Both better: strongly sustainable
- Only fixed work: weakly sustainable
- Both worse: less sustainable
- Amdahl's law
- Inherent data uncertainty
Alternative Approaches
- ACT model: the key difference is that FOCAL is a top-down, parameterized model in contrast to ACT which is a bottom-up, data-driven approach.
5 Per-Instruction Cycle Stacks Through Time-Proportional Event Analysis
Motivation
- Understanding what applications spend time on and why is critical for effective performance optimization. State-of-the-art performance analysis tools are generally unable to provide this information.
Findings
- State-of-the-art attribute execution time to the instructions and performance events that the architecture is exposing the latency of.
- TEA makes manual performance optimization more effective
- TEA provides the foundation for a new class of automatic performance optimization approaches
Keywords
- TIP: Time-Proportional Instruction Profiling
- TEA: Time-Proportional Event Analysis
- TP vs. NTP profiling
- Moore's law
- Amdahl's law
- PICS: per-instruction cycle stacks
- Performance profiling vs. performance event analysis
- Profile-guided optimization (PGO)
- Commit stage
Alternative Approaches
- Current NTP profilers, e.g., Intel PEBS, AMD IBS, Arm SPE, and IBM RIS, all have situations in which they attribute samples correctly and other situations where they misattribute samples.
- Performance profiling (e.g., Arm, Intel, Drongowski, Anderson et al., Dean et al., Gottschall et al., and IBM) and performance event analysis (e.g., Intel, Anderson et al., Yasin,9 and Eyerman et al.).
6 AIO: An Abstraction for Performance Analysis Across Diverse Accelerator Architectures
Motivation
- Specialization is the key approach for continued performance growth beyond the end of Dennard scaling.
- We are fast approaching an era in which early-stage accelerator analysis is critical for maintaining the productivity of software developers, system software designers, and computer architects
- We are missing an abstraction level that compares the same work on different accelerators.
Findings
- Existing approaches fall short because they either adopt a level of abstraction that is too low or too high.
- AccMe yields an average error of 5.6% which is a significant improvement compared to the 20.6% average error of curve-fitted Roofline
- Three possible ares of usage, is for early-stage accelerator selection, scheduling compute jobs and using it for architectural exploration.
Keywords
- Architecture-Independent Operation (AIO)
- the key algorithmic operations that is used to exploit data-level parallelism.
- AccMe: a performance model that takes the AIO as the input and combines it with kernel and accelerator parameters to predict performance
- Dennard scaling
- Domain-Specific Accelerators (DSAs)
- Processing in Memory (PIM) accelerators
- Accelerator design space
- DSA: Conventional NPM.
- SCM: Logic-side PUM.
- Ambit: Memory-side PUM.
- UPMEM: Memory-side NDP.
Alternative Approaches
- Performance modeling of accelerators
- Accelerator offloading
- Accelerator architecture
7 RipTide: A Programmable, Energy-Minimal Dataflow Compiler and Architecture
Motivation
- Emerging sensing applications create an unprecedented need for energy efficiency in programmable processors.
- To achieve useful multi-year deployments on a small battery or energy harvester, these applications must avoid off-device communication and instead process most data locally.
Findings
- RipTide compiles applications written in C
- saves 25% energy v. the state-of-the-art energy-minimal CGRA and 6.6× energy v. a von Neumann core.
- runs C programs with near-ASIC efficiency
- is faster than prior energy-minimal CGRAs
Keywords
- RipTide is a co-designed compiler and CGRA architecture
- Coarse-grained reconfigurable arrays (CGRAs). Both programmable and efficient
- PE and NoC
- Amdahl efficiency bottleneck
- ILP v. SAT: map the optimized DFG to hardware
- ACIC
Alternative Approaches
- Alternatives to RipTide
- SNAFU circumnavigates forces the programmer to program in vector assembly.
- Alternatives to CGRA
- ASIC (Application-specific integrated circuit)
- FPGA
8 R-HLS: An IR for Dynamic High-Level Synthesis and Memory Disambiguation based on Regions and State Edges
Motivation
- Static HLS struggles with irregular application code.
- Since dynamically scheduled hardware is inherently data flow based, it is beneficial to have an IR that captures the global data flow to enable easier transformations.
Findings
- R-HLS enables fine-grained analyses, optimization, and creation of distributed and resource efficient memory disambiguation.
- R-HLS consistently produce circuits with a reduction in cycle counts and with significantly reduced resource usage.
Keywords
- High-level synthesis (HLS)
- Dynamically scheduled hardware
- Intermediate representation (IR)
- Regionalized Value State Dependence Graph (RVSDG)
- R-HLS: IR for HLS, and a dialect of RVSDG
- Intra block vs. inter block dependencies
- Out-of-order (OoO) scheduled core vs. in-order (InO)
- Very long input word cores (VLIW)
- Distributed memory disambiguation with ADDR-Q
- Load-store queue (LSQ)
- FF and DSP utilization
Alternative Approaches
- State-of-the-art dynamic HLS utilize control flow based IRs, which model data flow only at the basic block level, requiring the rediscovery of inter-block parallelism.
- Straight to the Queue (StoQ)
- Dynamatic
9 Reusability is FIRRTL ground: Hardware construction languages, compiler frameworks, and transformations
Motivation
- Hardware development practices lag far behind software development practices
- Why don’t hardware engineers write reusable libraries?
- HCLs and HCFs can enable new hardware libraries to be independent of underlying process technologies
Findings
- FIRRTL (Flexible Intermediate Representation for RTL), transforms target-independent RTL into technology-specific RTL.
- To enable hardware libraries, this paper contributes the following:
- (1) a reemphasis on how HCLs provide language expressivity to enable reusability
- (2) how our hardware compiler framework, FIRRTL, allows for generating target-specific RTL
- (3) the wide-ranging applications of a hardware compiler framework
- Case study: 94% of this design was reused
Chisel or Verilog frontends translate designs into FIRRTL (IR), transformation passes do optimizing, and the resulting FIRRTL can be tailored to different simulators, FPGAs or ASICs.
Keywords
- Hardware description languages (HDL)
- Register-transfer level (RTL): a way of describing a circuit (less generic than HDL)
- Hardware construction languages (HCL)
- Hardware compiler frameworks (HCF)
- FIRRTL (Flexible Intermediate Representation for RTL)
Alternative Approaches
- Chisel