courses/tdt01 ✨

TDT01

Server CPU Microarchitecture

1 Profiling a warehouse-scale computer
2 Weeding out Front-End Stalls with Uneven Block Size Instruction Cache
3 ACIC: Admission-Controlled Instruction Cache

Sustainability and Performance Analysis

4 FOCAL: A First-Order Carbon Model to Assess Processor Sustainability
5 Per-Instruction Cycle Stacks Through Time-Proportional Event Analysis
6 AIO: An Abstraction for Performance Analysis Across Diverse Accelerator Architectures

Hardware generation and compilers

7 RipTide: A Programmable, Energy-Minimal Dataflow Compiler and Architecture
8 R-HLS: An IR for Dynamic High-Level Synthesis and Memory Disambiguation based on Regions and State Edges
9 Reusability is FIRRTL ground: Hardware construction languages, compiler frameworks, and transformations

1 Profiling a warehouse-scale computer

Motivation

Increasing prevalence of warehouse-scale (WSC) and cloud computing

Findings

WSC workloads are extremely diverse (there is no "killer application" to optimize for.)
"Datacenter tax" can comprise nearly 30% of cycles
Common for WSC applications – low IPC, large instruction footprints, bimodal ILP and a preference for latency over bandwidth

Keywords

Microarchitecture analysis
Datacenter tax: common low-level functions which show potential for specialized hardware
"wimpy" cores
Top-Down performance analysis
CPI stacks
SMT
Datacenter specific SoCs
I-prefetchers, i/d-cache partitioning

Alternative Approaches

CloudSuite
DCBench
Kozyrakis et al. present data on internet-scale workloads from Microsoft, but their study focuses more on system-level Amdahl ratios rather than microarchitectural implications

2 Weeding out Front-End Stalls with Uneven Block Size Instruction Cache

Motivation

The core front-end remains a critical bottleneck in modern server workloads owing to their multi-MB instruction footprints stemming from deep software stacks.

Findings

About 60% of the bytes in a cache block are never accessed before the block is evicted
Large spatial locality variability in the instruction stream, demands cache block sizing variability
Present Uneven Block Size (UBS) cache
- simple and highly storage efficient cache organization with unevenly sized ways that gracefully accommodate the varying spatial locality.
- improves the storage efficiency by 32 percentage points over the baseline instruction cache.
- accommodates more than twice the number of blocks than a conventional cache within a given storage budget.
- approaches the performance of a conventional 64KB cache with a footprint of 32KB.
Present a useful byte predictor that weeds out the cold code and places only the hot code

Keywords

Core front-end bottleneck
Instruction cache
Stalling
Cache ways
Spatial locality & cache block sizing
Predictor
Cache hit, cache partial miss
MPKI (misses per kilo instructions)

Alternative Approaches

Prior work has mainly investigated instruction prefetching and cache replacement policies to mitigate this bottleneck.
ACIC (3)
GHRP
Line Distillation
UBS can work in congruence with ACIC and GHRP since insertion policy, replacement policy, and block size are complementary aspects of a cache design.

3 ACIC: Admission-Controlled Instruction Cache

Motivation

Similar to UBS (2)

Findings

Burstiness in accesses to instruction blocks
Admission-Controlled Instruction Cache: i-Filter + temporal locality predictor
ACIC provides 1.0223 speedup over LRU i-cache

Keywords

ACIC: Admission-Controlled Instruction Cache
- Uses cache burst history rather than cache access history to predict whether to keep a block or not (temporal locality).
i-Filter: separate spatial from temporal accesses
- victim blocks
Predictor
- History register table (HRT)
- Pattern table (PT)
Admission control
CSHR (Comparison Status Holding Register)

Alternative Approaches

LRU (default)
Replacement algorithms
Bypassing mechanisms
Victim caches
UBS (2)

4 FOCAL: A First-Order Carbon Model to Assess Processor Sustainability

Motivation

Sustainability and global warming
Environmental impact of computing

Findings

Findings (#1-#15)
A case study illustrated how FOCAL can guide the design of future processors that deliver both higher performance and incur a smaller environmental impact by leveraging technology innovation in a sober way.

Keywords

Normalized carbon footprint (NCF) metric
Strongly, weakly or less sustainable design choices
Embodied carbon footprint: emissions from hardware manufacturing and infrastructure
Operational footprint: emissions from the energy usage during a device's lifetime
E2O: embodied / operational footprint, a weight parameter
Jevons' paradox: improvement in efficiency will lead to increase in demand (embodied) and/or usage (operational)
Fixed-work vs. fixed-time
- Both better: strongly sustainable
- Only fixed work: weakly sustainable
- Both worse: less sustainable
Amdahl's law
Inherent data uncertainty

Alternative Approaches

ACT model: the key difference is that FOCAL is a top-down, parameterized model in contrast to ACT which is a bottom-up, data-driven approach.

5 Per-Instruction Cycle Stacks Through Time-Proportional Event Analysis

Motivation

Understanding what applications spend time on and why is critical for effective performance optimization. State-of-the-art performance analysis tools are generally unable to provide this information.

Findings

State-of-the-art attribute execution time to the instructions and performance events that the architecture is exposing the latency of.
TEA makes manual performance optimization more effective
TEA provides the foundation for a new class of automatic performance optimization approaches

Keywords

TIP: Time-Proportional Instruction Profiling
TEA: Time-Proportional Event Analysis
TP vs. NTP profiling
Moore's law
Amdahl's law
PICS: per-instruction cycle stacks
Performance profiling vs. performance event analysis
Profile-guided optimization (PGO)
Commit stage

Alternative Approaches

Current NTP profilers, e.g., Intel PEBS, AMD IBS, Arm SPE, and IBM RIS, all have situations in which they attribute samples correctly and other situations where they misattribute samples.
Performance profiling (e.g., Arm, Intel, Drongowski, Anderson et al., Dean et al., Gottschall et al., and IBM) and performance event analysis (e.g., Intel, Anderson et al., Yasin,9 and Eyerman et al.).

6 AIO: An Abstraction for Performance Analysis Across Diverse Accelerator Architectures

Motivation

Specialization is the key approach for continued performance growth beyond the end of Dennard scaling.
We are fast approaching an era in which early-stage accelerator analysis is critical for maintaining the productivity of software developers, system software designers, and computer architects
We are missing an abstraction level that compares the same work on different accelerators.

Findings

Existing approaches fall short because they either adopt a level of abstraction that is too low or too high.
AccMe yields an average error of 5.6% which is a significant improvement compared to the 20.6% average error of curve-fitted Roofline
Three possible ares of usage, is for early-stage accelerator selection, scheduling compute jobs and using it for architectural exploration.

Keywords

Architecture-Independent Operation (AIO)
- the key algorithmic operations that is used to exploit data-level parallelism.
AccMe: a performance model that takes the AIO as the input and combines it with kernel and accelerator parameters to predict performance
Dennard scaling
Domain-Specific Accelerators (DSAs)
Processing in Memory (PIM) accelerators
Accelerator design space
- DSA: Conventional NPM.
- SCM: Logic-side PUM.
- Ambit: Memory-side PUM.
- UPMEM: Memory-side NDP.

Alternative Approaches

Performance modeling of accelerators
Accelerator offloading
Accelerator architecture

7 RipTide: A Programmable, Energy-Minimal Dataflow Compiler and Architecture

Motivation

Emerging sensing applications create an unprecedented need for energy efficiency in programmable processors.
To achieve useful multi-year deployments on a small battery or energy harvester, these applications must avoid off-device communication and instead process most data locally.

Findings

RipTide compiles applications written in C
- saves 25% energy v. the state-of-the-art energy-minimal CGRA and 6.6× energy v. a von Neumann core.
- runs C programs with near-ASIC efficiency
- is faster than prior energy-minimal CGRAs

Keywords

RipTide is a co-designed compiler and CGRA architecture
Coarse-grained reconfigurable arrays (CGRAs). Both programmable and efficient
PE and NoC
Amdahl efficiency bottleneck
ILP v. SAT: map the optimized DFG to hardware
ACIC

Alternative Approaches

Alternatives to RipTide
- SNAFU circumnavigates forces the programmer to program in vector assembly.
Alternatives to CGRA
- ASIC (Application-specific integrated circuit)
- FPGA

8 R-HLS: An IR for Dynamic High-Level Synthesis and Memory Disambiguation based on Regions and State Edges

Motivation

Static HLS struggles with irregular application code.
Since dynamically scheduled hardware is inherently data flow based, it is beneficial to have an IR that captures the global data flow to enable easier transformations.

Findings

R-HLS enables fine-grained analyses, optimization, and creation of distributed and resource efficient memory disambiguation.
R-HLS consistently produce circuits with a reduction in cycle counts and with significantly reduced resource usage.

Keywords

High-level synthesis (HLS)
Dynamically scheduled hardware
Intermediate representation (IR)
Regionalized Value State Dependence Graph (RVSDG)
R-HLS: IR for HLS, and a dialect of RVSDG
Intra block vs. inter block dependencies
Out-of-order (OoO) scheduled core vs. in-order (InO)
Very long input word cores (VLIW)
Distributed memory disambiguation with ADDR-Q
Load-store queue (LSQ)
FF and DSP utilization

Alternative Approaches

State-of-the-art dynamic HLS utilize control flow based IRs, which model data flow only at the basic block level, requiring the rediscovery of inter-block parallelism.
Straight to the Queue (StoQ)
Dynamatic

9 Reusability is FIRRTL ground: Hardware construction languages, compiler frameworks, and transformations

Motivation

Hardware development practices lag far behind software development practices
Why don’t hardware engineers write reusable libraries?
HCLs and HCFs can enable new hardware libraries to be independent of underlying process technologies

Findings

FIRRTL (Flexible Intermediate Representation for RTL), transforms target-independent RTL into technology-specific RTL.
To enable hardware libraries, this paper contributes the following:
- (1) a reemphasis on how HCLs provide language expressivity to enable reusability
- (2) how our hardware compiler framework, FIRRTL, allows for generating target-specific RTL
- (3) the wide-ranging applications of a hardware compiler framework
Case study: 94% of this design was reused

Chisel or Verilog frontends translate designs into FIRRTL (IR), transformation passes do optimizing, and the resulting FIRRTL can be tailored to different simulators, FPGAs or ASICs.

Keywords

Hardware description languages (HDL)
Register-transfer level (RTL): a way of describing a circuit (less generic than HDL)
Hardware construction languages (HCL)
Hardware compiler frameworks (HCF)
FIRRTL (Flexible Intermediate Representation for RTL)

Alternative Approaches

Chisel

courses/tdt01 ✨

Table of Contents

TDT01

Server CPU Microarchitecture

Sustainability and Performance Analysis

Hardware generation and compilers

1 Profiling a warehouse-scale computer

Motivation

Findings

Keywords

Alternative Approaches

2 Weeding out Front-End Stalls with Uneven Block Size Instruction Cache

Motivation

Findings

Keywords

Alternative Approaches

3 ACIC: Admission-Controlled Instruction Cache

Motivation

Findings

Keywords

Alternative Approaches

4 FOCAL: A First-Order Carbon Model to Assess Processor Sustainability

Motivation

Findings

Keywords

Alternative Approaches

5 Per-Instruction Cycle Stacks Through Time-Proportional Event Analysis

Motivation

Findings

Keywords

Alternative Approaches

6 AIO: An Abstraction for Performance Analysis Across Diverse Accelerator Architectures

Motivation

Findings

Keywords

Alternative Approaches

7 RipTide: A Programmable, Energy-Minimal Dataflow Compiler and Architecture

Motivation

Findings

Keywords

Alternative Approaches

8 R-HLS: An IR for Dynamic High-Level Synthesis and Memory Disambiguation based on Regions and State Edges

Motivation

Findings

Keywords

Alternative Approaches

9 Reusability is FIRRTL ground: Hardware construction languages, compiler frameworks, and transformations

Motivation

Findings

Keywords

Alternative Approaches