tdt4186/lectures

📂 assets
📄 01
📄 02 ✨
📄 03
📄 04
📄 05
📄 06
📄 07
📄 08
📄 09
📄 10
📄 11
📄 12
📄 13
📄 14
📄 15
📄 16
📄 17
📄 18
📄 19
📄 20
📄 21
📄 22
📄 questions

Lecture 2: Resources and computer architecture

Previous lecture Next lecture

Exam

Interaction of computer architecture and the OS, resources and their management

Important questions:

What are the building blocks of a computer system?
Which resources are represented by these building blocks?
How does code (in the OS) interact with hardware resources?
What are the most relevant developments in computer architecture of the last decades and which problems/benefits are related to these developments?

Computers as they are no more

The typical von Neumann-style computer
Addressable unified memory for code and data
I/O devices in the same or a different address range
Optional: Interrupts notify CPU of the completion of an I/O operation
Optional: I/O devices can use DMA (direct memory access) to transfer data to memory without CPU interaction

Asynchronous execution: interrupts

Access to I/O devices is often slow
- Polling sends a command and then waits until the device returns data
With interrupts, the device notifies the program when data is ready
- This changes the control flow the CPU executes!
- More complex to develop software for

Buses

Components of the computer are connected by buses
- Address bus: indentifying component
- Data bus: transfer information
- Control bus: metainformation (read/write, interrupt)
CPU has control over the bus
- Exception: DMA

Getting a bit more real

Simple model of execution only works efficitently if the speed of memory is on par with the speed of the CPU
- This was the case until ca. 1980
Today: 'memory gap':
- CPU speed ~ 10 000x faster, but memory speed only ~ 10x faster

Introdusing a memory hierarchy

Idea: introduse caches
- Small, but fast intermediate levels of memory
Caches can only hold a partial copy of the whole memory
- Unified caches vs. separate instruction and data caches
- Expensive to manufacture
- Later: introduction of multiple level of cache (L1, L2, L3..)
  - Each one bugger but slower than the previous one
Caches work efficiently due to two locality principles:
- Temporal locality: a program accessing some part of memory is likely to access the same memory soon thereafter
- Spatial locality: a program accessing some part of memory is likely to access nearby memory next
The further from the CPU:
- Increasing size
- Decreasing speed

Memory impact: non-functional properties

Memory has a large influence on non-functional properties of a system
- Average, best and worst case performance, troughput and latencies
- Power and energy consumption
- Reliability and security
Non-functional properties depend on many parameters of memory, e.g.:
- Cache architecture
- Memory type
- Alignment and aliasing of data

When one processor is not enough

Moore's Law (1965)
- Observation that the number of transistors in a dense integrated circuit (IC) doubles about every two years
- Accordingly, increase in CPU speed due to smaller semiconductor structures
This development is hitting physical limitations
- CPU frequencies 'stuck' at ~3 GHz
- Energy consumption is additional limiting factor
What can we do with all these transistors?
- Bigger caches - energy hungry and prone to faults!
- Put more processors on a chip!
  - Earlier high-end systems already used multiple separate processor chips
Old as well as new problems (with multiple cores):
- Memory throughput now has to satisfy demands of n processors
- Software now has to support execution on multiple processors
- Caches need to be coherent so they hold the same copies of main memory data

More processors, more memories

Problem: Memory throughput now has to satisfy demands of n processors
- Provide each processor with its own main memory!
- NUMA (non unified memory architecture)
And new problems show up:
- How to access data in another CPU's memory?
- Who decides which CPU is allowed to use the bus?
- Is a commom bus still efficient?

On-chip communication

Use high-speed networks instead of conventional buses
- Using ideas from computer networking
- On-chip network can achieve high throughput and low latencies

Heterogeneous systems: GPGPUs

In modern computers, not only CPUs can execute code
GPGPUs (general purpose graphics processing units)
- Massively parallel processors for typical parallel tasks
- 3D graphics, signal processing, machine learning, bitcoin mining etc.
- Few features for protection, security
Traditionally, GPUs were accessible to a single program only for drawing
In modern systems, multiple programs want direct access to the GPGPU
- How can the OS multiplex the GPGPU safely and securely?

Security

There's another important non-functional property!
Multiple programs running simultaneously
- e.g. an online banking application and a video player
How can we prevent the video player from accessing memory of the banking app?
Restrict access to non permitted memory ranges
- The MMU only makes memory visible to a running program 'belonging' to it

The MMU

Idea: intercept 'virtual' addresses generated by the CPU
- MMU checks for 'allowed' addresses
- Translates allowed addresses to 'physical' addresses in main memory using a translation table
Problem: translation table for each single address would be large
- Split memory into pages of identical size (power of 2)
- Apply the same translation to all addresses in the page: page table
MMUs were originally separate ICs sitting between CPU and RAM, today they are fitted on the CPU

Page table structure

Find a compromise page size allowing both flexibility and efficiency

The memory translation process

The MMU splits the virtual address coming from the CPU into three parts:
- 10 bits (31-22) page directory entry (PDE) number
- 10 bits (21-12) page table entry (PTE) number
- 12 bits (11-0) page offset inside the refernces page (untranslated)
Translation process
1. Read PDE entry from directory -> address of one page table
2. Read PTE entry from table -> physical base address of memory page
3. Add offset from original virtual address to obtain the complete physical memory address

Speeding up the translation

Where is the page table stored?
- Can be several MB in size -> doesn't fit on the CPU
- Page dir and page table are in main memory!
Using virtual memory address translation requires three main memory accesses!
- Use cache!
The MMU uses a special cache on the CPU - the translation lookaside buffer (TLB)

What about the operating system?

New hardware capabilities have to be used efficiently
The OS has to manage an multiplex the related resources
- Has to provide code for all new capabilites
- These often interact with other parts of the system, making the overall OS more complex
A modern OS also has to ensure adherence to non-functional requirements (security, energy, real-time, etc.)
- OS has to do more bookkeeping and statistics
- Some of the non-functional properties contradict each other
Finally, the OS itself has to be efficient!

tdt4186/lectures

Table of Contents

Lecture 2: Resources and computer architecture

Exam

Computers as they are no more

Asynchronous execution: interrupts

Buses

Getting a bit more real

Introdusing a memory hierarchy

Memory impact: non-functional properties

When one processor is not enough

More processors, more memories

On-chip communication

Heterogeneous systems: GPGPUs

Security

The MMU

Page table structure

The memory translation process

Speeding up the translation

What about the operating system?