本文是对 Operating Systems: Three Easy Pieces 的读书笔记,这本书很棒,可以免费在线看,我付了几美元购买了 PDF 版。 "Abstractions are fundamental to all aspects of Computer Science." 这句话让我恍然大悟,计算机科学的本质就是抽象。

CPU

Introduction

Essence of OS

The key role of OS is virtualizing physical resources and managing them.

OS takes physical resources, such as a CPU, memory, or disk, and virtualizes them. It handles tough and tricky issues related to concurrency. And it stores files persistently, thus making them safe over the long-term.

Design Goals

  • Abstractions
  • High Performance
  • Protection Isolation
  • Reliability
  • Others Goals
    • Energy-Efficiency
    • Security
    • Mobility

Virtualization

Process

Process is abstraction on CPU

  • The OS creates this illusion by virtualizing the CPU. By running one process, then stopping it and running another, and so forth, the OS can promote the illusion that many virtual CPUs exist when in fact there is only one physical CPU (or a few). This basic technique, known as time sharing of the CPU.

Machine State of Process

  • Memory: Address Space.
  • Registers.
  • Program Counter (PC) or Instruction Pointer (IP): Tells us which instruction of the program will execute next.
  • A stack pointer and associated frame pointer are used to manage the stack for function parameters, local variables, and return addresses.
  • File Descriptors: Include a list of the files the process currently has open.

Process States

Process States

  • Ready
  • Running
  • Blocked

Low-Level Mechanisms

Limited Direct Execution
Process: User Mode Hardware OS: Kernel Mode
system call with system call number    
  trap instruction:
- save regs (to kernel stack)
- move to kernel mode
- jump to system call trap handler in trap table
 
    do work of system call
  return-from-tap instruction:
- restore regs (from kernel stack)
- move to user mode
- jump to PC
 

The kernel does so by setting up a trap table at boot time. When the machine boots up, it does so in privileged (kernel) mode, and thus is free to configure machine hardware as need be. One of the first things the OS thus does is to tell the hardware what code to run when certain exceptional events occur. For example, what code should run when a hard-disk interrupt takes place, when a keyboard interrupt occurs, or when a program makes a system call? The OS informs the hardware of the locations of these trap handlers.

CPU provide two modes:

  • User mode: Which a process runs in. Code that runs in user mode is restricted in what it can do.
  • Kernel mode: Which the operating system (or kernel) runs in. Code that runs can do what it likes, including privileged operations such as issuing I/O requests and executing all types of restricted instructions.

Timer interrupt:

  • A timer device can be programmed to raise an interrupt every so many milliseconds; when the interrupt is raised, the currently running process is halted, and a pre-configured interrupt handler in the OS runs. At this point, the OS has regained control of the CPU, and thus can do what it pleases: stop the current process, and start a different one.
  • During the boot sequence, the OS must start the timer, which is of course a privileged operation.

Higher-Level Policies

Workload Assumptions
  1. Each job runs for the same amount of time.
  2. All jobs arrive at the same time.
  3. Once started, each job runs to completion.
  4. All jobs only use the CPU (i.e., they perform no I/O).
  5. The run-time of each job is known.
Scheduling Metrics
  • Turnaround time: The turnaround time of a job is defined as the time at which the job completes minus the time at which the job arrived in the system.
  • Response time: We define response time as the time from when the job arrives in a system to the first time it is scheduled.
Basic Scheduling Policy
  • First In, First Out (FIFO)
  • Consider turnaround time:
    • Shortest Job First (SJF): It runs the shortest job first, then the next shortest, and so on.
    • Shortest Time-to-Completion First (STCF): Any time a new job enters the system, the STCF scheduler determines which of the remaining jobs (including the new job) has the least time left, and schedules that one.
  • Consider response time:
    • Round-Robin (RR): Runs a job for a time slice and then switches to the next job in the run queue. Time slice must be a multiple of the timer-interrupt period.
Multi-level Feedback Queue (MLFQ)

It has multiple levels of queues, and uses feedback to determine the priority of a given job:

  • Rule 1: If Priority(A) > Priority(B), A runs (B doesn’t).
  • Rule 2: If Priority(A) = Priority(B), A & B run in round-robin fashion using the time slice (quantum length) of the given queue.
  • Rule 3: When a job enters the system, it is placed at the highest priority (the topmost queue).
  • Rule 4: Once a job uses up its time allotment at a given level (regardless of how many times it has given up the CPU), its priority is reduced (i.e., it moves down one queue).
  • Rule 5: After some time period S, move all the jobs in the system to the topmost queue.
Proportional Share

Consider metric: Guarantee that each job obtain a certain percentage of CPU time.

  • Lottery Scheduling.
  • Stride Scheduling.
  • Linux Completely Fair Scheduler (CFS): Weighted round-robin with dynamic time slices. Using red-black tree for running processes.
Multiprocessor Scheduling
  • Issues:
    • Cache Coherence.
    • Cache Affinity.
  • Approaches:
    • Single-Queue Multiprocessor Scheduling (SQMS).
    • Multi-Queue Multiprocessor Scheduling (MQMS).

Address Spaces

Virtual memory is abstraction on physical memory

  • The virtual memory system is responsible for providing the illusion of a large, sparse, private address space to programs.
  • The OS, with some serious hardware help, will take each of these virtual memory references, and turn them into physical addresses.

Low-Level Mechanisms

Address Translation

physical address = virtual address + base

  • memory management unit (MMU) of CPU:
    • base register.
    • bounds register: If a process generates a virtual address that is greater than the bounds, or one that is negative, the CPU will raise an exception.
  • privileged instruction: update base and bounds registers.
Segmentation

Segmentation allows the OS to do is to place each one of those segments in different parts of physical memory, and thus avoid filling physical memory with unused virtual address space.

Three base and bounds register pairs - code, heap, stack:

Segment Register

As you can see from the following picture, the top two bits (01) tell the hardware which segment we are referring to. The bottom 12 bits are the offset into the segment:

Segment Offset

Free-Space Management

A free list contains a set of elements that describe the free space still remaining in the heap:

Free List

Splitting: It will find a free chunk of memory that can satisfy the request and split it into two. The first chunk it will return to the caller; the second chunk will remain on the list.

Coalescing: If the newly-freed space sits right next to one existing free chunks, merge them into a single larger free chunk.

Tracking The Size Of Allocated Regions:

Specific Contents Of The Header

Strategies:

  • Best Fit
  • Worst Fit
  • First Fit
  • Next Fit
  • Segregated Lists
  • Buddy Allocation
Paging

Page and Page Frame:

  • Page: Chop up virtual address space into fixed-sized pieces.
  • Page Frame: View physical memory as an array of fixed-sized slots called page frames.

Page Table:

  • The operating system usually keeps a per-process data structure known as a page table.
  • The OS indexes the array by the virtual page number (VPN), and looks up the page-table entry (PTE) at that index in order to find the desired physical frame number (PFN).
  • PTE = PFN + Other Bits

Translation-Lookaside Buffer (TLB):

  • A TLB is part of the chip’s memory-management unit (MMU), and is simply a hardware cache of popular virtual-to-physical address translations; thus, a better name would be an address-translation cache.
  • TLB Entry = VPN + PFN + Other Bits

Smaller page tables are too big and thus consume too much memory:

  • Linear Page Table: Memory problem.
  • Paging and Segments: Divide by segmentation. Divide each segment by paging.
  • Multi-Level Page Tables.
  • Inverted Page Tables.

Beyond Physical Memory:

  • Swap Space: Reserve some space on the disk for moving pages back and forth.
  • Page-Fault Handler: If a page is not present and has been swapped to disk, the OS will need to swap the page into memory in order to service the page fault.

Complete Address Translation:

Address Translation

Higher-Level Policies

Replacement Algorithm

Deciding which page (or pages) to evict is encapsulated within the replacement policy of the OS.

Given that main memory holds some subset of all the pages in the system, it can rightly be viewed as a cache for virtual memory pages in the system. Thus, our goal in picking a replacement policy for this cache is to minimize the number of cache misses. Alternately, one can view our goal as maximizing the number of cache hits.

Average Memory Access Time (AMAT):

Average Memory Access Time

Replacement Policy:

  • The Optimal Replacement Policy: Know page that will be accessed furthest in the future.
  • FIFO.
  • Random.
  • Least-Recently-Used (LRU): Use history as our guide.
  • Approximating LRU: Clock Algorithm.
  • Considering Dirty Pages.

Techniques and Approaches from Real VMS

  • Virtual memory is divided into process space and kernel space.
  • Clustering
  • Demand zero
  • Copy on write
  • Large page
  • 2Q replacement algorithm
  • Security

Concurrency

Thread

How to use CPU

  • A single thread has a program counter (PC) that tracks where the program is fetching instructions from.
  • Each thread has its own private set of registers it uses for computation.
  • If there are two threads that are running on a single processor, when switching from running one (T1) to running the other (T2), a context switch must take place. Use thread control blocks (TCBs) to store the state of each thread of a process.

How to use memory

Multi-Threaded Address Spaces

  • Multiple threads in one process share the same address space.
  • One stack in the address space per one thread. Sometimes called thread-local storage.

Thread States

  • Ready
  • Running
  • Blocked
  • Sleep

参考 UNIX 环境高级编程:线程博客 - Posix Thread

The Heart Of The Problem

Uncontrolled Scheduling

The code sequence for counter + 1 in x86, thread context-switch can happen in this three-instruction sequence:

第一步,将内存地址 0x8049a1c(counter 变量在内存中的地址)中的值拷贝到寄存器 eax

mov 0x8049a1c, %eax

第二步,将寄存器 eax 中的值加 1

add $0x1, %eax

第三步,将寄存器 eax 中的值拷贝回内存地址 0x8049a1c(counter 变量在内存中的地址)

mov %eax, 0x8049a1c

Waiting For Another

One thread must wait for another to complete some action before it continues.

Locks

Evaluating Locks

  • Mutual Exclusion: Preventing multiple threads from entering a critical section.
  • Fairness: Each thread contending for the lock get a fair shot at acquiring it once it is free. Avoid starvation.
  • Performance: The time overheads added by using the lock.

Spin Locks

Spin waiting for the lock:

void lock(lock_t *mutex) {
  while (mutex->flag == 1) // TEST the flag

    ; // spin-wait (do nothing)

  mutex->flag = 0; // now SET it!

} 

Hardware supported instructions:

  • test-and-set instruction
  • compare-and-swap instruction
  • load-linked and store-conditional instructions
  • fetch-and-add instruction

Instead of spin waiting:

  • Yield: A system call that moves the caller from the running state to the ready state, and thus promotes another thread to running.
  • Queues: park() and unpark(threadID) in Solaris can be used in tandem to build a lock that puts a caller to sleep if it tries to acquire a held lock and wakes it when the lock is free.

参考 UNIX 网络编程 - 进程间通信:同步博客 - Posix Thread

Condition Variables

In particular, there are many cases where a thread wishes to check whether a condition is true before continuing its execution. For example, a parent thread might wish to check whether a child thread has completed before continuing.

The Producer/Consumer (Bounded Buffer) Problem

Imagine one or more producer threads and one or more consumer threads. Producers generate data items and place them in a buffer; consumers grab said items from the buffer and consume them in some way.

Check with while not if

void thr_join() {
  Pthread_mutex_lock(&m);
  while (done == 0)
    Pthread_cond_wait(&c, &m);
  Pthread_mutex_unlock(&m);
}

Covering Conditions

pthread_cond_broadcast() wakes up all waiting threads. Those threads will simply wake up, re-check the condition, and then go immediately back to sleep.

参考 UNIX 网络编程 - 进程间通信:同步博客 - Posix Thread

Semaphores

Dijkstra and colleagues invented the semaphore as a single primitive for all things related to synchronization; as you will see, one can use semaphores as both locks and condition variables.

A Definition

Decrement the value of semaphore s by one. Wait if value of semaphore s is negative:

int sem_wait(sem_t *s) {
}

Increment the value of semaphore s by one. If there are one or more threads waiting, wake one:

int sem_post(sem_t *s) {
}

Use semaphores to solve following problems

  • Binary Semaphores (Locks)
  • Semaphores For Ordering
  • The Producer/Consumer (Bounded Buffer) Problem
  • Reader-Writer Locks

How To Implement Semaphores

One of the ways is using locks and condition variables.

参考 UNIX 网络编程 - 进程间通信:同步博客 - 读写锁

Event-based Concurrency

An Event Loop

Pseudocode for an event loop looks:

while (1) {
  events = getEvents();
  for (e in events)
    processEvent(e);
}

Problems caused by an event loop

  • Problem: Blocking System Calls.
  • Solution: Asynchronous I/O. Use polling check whether an I/O has completed. Use interrupt to get inform from OS when an asynchronous I/O completes.

  • Problem: State Management.
  • Solution: When an event handler issues an asynchronous I/O, it must package up some program state for the next event handler to use when the I/O finally completes.

参考 博客 - iOS 基础 - RunLoop 和 定时器博客 - iOS 基础 - GCD 和 Operation

Persistence

I/O Devices

OS Persistence

OS interact with device registers
  • I/O Instructions.
  • Memory-Mapped I/O.
  • Polling: If a device is fast.
  • Interrupt: If a device is slow. The OS can issue a request, put the calling process to sleep, and context switch to another task. When the device is finally finished with the operation, it will raise a hardware interrupt, causing the CPU to jump into the OS at a predetermined interrupt service routine (ISR) or more simply an interrupt handler. The handler is just a piece of operating system code that will finish the request (for example, by reading data and perhaps an error code from the device) and wake the process waiting for the I/O, which can then proceed as desired.
  • Direct Memory Access (DMA).
A Canonical Device

CPU

Hard Disk Drives

Hard Disk Drives

I/O Time
  • First a seek,
  • Then waiting for the rotational delay,
  • And finally the transfer.
Disk Scheduling
  • What: Given a set of I/O requests, the disk scheduler examines the requests and decides which one to schedule next.
  • How: Consider time.
  • Where: OS and disk.

Redundant Arrays of Inexpensive Disk (RAID)

RAID transforms a number of independent disks into a large, more capacious, and more reliable single entity; Importantly, it does so transparently, and thus hardware and software above is relatively oblivious to the change.

RAID design along three important axes
  • Capacity
  • Reliability
    • Redundancy
  • Performance
    • Throughput
      • Sequential Read/Write
      • Random Read/Write
    • Latency

Flash-based Solid-state Storage Device (SSD)

Basic Flash Operations

SSD Pages with Blocks

  • Read (a page): Being able to access any location uniformly quickly means the device is a random access device.
  • Erase (a block): Before writing to a page within a flash, the nature of the device requires that you first erase the entire block the page lies within.
  • Program (a page): Once a block has been erased, the program command can be used to change the content of a page.
Organization of SSD

SSD Logical Diagram

  • Flash Translation Layer (FTL)
    • Direct Mapped.
    • A Log-Structured FTL: Upon a write to logical block N, the device appends the write to the next free spot in the currently-being-written-to block; we call this style of writing logging. To allow for subsequent reads of block N , the device keeps a mapping table (in its memory, and persistent, in some form, on the device); this table stores the physical address of each logical block in the system.
  • Garbage Collection.
  • Mapping Table Size.

File System

The file system is pure software:

Data Structures

What types of on-disk structures are utilized by the file system to organize its data and metadata:

  • In-memory open file table for file descriptors.
  • Inode bitmap and data bitmap: A file system must track which inodes and data blocks are free, and which are not. Free space management use this two bitmaps.
  • Inode (It is short for index node. Store information about a file as metadata):
    • For file:
      • Direct Pointers.
      • The Multi-Level Index: Indirect Pointers.
    • For directory: Just a specific type of file that store name to inode-number mappings.
  • File block.
Access Methods
  • Reading and writing files can be expensive, incurring many I/Os to the (slow) disk.
  • Use read cache and write buffer to increase performance.
  • Avoid read cache and write buffer to use fsync() or raw disk interface.

参考 UNIX 环境高级编程:I/O

Locality and The Fast File System

Fast File System (FFS)
  • Organize the drive into block groups, each of which is just a consecutive portion of the disk’s address space.
Locality

Keep related stuff together in one group:

  • First, it makes sure (in the general case) to allocate the data blocks of a file in the same group as its inode, thus preventing long seeks between inode and data.
  • Second, it places all files that are in the same directory in the cylinder group of the directory they are in.

Crash Consistency

File System Checker (FSCK)

Scanning the entire disk to find,make sure the file system metadata is internally consistent.

Journaling

By writing the note to disk, you are guaranteeing that if a crash takes places during the update (overwrite) of the structures you are updating, you can go back and look at the note you made and try again:

  • Data Journaling.
  • Metadata Journaling.

Log-structured File System (LFS)

  • LFS can gather all updates into an in-memory segment and then write them out together sequentially.
  • LFS always writes to an unused portion of the disk, and then later reclaims that old space through cleaning.

Data Integrity and Protection

Techniques used to ensure that the data you put into your storage system is the same when the storage system returns it to you.

  • Latent-Sector Errors (LSEs): Have one or more blocks become inaccessible.
    • Solution: Redundancy mechanism.
  • Block Corruption: Hold the wrong contents.
    • Solution: Checksum to detection corruption.
      • Different functions are vary in strength (i.e., how good they are at protecting data integrity) and speed (i.e., how quickly can they be computed):
        • XOR.
        • Fletcher checksum.
        • Cyclic Redundancy Check (CRC).
  • Misdirected Writes.
  • Lost Writes.

Distributed Systems

Distributed File System Architecture

  • The key to any distributed system: how to deal with failure.
  • Communication forms the heart of any distributed system.
  • Cache Consistency problem:
    • Update Visibility and Cache Staleness.
    • Read Cache and Write Buffer.

Remote Procedure Call (RPC)

  • Stub Generator (Protocol Compiler): Takes an interface code to generate client and server codes.
  • Run-Time Library:
    • Built upon UDP.
    • Timeout/Retry and Acknowledgment.

参考 UNIX 网络编程 - 进程间通信:远程过程调用

Sun’s Network File System (NFS)

  • The main goal: simple and fast server crash recovery.
  • A block-based protocol.
  • A stateless approach: each client operation contains all the information needed to complete the request.

The Andrew File System (AFS)

  • The main goal: scale.
  • A file-based protocol.
  • Reaps the benefits of the client-side OS memory caching infrastructure.

Other Topics of OS

Essence of Computer Science

Abstractions are fundamental to all aspects of Computer Science.