NUMA
Andrea Arcangeli
andrea@suse.de
SuSE Kernel Developer
http://www.suse.com
Copyright (C) 2001 Andrea Arcangeli SuSE
UKUUG Manchester, 29 June - 1 July 2001
(page 1)
Wildfire on GS alpha-servers
`wildfire' is the name of the alpha platform that is installed on the alpha-server GS series.
Feature GS80 GS160 GS320
QBBs 2 4 8
CPUs 8 16 32
Partitions Supported 2 4 8
Memory (GB) 64 128 256
Bandwidth (GB/sec) 12,8 25,6 51,2
PCI boxes/slots (64bit) 4/56 8/112 16/224
Bandwidth (GB/sec) 3,2 6,4 12,8
And most important the wildfire is a NUMA architecture: each partition is a single node and the whole machine is the interconnect of those nodes through high speed switches.
(page 2)
Wildfire platform
Two QBBs communicates directly through their Global Ports.
When connecting more than two QBBs a high-speed, low-latency Hierarchical Switch is used.
This switch provides many critical functions, including maintaining a cache coherency directory and enforcing hardware partitions.
The bandwidth between two QBBs is 1.6 GB/sec in each direction.
(page 3)
Wildfire design
The two-level switch hierarchy is responsible for the linear scalability of per-CPU bandwidth, as compared to standard bus-based systems where the more CPUs that are added, the less memory bandwidth each has available to it.
This switch hierarchy is then matched with a highly scalable memory interleave strategy, aggressive memory resource scheduling and aggressive data link bandwidth management to provide a huge memory system bandwidth with a capacity to handle hundreds of outstanding references.
The impact of remote memory accesses is lower than many other ccNUMA products - just under a factor of 3:1 over local memory access (330 ns local access latency vs. 960 ns remote latency).
(page 4)
history about linux on wildfire
linux successfully booted and run for the first time on a wildfire machine on March 2000 after one week of the start of the linux-wildfire port.
The early wildfire port was missing many features, it wasn't able to use all the ram, and PCI buses in the machine.
linux run for the first time on a GS320 production system using 256G of ram and 32 CPUs at the end of 2000.
This basic wildfire support that allows linux to use all ram/cpus/pci-buses in the machine is now included in mainstream 2.4 kernel. 2.4.5 includes the very latest wildfire numa support.
(page 5)
wildfire NUMA memory management
The wildfire memory management NUMA support was developed in Jan 2001 and it is been integrated in the 2.4.5 kernel.
The current mm numa support consists in:
- all userspace and kernel page allocations tries to first get memory from the local node to exploit memory locality
- if there's no local memory free, we try to allocate from another node before starting the memory balancing
- we don't waste tons of ram by allocating "page structures" for inter-node memory holes in the physical address space
- we can now allocate per-node kernel data structures (like spinlocks) to exploit memory locality and to avoid false sharing of cachelines across remote nodes
This is a visible improvement in the speccpu benches and the discontigmem part was necessary to support all the memory configurations.
(page 6)
NUMA scheduler
To better exploit the NUMA mm heuristic the scheduler needs to be aware of the penalty of migrating a task to a remote node.
However the NUMA scheduler should continue to keep all the CPU busy as the current linux SMP scheduler is just doing, in order to fully utilize all the available CPU computing power.
After a task is migrated to a remote node, it worth to keep it in the remote node to allow it to stay in the same CPU and to better optimize the cache utilization (future memory allocations will happen in the remote node too then).
(page 7)
NUMA scheduler implementation
There is a per-node run-queue (spin-lock is still global to simplify the implementation but it can be made per-node or per-cpu later using spin_trylock if the task migrates to a remote node)
When a wakeup of a not-running task happens we first try to reschedule such task in an idle CPU in the current node, if there isn't we try find an idle cpu in the remote nodes and we migrate the task there if there is at least one.
If none CPU is idle in the global
system then we see if it's worthwhile to reschedule the waken-up task in one of the cpus of the current node with the usual linux goodness() and reschedule_idle() heuristic.
(page 8)
NUMA scheduler
By never migrating a task to a remote node when none CPU is idle we try to increase performance by exploiting memory locality.
This can lead to some unfairness between tasks running in different nodes (some node could be more loaded leading to the tasks running in such node to receive less CPU time than the ones running in the other nodes).
It will be more like to have separate machines.
If smp_num_cpus tasks are running they will be guaranteed to all run in different cpus and none CPU in the system will be idle.
(page 9)
NUMA scheduler numbers
The numa scheduler is been benchmarked using a modified modified AIM VII
workload.
Without the numa scheduler:
Tasks jobs/min jti j/m/task real cpu
1 277.55 100 277.5488 21.83 11.47
500 22754.92 98 45.5098 133.16 959.78
1000 23218.03 98 23.2180 261.00 1957.36
1500 22702.30 98 15.1349 400.40 3027.26
2000 22143.90 98 11.0720 547.33 4060.09
2500 21749.05 98 8.6996 696.58 5165.97
3000 21255.60 98 7.0852 855.30 6414.07
3500 20292.69 98 5.7979 1045.20 8192.61
4000 12903.73 97 3.2259 1878.53 11864.56
With the numa scheduler:
Tasks jobs/min jti j/m/task real cpu
1 274.11 100 274.1089 22.11 11.73
500 23020.82 61 46.0416 131.62 949.12
1000 23633.84 58 23.6338 256.41 1923.77
1500 23747.01 58 15.8313 382.79 2908.32
2000 24099.98 75 12.0500 502.90 3846.92
2500 23962.99 95 9.5852 632.23 4883.96
3000 23863.53 64 7.9545 761.83 6055.48
3500 23252.12 57 6.6435 912.17 7667.49
4000 15715.41 76 3.9289 1542.43 13216.40
(page 10)
NUMA kernel hot spot
The kernel byte-code is loaded in one copy near the start of the physical ram.
For a NUMA machine like the wildfire this means all nodes will fetch the kernel code from the first node.
This doesn't scale well, we don't exploit the total band-with of the wildfire NUMA architecture by not having a local copy of the kernel.
(page 11)
NUMA kernel replication
The simpler way to generate a local copy of the kernel is to run the kernel mapped by pagetables (instead of in the kseg), and to have a per-node PGD that points to a physical local copy of the kernel code.
This has the downside of requiring the CPU to walk pagetables and to use tlbs for kernel code too.
To reduce the overhead of running the kernel paged, we can use the granularity hint for the tlb.
(page 12)
spin-lock starvation
Another issue that we face with the NUMA wildfire architecture is that the cpus in remote nodes may starve in the slow path of the spin-lock if the cpus in the local node keeps spinning and acquiring it too all the time.
An excessive banging in loop from all cpus in all nodes on the same shared ram, may also cause slowdown.
So for those numa machines it is possible to use a specialized spin-lock implementation that guarantees fairness and that avoids an excessive trashing of cachelines across the cpus.
(page 13)
per-cpu/per-node data structures
To optimize the kernel performance on those machines we'll also need to allocate in the memory of the local node the data structures relative to the local cpus.
The SMP scalability of the kernel is of course extremely important as well for the NUMA platforms to fully utilize the CPU power.