NUMA





Andrea Arcangeli
andrea@suse.de
SuSE Kernel Developer
http://www.suse.com







Copyright (C) 2001 Andrea Arcangeli SuSE

UKUUG Manchester, 29 June - 1 July 2001


(page 1)


Wildfire on GS alpha-servers

`wildfire' is the name of the alpha platform that is installed on the alpha-server GS series.
Feature			GS80	GS160	GS320
QBBs			2	4	8
CPUs			8	16	32
Partitions Supported	2	4	8
Memory (GB)		64	128	256
Bandwidth (GB/sec)	12,8	25,6	51,2
PCI boxes/slots (64bit)	4/56	8/112	16/224
Bandwidth (GB/sec)	3,2	6,4	12,8
And most important the wildfire is a NUMA architecture: each partition is a single node and the whole machine is the interconnect of those nodes through high speed switches.


(page 2)


Wildfire platform

Two QBBs communicates directly through their Global Ports.

When connecting more than two QBBs a high-speed, low-latency Hierarchical Switch is used.

This switch provides many critical functions, including maintaining a cache coherency directory and enforcing hardware partitions.

The bandwidth between two QBBs is 1.6 GB/sec in each direction.


(page 3)


Wildfire design

The two-level switch hierarchy is responsible for the linear scalability of per-CPU bandwidth, as compared to standard bus-based systems where the more CPUs that are added, the less memory bandwidth each has available to it.
This switch hierarchy is then matched with a highly scalable memory interleave strategy, aggressive memory resource scheduling and aggressive data link bandwidth management to provide a huge memory system bandwidth with a capacity to handle hundreds of outstanding references.
The impact of remote memory accesses is lower than many other ccNUMA products - just under a factor of 3:1 over local memory access (330 ns local access latency vs. 960 ns remote latency).


(page 4)


history about linux on wildfire

linux successfully booted and run for the first time on a wildfire machine on March 2000 after one week of the start of the linux-wildfire port.

The early wildfire port was missing many features, it wasn't able to use all the ram, and PCI buses in the machine.

linux run for the first time on a GS320 production system using 256G of ram and 32 CPUs at the end of 2000.

This basic wildfire support that allows linux to use all ram/cpus/pci-buses in the machine is now included in mainstream 2.4 kernel. 2.4.5 includes the very latest wildfire numa support.


(page 5)


wildfire NUMA memory management

The wildfire memory management NUMA support was developed in Jan 2001 and it is been integrated in the 2.4.5 kernel.
The current mm numa support consists in:
This is a visible improvement in the speccpu benches and the discontigmem part was necessary to support all the memory configurations.


(page 6)


NUMA scheduler

To better exploit the NUMA mm heuristic the scheduler needs to be aware of the penalty of migrating a task to a remote node.

However the NUMA scheduler should continue to keep all the CPU busy as the current linux SMP scheduler is just doing, in order to fully utilize all the available CPU computing power.

After a task is migrated to a remote node, it worth to keep it in the remote node to allow it to stay in the same CPU and to better optimize the cache utilization (future memory allocations will happen in the remote node too then).


(page 7)


NUMA scheduler implementation

There is a per-node run-queue (spin-lock is still global to simplify the implementation but it can be made per-node or per-cpu later using spin_trylock if the task migrates to a remote node)

When a wakeup of a not-running task happens we first try to reschedule such task in an idle CPU in the current node, if there isn't we try find an idle cpu in the remote nodes and we migrate the task there if there is at least one.

If none CPU is idle in the global
system then we see if it's worthwhile to reschedule the waken-up task in one of the cpus of the current node with the usual linux goodness() and reschedule_idle() heuristic.


(page 8)


NUMA scheduler

By never migrating a task to a remote node when none CPU is idle we try to increase performance by exploiting memory locality.

This can lead to some unfairness between tasks running in different nodes (some node could be more loaded leading to the tasks running in such node to receive less CPU time than the ones running in the other nodes).

It will be more like to have separate machines.

If smp_num_cpus tasks are running they will be guaranteed to all run in different cpus and none CPU in the system will be idle.


(page 9)


NUMA scheduler numbers

The numa scheduler is been benchmarked using a modified modified AIM VII
workload.
Without the numa scheduler:
Tasks    jobs/min  jti  j/m/task     real      cpu
    1      277.55  100  277.5488    21.83    11.47
  500    22754.92   98   45.5098   133.16   959.78
 1000    23218.03   98   23.2180   261.00  1957.36
 1500    22702.30   98   15.1349   400.40  3027.26
 2000    22143.90   98   11.0720   547.33  4060.09
 2500    21749.05   98    8.6996   696.58  5165.97
 3000    21255.60   98    7.0852   855.30  6414.07
 3500    20292.69   98    5.7979  1045.20  8192.61
 4000    12903.73   97    3.2259  1878.53 11864.56

With the numa scheduler:
Tasks    jobs/min  jti  j/m/task     real      cpu
    1      274.11  100  274.1089    22.11    11.73
  500    23020.82   61   46.0416   131.62   949.12
 1000    23633.84   58   23.6338   256.41  1923.77
 1500    23747.01   58   15.8313   382.79  2908.32
 2000    24099.98   75   12.0500   502.90  3846.92
 2500    23962.99   95    9.5852   632.23  4883.96
 3000    23863.53   64    7.9545   761.83  6055.48
 3500    23252.12   57    6.6435   912.17  7667.49
 4000    15715.41   76    3.9289  1542.43 13216.40


(page 10)


NUMA kernel hot spot

The kernel byte-code is loaded in one copy near the start of the physical ram.

For a NUMA machine like the wildfire this means all nodes will fetch the kernel code from the first node.

This doesn't scale well, we don't exploit the total band-with of the wildfire NUMA architecture by not having a local copy of the kernel.


(page 11)


NUMA kernel replication

The simpler way to generate a local copy of the kernel is to run the kernel mapped by pagetables (instead of in the kseg), and to have a per-node PGD that points to a physical local copy of the kernel code.

This has the downside of requiring the CPU to walk pagetables and to use tlbs for kernel code too.

To reduce the overhead of running the kernel paged, we can use the granularity hint for the tlb.


(page 12)


spin-lock starvation

Another issue that we face with the NUMA wildfire architecture is that the cpus in remote nodes may starve in the slow path of the spin-lock if the cpus in the local node keeps spinning and acquiring it too all the time.

An excessive banging in loop from all cpus in all nodes on the same shared ram, may also cause slowdown.

So for those numa machines it is possible to use a specialized spin-lock implementation that guarantees fairness and that avoids an excessive trashing of cachelines across the cpus.


(page 13)


per-cpu/per-node data structures

To optimize the kernel performance on those machines we'll also need to allocate in the memory of the local node the data structures relative to the local cpus.

The SMP scalability of the kernel is of course extremely important as well for the NUMA platforms to fully utilize the CPU power.