PCI: past, present and future


Introduction

PCI is probably the most successful I/O architecture in history. Introduced in 1992, it has enjoyed great success over the last 12 years. Incremental changes have brought it from a theoretical bandwidth of 132MB/s to over 1GB/s. It has replaced many proprietary I/O architectures and is supported by almost every system architecture. It has spawned a number of related technologies including PCI-X, AGP and CompactPCI, and several formfactors including Mini PCI, Low Profile PCI and Cardbus. It has been flexible enough to meet the needs of everyone from low-margin sound cards to high-end Ultra 320 SCSI cards.

Standards

The PCI Special Interest Group is responsible for publishing the standards that define PCI. They were formed in 1992 and have more than 800 member companies.

There are a family of standards that define different parts of what is now referred to as conventional PCI. The PCI Local Bus specification has been through six revisions since its first release in 1992. It is augmented by the PCI-to-PCI Bridge specification, the PCI Power Management specification, the Mini PCI specification and the PCI-X Addendum.

Similarly, PCI Express is also defined by a family of standards. There's the Base specification, the Card specification and the Mini PCI Express specification.

Other groups also publish specifications that build upon the specifications released by the PCI-SIG. The PCMCIA association released the CardBus and ExpressCard standards for laptop expansion ports. The PICMG released the CompactPCI and AdvancedTCA standards for industrial and telecom computer.

AGP is a variant of the PCI protocol. It includes some additional features to enhance graphics performance and removes or restricts some features from PCI that aren't needed for graphics cards.

Conventional PCI

PCI is a bus based system. There can be up to 32 devices on a bus (though electrical considerations limit this to around 5 slots). It's either 32 or 64 bit. You can use 32-bit devices on a 64-bit bus and vice-versa.

Signalling on the PCI bus can be done at 3.3 volts or 5 volts. PCI slots are keyed to prevent you plugging the wrong card into the wrong slot. If you look at a PCI slot, there's a short run of pins and a long run of pins (on a 64-bit bus, there's a third run of pins for the additional signals). If the long run of pins is nearest to the case of the machine, it's a 5 volt slot. If the short run is nearest, it's a 3.3 volt slot. Many PCI cards have two notches in the connector so they can fit into either slot. These are known as universal cards.

The pins on the PCI bus can be divided into six functional groups, plus power. The Clock and Reset pins are referred to as the \emph{System Pins}. The \emph{Address and Data Pins} include the 32 (or 64) pins that are used for both addresses and data, 4 (or 8) Command and Byte Enable pins and a Parity pin. The seven \emph{Interface Control Pins} are used to indicate what stage of the protocol is active. The Request and Grant \emph{Arbitration Pins} are used to negotiate which device gets access to the bus next. The Parity and System \emph{Error Reporting Pins} indicate that a device has found a problem on the bus. The four \emph{Interrupt Pins} can be used to signal interrupts.

Conventional PCI describes operations in terms of transactions. A transaction starts when a device uses the Request pin to request time on the bus and the arbiter uses the Grant pin to tell it the bus is clear. The first phase that the initiator drives on the bus is an address phase. Then either the initiator device or the target device will drive a number of data phases.

If a device isn't ready to accept a phase, it inserts delay cycles on the bus. PCI-X devices have the option to terminate the transaction and retry it later, allowing other devices an opportunity to use the bus. PCI-X also tweaks the bus protocol a little by inserting an additional attribute phase between the address and data phases. By providing additional information in the attribute cycle, such as the length of the transaction, devices can make more efficient use of the bus.

PCI Express

PCI Express is completely different at the hardware level. Instead of being a bus, it's a collection of point-to-point serial links. It still uses transactions, but rather than specifying them in terms of which bus pins are in what state, the transaction is encoded into a packet. That packet is then routed through a fabric to the destination device, which will send another packet in response.

A link is composed of between 1 and 32 lanes. Each lane has 4 pins, one pair of pins to receive data and one pair to transmit. There are no interrupt or control pins as the equivalent functionality is provided by special purpose messages or header fields in the packet.

In order to make PCI Express a suitable replacement for AGP as well as PCI, it had to support isochronous operation. Isochronous communication guarantees a certain mimimum amount of bandwidth to the device. PCI Express does this by specifying one of 8 Traffic Classes in each packet. A device can support up to 8 Virtual Channels, reserving a certain amount of bandwidth for higher-priority traffic classes.

Bandwidth

The primary motivation for moving to PCI Express is for higher bandwidth. The original 32-bit, 33MHz busses had a raw bandwidth of 133MB/s. Due to various overheads imposed by the PCI bus protocol, the effective data bandwidth can be anywhere from 60-100MB/s, depending on the types of transfers being used.

Several complementary approaches were taken to increase the effective data bandwidth available. The first approach doubles the bus width to 64 bits, increasing the amount of data per cycle. The second approach (introduced in the PCI 2.1 specification) doubles the maximum bus speed to 66 MHz. A more subtle approach is to provide multiple independent PCI busses. Since devices on different busses don't interfere with each other, not only does each card have more bandwidth available to it directly, but they interrupt each others transfers less often. Several manufacturers took this approach to the extreme with each slot being on its own PCI bus. Combining all three approaches allows the bus data bandwidth to approach 400MB/s.

Most desktop systems have not evolved beyond the original PCI-1x specification. Server and workstation chipsets have support for PCI-2x, PCI-4x and PCI-X 133, but these are a much smaller market. The reason for this is simply a matter of demand. Except for graphics, there are no devices that demand anything even close to PCI's bandwidth. Graphics cards are almost exclusively handled through the AGP bus.

This is quite an astute decision -- if you place a 133MHz PCI-X card and a 33MHz PCI card on the same bus, the bus is configured to the lowest common denominator. To avoid this, machines need multiple busses -- one for high performance cards and another for low performance cards. But the average desktop has only one high-performance card in it. By using a different bus, you prevent customers from putting their cards in the wrong slots and getting abysmal performance.

PCI Express is completely different. Each lane can simultaneously transmit and receive 2.5 gigabits per second. The encoding used on the links reduces this to 2 gigabits per second which is 250MB/s \emph{in each direction}. An x4 link has approximately the same bandwidth as a 133MHz 64-bit PCI-X bus. The x16 slots currently being deployed for graphics cards have double the bandwidth of an AGP 8x slot.

The 1GB/s offered by PCI-X 133 isn't enough for the top end cards. AGP has already moved beyond it to the 2GB/s AGP-8x. 10Gbps ethernet cards require 2GB/s of data bandwidth. Serial ATA and Serial Attached SCSI will both require upwards of 2GB/s bandwidth per card in the next few years.

PCI-X has faster speeds specified, taking it to 4GB/s and probably beyond, but PCI-Express aims to be the technology for the future. Since Intel has announced plans to kill future AGP development in favour of PCI-Express and the graphics cards tend to lead the market in terms of bandwidth consumption, it seems like a pretty safe bet that PCI Express will become prevalent.

Address Spaces

Depending on the transaction type, the address may refer to one of three different address spaces -- Configuration, Memory or I/O.

I/O space is also commonly known as port space. Under Linux, you can examine how it is assigned by looking at /proc/ioports. It's a 16-bit address space which exists to provide compatibility with ISA cards. Linux provides the functions $inb()$, $inw()$, $inl()$, $outb()$, $outw()$ and $outl()$ for reading and writing byte, 2-byte and 4-byte quantities.

Memory space is memory-mapped I/O. It's up to 64-bits in size and offers significant performance benefits over port I/O space. On the i386 architecture, you really can access it just like it's memory, but this doesn't work on all architectures. To access it in a portable manner, first $ioremap()$ it, then call $readb()$, $readw()$, $readl()$, $readq()$, $writeb()$, $writew()$, $writel()$ or $writeq()$ to access 1, 2, 4 or 8-bit quantities. You can also use functions like $memcpy\_fromio()$ and $memset\_io()$ to deal with larger chunks of memory mapped I/O. Once you're done with it, call $iounmap()$ to release it.

Configuration space is a mere 256 bytes in size. It is the mechanism for plug-and-play configuration and reports much useful information about the device. Linux provides the functions $pci\_read\_config\_byte()$, $pci\_read\_config\_word()$, $pci\_read\_config\_dword()$, $pci\_write\_config\_byte()$, $pci\_write\_config\_word()$ and $pci\_write\_config\_dword()$ to access this space. It's not normally necessary to do this as Linux caches much of the useful information from this space in the $pci\_dev$ structure.

PCI Express retains the three address spaces and adds a new address space --- the Message space. The Message space is used for various control messages, there's no need for the operating system to generate message packets explicitly; instead a message of the appropriate type will be generated in response to an event. For example, if an error is detected, an Error Signalling Message will be sent.

Configuring Devices

Before a PCI device can be accessed in any other way, it must be configured. Platform-dependent code tells the Linux PCI code which root busses exist. The PCI code scans each bus for devices then configures the ones it finds.

Each device on a bus is uniquely identified by its device and function number. There can be up to 32 devices on each bus, though physical constraints normally limit the number of devices to around 5. Each device has a function 0 and may have up to 7 additional functions, though it is rare to see more than 2.

Each PCI bus has a number, ranging from 0 to 255. Some of the bus devices may be PCI-to-PCI bridges allowing for expansion to many secondary busses. To uniquely identify a device, you must know the number of the bus it is on, its device number and its function number. This is normally written out (for example by lspci) as 01:09.0. That represents bus 1, device 9, function 0.

Even this is not sufficient for some manufacturers so Linux 2.6 supports PCI domains (aka PCI segments). This adds yet another layer of hierarchy so a configuration address is now written in the form 0003:01:09.0 for domain 3, bus 1, device 9, function 0.

Once a device has been found, it can be configured. This involves assigning ranges of port and memory space to it, setting up interrupts, configuring DMA, error reporting and so on.

From a software point of view, PCI Express changes very little. The configuration space is expanded from 256 bytes to 4096. Because PCI Express is point-to-point, the bus hierarchy tends to be deeper than on conventional PCI, but this doesn't pose a problem.

Credits