68040 Info: ---------------------------- This new CISC microprocessor offers RISC performance ---------------------------- Motorola has officially unwrapped its newest 32-bit microprocessor, the 68040. Manufactured with 0.8-micron high-speed CMOS technology, the 68040 packs 1.2 million transistors on a single silicon die. With 900,000 extra transistors to work with over the 300,000 transistors in a 68000 processor, the 68040's designers added new features and boosted performance. New features include the following: -- Optimised 68030 integer unit. While retaining object-code compatibility with previous 68000-family processors, the IU has been optimised to execute instructions in fewer clock cycles (i.e., run faster). The claimed boost in performance is three times that of a 68030. -- Integral FPU. The 68020 and 68030 require external FPU coprocessor chips to handle floating-point math. The 68040, however, has an FPU built into it, giving it the power to do serious number crunching. The FPU's data types are compatible with the ANSI/IEEE 754 standard for binary floating-point math, and its instruction set is object code-compatible with Motorola's 68881/68882 FPUs. Like the IU, the 68040's on-chip FPU has been optimised to execute frequently used instructions using fewer clock cycles. The claimed performance boost is 10 times that of a 68882. -- Large caches. Processor accesses to the system bus are minimised by storing the most recently used set of instructions or data in on-chip, 4K-byte caches. Both caches operate independently but can be accessed at the same time. Bus snoop logic is used to maintain cache coherence (i.e., it ensures that the cache's contents match those parts of memory corresponding to the cache). The bus snooper's design is fine-tuned to support multiprocessor systems where one or more bus masters or 68040s might share the same section of memory. -- Separate memory units for instructions and data. Each memory unit consists of a memory management unit, a cache controller, and bus snoop logic. The MMUs use a subset of the 68030's MMU instruction set. Both memory units function independently of each other to improve processor throughput. The 68040 ships with an initial clock speed of 25MHz; higher speeds are to be available in the future, Motorola says. The 68040 comes in a 179-pin grid-array package. With the elimination of coprocessor function lines (now that the MMU and FPU are consolidated onto the processor) and the addition of snoop control lines, the 68040 is not pin-compatible with the 68030. Because of the 68040's software compatibility with its predecessors, it can tap into the existing software base of 680x0 applications. It does this not only while eliminating a component (the FPU) from a computer's design, but also while improving performance. In fact, the 68040 executes instructions on the average of nearly once per clock cycle -- the same as a RISC processor. Fine-Tuned for Performance The 68040 was built on the firm foundation of its predecessors. The design team used the experience garnered from developing earlier processors to aid in optimising the throughput of the 040. The 040 was designed from the ground up, Motorola engineers said. It incorporates a high degree of parallelism using a number of internal buses. An internal Harvard architecture gives the processor full access to both instructions and data. Both the IU and FPU have separate pipelines and can operate concurrently. For example, the FPU can perform floating-point instructions independently of the IU. Each stream (instructions or data) has its own dedicated cache and MU that function independently of each other. A smart bus controller assigns priorities to bus traffic to and from the caches. There were several key areas where Motorola was able to boost performance. The first was in reducing the clock cycles needed to execute certain instructions. The next was to ensure that the processor funnels instructions and data into itself quickly and constantly, lest it stall while waiting on information. The processor then gets its results back into the system without interfering with incoming information. Finally, as if this wasn't enough, the processor stays off the system bus to a greater extent than is the case with other processor designs. This lets DMA transfers and other bus masters have use of it. CISC with the Speed of RISC The IU was optimised so that high-usage instructions execute in fewer clock cycles, particularly branch instructions. Motorola said it performed thousands of code traces using real-world applications to determine which instructions were used most often. The IU consists of 6 stages: instruction prefetch, decode, effective address calculation, operand fetch, execution, and writeback (i.e., the result is written to either a register or to memory). Each stage works concurrently on the instruction pipeline. Dual prefetch and decode units deal with the branch instructions: One set processes the instruction taken on the branch, and another processes the instruction not taken. In this way, no matter what the outcome, the IU has the next instruction decoded and ready to go without seriously disrupting the pipeline. This complex design has a big pay-off: Motorola has determined that the average instruction takes 1.3 clock cycles to execute. The ability to execute an instruction once per clock cycle is the performance edge of RISC processors -- yet the 68040's IU accomplishes the same goal while executing complex-instruction-set computer (CISC) instructions. The FPU adds 11 registers to the 68040 register set: Eight of them are 80-bit floating-point registers, and three are status, control, and instruction address registers. The FPU has a three-stage execution unit, and, like the IU, each stage operates concurrently. Load and store instructions (FMOVE) can be performed during other arithmetic operations, and a 64- by 8-bit hardware multiplication unit speeds many calculations. However, the FPU only implements a subset of the 68882 instructions on-chip. The transcendental (trigonometric and exponential) functions are emulated in software via a software trap. But Motorola claims that even these instructions should execute 25% to 100% faster on 25MHz 68040 than on a 33MHz 68882 FPU. Boosting Throughput In the area of throughput, each stream is managed by a separate memory unit that uses an MMU for logical-to-physical address translations during bus accesses. These MMUs support demand-paged virtual memory. Both MMUs have a four-way set-associative address translation cache (ATC) with 4 entries (versus 22 entries for the 68030). The ATCs reduce processor overhead by storing the most recent address translations. When an address translation is required, the ATC is searched, and if it contains the address, it is used immediately. Otherwise, a combination of high-speed hardware logic and microcode searches the translation tables located in main memory. Like the PU, these MMUs implement a subset of the 68030's MMU instruction set. Gone are the PLOAD and PMOVE instructions, because enhanced existing instructions made them superfluous. Also, only 2 memory page sizes are supported, 4K and 8K bytes, whereas the 68030 MMU supported 8 page sizes ranging from 256 bytes to 32K bytes. A design tradeoff was made here: A performance gain was possible by supporting only the 2 most common page sizes. In any case, this change impacts only operating-system code, since MMU instructions aren't normally used by applications. The two on-chip 4K caches improve processor throughput in 2 ways: They keep the pipelines filled and minimise system bus accesses. To see how this is done, you must examine the structure of the cache. Each is a four-way set-associative cache composed of 64 sets of four lines. A line consists of 4 longwords, or 16 bytes. Cache lines are read or written rapidly using burst-mode access (a type of bus transfer that moves 16 bytes in a minimum of clock cycles). For read operations, this fills the cache efficiently and, at the same time, loads adjacent instructions or data into the cache that could be used in the near future. Zen and the Art of Cache Maintenance As the cache is accessed and data modified, cache-mode bits in the ATC determine, on a page-by-page basis, the method by which the information is handled. That is, the ATC entry that corresponds to the address in main memory whose contents were copied into the cache decides how the data will be updated. The modes are cacheable write-through, cacheable copyback, noncacheable, and noncacheable I/O. In the cacheable write-through mode, an update to the data cache forces a write to main memory. While this generates additional bus activity, this mode is required when working with a portion of memory that other processors share. The copyback mode updates the cache line but without updating main memory. The modified (or "dirty") cache line is copied back into main memory only when absolutely necessary. "Noncacheable" indicates that the data shouldn't be cached, which is typically the situation for shared data structures or for locked accesses (e.g., an operand access or a translation table entry update). Noncacheable I/O indicates that the data can't be cached and must be read or written in the exact order of instruction execution. This mode is for memory-mapped I/O devices (typically a serial device) where the information's order is crucial. The bus snooper is used in multiple bus master situations where a noncaching bus master, such as a DMA controller, might modify the memory that is mapped into the 68040's cache. The bus snooper monitors the external bus and updates the cache as required. Cache validity is handled on a line-by-line basis (i.e., a cache miss triggers a burst-mode access that updates 16 bytes either in the cache or main memory). The copyback mode minimises writes to main memory, and the bus controller prioritises each cache's external memory requests. Read requests take priority over writes to ensure that the pipelines remain filled. The caches are critical to the 040's overall throughput. They keep instructions and data moving into the processor while satisfying the apparently contradictory role of minimising system bus accesses. Motorola estimates that the cache hit rate is about 93 percent for instruction and data reads and about 94 percent for data writes. A Processor for the 1990's It is perhaps appropriate that Motorola has introduced the 68040 in the first month of the 1990s. The 040 has the power to tackle the jobs with large amounts of information that we will be dealing with regularly in the next ten or so years. Preliminary results have a 68040 weighing in at 20 million instructions per second versus the SPARC's 18 MIPS and the 80486's 15 MIPS, all clocked at 25MHz. On floating-point operations, the 68040 antes up 3.5 million floating-point operations per second versus the SPARCS's 2.6 MFLOPS and the 80486's 1 MFLOPS. If these numbers are accurate, then the 68040 already out performs one RISC processor. But the computer industry doesn't stand still. As we move into the new decade, we can expect new RISC processors to once again take the lead in performance. Still, the 68040 shows that owners of CISC systems can have their cake and eat it, too. They don't have to forsake their software base or settle for mediocre performance. And Motorola is already working on the 68050. |-THiS FiLE PASSED THR0UGH --- /\ ---.------ /\ ---*--.- FiDONET 2:200/612 --| | . * . // \ . // \ . FUJiNeT 7:102/102 | | I.C.S Swedish HQ // \ + // \ . MeGANeT 66:666/1 | | + // / \ // \ + NeST 90:1101/112 | | Sync World HQ /\\ \\ / . // \\ / | | . // \ \/ // /\/ . 16800 DUAL STANDARD | | +46-451-91002 \\ / / \\ \/ + | | * \\ / + . \\ \ . . . | | . \\ / \\ / | |- SysOp: Troed ------------ \/ARCASTIC -- \/XISTENCE --- CoSysOp: Zaphod B -| < Advertisment added using -=Bad Ad=- 1.91 by Troed/Sync. BBS: +46-451-91002 >