Technical analysis of CPU performance flaw

TECHNICAL ANALYSIS

Let's talk about transfer rate. From current chip set data sheets it is well known that they achieve 5-2-2-2 transfer rates using 60 nsec EDO memory; that is, EDO has 75 nsecs initial transfer (on page hit, of course) and 30 nsecs for each burst, a total of 165 nsec for the whole burst. So we would expect this EDO memory to be able to transfer about 1,000,000,000/ 165 = 6060606 transfers/sec of 4*8=32 bytes each= 189.4 Mbytes/sec. Does it?
Alternatively, having the 5-2-2-2 = 11 cycles transfer in mind, we would expect to have a memory transfer rate (read) of about 66.66MHz/11 cycles = 6060606 transfers/sec of 4*8=32 bytes each = 189.4 Mbytes/sec. But when we actually measure a Pentium MMX-200MHz's main memory read rate in the best possible way (large 2 burst unrolled loop) we get only 116.95 Mbytes/sec! which is 17.4 cycles or 273.6 nsec per burst. Why? What are the additional 6.4 cycles or 108.6 nsec for?

Workaround:
After a long time of experimentation we discovered that Intel has a specific flaw or imperfection in their 486/Pentium/Pentium MMX’s read buffer. In Intel’s data sheets we are told that on a cache miss read, the bus controller will return the requested word first (that is, on page hit EDO in 5 cycles) and will proceed afterwards to read in parallel the other 3 words (of the burst) which will be available after the whole burst finishes (after 2+2+2=6 cycles). But what Intel does not tell us, is that if one makes a request of another byte/word/dword in the same burst line (even the same one on non-MMX Pentiums) while the burst is still loading, then there is a considerable time penalty. That is additionally to the normal whole burst time of 5+2+2+2=11 cycles. (Our guess is that the processor stalls for the read buffer to transfer its contents into the data cache and the execution unit to read from the data cache.) But if while the burst has not finished another read request is made (in a different burst line), there is no penalty; immediately after the current burst a new one is generated (while the data of the 1st are copied to the L1 cache) which returns the requested byte/word/dword first. Consequently, in order to get rid of this penalty the workaround is just to rearrange the order of read requests.

We even managed to make read transfers better than 5-2-2-2, by reading 2 bursts at one loop, we made them 5-2-2-2-3-2-2-2 = 10 cycles/burst, that is 200 MB/sec ! an increase of 71% in main memory read/search speed! (Actually 10.17 cycles/burst, the 0.17 being the page miss delay).

Similar rules apply to the secondary (L2) cache also. Modern L2 cache is 3-1-1-1 = 6 cycles, so the read rate should be 339 Mbytes/sec. Instead it is only 224.71 Mbytes/sec, that is 9 cycles/burst. Exactly the same loop that we used for main memory results in 338.98 MB/sec for L2!

Main memory read rate: 116.95 => 200 Mbytes/sec (+71%)
Secondary cache read rate: 224.71 => 338.98 Mbytes/sec (+50.8%)

In order to increase write rate dramatically, we made the simple trick of using floating point writes, which are the only way 64 bit writes can be done on a non-MMX Pentium. On the MMX Pentiums of course we use MMX instructions.

Main memory write rate: 85.10 => 169.49 Mbytes/sec (+99.1%)
Secondary cache write rate: 84.38 => 169.49 Mbytes/sec (+100%)

(P5’s cache has no write allocation).

By combining these 2 techniques, we increased the main memory transfer rate (read & write) from a mediocre 49.50 Mbytes/sec (41.1 cycles/burst using the supposedly ‘perfect’ REP MOVSD) to 95.23 (21.36 cycles/burst; 3-2-2-2 read + 3-3-3-3 write!).

Main memory transfer rate: 49.50 => 95.23 Mbytes/sec (92%)
Secondary cache transfer rate: 70.42 => 119.04 Mbytes/sec (69%)

Finally by using the 64 bit writing loop (FP or MMX) on the video memory (Matrox Mystique 4 MB), we got an increase of 30.8% (from 81.30 MB/sec, 3.13 PCI cycles/qword to 106.38, 2.39 PCI cycles/qword). Therefore it is visible that the 33 MHz 32 bit PCI bus and medium end PCI video cards are faster than what the Pentium MMX CPU can handle.

Video memory write rate: 81.30 => 106.38 Mbytes/sec (30.8%)
Video memory transfer rate: 57.47 => 68.49 Mbytes/sec (19.2%)

(Video transfer rate is reading from main memory and writing to video memory which is the most common technique for frame updating used by games).

All the above measurements were made on a 200 MHz Pentium MMX with Intel's HX chipset and EDO memory. (TX chipset has slightly different values).

Schematic demonstration of the flaw and our workaround.

Click here if you are interested in the source code.

For questions, go to the Q&A page.
For comments or suggestions, mail us

Return to main page.

Everything at this web site is the property of Intelligent Firmware Ltd. You may not repost/publish this information without our explicit permission.