Let's talk about transfer rate. From current chip set data sheets it
is well known that they achieve 5-2-2-2 transfer rates using 60 nsec EDO
memory; that is, EDO has 75 nsecs initial transfer (on page hit, of course)
and 30 nsecs for each burst, a total of 165 nsec for the whole burst. So
we would expect this EDO memory to be able to transfer about 1,000,000,000/
165 = 6060606 transfers/sec of 4*8=32 bytes each= 189.4 Mbytes/sec.
Does it?
Alternatively, having the 5-2-2-2 = 11 cycles transfer in mind, we
would expect to have a memory transfer rate (read) of about 66.66MHz/11
cycles = 6060606 transfers/sec of 4*8=32 bytes each = 189.4 Mbytes/sec.
But when we actually measure a Pentium MMX-200MHz's main memory read rate
in the best
possible way (large
2 burst unrolled loop) we get only 116.95 Mbytes/sec! which
is 17.4 cycles or 273.6 nsec per burst. Why? What are the additional 6.4
cycles or 108.6 nsec for?
Workaround:
After a long time of experimentation we discovered that Intel has a
specific flaw or imperfection in their 486/Pentium/Pentium MMX’s read buffer. In
Intel’s data sheets we are told that on a cache miss read, the bus controller
will return the requested word first (that is, on page hit EDO in 5 cycles)
and will proceed afterwards to read in parallel the other 3 words (of the
burst) which will be available after the whole burst finishes (after
2+2+2=6 cycles). But what Intel does not tell us,
is that if one makes a request of another byte/word/dword in the same
burst line (even the same one on non-MMX Pentiums) while the burst is still
loading, then there is a considerable time
penalty. That is additionally
to the normal whole burst time of 5+2+2+2=11 cycles. (Our
guess is that the processor stalls for the read buffer to transfer its
contents into the data cache and the execution unit to read from the data
cache.) But if while the burst has not finished another read request
is made (in a different burst line), there is no penalty; immediately
after the current burst a new one is generated (while the data of the
1st are copied to the L1 cache) which returns the requested byte/word/dword
first. Consequently, in order to get rid of this
penalty the workaround is just to rearrange the order of read requests.
We even managed to make read transfers better than 5-2-2-2, by reading 2 bursts at one loop, we made them 5-2-2-2-3-2-2-2 = 10 cycles/burst, that is 200 MB/sec ! an increase of 71% in main memory read/search speed! (Actually 10.17 cycles/burst, the 0.17 being the page miss delay).
Similar rules apply to the secondary (L2) cache also. Modern L2 cache is 3-1-1-1 = 6 cycles, so the read rate should be 339 Mbytes/sec. Instead it is only 224.71 Mbytes/sec, that is 9 cycles/burst. Exactly the same loop that we used for main memory results in 338.98 MB/sec for L2!
Main memory read rate:
116.95 => 200 Mbytes/sec
(+71%)
Secondary cache read rate: 224.71 => 338.98
Mbytes/sec (+50.8%)
In order to increase write rate dramatically, we made the simple trick
of using floating point writes, which are the only way 64 bit writes can
be done on a non-MMX Pentium. On the MMX Pentiums of course we use MMX
instructions.
Main memory write
rate: 85.10 => 169.49 Mbytes/sec (+99.1%)
Secondary cache write rate: 84.38 =>
169.49 Mbytes/sec (+100%)
(P5’s cache has no
write allocation).
By combining these 2 techniques, we increased the main memory transfer rate (read & write) from a mediocre 49.50 Mbytes/sec (41.1 cycles/burst using the supposedly ‘perfect’ REP MOVSD) to 95.23 (21.36 cycles/burst; 3-2-2-2 read + 3-3-3-3 write!).
Main memory transfer
rate: 49.50 => 95.23 Mbytes/sec (92%)
Secondary cache transfer rate: 70.42 => 119.04
Mbytes/sec (69%)
Finally by using the 64 bit writing loop (FP or MMX) on the video memory (Matrox Mystique 4 MB), we got an increase of 30.8% (from 81.30 MB/sec, 3.13 PCI cycles/qword to 106.38, 2.39 PCI cycles/qword). Therefore it is visible that the 33 MHz 32 bit PCI bus and medium end PCI video cards are faster than what the Pentium MMX CPU can handle.
Video memory write rate:
81.30 => 106.38 Mbytes/sec (30.8%)
Video memory transfer rate:
57.47 => 68.49 Mbytes/sec (19.2%)
(Video transfer rate is reading from main memory and writing to video
memory which is the most common technique for frame updating used by games).
All the above measurements were made on a 200 MHz Pentium MMX with Intel's
HX chipset and EDO memory. (TX chipset has slightly different values).
Schematic
demonstration of the flaw and our workaround.
Click
here if you are interested in the source code.
For questions, go to the Q&A
page.
For comments or suggestions, mail
us
Everything at this web site is the
property of Intelligent Firmware Ltd. You may not repost/publish this information
without our explicit permission.