Modern processors read from memory or secondary cache in bursts. A burst (also known as line fill) is a series of 4 successive contiguous reads. The reason processors read in bursts and not in single accesses is that bursts are quicker (e.g. when a single read is 5 cycles, a burst read is 5+2+2+2= 11c, that is 11/4= 2.75 cycles per read). In this page we will assume that a burst is 5+2+2+2 = 11 cycles.
The size of each burst is fixed at 4 transfers of 8 bytes each, 32
bytes in total.
All cycles in this page are of a Pentium with VX/HX/TX chipset and
EDO memory. Slightly different timings apply for other configurations.
(The read accesses in this page have been simplified as 64 bit ones).
A' Conventional reading:
The conventional (normal) method of reading/searching/transfering
is the sequential which is also recommended
by Intel.
1' First read:
The first read is being performed. It takes 5 cycles.
Conclusion:
The total time taken for the conventional method is 5+2+2+2+penalty
cycles + a few cycles for the program's instructions to execute which results
in a total of about 17 cycles per burst, which is about 117
Mbytes per second.
B' Innovative reading:
The unconventional method we discovered
is non-sequential.
1' First read:
As in the conventional method, the first read is being performed. It
takes 5 cycles.
2' Second read:
The second read is not done at the address 8, but at address 32, that
is at the start of the next burst! Of course this second read can
only be served after the whole current burst is finished, that is after
2+2+2= 6 cycles, and it takes 3 instead of 5 cycles for it to be
performed, because on Intel's chipsets there's a special case: when a burst
is initiated immediately after another one has finished, it is considered
an extension of the previous one and takes only 3 cycles! With the conventional
method, this cannot happen because of the processor's penalty. In this
case there's no penalty because no access is made on the current
burst. Total delay: 9 cycles.
3' Third, fourth and fifth reads:
Because the whole first burst is already loaded (and stored in primary
cache) the remaining 3 reads incur no delay. It should be noted that at
the same time the processor makes these 3 reads, it continues to load the
2nd burst, so there is no lost bus time, this method exploits the processor's
parallelism to its maximum: all the instructions of the program run while
the processor loads bursts from outside, that is the bus operates at it's
maximum bandwidth!
Conclusion:
The total time taken for the innovative method is 5+2+2+2+3
cycles =14 cycles for one full burst and the start of the next one. It
is obvious that for burst 1 it takes 5+2+2+2=11 cycles and for burst 2
it will take 3+2+2+2= 9 cycles; that is 10 cycles on average per burst,
which is exactly 200
Mbytes per second!
Comments and feedback are welcome.
For questions, go to the Q&A
page.
Everything at this web site is the
property of Intelligent Firmware Ltd. You may not repost/publish this information
without our explicit permission.