Table 1: Speed improvements over memcpy


For entries under the header "Self-warming," the optimized copy code
included a pre-warming section that read every 32nd byte from the read
buffer before entering the copy loop.  The column labeled "Bytes/Iter"
indicates how many bytes were moved per iteration of the copy loop.  The
column labeled "Warm Cache Results" shows the cycle count per byte
obtained when executing the optimized copy code upon a cache that
already contains the entire source and destination buffers.  The column
labeled "Cycles Saved" indicates how many cycles per byte the optimized
code saved over memcpy.  The column labeled "% Improvement"
indicates the percentage performance improvement obtained by running the
optimized code instead of memcpy.  This figure is calculated by
the formula:

%improvement = 100(1 - optimized cycle count/memcpy cycle count)

The last two columns were measured by applying both memcpy and the
optimized copy to a cold cache.  When memcpy was applied to a warm
cache, the cycle count per byte transferred was 0.27 on an MMX machine;
0.31 on a non-MMX Pentium 90.
 
No Self-warming, MMX register copy, MMX machine
Bytes/Iter  Warm Cache Results  Cycles Saved    %Improvement
8           0.65                1.01            33
16          0.41                0.91            30
32          0.34                0.74            24
64          0.31                0.72            23
256         0.29                0.62            20

Self-warming, MMX register copy, MMX machine
Bytes/Iter  Warm Cache Results  Cycles Saved    %Improvement
8           0.70                1.02            33
16          0.44                0.99            32
32          0.38                0.93            30
64          0.35                0.95            31
256         0.32                0.72            23

No Self-warming, float register copy, MMX machine
Bytes/Iter  Warm Cache Results  Cycles Saved    %Improvement
8           1.15                1.25            36
16          1.10                1.53            44
32          1.03                1.35            39
64          0.97                1.54            43

Self-warming, float register copy, MMX machine
Bytes/Iter  Warm Cache Results  Cycles Saved    %Improvement
8           1.19                1.51            45
16          1.13                1.60            47
32          1.06                1.64            48
64          1.00                1.71            48

No Self-warming, float register copy, Pentium 90
(All cases were worse than memcpy)

Self-warming, float register copy, Pentium 90
Bytes/Iter  Warm Cache Results  Cycles Saved    %Improvement
8           1.21                0.41            16
16          1.15                0.41            16
32          1.09                0.49            20
64          1.02                0.59            24