For entries under the header "Self-warming," the optimized copy code included a pre-warming section that read every 32nd byte from the read buffer before entering the copy loop. The column labeled "Bytes/Iter" indicates how many bytes were moved per iteration of the copy loop. The column labeled "Warm Cache Results" shows the cycle count per byte obtained when executing the optimized copy code upon a cache that already contains the entire source and destination buffers. The column labeled "Cycles Saved" indicates how many cycles per byte the optimized code saved over memcpy. The column labeled "% Improvement" indicates the percentage performance improvement obtained by running the optimized code instead of memcpy. This figure is calculated by the formula: %improvement = 100(1 - optimized cycle count/memcpy cycle count) The last two columns were measured by applying both memcpy and the optimized copy to a cold cache. When memcpy was applied to a warm cache, the cycle count per byte transferred was 0.27 on an MMX machine; 0.31 on a non-MMX Pentium 90. No Self-warming, MMX register copy, MMX machine Bytes/Iter Warm Cache Results Cycles Saved %Improvement 8 0.65 1.01 33 16 0.41 0.91 30 32 0.34 0.74 24 64 0.31 0.72 23 256 0.29 0.62 20 Self-warming, MMX register copy, MMX machine Bytes/Iter Warm Cache Results Cycles Saved %Improvement 8 0.70 1.02 33 16 0.44 0.99 32 32 0.38 0.93 30 64 0.35 0.95 31 256 0.32 0.72 23 No Self-warming, float register copy, MMX machine Bytes/Iter Warm Cache Results Cycles Saved %Improvement 8 1.15 1.25 36 16 1.10 1.53 44 32 1.03 1.35 39 64 0.97 1.54 43 Self-warming, float register copy, MMX machine Bytes/Iter Warm Cache Results Cycles Saved %Improvement 8 1.19 1.51 45 16 1.13 1.60 47 32 1.06 1.64 48 64 1.00 1.71 48 No Self-warming, float register copy, Pentium 90 (All cases were worse than memcpy) Self-warming, float register copy, Pentium 90 Bytes/Iter Warm Cache Results Cycles Saved %Improvement 8 1.21 0.41 16 16 1.15 0.41 16 32 1.09 0.49 20 64 1.02 0.59 24