今天继续在原来内存拷贝代码上优化:
1. 修改了小内存方案:由原来64字节扩大为128字节,由 int 改为 xmm,小内存性能提升 80%
2. 修改了中内存方案:从4个xmm寄存器并行拷贝改为8个并行拷贝+prefetch,提升20%左右
3. 去除目标地址头部对齐的分支判断,用一次xmm拷贝完成目标对齐,性能替升10%。
4. 增加测试用例:为贴近实际,增加了随机访问,10MB空间内(绝对大于L2尺寸)随机位置和长度的测试
为避免随机数生成影响结果,提前生成随机数,最终平均性能达到gcc4.9配套标准库的2倍以上:
https://github.com/skywind3000/FastMemcpy
最新代码测试结果(可以对比老的表看新版本性能是否有所提升):
benchmark(size=32 bytes, times=16777216):
result(dst aligned, src aligned): memcpy_fast=78ms memcpy=260 ms
result(dst aligned, src unalign): memcpy_fast=78ms memcpy=250 ms
result(dst unalign, src aligned): memcpy_fast=78ms memcpy=266 ms
result(dst unalign, src unalign): memcpy_fast=78ms memcpy=234 msbenchmark(size=64 bytes, times=16777216):
result(dst aligned, src aligned): memcpy_fast=109ms memcpy=281 ms
result(dst aligned, src unalign): memcpy_fast=109ms memcpy=328 ms
result(dst unalign, src aligned): memcpy_fast=109ms memcpy=343 ms
result(dst unalign, src unalign): memcpy_fast=93ms memcpy=344 msbenchmark(size=512 bytes, times=8388608):
result(dst aligned, src aligned): memcpy_fast=125ms memcpy=218 ms
result(dst aligned, src unalign): memcpy_fast=156ms memcpy=484 ms
result(dst unalign, src aligned): memcpy_fast=172ms memcpy=546 ms
result(dst unalign, src unalign): memcpy_fast=172ms memcpy=515 msbenchmark(size=1024 bytes, times=4194304):
result(dst aligned, src aligned): memcpy_fast=109ms memcpy=172 ms
result(dst aligned, src unalign): memcpy_fast=187ms memcpy=453 ms
result(dst unalign, src aligned): memcpy_fast=172ms memcpy=437 ms
result(dst unalign, src unalign): memcpy_fast=156ms memcpy=452 msbenchmark(size=4096 bytes, times=524288):
result(dst aligned, src aligned): memcpy_fast=62ms memcpy=78 ms
result(dst aligned, src unalign): memcpy_fast=109ms memcpy=202 ms
result(dst unalign, src aligned): memcpy_fast=94ms memcpy=203 ms
result(dst unalign, src unalign): memcpy_fast=110ms memcpy=218 msbenchmark(size=8192 bytes, times=262144):
result(dst aligned, src aligned): memcpy_fast=62ms memcpy=78 ms
result(dst aligned, src unalign): memcpy_fast=78ms memcpy=202 ms
result(dst unalign, src aligned): memcpy_fast=78ms memcpy=203 ms
result(dst unalign, src unalign): memcpy_fast=94ms memcpy=203 msbenchmark(size=1048576 bytes, times=2048):
result(dst aligned, src aligned): memcpy_fast=203ms memcpy=191 ms
result(dst aligned, src unalign): memcpy_fast=219ms memcpy=281 ms
result(dst unalign, src aligned): memcpy_fast=218ms memcpy=328 ms
result(dst unalign, src unalign): memcpy_fast=218ms memcpy=312 msbenchmark(size=4194304 bytes, times=512):
result(dst aligned, src aligned): memcpy_fast=312ms memcpy=406 ms
result(dst aligned, src unalign): memcpy_fast=296ms memcpy=421 ms
result(dst unalign, src aligned): memcpy_fast=312ms memcpy=468 ms
result(dst unalign, src unalign): memcpy_fast=297ms memcpy=452 msbenchmark(size=8388608 bytes, times=256):
result(dst aligned, src aligned): memcpy_fast=281ms memcpy=452 ms
result(dst aligned, src unalign): memcpy_fast=280ms memcpy=468 ms
result(dst unalign, src aligned): memcpy_fast=298ms memcpy=514 ms
result(dst unalign, src unalign): memcpy_fast=344ms memcpy=472 msbenchmark random access:
memcpy_fast=515ms memcpy=1014ms
老的测试结果为:
result: gcc4.9 (msvc 2012 got a similar result):
benchmark(size=32 bytes, times=16777216):
result(dst aligned, src aligned): memcpy_fast=180ms memcpy=249 ms
result(dst aligned, src unalign): memcpy_fast=170ms memcpy=271 ms
result(dst unalign, src aligned): memcpy_fast=179ms memcpy=269 ms
result(dst unalign, src unalign): memcpy_fast=180ms memcpy=260 ms
benchmark(size=64 bytes, times=16777216):
result(dst aligned, src aligned): memcpy_fast=162ms memcpy=300 ms
result(dst aligned, src unalign): memcpy_fast=199ms memcpy=328 ms
result(dst unalign, src aligned): memcpy_fast=410ms memcpy=339 ms
result(dst unalign, src unalign): memcpy_fast=390ms memcpy=361 ms
benchmark(size=512 bytes, times=8388608):
result(dst aligned, src aligned): memcpy_fast=160ms memcpy=241 ms
result(dst aligned, src unalign): memcpy_fast=200ms memcpy=519 ms
result(dst unalign, src aligned): memcpy_fast=313ms memcpy=509 ms
result(dst unalign, src unalign): memcpy_fast=311ms memcpy=520 ms
benchmark(size=1024 bytes, times=4194304):
result(dst aligned, src aligned): memcpy_fast=145ms memcpy=179 ms
result(dst aligned, src unalign): memcpy_fast=180ms memcpy=430 ms
result(dst unalign, src aligned): memcpy_fast=245ms memcpy=430 ms
result(dst unalign, src unalign): memcpy_fast=230ms memcpy=455 ms
benchmark(size=4096 bytes, times=524288):
result(dst aligned, src aligned): memcpy_fast=80ms memcpy=80 ms
result(dst aligned, src unalign): memcpy_fast=110ms memcpy=205 ms
result(dst unalign, src aligned): memcpy_fast=110ms memcpy=224 ms
result(dst unalign, src unalign): memcpy_fast=110ms memcpy=200 ms
benchmark(size=8192 bytes, times=262144):
result(dst aligned, src aligned): memcpy_fast=70ms memcpy=78 ms
result(dst aligned, src unalign): memcpy_fast=100ms memcpy=222 ms
result(dst unalign, src aligned): memcpy_fast=100ms memcpy=210 ms
result(dst unalign, src unalign): memcpy_fast=100ms memcpy=230 ms
benchmark(size=1048576 bytes, times=2048):
result(dst aligned, src aligned): memcpy_fast=200ms memcpy=201 ms
result(dst aligned, src unalign): memcpy_fast=260ms memcpy=270 ms
result(dst unalign, src aligned): memcpy_fast=263ms memcpy=361 ms
result(dst unalign, src unalign): memcpy_fast=267ms memcpy=321 ms
benchmark(size=4194304 bytes, times=512):
result(dst aligned, src aligned): memcpy_fast=281ms memcpy=391 ms
result(dst aligned, src unalign): memcpy_fast=265ms memcpy=407 ms
result(dst unalign, src aligned): memcpy_fast=313ms memcpy=453 ms
result(dst unalign, src unalign): memcpy_fast=282ms memcpy=439 ms
benchmark(size=8388608 bytes, times=256):
result(dst aligned, src aligned): memcpy_fast=266ms memcpy=422 ms
result(dst aligned, src unalign): memcpy_fast=250ms memcpy=407 ms
result(dst unalign, src aligned): memcpy_fast=297ms memcpy=516 ms
result(dst unalign, src unalign): memcpy_fast=281ms memcpy=436 msbenchmark random access:
memcpy_fast=594ms memcpy=1161ms
旧文索引:
VS2015比起以前的版本进步了很多,已经有小部分反超的趋势了
====================================
E:\FastMemcpy-master>cl -nologo -O2 FastMemcpy.c
FastMemcpy.c
c:\program files (x86)\microsoft sdks\windows\v7.1a\include\sal_supp.h(57): warning C4005: “__useHeader”: 宏重定义
C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\include\sal.h(2886): note: 参见“__useHeader”的前一个定义
c:\program files (x86)\microsoft sdks\windows\v7.1a\include\specstrings_supp.h(77): warning C4005: “__on_failure”: 宏重定义
C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\include\sal.h(2896): note: 参见“__on_failure”的前一个定义
E:\FastMemcpy-master>FastMemcpy.exe
benchmark(size=32 bytes, times=16777216):
result(dst aligned, src aligned): memcpy_fast=58ms memcpy=75 ms
result(dst aligned, src unalign): memcpy_fast=58ms memcpy=77 ms
result(dst unalign, src aligned): memcpy_fast=54ms memcpy=76 ms
result(dst unalign, src unalign): memcpy_fast=54ms memcpy=76 ms
benchmark(size=64 bytes, times=16777216):
result(dst aligned, src aligned): memcpy_fast=66ms memcpy=81 ms
result(dst aligned, src unalign): memcpy_fast=66ms memcpy=88 ms
result(dst unalign, src aligned): memcpy_fast=66ms memcpy=87 ms
result(dst unalign, src unalign): memcpy_fast=67ms memcpy=88 ms
benchmark(size=512 bytes, times=8388608):
result(dst aligned, src aligned): memcpy_fast=139ms memcpy=203 ms
result(dst aligned, src unalign): memcpy_fast=145ms memcpy=221 ms
result(dst unalign, src aligned): memcpy_fast=160ms memcpy=203 ms
result(dst unalign, src unalign): memcpy_fast=162ms memcpy=207 ms
benchmark(size=1024 bytes, times=4194304):
result(dst aligned, src aligned): memcpy_fast=109ms memcpy=146 ms
result(dst aligned, src unalign): memcpy_fast=113ms memcpy=157 ms
result(dst unalign, src aligned): memcpy_fast=126ms memcpy=156 ms
result(dst unalign, src unalign): memcpy_fast=126ms memcpy=157 ms
benchmark(size=4096 bytes, times=524288):
result(dst aligned, src aligned): memcpy_fast=51ms memcpy=50 ms
result(dst aligned, src unalign): memcpy_fast=53ms memcpy=58 ms
result(dst unalign, src aligned): memcpy_fast=54ms memcpy=68 ms
result(dst unalign, src unalign): memcpy_fast=59ms memcpy=77 ms
benchmark(size=8192 bytes, times=262144):
result(dst aligned, src aligned): memcpy_fast=49ms memcpy=53 ms
result(dst aligned, src unalign): memcpy_fast=53ms memcpy=63 ms
result(dst unalign, src aligned): memcpy_fast=54ms memcpy=67 ms
result(dst unalign, src unalign): memcpy_fast=54ms memcpy=73 ms
benchmark(size=1048576 bytes, times=2048):
result(dst aligned, src aligned): memcpy_fast=193ms memcpy=133 ms
result(dst aligned, src unalign): memcpy_fast=193ms memcpy=141 ms
result(dst unalign, src aligned): memcpy_fast=196ms memcpy=138 ms
result(dst unalign, src unalign): memcpy_fast=192ms memcpy=146 ms
benchmark(size=4194304 bytes, times=512):
result(dst aligned, src aligned): memcpy_fast=238ms memcpy=262 ms
result(dst aligned, src unalign): memcpy_fast=259ms memcpy=271 ms
result(dst unalign, src aligned): memcpy_fast=288ms memcpy=319 ms
result(dst unalign, src unalign): memcpy_fast=265ms memcpy=280 ms
benchmark(size=8388608 bytes, times=256):
result(dst aligned, src aligned): memcpy_fast=258ms memcpy=281 ms
result(dst aligned, src unalign): memcpy_fast=262ms memcpy=291 ms
result(dst unalign, src aligned): memcpy_fast=272ms memcpy=296 ms
result(dst unalign, src unalign): memcpy_fast=268ms memcpy=302 ms
benchmark random access:
memcpy_fast=484ms memcpy=433ms
@MuYu
没错标准库也在不断的进步,希望越来越好吧,这样就用不着来自己优化些东西了。只是说咱们用的各种现成东西,也不能一味的“无条件信任”,就像早年看cpu指令集你会以为rep movsb是最快的,但其实并不是,比如以前gcc的list.size()并不是O(1),而是O(n),例子太多,所以即便基础库也并非完全尽善尽美,照样可以怀疑,对吧?
@skywind
你看好多开源库,都有对基础库的重新实现,比如SDL1/2里面照样有重新实现的memcpy代码(虽然没FastMemcpy快),折射出来的其实是对早期基础库版本的全面不信任,和精确控制每一步的一些编程偏好。
@MuYu
我又更新了一个版本,哈哈,提速不少,能否帮测试下2015?手头没有。
@MuYu
你如果仔细看source code的话会发现msvc 14确实比msvc 12的memcpy精细太多了
14已经类似skywind那样分多种case去做了 而且做法已经基本上差不多 只是一些阈值设定不同
当然是用asm写 至于tiny那块 像本文这种做法 asm写起来还是挺蛋疼的。。。
不过我个人认为tiny这种case从我看glibc和msvc的做法看来 他们是没有特别在意这个地方