ruby -v
ruby 2.0.0p247 (2013-06-27 revision 41674) [x86_64-linux]
python3 --version
Python 3.3.2
使用传统的 fibonacci 来 benchmark
#ruby
require 'benchmark'
def fib n
if n < 2
1
else
fib(n - 1) + fib(n - 2)
end
end
Benchmark.bm{|bm| bm.report{puts fib(35)}}
#python
import timeit
def fib(n):
if n < 2:
return 1
else:
return fib(n - 1) + fib(n - 2)
if __name__=='__main__':
from timeit import Timer
t = Timer("print(fib(35))","from __main__ import fib")
print(t.timeit(1))
#ruby
ruby fib.rb
user system total real
14930352
2.600000 0.000000 2.600000 ( 2.601639)
#python3.2.2
fib.py
14930352
9.720147652000378
ruby 版本快了大概 3.7 倍 我使用 python2.7.5 版会稍微快些,大概 6.3 秒,当然还是比 ruby 慢的
ruby2.0 真的快了不少,还是说我的 python 代码写的有问题?
加了好多特化指令的结果... 你可以把 trace 的编译参数关掉看到可以更快点
require 'benchmark'
iseq = RubyVM::InstructionSequence.compile\
<<-RUBY, __FILE__, '.', 0, trace_instruction: false
def fib n
if n < 2
1
else
fib(n - 1) + fib(n - 2)
end
end
RUBY
iseq.eval
Benchmark.bm{|bm| bm.report{puts fib(35)}}
$ ruby fib.rb
user system total real
14930352
2.240000 0.000000 2.240000 ( 2.240812)
$ python2 fib.py
14930352
4.98553395271
$ python3 fib.py
14930352
7.37453122099987
$ pypy fib.py
14930352
1.59528398514
if ruby
puts "u love"
elsif pathon
puts 'u love too'
...
end
puts 'good programmers'
pypy 可以和 jruby 对比一下,两个 BDD (benchmark driven development) 产物...
看指令:
puts iseq.disasm
左列是 offset, 表示偏离的 qword 数
opt_plus
是常见方法的特化指令,1.9 加的,做整数运算大约是静态语言的一半速度了
opt_send_simple
和 指令_OP_XXX
是 2.0 加进去的特化指令
太假了,send
改成opt_lt
,opt_minus
和opt_plus
竟然从和 Python 速度一样变成比 Python 快一倍
可是 Python 现在默认生成的指令也是类似的,对应的是COMPARE_OP
,BINARY_ADD
,BINARY_SUBTRACT
,而不是先载入 operator 再CALL_FUNCTION
我觉得原因在其他地方。
这要看对应指令的实现,python 的实现可能只是改了下 calling convention, ruby 的 opt_plus 是把 fixnum / float / string / array 的加法 inline 进来了
别的没有,刚好就COMPARE_OP
,BINARY_ADD
和BINARY_SUBTRACT
这三个 inline 了 int 的,对于现在这个情形应该一样才对啊。
Python 没用 threading 是会慢一点,但是 Ruby 的速度快了一倍啊,不是换个 threading 就能达到的吧。
case BINARY_ADD:
w = POP();
v = TOP();
if (PyInt_CheckExact(v) && PyInt_CheckExact(w)) {
/* INLINE: int + int */
register long a, b, i;
a = PyInt_AS_LONG(v);
b = PyInt_AS_LONG(w);
/* cast to avoid undefined behaviour
on overflow */
i = (long)((unsigned long)a + b);
if ((i^a) < 0 && (i^b) < 0)
goto slow_add;
x = PyInt_FromLong(i);
}
else if (PyString_CheckExact(v) &&
PyString_CheckExact(w)) {
x = string_concatenate(v, w, f, next_instr);
/* string_concatenate consumed the ref to v */
goto skip_decref_vx;
}
else {
slow_add:
x = PyNumber_Add(v, w);
}
Py_DECREF(v);
skip_decref_vx:
Py_DECREF(w);
SET_TOP(x);
if (x != NULL) continue;
break;
#16 楼 @bhuztez 好吧其实 ruby 的实现也 pop 的
你确定 python 没开 threading? 在解释器中很基本的... switch..case 的状态也可以做 threading 的。
cachegrind 一下比较下 cache miss 比率大致可以猜出有没有做 threading :
valgrind --tool=cachegrind --cachegrind-out-file=/tmp/rb ruby fib.rb
==49280==
==49280== I refs: 10,237,483,293
==49280== I1 misses: 397,866
==49280== LLi misses: 9,631
==49280== I1 miss rate: 0.00%
==49280== LLi miss rate: 0.00%
==49280==
==49280== D refs: 4,846,225,267 (2,980,702,999 rd + 1,865,522,268 wr)
==49280== D1 misses: 456,388 ( 344,239 rd + 112,149 wr)
==49280== LLd misses: 66,417 ( 11,651 rd + 54,766 wr)
==49280== D1 miss rate: 0.0% ( 0.0% + 0.0% )
==49280== LLd miss rate: 0.0% ( 0.0% + 0.0% )
==49280==
==49280== LL refs: 854,254 ( 742,105 rd + 112,149 wr)
==49280== LL misses: 76,048 ( 21,282 rd + 54,766 wr)
==49280== LL miss rate: 0.0% ( 0.0% + 0.0% )
主要看 LL miss
Python
==8232== I refs: 24,435,810,159
==8232== I1 misses: 424,549
==8232== LLi misses: 7,356
==8232== I1 miss rate: 0.00%
==8232== LLi miss rate: 0.00%
==8232==
==8232== D refs: 11,447,475,486 (8,210,939,243 rd + 3,236,536,243 wr)
==8232== D1 misses: 505,807 ( 410,181 rd + 95,626 wr)
==8232== LLd misses: 57,991 ( 23,760 rd + 34,231 wr)
==8232== D1 miss rate: 0.0% ( 0.0% + 0.0% )
==8232== LLd miss rate: 0.0% ( 0.0% + 0.0% )
==8232==
==8232== LL refs: 930,356 ( 834,730 rd + 95,626 wr)
==8232== LL misses: 65,347 ( 31,116 rd + 34,231 wr)
==8232== LL miss rate: 0.0% ( 0.0% + 0.0% )
Ruby
==8464== I refs: 14,783,720,041
==8464== I1 misses: 137,316
==8464== LLi misses: 9,038
==8464== I1 miss rate: 0.00%
==8464== LLi miss rate: 0.00%
==8464==
==8464== D refs: 6,208,884,291 (3,666,216,719 rd + 2,542,667,572 wr)
==8464== D1 misses: 205,003 ( 150,900 rd + 54,103 wr)
==8464== LLd misses: 57,501 ( 17,151 rd + 40,350 wr)
==8464== D1 miss rate: 0.0% ( 0.0% + 0.0% )
==8464== LLd miss rate: 0.0% ( 0.0% + 0.0% )
==8464==
==8464== LL refs: 342,319 ( 288,216 rd + 54,103 wr)
==8464== LL misses: 66,539 ( 26,189 rd + 40,350 wr)
==8464== LL miss rate: 0.0% ( 0.0% + 0.0% )
python 用 switch case 也能做到这个程度啊,里面肯定有些黑魔法...
刚下好 py3 代码... 还有就是 python 是否有 inline caching? 这个影响也很大的。
python3 里 direct threading 肯定是开了的。
因为
if (x != NULL) DISPATCH();
break;
然后那个 DISPATCH()
其实可以是 goto *ip++
(incr instruction pointer, goto next instruction)python 是取 attr 再 apply, non-method attr 没有 cache 的意义,估计就是 inline method cache 的区别了...
如果在 ruby 源码的 vm_opts.h 里把这些优化选项改成 0 关掉:
#define OPT_INLINE_METHOD_CACHE 0
#define OPT_GLOBAL_METHOD_CACHE 0
再 make install
应该可以更接近 python3 一点...
我贴的那个是 Python2
3 开了 2 没开,2 还比 3 快不少 ... 不科学啊
http://bugs.python.org/issue4753
肯定是 Ruby 用了黑魔法...
Python
def fib(n):
if n < 2:
return 1
else:
return fib(n - 1) + fib(n - 2)
2 0 LOAD_FAST 0 (n)
3 LOAD_CONST 1 (2)
6 COMPARE_OP 0 (<)
9 POP_JUMP_IF_FALSE 16
3 12 LOAD_CONST 2 (1)
15 RETURN_VALUE
5 >> 16 LOAD_GLOBAL 0 (fib)
19 LOAD_FAST 0 (n)
22 LOAD_CONST 2 (1)
25 BINARY_SUBTRACT
26 CALL_FUNCTION 1
29 LOAD_GLOBAL 0 (fib)
32 LOAD_FAST 0 (n)
35 LOAD_CONST 1 (2)
38 BINARY_SUBTRACT
39 CALL_FUNCTION 1
42 BINARY_ADD
43 RETURN_VALUE
44 LOAD_CONST 0 (None)
47 RETURN_VALUE
def fib n
if n < 2
1
else
fib(n - 1) + fib(n - 2)
end
end
local table (size: 2, argc: 1 [opts: 0, rest: -1, post: 0, block: -1] s1)
[ 2] n<Arg>
0000 getlocal n ( 1)
0002 putobject 2
0004 opt_lt <ic:6>
0006 branchunless 12
0008 putobject 1 ( 2)
0010 leave ( 1)
0011 pop
0012 putself ( 4)
0013 getlocal n
0015 putobject 1
0017 opt_minus <ic:7>
0019 send :fib, 1, nil, 8, <ic:2>
0025 putself
0026 getlocal n
0028 putobject 2
0030 opt_minus <ic:8>
0032 send :fib, 1, nil, 8, <ic:4>
0038 opt_plus <ic:9>
0040 leave
function fib(n)
if (n<2) then
return 1
end
return fib(n-1)+fib(n-2)
end
function <fib.lua:1,7> (13 instructions, 52 bytes at 0x1120710)
1 param, 4 slots, 0 upvalues, 1 local, 3 constants, 0 functions
1 [2] LT 0 0 -1 ; - 2
2 [2] JMP 2 ; to 5
3 [3] LOADK 1 -2 ; 1
4 [3] RETURN 1 2
5 [6] GETGLOBAL 1 -3 ; fib
6 [6] SUB 2 0 -2 ; - 1
7 [6] CALL 1 2 2
8 [6] GETGLOBAL 2 -3 ; fib
9 [6] SUB 3 0 -1 ; - 2
10 [6] CALL 2 2 2
11 [6] ADD 1 1 2
12 [6] RETURN 1 2
13 [7] RETURN 0 1
还 Ruby 多一条指令...
Erlang,快得离谱啊,竟然连 1 秒都不要...
-module(fib).
-export([test/0, fib/1]).
fib(0) ->
1;
fib(1) ->
1;
fib(N) ->
fib(N-1)+fib(N-2).
test() ->
{T, V} = timer:tc(fib, fib, [35]),
io:format("~w:~w~n", [T,V]).
➜ demo python fib.py
14930352
4.9382610321
➜ demo python3 fib.py
14930352
4.950316907998058
➜ demo pypy fib.py
14930352
0.762974023819
#30 楼 @luikore 根据定义,所有函数最后都会自动加上这个...
我数的是实际会用到的指令。
刚才我妄图把那两个指令删了,好像把 Python 解释器搞挂了...
可以先用我的 orz 生成一下 bytecode http://xiazheteng.github.io/orz/
from orz.lua.asm import Assembly, Label
from orz.asm import Opcode, StringTable
from orz.symbol import Global, Local, Name, Attribute, calculate_slots
import StringIO
import marshal
def compile_fib(parent, stringtable):
filename = Name(__file__)
func_name = Name("fib")
n = Local("n")
fib = Global("fib")
nnames = calculate_slots([n, fib])
asm = Assembly(
func_name, *nnames,
argcount = 1,
parent = parent)
asm.set_lineno(1)
asm.emit(Opcode.LOAD_FAST, n.slot)
asm.load_const(2)
asm.emit(Opcode.COMPARE_OP, 0)
l = Label()
asm.emit(Opcode.POP_JUMP_IF_FALSE, l)
asm.load_const(1)
asm.emit(Opcode.RETURN_VALUE)
asm.emit(Opcode.LABEL, l)
asm.emit(Opcode.LOAD_GLOBAL, fib.slot)
asm.emit(Opcode.LOAD_FAST, n.slot)
asm.load_const(2)
asm.emit(Opcode.BINARY_SUBTRACT)
asm.emit(Opcode.CALL_FUNCTION, 1)
asm.emit(Opcode.LOAD_GLOBAL, fib.slot)
asm.emit(Opcode.LOAD_FAST, n.slot)
asm.load_const(1)
asm.emit(Opcode.BINARY_SUBTRACT)
asm.emit(Opcode.CALL_FUNCTION, 1)
asm.emit(Opcode.BINARY_ADD)
asm.emit(Opcode.RETURN_VALUE)
# asm.load_const(None)
# asm.emit(Opcode.RETURN_VALUE)
for names in nnames:
for name in names:
name.s = stringtable.add(name.name, interned=True)
func_name.s = stringtable.add(func_name.name, interned=True)
return asm
def compile_mod():
stringtable = StringTable()
filename = Name(__file__)
fname = Name(__name__)
fib = Global("fib")
nnames = calculate_slots([fib])
asm = Assembly(fname, *nnames)
asm.set_lineno(1)
asm.load_const(compile_fib(asm, stringtable))
asm.emit(Opcode.MAKE_FUNCTION, 0)
asm.emit(Opcode.STORE_GLOBAL, fib.slot)
asm.load_const(None)
asm.emit(Opcode.RETURN_VALUE)
for names in nnames:
for name in names:
name.s = stringtable.add(name.name, interned=True)
fname.s = stringtable.add(fname.name, interned=True)
stringtable.close()
buf = StringIO.StringIO()
asm.serialize(buf, __file__)
return marshal.loads(buf.getvalue())
mod = compile_mod()
d = {}
exec mod in d
fib = d['fib']
if __name__=='__main__':
from timeit import Timer
t = Timer("print(fib(35))","from __main__ import fib")
print(t.timeit(1))
代码分别抄自http://rosettacode.org/wiki/Fibonacci_sequence
AWK SWIProlog 和 Python 差不多
Perl 更慢... Tcl 和 Perl 差不多
bc 比 Perl 还慢
目前 Octave 垫底
lua 比 Ruby 快那么点
boo 和 Erlang 差不多
Mozilla 的 JS 参数选对了比 Erlang 快,当然了是开了 JIT,
参数js -b -m fib.js
比 Erlang 快,参数js -b -j fib.js
只能和 Ruby 比比
目前试过的里面纯解释还是 Erlang 最快
其实闹这么多不如加个 memoize ...
require 'benchmark'
require "memoize"
include Memoize
def fib n
if n < 2
1
else
fib(n - 1) + fib(n - 2)
end
end
memoize :fib
Benchmark.bm{|bm| bm.report{puts fib(35)}}
#35 楼 @rasefon IO 完全不关语言的事啊... 难道不是操作系统或者网络的问题?有具体代码或者例子么?
#36 楼 @bhuztez 不知道,大概是 python vm 里指定了 register modifier 的变量影响了流水线,导致空跑几个时钟周期之类的... 具体只能 micro benchmark 单个指令了吧...
JIT 比解释慢没什么好奇怪的,java 1.1 刚加 JIT 的时候是默认不开的,因为开了反而更慢...
erlang 生成了什么样的 byte code 也不知道啊,或许发现这个函数是 pure recursive 的把 memoize 应用上了...
{function, fib, 1, 2}.
{label,1}.
{line,[{location,"fib.erl",5}]}.
{func_info,{atom,fib},{atom,fib},1}.
{label,2}.
{test,is_integer,{f,4},[{x,0}]}.
{select_val,{x,0},{f,4},{list,[{integer,1},{f,3},{integer,0},{f,3}]}}.
{label,3}.
{move,{integer,1},{x,0}}.
return.
{label,4}.
{allocate_zero,1,1}.
{line,[{location,"fib.erl",10}]}.
{gc_bif,'-',{f,0},1,[{x,0},{integer,1}],{x,1}}.
{move,{x,0},{y,0}}.
{move,{x,1},{x,0}}.
{line,[{location,"fib.erl",10}]}.
{call,1,{f,2}}.
{line,[{location,"fib.erl",10}]}.
{gc_bif,'-',{f,0},1,[{y,0},{integer,2}],{x,1}}.
{move,{x,0},{y,0}}.
{move,{x,1},{x,0}}.
{line,[{location,"fib.erl",10}]}.
{call,1,{f,2}}.
{line,[{location,"fib.erl",10}]}.
{gc_bif,'+',{f,0},1,[{y,0},{x,0}],{x,0}}.
{deallocate,1}.
return.
是说,能选这个问题试试看吗? http://blog.dhananjaynene.com/2008/07/performance-comparison-c-java-python-ruby-jython-jruby-groovy/ 因为这个的问题是 OO 问题,跟物件导向語言比较相關。
对于斐波那契数列计算的优化是经典问题,楼主运行一下 fib 10000 试试。
附赠 matrix 实现
def fib n
m = Matrix[[1,1],[1,0]]
(m ** n)[0,0]
end
#54 楼 @luikore Erlang 跑分就是比 Ruby 2.0 好... Ruby 2.0 比 JRuby 好
Erlang 跑分都和 PHP 不相上下了
http://benchmarksgame.alioth.debian.org/u64q/benchmark.php?test=all&lang=erlang&lang2=yarv&data=u64q
想起来了,Erlang 还可以编译的。编译之后速度快了不少嘛。
1> c(fib).
{ok,fib}
2> fib:test().
730082:14930352
ok
3> c(fib,[native]).
{ok,fib}
4> fib:test().
272043:14930352
ok
5>
看到 PyPy 的结果我就安心了 (2014 的 MacBook 主频 2.6, 我想放到 i74790 上跑得更快) 也许 Ruby 该出个解释器叫 RubyRuby
mic@mbp: ~/work/$ ruby2.0 foo.rb
user system total real
14930352
2.140000 0.010000 2.150000 ( 2.146835)
mic@mbp: ~/work/$ pypy bar.py
14930352
0.649578094482
mic@mbp: ~/work/$ python3.4 bar.py
14930352
3.95859599113