Some thoughts on Ruby's speed
Posted by matijs 02/03/2013 at 16h42
Yesterday, I read Alex Gaynor’s slides on dynamic language speed. It’s an interesting argument, but I’m not totally convinced.
At a high level, the argument is as follows, it seems:
- For a comparable algorithm, Ruby et al. do much more work behind the scenes than ‘fast’ languages such as C.
- In particular, they do a lot of memory allocation.
- Therefore, we should add tools to those languages that allow us to do memory allocation more efficiently.
Allocation
As an example, the creation of an array of squared integers is presented, where the runtime needs to dynamically allocate ever larger arrays to hold its result.
I can see the argument there, but I’m not convinced the solution is correct, for the example given. In particular, how much time is spent allocating and re-allocating the array of object references, and how much is spent elsewehere?
<typo:code lang=“ruby”> require ‘benchmark’
def doubles_unallocated n buf = [] i = 1; j = 0 while j < n do buf[i - 1] = i * i i += 1; j += 1 end buf end
def doubles_preallocated n buf = Array.new n i = 1; j = 0 while j < n do buf[j] = i * i i += 1; j += 1 end buf end
n = 10000
Benchmark.bmbm do |x| x.report(“unallocated”) { n.times { doubles_unallocated 1000 } } x.report(“preallocated”) { n.times { doubles_preallocated 1000 } } end </typo:code>
A typical result on Ruby 1.9.3 would be (I’m not showing the rehearsal section here):
user system total real unallocated 1.480000 0.020000 1.500000 ( 1.493959) preallocated 1.370000 0.030000 1.400000 ( 1.404668)
Taking several runs into account, I conclude that pre-allocating the array shaves off 7 to 13 percent of the runtime.
Algorithms
The algorithms used above are obviously not idiomatic Ruby, but they have the advantage of allowing a comparison of allocated and unallocated arrays. Normally, I would probably use map with a range. Similarly, the Python code presented in the slides looks odd to me. I would have expected something using list comprehensions.
Here is the idiomatic Ruby implementation:
<typo:code lang=“ruby”> def doubles_map n (1..n).map {|i| i * i } end </typo:code>
And here is the Ruby equivalent of the Python example given (using
each
versus for
gives no difference in speed):
<typo:code lang=“ruby”> def doubles_each n buf = [] (1..n).each do |i| buf << i * i end buf end </typo:code>
For Ruby, doubles_map
is the idiomatic implementation. It also has the
advantage that a dedicated implementation of Range#map
would be able
to allocate the result array in one go, transparantly avoiding the
array growing overhead. Using list comprehensions in Python would
probably have the same advantage.
Unfortunately, doubles_map
is also a lot slower than doubles_each
,
except on MRI 1.8. Here is the benchmark I ran with a fixed-array
version thrown in for good measure:
<typo:code lang=“ruby”> ARR = (1..1000).to_a def doubles_fixed_array ARR.map {|i| i * i } end
n = 10000
Benchmark.bmbm do |x| x.report(“using map”) { n.times { doubles_map 1000 } } x.report(“using each”) { n.times { doubles_each 1000 } } x.report(“using a fixed array”) { n.times { doubles_fixed_array } } end </typo:code>
Here is the result for MRI 1.9.3:
user system total real using map 1.220000 0.000000 1.220000 ( 1.217607) using each 1.070000 0.020000 1.090000 ( 1.091302) using a fixed array 0.940000 0.020000 0.960000 ( 0.969685)
And for MRI 1.8.7:
user system total real using map 2.790000 0.000000 2.790000 ( 2.795956) using each 3.630000 0.010000 3.640000 ( 3.644272) using a fixed array 2.120000 0.010000 2.130000 ( 2.133636)
A better Range#map
These benchmark results made me wonder how Range#map
is implemented.
Since it is written in Ruby, I checked the Rubinius implementation, and there
at least the problem seems due to the lack of a dedicated Range#map
implementation: The generic Enumerable#map
is used, effectively turning
it into doubles_each
. I’m not entirely sure why it is actually slower,
perhaps that’s due to the double yield involved and/or argument
splatting.
Here are the results of the previous benchmark, run on Rubinius 2.0.0-rc1:
user system total real using map 1.312082 0.004000 1.316082 ( 1.316717) using each 0.996062 0.000000 0.996062 ( 0.997418) using a fixed array 0.720045 0.000000 0.720045 ( 0.724615)
I made the following simple dedicated implementation for Range#map
.
It falls back to the default Enumerable#map
for most cases, except for
Fixnums:
<typo:code lang=“ruby”> class Range def map if block_given? && Fixnum === @begin first, last = @begin, @end last -= 1 if @excl size = last - first + 1
out = Array.new size
out_tuple = out.tuple
i = first
j = 0
while i <= last
out_tuple[j] = yield i
i += 1
j += 1
end
out
else
super
end
end end </typo:code>
This makes doubles_map
run slightly faster than the fixed array
version:
user system total real using map 0.648040 0.000000 0.648040 ( 0.650103) using each 0.928058 0.000000 0.928058 ( 0.930109) using a fixed array 0.680042 0.004001 0.684043 ( 0.686719)
Benchmarking Still Needed
While examining these options, I found that there are many ways to write
this code, each subtly different, each with a different performance on
different Rubies (the examples using while
are the slowest on MRI
1.9.3, but the fastest on Rubinus). So, if you really want your code to
be fast, you still have to benchmark, you still have to know which Ruby
implementation your code will run on, and you still have to try all the
options.