[Rails] find_by* efficiency - blocks?

Ara.T.Howard Ara.T.Howard at noaa.gov
Sat Feb 26 03:25:23 GMT 2005


On Fri, 25 Feb 2005, hadley wickham wrote:

> Why wouldn't you just do find_by_sql(sql).filter{|r| some_cond(r)} ?

this starts by allocating the entire result set - what if this contains
1,000,000 records and you have 128MB of ram.  what if 20 of these request come
into your webserver at once?

it's an extremely common ruby idiom to provide an iterator method when the
returned value could potentiall be a memory killer.  all the database apis
sqlite uses already provide this ability - it's simply not currently exposed.

> Or do you mean you want to retrieve one record at a time and filter them as
> you go?

no.  here is an example using sqlite:

this technique uses sqlite's ability to pass a block with which to handle each
tuple as it is returned from the db.

   jib:~ > cat a.rb
   #!/usr/bin/env ruby
   require 'sqlite'
   db = SQLite::Database::new(ARGV.shift, 0)
   db.use_array = true
   db.execute('select * from jobs'){|tuple| p tuple}

   jib:~ > time a.rb db > /dev/null

   real    0m1.510s
   user    0m1.420s
   sys     0m0.080s

this technique returns all tuples at once (the way rails does) and  then
iterates over this set.

   jib:~ > cat b.rb
   #!/usr/bin/env ruby
   require 'sqlite'
   db = SQLite::Database::new(ARGV.shift, 0)
   db.use_array = true
   tuples = db.execute('select * from jobs')
   tuples.each{|tuple| p tuple}


   jib:~ > time b.rb db > /dev/null

   real    0m2.506s
   user    0m2.370s
   sys     0m0.140s

note that the block method is nearly twice as fast and, much, much more
importantly only allocated memory for ONE tuple at a time.  if your query
returns ALL tuples at a time imagine what will happen to your webserver if you
run a query that returns 50,000 tuples and you get 10 request for this page at
once...  now, it's REQUIRED by rails that, at some point, you have a set of
tuples to render via eruby - but by no means will this alway be the exact set
of tuples returned by your sql (though for obvious reasons you should try to
keep your sql and (de)normalization to allow this).  for example, say you are
using a db with lame regex facilities and want to do some pattern matching on
your result set to further filter them: something like

   matching = []

   tuples = recipes.find_by_sql sql

   tuples.each do |tuple|
     matching << tuple if tuple.description =~ /saucy/
   end

now, further imagine that the set of tuples will be 50,000,000 big.  your code
will explode with an Errno::ENOMEM exception being thrown.  if rails exposed
the (already existing) iterator methods of all the database apis one could

   matching = []

   recipes.find_by_sql(sql) do |tuple|
     matching << tuple if tuple.description =~ /saucy/
   end

and ONLY those tuples matching 'saucy' would be allocated.  (of course this is
not strictly true since ruby is garbage allocated - it's more appropriate to
say that all but those tuples not saved would be candiate for subsequent gc
runs which, of course, get triggered if malloc ever fails)

> Wouldn't that be rather expensive in terms of queries?  Although I suppose
> you could buffer by fetching, say, 20 at a time.

again.  the whole reason to do it is that is more efficient both in space and
time.  play with the builtin 'find' module of ruby using the following two
techniques in a directory hierarchy with several million files and you'll see
what i mean.  compare

   require 'find'
   entries = Find::find('.')
   entries.each{|entry| p entry}

with

   require 'find'
   Find::find('.'){|entry| p entry}

have top running beforehand for the first one because it'll grind your
terminal to a halt pretty quickly unless you have tons of RAM.

kind regards.

-a
-- 
===============================================================================
| EMAIL   :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
| PHONE   :: 303.497.6469
| When you do something, you should burn yourself completely, like a good
| bonfire, leaving no trace of yourself.  --Shunryu Suzuki
===============================================================================


More information about the Rails mailing list