Access denied, Please sign in and make sure you have proper permission.

数据库 Use MySQL stream for large datasets

hooopo for Shopper+ · October 26, 2015 · Last by hooopo replied at December 12, 2015 · 6009 hits

Topic has been selected as the excellent topic by the admin.

find_each

在 ETL 过程中，经常需要处较大的表，较复杂的查询，通常会涉及到 JOIN 几张表和 SUM/COUNT/AVG 等聚合计算。

之前介绍过MySQL 怎样插的最快，但在 ETL 实践过程中，我发现其实大数据集读取和转换才是最耗时的。

我们知道，当数据量稍微大一点的时候，在 Rails 里使用简单的User.all这样的查询进程都会直接死掉，原因主要是结果集量太大，导致memory float。Rails 里的一个流行解决方案是使用find_each，原理是把一个查询拆成多条查询，每个只返回固定条数的记录。由于实现机制，find_each 对带有 order 和 limit 的查询及其不友好。最大的问题其实是拆成多条查询之后性能其实降低了很多，尤其是需要 JOIN 很多表的情况。

Person.find_each(start: 2000, batch_size: 2000) do |person|
  person.party_all_night!
end

mysql2 adapter streaming

Mysql2 Adapter 有一个 stream 选项：

Mysql2::Client can optionally only fetch rows from the server on demand by setting :stream => true. This is handy when handling very large result sets which might not fit in memory on the client.

使用也很简单：

require 'mysql2'

client = Mysql2::Client.new(:host => "localhost",
                            :username => "root",
                            :password => "xxx",
                            :database => "crm_dev"
                            )
result = client.query('SELECT id, email FROM shopperplus_customers', :stream => true)
result.each do |row|
  p row
end

mysql client --quick option

另一种方法是使用 mysql 的命令行客户端，并且带 --quick 参数。

这个思路来自 activewarehouse-etl 的 MySQLStreamer:

The MySQL streamer is a helper with works with the database_source in order to allow you to use the --quick option (which stops MySQL) from building a full result set, also we don't build a full resultset in Ruby - instead we yield a row at a time

我把它简化一下，是这样子的：

require 'open3'

mysql_command = %Q{mysql --quick -h localhost -u root -e "SELECT id, email FROM shopperplus_customers" -D crm_dev --password="xxx" -B}

Open3.popen3(mysql_command) do |stdin, out, err, external|
  while line = out.gets do
    columns = line.strip.split("\t")
    keys ||= columns
    result = keys.zip(columns).to_h

    p result
  end
end

14 likes

ibugs #0 October 27, 2015

测试了一下，总共 5w 条数据。效果还是很明显的


def test1
  client = Mysql2::Client.new(:host => "localhost",
                            :username => "root",
                            :password => "",
                            :database => "happy_for_ni"
                            )
  result = client.query('SELECT * FROM videos', :stream => true)
  result.each do |row|
    p row
  end
end

def test2
  Videos.find_each do |row|
    p row
  end
end

Benchmark.bm do |x|

  x.report {  test2 }
  x.report {  test1 }

end


[#<Benchmark::Tms:0x0000010cdc9788 @cstime=0.0, @cutime=0.0, @label="", @real=18.736821, @stime=0.8600000000000003, @total=11.649999999999999, @utime=10.79>,
#<Benchmark::Tms:0x000001084c0d70 @cstime=0.0, @cutime=0.0, @label="", @real=14.197525, @stime=0.79, @total=6.850000000000002, @utime=6.060000000000002>]

hooopo #1 October 28, 2015

#1 楼 @ibugs 简单的单表查可能没什么优势，join 一些表或数据量再大些会更明显。

xieyunzi #2 October 28, 2015

stream 是什么意思，是直接从硬盘读取数据吗？ postgresql 有没有类似的概念？

luikore #3 October 28, 2015

#3 楼 @xieyunzi pg 的 sql 能力比较强，这种情况直接用 pl/v8, pl/ruby 或者 pl/mruby 做就可以，不用把数据 stream 到另一个地方又 stream 回去那么麻烦...

rubyu2 #4 October 28, 2015

#3 楼 @xieyunzi http://dev.mysql.com/doc/refman/5.1/en/mysql-use-result.html

After invoking mysql_query() or mysql_real_query(), you must call mysql_store_result() or mysql_use_result() for every statement that successfully produces a result set (SELECT, SHOW, DESCRIBE, EXPLAIN, CHECK TABLE, and so forth). You must also call mysql_free_result() after you are done with the result set.

MySql 在的mysql_query 或mysql_real_query时候，会调用的mysql_store_result，会将结果保存到内存中。

mysql_use_result() initiates a result set retrieval but does not actually read the result set into the client like mysql_store_result() does. Instead, each row must be retrieved individually by making calls to mysql_fetch_row(). This reads the result of a query directly from the server without storing it in a temporary table or local buffer, which is somewhat faster and uses much less memory than mysql_store_result().

而mysql_use_result跳过了这步mysql_store_result，直接是mysql_fetch_row，所以会节省很多内存，也不会出现内存不足的情况。

另外一篇文章： http://marksverbiage.blogspot.com/2010/04/streaming-data-from-mysql.html

Unfortunately, this does not do streaming in most cases. Normally most client libraries (which call mysql_store_result) will read the entire result into memory, and you're just going through an already-in-memory data set. This will fail if it doesn't fit in memory. Enter mysql_use_result - if your library can use this instead of mysql_store_result, you can then skip through the records without needing to keep them all in ram at once.

hooopo #5 October 28, 2015

#4 楼 @luikore plruby 没有安装成功过，求个能用的教程

harris_rubychina #6 October 28, 2015

以前用 jasperreport + mysql 做报表的时候 mysql 的大数据量查询性能问题。深有体会。当时更多的是把逻辑放在程序解决 . 减少 join 和聚合计算。

xieyunzi #7 October 28, 2015

了解了

nxbtch #8 November 03, 2015

结果

测试代码

require 'mysql2'
require 'active_record'
require 'benchmark'


ActiveRecord::Base.establish_connection(
  adapter:  'mysql2',
  host:     'localhost',
  username: 'root',
  database: 'test'
)

class Person < ActiveRecord::Base 
  self.table_name =  'persons'
end

if Person.count < 10000
  100000.times do |n|
    Person.create
  end
end

def test1
  client = Mysql2::Client.new(host: 'localhost', username: 'root', database: 'test')
  result = client.query('select * from persons', stream: true)
  result.each do |row|
    a = row
  end
end

def test2
  Person.find_each do |row|
    a = row.id
  end
end

Benchmark.bm do |x|

  x.report { test1 }
  x.report { test2 }
end

luikore #9 November 03, 2015

#6 楼 @hooopo 最早的作者去世了...

csg #10 December 10, 2015

关于 mysql2 的 Streaming，实际上就是 mysql_use_result() 和 mysql_store_result() 的差别是一种牺牲服务器成全客户端的做法，有些时候会对服务器以及其他客户端造成影响，需要谨慎使用。至于目的，主要是为了减轻客户端的内存使用量，而不是加快速度。那么童鞋们就要问了，我们测试了呀，速度明显快了很多，怎么还说不是为了加快速度呢，你这个骗纸！这部分时间实际是差在了是否生成 AR 的 obj 上，注意这些测试，查询返回的类型是不同的撒 w|;ﾟﾛﾟ|w 感兴趣的童鞋，不妨试一下 stream: false ～～ヾ (@´▽`@) ﾉ～～～～

hooopo #11 December 12, 2015

#11 楼 @csg 任何 stream 都是这个意思吧... 牺牲 server 端成全客户端。另外一般这种应用都是跑从库上的...

You need to Sign in before reply, if you don't have an account, please Sign up first.

14 likes

Total 12 replies

Reward

New Reply comming, click to load.