Rails Rails 中的表联结与预加载

lanzhiheng · 2021年12月27日 · 590 次阅读

这篇文章会重点分析 Rails 数据库查询中最为常用的表联结以及预加载,并用实际案例展示他们的用途。原文连接: https://lanzhiheng.com/posts/preload-and-join-in-rails


在我发表了简析 Rails 查询中的 includes 与 joins这篇文章之后,便有网友建议我写一下left_outer_joins, preload, eagerload的区别。说起来惭愧,前段的心态不是很好,也就不怎么执笔写文,便一直拖到现在。而更大的问题在于这几个玩意的区别我自己也不太清楚(赶紧翻阅文档去)。

然而这几者的区别,文档写得还算比较清楚的,考虑把他们分成两大模块。一个是Joining Tables(表联结),另一个是Eager Loading Associations(预加载关联)。

Joining Tables(表联结)

在关系型数据库里面,表的联结应该是最为常见的操作。假设我要查找一张表的数据,而这个搜索结果要依赖于另外一张表的时候,往往就需要联结两个表进行查询。这么说可能有些抽象。我们还是用博客系统来看看。假设每篇博客文章都关联了一个分类,而我需要查询出分类名为“Notes”的博客文章,以下是系统里面包含的所有分类

> Category.pluck(:name)
   (0.7ms)  SELECT "categories"."name" FROM "categories"
=> ["Huiliu", "Notes", "Psychology", "Translations", "Blogs"]

joins

根据分类名来查找博客文章的 Rails 代码大概就是

> Post.where("categories.name = 'Notes'")
  Post Load (2.2ms)  SELECT "posts".* FROM "posts" WHERE (categories.name = 'Notes') LIMIT $1  [["LIMIT", 11]]
Traceback (most recent call last):
       16: from activerecord (6.0.3.7) lib/active_record/connection_adapters/postgresql_adapter.rb:673:in `exec_no_cache'

可见这样写 SQL 语句的话会报错。因为我们只SELECTposts表,而条件查询里面却依赖了categories表的字段。这种场景需要用joins对两表进行联结。于是乎

> Post.joins(:category).where("categories.name = 'Notes'")
  Post Load (0.8ms)  SELECT "posts".* FROM "posts" INNER JOIN "categories" ON "categories"."id" = "posts"."category_id" WHERE (categories.name = 'Notes') LIMIT $1  [["LIMIT", 11]]
=> #<ActiveRecord::Relation....

从 Rails 生成的 SQL 语句可以看出,joins方法默认生成的数据库联结语句是INNER JOIN。如果你很闲也可以自己去写联结的细节

> Post.joins('INNER JOIN categories ON categories.id = posts.category_id').where("categories.name = 'Notes'").to_sql

=> "SELECT \"posts\".* FROM \"posts\" INNER JOIN categories ON categories.id = posts.category_id WHERE (categories.name = 'Notes')"

效果是一样的。

left_outer_join

前面例子中的INNER JOIN会自动筛选掉那些没有绑定分类的文章。举个例子,假设我们把几篇文章的分类给取消掉

> post_ids = Post.first(10).pluck(:id)
  Post Load (1.3ms)  SELECT "posts".* FROM "posts" ORDER BY "posts"."id" ASC LIMIT $1  [["LIMIT", 10]]
=> [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

> Post.where(id: post_ids).each {|p| p.update_column('category_id', nil)}

> Post.where(id: post_ids).pluck(:category_id)
   (0.9ms)  SELECT "posts"."category_id" FROM "posts" WHERE "posts"."id" IN ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10)  [["id", 1], ["id", 2], ["id", 3], ["id", 4], ["id", 5], ["id", 6], ["id", 7], ["id", 8], ["id", 9], ["id", 10]]
=> [nil, nil, nil, nil, nil, nil, nil, nil, nil, nil]

对这些文章进行INNER JOIN之后就相当于加了层是否有绑定分类的过滤器

> Post.where(id: post_ids).count
   (1.1ms)  SELECT COUNT(*) FROM "posts" WHERE "posts"."id" IN ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10)  [["id", 1], ["id", 2], ["id", 3], ["id", 4], ["id", 5], ["id", 6], ["id", 7], ["id", 8], ["id", 9], ["id", 10]]
=> 10

> Post.where(id: post_ids).joins(:category).count
   (0.8ms)  SELECT COUNT(*) FROM "posts" INNER JOIN "categories" ON "categories"."id" = "posts"."category_id" WHERE "posts"."id" IN ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10)  [["id", 1], ["id", 2], ["id", 3], ["id", 4], ["id", 5], ["id", 6], ["id", 7], ["id", 8], ["id", 9], ["id", 10]]
=> 0

也是容易出 Bug 的地方。假设我们要搜索出分类名不为“Notes”的文章,包括那些没有绑定分类的文章。如果是这样写

> Post.joins(:category).where("category_id IS NULL OR categories.name <> 'Notes'").pluck(:id).count
   (0.7ms)  SELECT "posts"."id" FROM "posts" INNER JOIN "categories" ON "categories"."id" = "posts"."category_id" WHERE (category_id IS NULL OR categories.name <> 'Notes')
=> 65

> (Post.joins(:category).where("category_id IS NULL OR categories.name <> 'Notes'").pluck(:id) & post_ids).empty?
   (0.7ms)  SELECT "posts"."id" FROM "posts" INNER JOIN "categories" ON "categories"."id" = "posts"."category_id" WHERE (category_id IS NULL OR categories.name <> 'Notes')
=> true

搜索结果就不会包含最早取消了分类的几篇文章,容易导致 Bug,要解决这种问题可以考虑用left_outer_joins

> Post.left_outer_joins(:category).where("category_id IS NULL OR categories.name <> 'Notes'").pluck(:id).count
   (0.7ms)  SELECT "posts"."id" FROM "posts" LEFT OUTER JOIN "categories" ON "categories"."id" = "posts"."category_id" WHERE (category_id IS NULL OR categories.name <> 'Notes')
=> 75

> (Post.left_outer_joins(:category).where("category_id IS NULL OR categories.name <> 'Notes'").pluck(:id)  & post_ids).empty?
   (0.7ms)  SELECT "posts"."id" FROM "posts" LEFT OUTER JOIN "categories" ON "categories"."id" = "posts"."category_id" WHERE (category_id IS NULL OR categories.name <> 'Notes')
=> false

同样,如果你很闲,也可以自己编写LEFT OUTER JOIN相关细节

> Post.joins('LEFT OUTER JOIN categories ON posts.category_id = categories.id').where("category_id IS NULL OR categories.name <> 'Notes'").pluck(:id).count
   (0.8ms)  SELECT "posts"."id" FROM "posts" LEFT OUTER JOIN categories ON posts.category_id = categories.id WHERE (category_id IS NULL OR categories.name <> 'Notes')
=> 75

Rails 给予你便捷的同时也保留了很大的自由度。除了系统默认的联表方法,我们还可以使用其他的表联结语句,比如RIGHT OUTER JOIN

> Category.joins('RIGHT OUTER JOIN posts ON posts.category_id = categories.id').where("category_id IS NULL OR categories.name <> 'Notes'").pluck('posts.id').count
   (0.7ms)  SELECT posts.id FROM "categories" RIGHT OUTER JOIN posts ON posts.category_id = categories.id WHERE (category_id IS NULL OR categories.name <> 'Notes')
=> 75

查出来的结果是一样的。

Eager Loading Associations(预加载关联)

预加载主要用来解决业务过程中经常出现的 N+1 问题。假设你需要获取最新的 5 篇博客文章的分类。那么你可能会

> Post.order('created_at DESC').limit(5).each {|m| puts m.category.name }
  Post Load (1.0ms)  SELECT "posts".* FROM "posts" ORDER BY created_at DESC LIMIT $1  [["LIMIT", 5]]
  Category Load (0.2ms)  SELECT "categories".* FROM "categories" WHERE "categories"."id" = $1 LIMIT $2  [["id", 1], ["LIMIT", 1]]
Blogs
  Category Load (0.2ms)  SELECT "categories".* FROM "categories" WHERE "categories"."id" = $1 LIMIT $2  [["id", 1], ["LIMIT", 1]]
Blogs
  Category Load (0.2ms)  SELECT "categories".* FROM "categories" WHERE "categories"."id" = $1 LIMIT $2  [["id", 6], ["LIMIT", 1]]
Huiliu
  Category Load (0.1ms)  SELECT "categories".* FROM "categories" WHERE "categories"."id" = $1 LIMIT $2  [["id", 1], ["LIMIT", 1]]
Blogs
  Category Load (0.1ms)  SELECT "categories".* FROM "categories" WHERE "categories"."id" = $1 LIMIT $2  [["id", 1], ["LIMIT", 1]]
Blogs

这里面就有5 + 1次的查询。首先获取目标的文章数据,接下来根据文章中的caregory_id查询分类表 5 次,以获取文章所对应的分类名。在数据量比较小的时候,这种查询影响不大,然而当数据量十分庞大,这种卡顿将会是致命的。Rails 官方提供了三个方法来解决这个问题,它们分别是

  • includes
  • preload
  • eager_load

接下来笔者会分别尝试一下他们,并在最后总结一下他们的区别

includes

includes方法是较为常用的解决 N + 1 问题的手段。基于上面的例子,只需要像下面这样处理一下就能修复 N + 1 问题。

> Post.includes(:category).order('created_at DESC').limit(5).each {|m| puts m.category.name }
  Post Load (1.0ms)  SELECT "posts".* FROM "posts" ORDER BY created_at DESC LIMIT $1  [["LIMIT", 5]]
  Category Load (0.6ms)  SELECT "categories".* FROM "categories" WHERE "categories"."id" IN ($1, $2)  [["id", 1], ["id", 6]]
Blogs
Blogs
Huiliu
Blogs
Blogs

原来的5 + 1条查询,优化过后只剩下两条。第一条依旧是获取文章数据。而第二条查询则是根据第一条的查询结果中的category_id数组直接从分类表中一次性获取分类数据。数据量越大,这种优化所带来的效用便越明显。

includes方法同时也支持预加载多个关联,甚至是嵌套的关联。具体例子可以参考 Rails 的官方文档。一般来说includes生成的 SQL 语句比较简洁,然而如果是依赖关联表进行条件筛选的话就会有些复杂了。

> Post.includes(:category).where("categories.name = 'Blogs'").order('posts.created_at DESC').limit(5).each {|m| puts m.category.name }
  Post Load (2.0ms)  SELECT "posts".* FROM "posts" WHERE (categories.name = 'Blogs') ORDER BY posts.created_at DESC LIMIT $1  [["LIMIT", 5]]
Traceback (most recent call last):
       16: from activerecord (6.0.3.7) lib/active_record/connection_adapters/postgresql_adapter.rb:673:in `exec_no_cache'

哦?看来一般情况下includes并不能在条件查询的过程中默认联结对应的表。我们需要手动去做这个事情,官方提供了一个专用的方法references,显式去关联那个表,而这里的关联字段是category。所以

>  Post.includes(:category).where("categories.name = 'Blogs'").references(:category).order('posts.created_at DESC').limit(5).each {|m| puts m.category.name }
  SQL (2.0ms)  SELECT "posts"."id" AS t0_r0, "posts"."title" AS t0_r1, "posts"."body" AS t0_r2, "posts"."slug" AS t0_r3, "posts"."created_at" AS t0_r4, "posts"."updated_at" AS t0_r5, "posts"."excerpt" AS t0_r6, "posts"."category_id" AS t0_r7, "posts"."draft" AS t0_r8, "categories"."id" AS t1_r0, "categories"."key" AS t1_r1, "categories"."name" AS t1_r2, "categories"."created_at" AS t1_r3, "categories"."updated_at" AS t1_r4 FROM "posts" LEFT OUTER JOIN "categories" ON "categories"."id" = "posts"."category_id" WHERE (categories.name = 'Blogs') ORDER BY posts.created_at DESC LIMIT $1  [["LIMIT", 5]]
Blogs
Blogs
Blogs
Blogs
Blogs

这生成的 SQL 语句有点长,它会一次性查询出我们需要的结果,另外还可以看到它默认是采用了LEFT OUTER JOIN。如果想要INNER JOIN的效果,我们其实也可以这样去写

>  Post.includes(:category).where("categories.name = 'Blogs'").joins(:category).order('posts.created_at DESC').limit(5).each {|m| puts m.category.name }
  SQL (1.2ms)  SELECT "posts"."id" AS t0_r0, "posts"."title" AS t0_r1, "posts"."body" AS t0_r2, "posts"."slug" AS t0_r3, "posts"."created_at" AS t0_r4, "posts"."updated_at" AS t0_r5, "posts"."excerpt" AS t0_r6, "posts"."category_id" AS t0_r7, "posts"."draft" AS t0_r8, "categories"."id" AS t1_r0, "categories"."key" AS t1_r1, "categories"."name" AS t1_r2, "categories"."created_at" AS t1_r3, "categories"."updated_at" AS t1_r4 FROM "posts" INNER JOIN "categories" ON "categories"."id" = "posts"."category_id" WHERE (categories.name = 'Blogs') ORDER BY posts.created_at DESC LIMIT $1  [["LIMIT", 5]]
Blogs
Blogs
Blogs
Blogs
Blogs

同样是一次查询出所有结果,SQL 语句依旧是有些“繁琐”。可见通过includes来解决 N + 1 查询,并且查询过程中需要涉及到联表筛选的时候则需要显式通过joins, references来关联对应的表(有些情况会例外,后面会说)。不过每次都要手动去关联表字段就比较烦,想省事一点点可以考虑eager_load

eager_load

同样是上面的筛选条件,使用eager_load实现的效果就是

> Post.eager_load(:category).where("categories.name = 'Blogs'").order('posts.created_at DESC').limit(5).each {|m| puts m.category.name }
  SQL (1.6ms)  SELECT "posts"."id" AS t0_r0, "posts"."title" AS t0_r1, "posts"."body" AS t0_r2, "posts"."slug" AS t0_r3, "posts"."created_at" AS t0_r4, "posts"."updated_at" AS t0_r5, "posts"."excerpt" AS t0_r6, "posts"."category_id" AS t0_r7, "posts"."draft" AS t0_r8, "categories"."id" AS t1_r0, "categories"."key" AS t1_r1, "categories"."name" AS t1_r2, "categories"."created_at" AS t1_r3, "categories"."updated_at" AS t1_r4 FROM "posts" LEFT OUTER JOIN "categories" ON "categories"."id" = "posts"."category_id" WHERE (categories.name = 'Blogs') ORDER BY posts.created_at DESC LIMIT $1  [["LIMIT", 5]]
Blogs
Blogs
Blogs
Blogs
Blogs

可见不需要显式使用references进行表关联,它会自动完成这部分工作。从查询语句可以看出它也是一次性查出所有东西,并且默认使用了LEFT OUTER JOIN,如果想要使用INNER JOIN的话可以使用joins方法来完成。比如

> Post.eager_load(:category).where("categories.name = 'Blogs'").joins(:category).order('posts.created_at DESC').limit(5).each {|m| puts m.category.name }
  SQL (2.1ms)  SELECT "posts"."id" AS t0_r0, "posts"."title" AS t0_r1, "posts"."body" AS t0_r2, "posts"."slug" AS t0_r3, "posts"."created_at" AS t0_r4, "posts"."updated_at" AS t0_r5, "posts"."excerpt" AS t0_r6, "posts"."category_id" AS t0_r7, "posts"."draft" AS t0_r8, "categories"."id" AS t1_r0, "categories"."key" AS t1_r1, "categories"."name" AS t1_r2, "categories"."created_at" AS t1_r3, "categories"."updated_at" AS t1_r4 FROM "posts" INNER JOIN "categories" ON "categories"."id" = "posts"."category_id" WHERE (categories.name = 'Blogs') ORDER BY posts.created_at DESC LIMIT $1  [["LIMIT", 5]]
Blogs
Blogs
Blogs
Blogs
Blogs

这样跟前面includes方法搭配joins方法所生成的 SQL 语句是等价的。

> Post.includes(:category).where("categories.name = 'Blogs'").joins(:category).order('posts.created_at DESC').to_sql  == Post.eager_load(:category).where("categories.name = 'Blogs'").joins(:category).order('posts.created_at DESC').to_sql
=> true

preload

通过includes或者eager_load配合相关联表语句进行条件查询的做法十分省事。生成的 SQL 查询也算高效,只是生成的 SQL 语句让人感觉有些一言难尽。正常来说我们还是希望能够有

  1. 先查出符合要求的文章。
  2. 通过文章对应的分类 id 再查出对应的分类。

其实用includes配合left_outer_joins的方法可以做到

> Post.includes(:category).where("categories.name = 'Blogs'").left_outer_joins(:category).order('posts.created_at DESC').limit(5).each {|m| puts m.category.name }
  Post Load (1.1ms)  SELECT "posts".* FROM "posts" LEFT OUTER JOIN "categories" ON "categories"."id" = "posts"."category_id" WHERE (categories.name = 'Blogs') ORDER BY posts.created_at DESC LIMIT $1  [["LIMIT", 5]]
  Category Load (0.1ms)  SELECT "categories".* FROM "categories" WHERE "categories"."id" = $1  [["id", 1]]
Blogs
Blogs
Blogs
Blogs
Blogs

是不是有点意思includes配合references或者joins生成的查询语句只有一句,而且看起来是“一团乱麻”,配合left_outer_join生成了两条查询,虽然同样LEFT OUTER JOIN,但看起来却是如此“正常”。作为程序员,看到底层生成的查询语句,我还是更乐意使用这种模式,毕竟看起来更容易理解。只是includes这种摇摆不定的结果着实让人难受。这个时候考虑用preload

preload就完全被设定成,如果要进行联表条件查询必须自行设定关联方法joins或者left_outer_joins否则必定报错。(includes使用 Hash 模式进行查询的时候也可以不手动指定连表方法,详情可见这个案例

> Post.includes(:category).where({categories: { name: 'Blogs'}}).order('posts.created_at DESC').limit(5).each {|m| puts m.category.name }
  SQL (1.2ms)  SELECT "posts"."id" AS t0_r0, "posts"."title" AS t0_r1, "posts"."body" AS t0_r2, "posts"."slug" AS t0_r3, "posts"."created_at" AS t0_r4, "posts"."updated_at" AS t0_r5, "posts"."excerpt" AS t0_r6, "posts"."category_id" AS t0_r7, "posts"."draft" AS t0_r8, "categories"."id" AS t1_r0, "categories"."key" AS t1_r1, "categories"."name" AS t1_r2, "categories"."created_at" AS t1_r3, "categories"."updated_at" AS t1_r4 FROM "posts" LEFT OUTER JOIN "categories" ON "categories"."id" = "posts"."category_id" WHERE "categories"."name" = $1 ORDER BY posts.created_at DESC LIMIT $2  [["name", "Blogs"], ["LIMIT", 5]]
Blogs
Blogs
Blogs
Blogs
Blogs

preload哪怕是 Hash 模式也会报错

> Post.preload(:category).where({categories: { name: 'Blogs'}}).order('posts.created_at DESC').limit(5).each {|m| puts m.category.name }
  Post Load (0.6ms)  SELECT "posts".* FROM "posts" WHERE "categories"."name" = $1 ORDER BY posts.created_at DESC LIMIT $2  [["name", "Blogs"], ["LIMIT", 5]]
Traceback (most recent call last):...

必须手动去指定关联关系

> Post.preload(:category).joins(:category).where({categories: { name: 'Blogs'}}).order('posts.created_at DESC').limit(5).each {|m| puts m.category.name }
  Post Load (1.3ms)  SELECT "posts".* FROM "posts" INNER JOIN "categories" ON "categories"."id" = "posts"."category_id" WHERE "categories"."name" = $1 ORDER BY posts.created_at DESC LIMIT $2  [["name", "Blogs"], ["LIMIT", 5]]
  Category Load (0.1ms)  SELECT "categories".* FROM "categories" WHERE "categories"."id" = $1  [["id", 1]]
Blogs
Blogs
Blogs
Blogs
Blogs
> Post.preload(:category).left_outer_joins(:category).where({categories: { name: 'Blogs'}}).order('posts.created_at DESC').limit(5).each {|m| puts m.category.name }
  Post Load (1.1ms)  SELECT "posts".* FROM "posts" LEFT OUTER JOIN "categories" ON "categories"."id" = "posts"."category_id" WHERE "categories"."name" = $1 ORDER BY posts.created_at DESC LIMIT $2  [["name", "Blogs"], ["LIMIT", 5]]
  Category Load (0.1ms)  SELECT "categories".* FROM "categories" WHERE "categories"."id" = $1  [["id", 1]]
Blogs
Blogs
Blogs
Blogs
Blogs

preload要进行联表条件查询必须搭配joins或者left_outer_joins,否则会报错。而且生成的 SQL 语句始终是两条。可能执行效率会差一些,但是胜在看起来更加直观。有洁癖的同学推荐使用preload

includeseager_loadpreload都是比较常用的用来解决 N + 1 查询的手段,只是他们之间有微妙的差别

  1. includes笔者平时用得比较多,当追加联表条件时如果条件语句where里面是 Hash 的格式,那么它会隐式进行表联结,否则的话则需要显式(使用joinsreferences等手段)指定关联关系,算是比较综合的一种选择。
  2. eager_load就相当于一步到位,当where条件查询语句里面有依赖其他的表,它会自动通过LEFT OUT JOIN的方式进行表联结。如果更倾向于JOIN的形式,则可以与其他方式一样通过joins来指定对应的关联关系。
  3. preload则是另一个极端,where条件查询语句里面如果有依赖其他表,它则直接报错。你必须显式通过joins, left_outer_joins去指定关联关系。

总的来说如果涉及到联表的条件查询eager_load是最省事的,它会默认帮你通过LEFT OUTER JOIN进行联结。而preload则是最费事的,你必须显式指定关联关系。includes则是介于两者之间,有些场景它会自动通过LEFT OUTER JOIN进行联表,而有些场景则必须显示指定关联关系。如果对 SQL 语句有“洁癖”的同学推荐使用preload,虽说它要显式指定联表关系,而且生成的 SQL 语句还会多查询一次数据库,但是看起来不会那么地“反人类”,相对比较容易理解。

总结

这篇文章的脉络其实完全承自于 Rails 官方文档对应的章节,因为笔者对这几个方法也不是很理解,就只能一个个去试了,并提供了实际的案例。也不知道对joinsleft_outer_joinsincludeseager_loadpreload这几个方法说清楚了没有。希望这篇文章对一些读者而言多少有些帮助。

暂无回复。
需要 登录 后方可回复, 如果你还没有账号请 注册新账号