这篇文章会重点分析 Rails 数据库查询中最为常用的表联结以及预加载,并用实际案例展示他们的用途。原文连接: https://lanzhiheng.com/posts/preload-and-join-in-rails
在我发表了简析 Rails 查询中的 includes 与 joins这篇文章之后,便有网友建议我写一下left_outer_joins
, preload
, eagerload
的区别。说起来惭愧,前段的心态不是很好,也就不怎么执笔写文,便一直拖到现在。而更大的问题在于这几个玩意的区别我自己也不太清楚(赶紧翻阅文档去)。
然而这几者的区别,文档写得还算比较清楚的,考虑把他们分成两大模块。一个是Joining Tables
(表联结),另一个是Eager Loading Associations
(预加载关联)。
在关系型数据库里面,表的联结应该是最为常见的操作。假设我要查找一张表的数据,而这个搜索结果要依赖于另外一张表的时候,往往就需要联结两个表进行查询。这么说可能有些抽象。我们还是用博客系统来看看。假设每篇博客文章都关联了一个分类,而我需要查询出分类名为“Notes”的博客文章,以下是系统里面包含的所有分类
> Category.pluck(:name)
(0.7ms) SELECT "categories"."name" FROM "categories"
=> ["Huiliu", "Notes", "Psychology", "Translations", "Blogs"]
joins
根据分类名来查找博客文章的 Rails 代码大概就是
> Post.where("categories.name = 'Notes'")
Post Load (2.2ms) SELECT "posts".* FROM "posts" WHERE (categories.name = 'Notes') LIMIT $1 [["LIMIT", 11]]
Traceback (most recent call last):
16: from activerecord (6.0.3.7) lib/active_record/connection_adapters/postgresql_adapter.rb:673:in `exec_no_cache'
可见这样写 SQL 语句的话会报错。因为我们只SELECT
了posts
表,而条件查询里面却依赖了categories
表的字段。这种场景需要用joins
对两表进行联结。于是乎
> Post.joins(:category).where("categories.name = 'Notes'")
Post Load (0.8ms) SELECT "posts".* FROM "posts" INNER JOIN "categories" ON "categories"."id" = "posts"."category_id" WHERE (categories.name = 'Notes') LIMIT $1 [["LIMIT", 11]]
=> #<ActiveRecord::Relation....
从 Rails 生成的 SQL 语句可以看出,joins
方法默认生成的数据库联结语句是INNER JOIN
。如果你很闲也可以自己去写联结的细节
> Post.joins('INNER JOIN categories ON categories.id = posts.category_id').where("categories.name = 'Notes'").to_sql
=> "SELECT \"posts\".* FROM \"posts\" INNER JOIN categories ON categories.id = posts.category_id WHERE (categories.name = 'Notes')"
效果是一样的。
left_outer_join
前面例子中的INNER JOIN
会自动筛选掉那些没有绑定分类的文章。举个例子,假设我们把几篇文章的分类给取消掉
> post_ids = Post.first(10).pluck(:id)
Post Load (1.3ms) SELECT "posts".* FROM "posts" ORDER BY "posts"."id" ASC LIMIT $1 [["LIMIT", 10]]
=> [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
> Post.where(id: post_ids).each {|p| p.update_column('category_id', nil)}
> Post.where(id: post_ids).pluck(:category_id)
(0.9ms) SELECT "posts"."category_id" FROM "posts" WHERE "posts"."id" IN ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10) [["id", 1], ["id", 2], ["id", 3], ["id", 4], ["id", 5], ["id", 6], ["id", 7], ["id", 8], ["id", 9], ["id", 10]]
=> [nil, nil, nil, nil, nil, nil, nil, nil, nil, nil]
对这些文章进行INNER JOIN
之后就相当于加了层是否有绑定分类的过滤器
> Post.where(id: post_ids).count
(1.1ms) SELECT COUNT(*) FROM "posts" WHERE "posts"."id" IN ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10) [["id", 1], ["id", 2], ["id", 3], ["id", 4], ["id", 5], ["id", 6], ["id", 7], ["id", 8], ["id", 9], ["id", 10]]
=> 10
> Post.where(id: post_ids).joins(:category).count
(0.8ms) SELECT COUNT(*) FROM "posts" INNER JOIN "categories" ON "categories"."id" = "posts"."category_id" WHERE "posts"."id" IN ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10) [["id", 1], ["id", 2], ["id", 3], ["id", 4], ["id", 5], ["id", 6], ["id", 7], ["id", 8], ["id", 9], ["id", 10]]
=> 0
也是容易出 Bug 的地方。假设我们要搜索出分类名不为“Notes”的文章,包括那些没有绑定分类的文章。如果是这样写
> Post.joins(:category).where("category_id IS NULL OR categories.name <> 'Notes'").pluck(:id).count
(0.7ms) SELECT "posts"."id" FROM "posts" INNER JOIN "categories" ON "categories"."id" = "posts"."category_id" WHERE (category_id IS NULL OR categories.name <> 'Notes')
=> 65
> (Post.joins(:category).where("category_id IS NULL OR categories.name <> 'Notes'").pluck(:id) & post_ids).empty?
(0.7ms) SELECT "posts"."id" FROM "posts" INNER JOIN "categories" ON "categories"."id" = "posts"."category_id" WHERE (category_id IS NULL OR categories.name <> 'Notes')
=> true
搜索结果就不会包含最早取消了分类的几篇文章,容易导致 Bug,要解决这种问题可以考虑用left_outer_joins
。
> Post.left_outer_joins(:category).where("category_id IS NULL OR categories.name <> 'Notes'").pluck(:id).count
(0.7ms) SELECT "posts"."id" FROM "posts" LEFT OUTER JOIN "categories" ON "categories"."id" = "posts"."category_id" WHERE (category_id IS NULL OR categories.name <> 'Notes')
=> 75
> (Post.left_outer_joins(:category).where("category_id IS NULL OR categories.name <> 'Notes'").pluck(:id) & post_ids).empty?
(0.7ms) SELECT "posts"."id" FROM "posts" LEFT OUTER JOIN "categories" ON "categories"."id" = "posts"."category_id" WHERE (category_id IS NULL OR categories.name <> 'Notes')
=> false
同样,如果你很闲,也可以自己编写LEFT OUTER JOIN
相关细节
> Post.joins('LEFT OUTER JOIN categories ON posts.category_id = categories.id').where("category_id IS NULL OR categories.name <> 'Notes'").pluck(:id).count
(0.8ms) SELECT "posts"."id" FROM "posts" LEFT OUTER JOIN categories ON posts.category_id = categories.id WHERE (category_id IS NULL OR categories.name <> 'Notes')
=> 75
Rails 给予你便捷的同时也保留了很大的自由度。除了系统默认的联表方法,我们还可以使用其他的表联结语句,比如RIGHT OUTER JOIN
> Category.joins('RIGHT OUTER JOIN posts ON posts.category_id = categories.id').where("category_id IS NULL OR categories.name <> 'Notes'").pluck('posts.id').count
(0.7ms) SELECT posts.id FROM "categories" RIGHT OUTER JOIN posts ON posts.category_id = categories.id WHERE (category_id IS NULL OR categories.name <> 'Notes')
=> 75
查出来的结果是一样的。
预加载主要用来解决业务过程中经常出现的 N+1 问题。假设你需要获取最新的 5 篇博客文章的分类。那么你可能会
> Post.order('created_at DESC').limit(5).each {|m| puts m.category.name }
Post Load (1.0ms) SELECT "posts".* FROM "posts" ORDER BY created_at DESC LIMIT $1 [["LIMIT", 5]]
Category Load (0.2ms) SELECT "categories".* FROM "categories" WHERE "categories"."id" = $1 LIMIT $2 [["id", 1], ["LIMIT", 1]]
Blogs
Category Load (0.2ms) SELECT "categories".* FROM "categories" WHERE "categories"."id" = $1 LIMIT $2 [["id", 1], ["LIMIT", 1]]
Blogs
Category Load (0.2ms) SELECT "categories".* FROM "categories" WHERE "categories"."id" = $1 LIMIT $2 [["id", 6], ["LIMIT", 1]]
Huiliu
Category Load (0.1ms) SELECT "categories".* FROM "categories" WHERE "categories"."id" = $1 LIMIT $2 [["id", 1], ["LIMIT", 1]]
Blogs
Category Load (0.1ms) SELECT "categories".* FROM "categories" WHERE "categories"."id" = $1 LIMIT $2 [["id", 1], ["LIMIT", 1]]
Blogs
这里面就有5 + 1
次的查询。首先获取目标的文章数据,接下来根据文章中的caregory_id
查询分类表 5 次,以获取文章所对应的分类名。在数据量比较小的时候,这种查询影响不大,然而当数据量十分庞大,这种卡顿将会是致命的。Rails 官方提供了三个方法来解决这个问题,它们分别是
接下来笔者会分别尝试一下他们,并在最后总结一下他们的区别
includes
includes
方法是较为常用的解决 N + 1 问题的手段。基于上面的例子,只需要像下面这样处理一下就能修复 N + 1 问题。
> Post.includes(:category).order('created_at DESC').limit(5).each {|m| puts m.category.name }
Post Load (1.0ms) SELECT "posts".* FROM "posts" ORDER BY created_at DESC LIMIT $1 [["LIMIT", 5]]
Category Load (0.6ms) SELECT "categories".* FROM "categories" WHERE "categories"."id" IN ($1, $2) [["id", 1], ["id", 6]]
Blogs
Blogs
Huiliu
Blogs
Blogs
原来的5 + 1
条查询,优化过后只剩下两条。第一条依旧是获取文章数据。而第二条查询则是根据第一条的查询结果中的category_id
数组直接从分类表中一次性获取分类数据。数据量越大,这种优化所带来的效用便越明显。
includes
方法同时也支持预加载多个关联,甚至是嵌套的关联。具体例子可以参考 Rails 的官方文档。一般来说includes
生成的 SQL 语句比较简洁,然而如果是依赖关联表进行条件筛选的话就会有些复杂了。
> Post.includes(:category).where("categories.name = 'Blogs'").order('posts.created_at DESC').limit(5).each {|m| puts m.category.name }
Post Load (2.0ms) SELECT "posts".* FROM "posts" WHERE (categories.name = 'Blogs') ORDER BY posts.created_at DESC LIMIT $1 [["LIMIT", 5]]
Traceback (most recent call last):
16: from activerecord (6.0.3.7) lib/active_record/connection_adapters/postgresql_adapter.rb:673:in `exec_no_cache'
哦?看来一般情况下includes
并不能在条件查询的过程中默认联结对应的表。我们需要手动去做这个事情,官方提供了一个专用的方法references
,显式去关联那个表,而这里的关联字段是category
。所以
> Post.includes(:category).where("categories.name = 'Blogs'").references(:category).order('posts.created_at DESC').limit(5).each {|m| puts m.category.name }
SQL (2.0ms) SELECT "posts"."id" AS t0_r0, "posts"."title" AS t0_r1, "posts"."body" AS t0_r2, "posts"."slug" AS t0_r3, "posts"."created_at" AS t0_r4, "posts"."updated_at" AS t0_r5, "posts"."excerpt" AS t0_r6, "posts"."category_id" AS t0_r7, "posts"."draft" AS t0_r8, "categories"."id" AS t1_r0, "categories"."key" AS t1_r1, "categories"."name" AS t1_r2, "categories"."created_at" AS t1_r3, "categories"."updated_at" AS t1_r4 FROM "posts" LEFT OUTER JOIN "categories" ON "categories"."id" = "posts"."category_id" WHERE (categories.name = 'Blogs') ORDER BY posts.created_at DESC LIMIT $1 [["LIMIT", 5]]
Blogs
Blogs
Blogs
Blogs
Blogs
这生成的 SQL 语句有点长,它会一次性查询出我们需要的结果,另外还可以看到它默认是采用了LEFT OUTER JOIN
。如果想要INNER JOIN
的效果,我们其实也可以这样去写
> Post.includes(:category).where("categories.name = 'Blogs'").joins(:category).order('posts.created_at DESC').limit(5).each {|m| puts m.category.name }
SQL (1.2ms) SELECT "posts"."id" AS t0_r0, "posts"."title" AS t0_r1, "posts"."body" AS t0_r2, "posts"."slug" AS t0_r3, "posts"."created_at" AS t0_r4, "posts"."updated_at" AS t0_r5, "posts"."excerpt" AS t0_r6, "posts"."category_id" AS t0_r7, "posts"."draft" AS t0_r8, "categories"."id" AS t1_r0, "categories"."key" AS t1_r1, "categories"."name" AS t1_r2, "categories"."created_at" AS t1_r3, "categories"."updated_at" AS t1_r4 FROM "posts" INNER JOIN "categories" ON "categories"."id" = "posts"."category_id" WHERE (categories.name = 'Blogs') ORDER BY posts.created_at DESC LIMIT $1 [["LIMIT", 5]]
Blogs
Blogs
Blogs
Blogs
Blogs
同样是一次查询出所有结果,SQL 语句依旧是有些“繁琐”。可见通过includes
来解决 N + 1 查询,并且查询过程中需要涉及到联表筛选的时候则需要显式通过joins
, references
来关联对应的表(有些情况会例外,后面会说)。不过每次都要手动去关联表字段就比较烦,想省事一点点可以考虑eager_load
eager_load
同样是上面的筛选条件,使用eager_load
实现的效果就是
> Post.eager_load(:category).where("categories.name = 'Blogs'").order('posts.created_at DESC').limit(5).each {|m| puts m.category.name }
SQL (1.6ms) SELECT "posts"."id" AS t0_r0, "posts"."title" AS t0_r1, "posts"."body" AS t0_r2, "posts"."slug" AS t0_r3, "posts"."created_at" AS t0_r4, "posts"."updated_at" AS t0_r5, "posts"."excerpt" AS t0_r6, "posts"."category_id" AS t0_r7, "posts"."draft" AS t0_r8, "categories"."id" AS t1_r0, "categories"."key" AS t1_r1, "categories"."name" AS t1_r2, "categories"."created_at" AS t1_r3, "categories"."updated_at" AS t1_r4 FROM "posts" LEFT OUTER JOIN "categories" ON "categories"."id" = "posts"."category_id" WHERE (categories.name = 'Blogs') ORDER BY posts.created_at DESC LIMIT $1 [["LIMIT", 5]]
Blogs
Blogs
Blogs
Blogs
Blogs
可见不需要显式使用references
进行表关联,它会自动完成这部分工作。从查询语句可以看出它也是一次性查出所有东西,并且默认使用了LEFT OUTER JOIN
,如果想要使用INNER JOIN
的话可以使用joins
方法来完成。比如
> Post.eager_load(:category).where("categories.name = 'Blogs'").joins(:category).order('posts.created_at DESC').limit(5).each {|m| puts m.category.name }
SQL (2.1ms) SELECT "posts"."id" AS t0_r0, "posts"."title" AS t0_r1, "posts"."body" AS t0_r2, "posts"."slug" AS t0_r3, "posts"."created_at" AS t0_r4, "posts"."updated_at" AS t0_r5, "posts"."excerpt" AS t0_r6, "posts"."category_id" AS t0_r7, "posts"."draft" AS t0_r8, "categories"."id" AS t1_r0, "categories"."key" AS t1_r1, "categories"."name" AS t1_r2, "categories"."created_at" AS t1_r3, "categories"."updated_at" AS t1_r4 FROM "posts" INNER JOIN "categories" ON "categories"."id" = "posts"."category_id" WHERE (categories.name = 'Blogs') ORDER BY posts.created_at DESC LIMIT $1 [["LIMIT", 5]]
Blogs
Blogs
Blogs
Blogs
Blogs
这样跟前面includes
方法搭配joins
方法所生成的 SQL 语句是等价的。
> Post.includes(:category).where("categories.name = 'Blogs'").joins(:category).order('posts.created_at DESC').to_sql == Post.eager_load(:category).where("categories.name = 'Blogs'").joins(:category).order('posts.created_at DESC').to_sql
=> true
preload
通过includes
或者eager_load
配合相关联表语句进行条件查询的做法十分省事。生成的 SQL 查询也算高效,只是生成的 SQL 语句让人感觉有些一言难尽。正常来说我们还是希望能够有
其实用includes
配合left_outer_joins
的方法可以做到
> Post.includes(:category).where("categories.name = 'Blogs'").left_outer_joins(:category).order('posts.created_at DESC').limit(5).each {|m| puts m.category.name }
Post Load (1.1ms) SELECT "posts".* FROM "posts" LEFT OUTER JOIN "categories" ON "categories"."id" = "posts"."category_id" WHERE (categories.name = 'Blogs') ORDER BY posts.created_at DESC LIMIT $1 [["LIMIT", 5]]
Category Load (0.1ms) SELECT "categories".* FROM "categories" WHERE "categories"."id" = $1 [["id", 1]]
Blogs
Blogs
Blogs
Blogs
Blogs
是不是有点意思includes
配合references
或者joins
生成的查询语句只有一句,而且看起来是“一团乱麻”,配合left_outer_join
生成了两条查询,虽然同样LEFT OUTER JOIN
,但看起来却是如此“正常”。作为程序员,看到底层生成的查询语句,我还是更乐意使用这种模式,毕竟看起来更容易理解。只是includes
这种摇摆不定的结果着实让人难受。这个时候考虑用preload
。
preload
就完全被设定成,如果要进行联表条件查询必须自行设定关联方法joins
或者left_outer_joins
否则必定报错。(includes
使用 Hash 模式进行查询的时候也可以不手动指定连表方法,详情可见这个案例)
> Post.includes(:category).where({categories: { name: 'Blogs'}}).order('posts.created_at DESC').limit(5).each {|m| puts m.category.name }
SQL (1.2ms) SELECT "posts"."id" AS t0_r0, "posts"."title" AS t0_r1, "posts"."body" AS t0_r2, "posts"."slug" AS t0_r3, "posts"."created_at" AS t0_r4, "posts"."updated_at" AS t0_r5, "posts"."excerpt" AS t0_r6, "posts"."category_id" AS t0_r7, "posts"."draft" AS t0_r8, "categories"."id" AS t1_r0, "categories"."key" AS t1_r1, "categories"."name" AS t1_r2, "categories"."created_at" AS t1_r3, "categories"."updated_at" AS t1_r4 FROM "posts" LEFT OUTER JOIN "categories" ON "categories"."id" = "posts"."category_id" WHERE "categories"."name" = $1 ORDER BY posts.created_at DESC LIMIT $2 [["name", "Blogs"], ["LIMIT", 5]]
Blogs
Blogs
Blogs
Blogs
Blogs
preload
哪怕是 Hash 模式也会报错
> Post.preload(:category).where({categories: { name: 'Blogs'}}).order('posts.created_at DESC').limit(5).each {|m| puts m.category.name }
Post Load (0.6ms) SELECT "posts".* FROM "posts" WHERE "categories"."name" = $1 ORDER BY posts.created_at DESC LIMIT $2 [["name", "Blogs"], ["LIMIT", 5]]
Traceback (most recent call last):...
必须手动去指定关联关系
> Post.preload(:category).joins(:category).where({categories: { name: 'Blogs'}}).order('posts.created_at DESC').limit(5).each {|m| puts m.category.name }
Post Load (1.3ms) SELECT "posts".* FROM "posts" INNER JOIN "categories" ON "categories"."id" = "posts"."category_id" WHERE "categories"."name" = $1 ORDER BY posts.created_at DESC LIMIT $2 [["name", "Blogs"], ["LIMIT", 5]]
Category Load (0.1ms) SELECT "categories".* FROM "categories" WHERE "categories"."id" = $1 [["id", 1]]
Blogs
Blogs
Blogs
Blogs
Blogs
> Post.preload(:category).left_outer_joins(:category).where({categories: { name: 'Blogs'}}).order('posts.created_at DESC').limit(5).each {|m| puts m.category.name }
Post Load (1.1ms) SELECT "posts".* FROM "posts" LEFT OUTER JOIN "categories" ON "categories"."id" = "posts"."category_id" WHERE "categories"."name" = $1 ORDER BY posts.created_at DESC LIMIT $2 [["name", "Blogs"], ["LIMIT", 5]]
Category Load (0.1ms) SELECT "categories".* FROM "categories" WHERE "categories"."id" = $1 [["id", 1]]
Blogs
Blogs
Blogs
Blogs
Blogs
preload
要进行联表条件查询必须搭配joins
或者left_outer_joins
,否则会报错。而且生成的 SQL 语句始终是两条。可能执行效率会差一些,但是胜在看起来更加直观。有洁癖的同学推荐使用preload
。
includes
,eager_load
,preload
都是比较常用的用来解决 N + 1 查询的手段,只是他们之间有微妙的差别
includes
笔者平时用得比较多,当追加联表条件时如果条件语句where
里面是 Hash 的格式,那么它会隐式进行表联结,否则的话则需要显式(使用joins
,references
等手段)指定关联关系,算是比较综合的一种选择。eager_load
就相当于一步到位,当where
条件查询语句里面有依赖其他的表,它会自动通过LEFT OUT JOIN
的方式进行表联结。如果更倾向于JOIN
的形式,则可以与其他方式一样通过joins
来指定对应的关联关系。preload
则是另一个极端,where
条件查询语句里面如果有依赖其他表,它则直接报错。你必须显式通过joins
, left_outer_joins
去指定关联关系。总的来说如果涉及到联表的条件查询eager_load
是最省事的,它会默认帮你通过LEFT OUTER JOIN
进行联结。而preload
则是最费事的,你必须显式指定关联关系。includes
则是介于两者之间,有些场景它会自动通过LEFT OUTER JOIN
进行联表,而有些场景则必须显示指定关联关系。如果对 SQL 语句有“洁癖”的同学推荐使用preload
,虽说它要显式指定联表关系,而且生成的 SQL 语句还会多查询一次数据库,但是看起来不会那么地“反人类”,相对比较容易理解。
这篇文章的脉络其实完全承自于 Rails 官方文档对应的章节,因为笔者对这几个方法也不是很理解,就只能一个个去试了,并提供了实际的案例。也不知道对joins
,left_outer_joins
,includes
,eager_load
,preload
这几个方法说清楚了没有。希望这篇文章对一些读者而言多少有些帮助。