简单总结一下在 Rails 的查询里面 includes 以及 joins 的用法。原文: https://lanzhiheng.com/posts/joins-and-includes-in-rails-query
记得去年面试广州的 Ruby 公司的时候,面试官问了我一个问题,在 ActiveRecord 数据库查询里面joins
与includes
有什么区别,当时的我哑口无言。估计面试官会琢磨,这货,怎么编译器都开发出来了,这么简单的问题都不懂。肯定是抄别人的编译器。实则真不是,只是 Rails 相关的工作做得少,做过的都是页面层,数据库层面的东西更是接触得少,连这种基础问题也回答不出来,实在尴尬。
还是用博客系统的表结构来说明,假设博客系统的文章(Post)都分属于某一个分类(Category),那么如果我们想获取某个分类下的所有文章,那么其实可以这样去做
> Category.first
Category Load (0.4ms) SELECT "categories".* FROM "categories" ORDER BY "categories"."id" ASC LIMIT $1 [["LIMIT", 1]]
=> #<Category id: 1, key: "blogs", name: "Blog", created_at: "2020-02-26 04:36:42", updated_at: "2020-02-26 04:36:42">
Post.where("category_id = 1")
Post Load (1.9ms) SELECT "posts".* FROM "posts" WHERE (category_id = 1) LIMIT $1 [["LIMIT", 11]]
=> #<ActiveRecord::Relation [#<Post id: 42, title: "事务与锁在Rails中的体现", body: "在用Rails写真实业务逻辑 ...
我们也可以通过分类名来进行查询
Post.where("categories.name = 'Blog'")
Post Load (1.7ms) SELECT "posts".* FROM "posts" WHERE (categories.name = 'Blog') LIMIT $1 [["LIMIT", 11]]
Traceback (most recent call last): ...
哦,报错了。因为这条语句并没有categories
表的信息。posts
表跟categories
是相关联的,于是要做这种联表的查询需要使用joins
语句
Post.joins(:category).where("categories.name = 'Blog'")
Post Load (1.9ms) SELECT "posts".* FROM "posts" INNER JOIN "categories" ON "categories"."id" = "posts"."category_id" WHERE (categories.name = 'Blog') LIMIT $1 [["LIMIT", 11]]
=> #<ActiveRecord::Relation [#<Post id: 42, title: "事务与锁在Rails中的体现", body: "在用Rai...
关键在于
... INNER JOIN "categories" ON "categories"."id" = "posts"."category_id" ...
两个表连结的纽带是categories
表的id
列与posts
表的category_id
列。
只要编写 Rails 代码一段时间,难免遇到 N+1 查询问题。一般来说 N+1 查询问题是这样的
irb(main):036:0> Post.limit(10).each {|p| p.category.name }
Post Load (1.5ms) SELECT "posts".* FROM "posts" LIMIT $1 [["LIMIT", 10]]
Category Load (0.1ms) SELECT "categories".* FROM "categories" WHERE "categories"."id" = $1 LIMIT $2 [["id", 2], ["LIMIT", 1]]
Category Load (0.1ms) SELECT "categories".* FROM "categories" WHERE "categories"."id" = $1 LIMIT $2 [["id", 1], ["LIMIT", 1]]
Category Load (0.1ms) SELECT "categories".* FROM "categories" WHERE "categories"."id" = $1 LIMIT $2 [["id", 1], ["LIMIT", 1]]
Category Load (0.1ms) SELECT "categories".* FROM "categories" WHERE "categories"."id" = $1 LIMIT $2 [["id", 2], ["LIMIT", 1]]
Category Load (0.1ms) SELECT "categories".* FROM "categories" WHERE "categories"."id" = $1 LIMIT $2 [["id", 1], ["LIMIT", 1]]
Category Load (0.1ms) SELECT "categories".* FROM "categories" WHERE "categories"."id" = $1 LIMIT $2 [["id", 2], ["LIMIT", 1]]
Category Load (0.1ms) SELECT "categories".* FROM "categories" WHERE "categories"."id" = $1 LIMIT $2 [["id", 2], ["LIMIT", 1]]
Category Load (0.1ms) SELECT "categories".* FROM "categories" WHERE "categories"."id" = $1 LIMIT $2 [["id", 3], ["LIMIT", 1]]
Category Load (0.1ms) SELECT "categories".* FROM "categories" WHERE "categories"."id" = $1 LIMIT $2 [["id", 2], ["LIMIT", 1]]
Category Load (0.1ms) SELECT "categories".* FROM "categories" WHERE "categories"."id" = $1 LIMIT $2 [["id", 2], ["LIMIT", 1]]
=> [#<Post id: 5, title: "【译】这福利是给谁的?", body: "这福利是给谁的?翻译....
我们首先查询出文章数据,然后通过each
语句去访问每篇文章对应的分类数据,然而分类数据是在另一个表里面,因此需要查询分类表。于是乎系统在每次我们使用分类信息的时候都用category_id
查询文章对应的分类信息。假设文章数据有 10 条。第一条查询SELECT "posts" xxxx
就是查询文章列表的。后面还有 10 条查询是单独查询分类信息的。可怕的是分类的查询语句都是类似的....这就是所谓的 N+1 查询,其实个人感觉称之为 1+N 查询会更好一些,毕竟我们还是先查询那个 1,然后再进行剩余的 N 次查询。
如何规避这种问题呢?既然后面的 N 条语句查询结构都是一样的,那么我们没有必要查询 N 次,查询一次并且复用即可。我们可以使用 Rails 的includes
语句
irb(main):038:0> Post.includes(:category).limit(10).each {|p| p.category.name }
Post Load (1.1ms) SELECT "posts".* FROM "posts" LIMIT $1 [["LIMIT", 10]]
Category Load (0.3ms) SELECT "categories".* FROM "categories" WHERE "categories"."id" IN ($1, $2, $3) [["id", 2], ["id", 1], ["id", 3]]
=> [#<Post id: 5, title: "【译】这福利是给谁的?", body: "这福利是给谁的?翻译....
可见原来的 11 条语句现在简化成 2 条。有时候还真别小看这种优化,对于一个列表而言,当这种关联数据很多的时候,处理得好,往往能给列表性能带来质的飞跃,而这种飞跃所需要投入的努力并不算大。特别是在 Rails 里面恰当使用includes
语句就能解决很多问题。
个人感觉includes
跟joins
并不是对立的东西,而是可以相互协助的工具。比方说写出这种查询:
irb(main):041:0> Post.includes(:category).joins(:category).where("categories.name = 'Blog'").each {|p| p.category.name }
SQL (1.9ms) SELECT "posts"."id" AS t0_r0, "posts"."title" AS t0_r1, "posts"."body" AS t0_r2, "posts"."slug" AS t0_r3, "posts"."created_at" AS t0_r4, "posts"."updated_at" AS t0_r5, "posts"."excerpt" AS t0_r6, "posts"."category_id" AS t0_r7, "posts"."draft" AS t0_r8, "categories"."id" AS t1_r0, "categories"."key" AS t1_r1, "categories"."name" AS t1_r2, "categories"."created_at" AS t1_r3, "categories"."updated_at" AS t1_r4 FROM "posts" INNER JOIN "categories" ON "categories"."id" = "posts"."category_id" WHERE (categories.name = 'Blog') LIMIT $1 [["LIMIT", 11]]
=> #<ActiveRecord::Relation [#<Post id: 42, title: "事务 ....
虽然说构造出来的 SQL 语句会复杂些,不过这些都是 Rails 系统的事情,咱们不用理会太多。我们只需要把重心放在如何把代码写得清晰可维护,容易理解即可。另外还要注意在联表查询的时候最好养成加上表前缀的习惯,比方说:
Post.includes(:category).joins(:category).where("categories.id = 1").each {|p| p.category.name }
而不是
Post.includes(:category).joins(:category).where("id = 1").each {|p| p.category.name }
ActiveRecord::StatementInvalid (PG::AmbiguousColumn: ERROR: column reference "id" is ambiguous)
LINE 1: ... ON "categories"."id" = "posts"."category_id" WHERE (id = 1)
两个表都有名为id
列,这个基于id
的查询,数据库系统无法定位是哪一个表。要是联表查询的时候关联的表比较少,那这个问题还好说,比较容易发现。然而随着业务越来越复杂,数据表肯定会越来越多,联接多个表又加上各种复杂的查询封装,很容易就出现PG::AmbiguousColumn
这种异常,到时候不得不在多个表之间一一排查问题工作量也不少。还不如一开始就习惯性加上前缀。要是担心自己的表名会变(其实一般也不会),这样写弹性不足的话则可以考虑,这样
Post.includes(:category).joins(:category).where("#{Category.table_name}.id = 1").each {|p| p.category.name }
效果是一样的,只不过这样写代码稍微健壮一丢丢。
简单总结一下在 Rails 的查询里面includes
以及joins
的用法。,也提到一些使用的注意事项。也不知道这么粗浅的回答当年那面试官是否能够接受。