28 大表Join大表（一）：什么是“分而治之”的调优思路？ 你好，我是吴磊。

## 如何理解“分而治之”？

“分而治之”的调优思路是把“大表Join大表”降级为“大表Join小表”，然后使用上一讲中“大表Join小表”的调优方法来解决性能问题。它的核心思想是，先把一个复杂任务拆解成多个简单任务，再合并多个简单任务的计算结果。那么，“大表Join大表”的场景是如何应用“分而治之”的计算思想的呢？

## “分而治之”调优思路实战

orders和transactions都是事实表，体量都在TB级别。基于这两张事实表，这家电商每隔一段时间，就会计算上一个季度所有订单的交易额，业务代码如下所示。

//统计订单交易额的代码实现 val txFile: String = _ val orderFile: String = _ val transactions: DataFrame = spark.read.parquent(txFile) val orders: DataFrame = spark.read.parquent(orderFile) transactions.createOrReplaceTempView(“transactions”) orders.createOrReplaceTempView(“orders”) val query: String = “ select sum(tx.price /* tx.quantity) as revenue, o.orderId from transactions as tx inner join orders as o on tx.orderId = o.orderId where o.status = ‘COMPLETE’ and o.date between ‘2020-01-01’ and ‘2020-03-31’ group by o.orderId “ val outFile: String = _ spark.sql(query).save.parquet(outFile)

//循环遍历dates、完成“分而治之”的计算 val dates: Seq[String] = Seq(“2020-01-01”, “2020-01-02”, … “2020-03-31”) for (date <- dates) { val query: String = s” select sum(tx.price /* tx.quantity) as revenue, o.orderId from transactions as tx inner join orders as o on tx.orderId = o.orderId where o.status = ‘COMPLETE’ and o.date = ${date} group by o.orderId “ val file: String = s”${outFile}/\${date}” spark.sql(query).save.parquet(file) }

## 小结

“大表Join大表”的第一种调优思路是“分而治之”，我们要重点掌握它的调优思路以及两个关键环节的优化处理。

# 参考资料

https://learn.lianglianglee.com/%e4%b8%93%e6%a0%8f/Spark%e6%80%a7%e8%83%bd%e8%b0%83%e4%bc%98%e5%ae%9e%e6%88%98/28%20%e5%a4%a7%e8%a1%a8Join%e5%a4%a7%e8%a1%a8%ef%bc%88%e4%b8%80%ef%bc%89%ef%bc%9a%e4%bb%80%e4%b9%88%e6%98%af%e2%80%9c%e5%88%86%e8%80%8c%e6%b2%bb%e4%b9%8b%e2%80%9d%e7%9a%84%e8%b0%83%e4%bc%98%e6%80%9d%e8%b7%af%ef%bc%9f.md