1. 数据聚合的核心场景

在商场做促销活动时(假设我们有个sales表),当老板问:

  • 每个商品分类的总销售额是多少?
  • 哪些商品单日销售额超过5000?
  • 客单价超过300的用户有什么特征?

这些问题都涉及到GROUP BY和HAVING的组合使用。先看个典型示例:

-- SQLite 3.38语法
SELECT 
    category_id,
    SUM(amount) AS total_sales,
    AVG(unit_price) AS avg_price
FROM sales
WHERE sale_date >= '2023-06-01'  -- 条件过滤
GROUP BY category_id
HAVING total_sales > 100000      -- 聚合后过滤
ORDER BY total_sales DESC;

2. 过滤操作的黄金分割点

2.1 WHERE与HAVING的差异

某电商平台的支付系统(payment表)需要统计支付成功的订单:

-- 错误做法(将成功状态过滤放在HAVING)
SELECT 
    user_id,
    COUNT(*) AS order_count 
FROM payment 
GROUP BY user_id
HAVING status = 'success';   -- HAVING无法过滤原始记录

-- 正确做法(用WHERE先过滤)
SELECT 
    user_id,
    COUNT(*) AS successful_orders
FROM payment
WHERE status = 'success'    -- 提前过滤无效数据
GROUP BY user_id;

2.2 条件前置的奇效

医院挂号系统(appointments表)优化案例:

-- 未优化的写法(超过200条记录才统计)
SELECT 
    department,
    COUNT(*) AS total 
FROM appointments
GROUP BY department
HAVING total > 200;      -- 聚合后过滤所有分组

-- 优化后的写法(查询条件前置)
SELECT 
    department,
    COUNT(*) AS total 
FROM (
    SELECT * 
    FROM appointments
    WHERE date >= '2023-07-01'   -- 先按日期过滤
)
GROUP BY department
HAVING total > 200;              -- 处理后的数据量减少80%

3. 索引的正确打开方式

3.1 单列索引的适配场景

物流系统的包裹跟踪表(packages表)优化:

-- 现有查询(速度慢)
SELECT 
    province,
    COUNT(*) AS package_count
FROM packages
WHERE weight > 5.0        -- 高频过滤条件
GROUP BY province;

-- 创建组合索引
CREATE INDEX idx_province_weight 
ON packages(province, weight);

-- 优化后执行计划
EXPLAIN QUERY PLAN
SELECT province, COUNT(*)
FROM packages
WHERE weight > 5.0
GROUP BY province;
/* 输出结果应显示使用索引扫描 */

3.2 覆盖索引的妙用

图书馆管理系统(books表)案例:

-- 需要作者和出版社的统计
SELECT 
    author,
    publisher,
    COUNT(*) AS total
FROM books
GROUP BY author, publisher;

-- 创建覆盖索引(包含所有查询字段)
CREATE INDEX idx_author_pub 
ON books(author, publisher, id);  -- 包含主键避免二次查找

-- 验证是否走索引
EXPLAIN 
SELECT author, publisher, COUNT(*) 
FROM books 
GROUP BY author, publisher;

4. 临时表的艺术

4.1 中间结果复用的智慧

股票交易系统(trades表)复杂查询优化:

-- 原始嵌套查询
SELECT 
    stock_code,
    AVG(daily_volume) AS avg_vol
FROM (
    SELECT 
        stock_code,
        SUM(quantity) AS daily_volume
    FROM trades
    WHERE trade_date BETWEEN '2023-01-01' AND '2023-06-30'
    GROUP BY stock_code, trade_date
) 
GROUP BY stock_code;

-- 优化为分步执行
-- 步骤1:创建临时表
CREATE TEMP TABLE temp_daily_vol AS
SELECT 
    stock_code,
    trade_date,
    SUM(quantity) AS daily_volume
FROM trades
WHERE trade_date BETWEEN '2023-01-01' AND '2023-06-30'
GROUP BY stock_code, trade_date;

-- 步骤2:二次聚合
SELECT 
    stock_code,
    AVG(daily_volume) AS avg_vol
FROM temp_daily_vol
GROUP BY stock_code;

4.2 内存利用的平衡术

当处理千万级数据时,需要调整临时存储设置:

-- 查看当前内存配置
PRAGMA temp_store;       -- 默认值为0(自动选择)

-- 设置为内存模式(查询完成后自动释放)
PRAGMA temp_store = 2;   -- 2代表强制使用内存

-- 重要提示:在事务中谨慎使用内存模式
BEGIN TRANSACTION;
-- 执行需要大量中间存储的操作
COMMIT;

5. 优化陷阱全知道

5.1 HAVING的滥用风险

订单系统中的经典错误案例:

-- 低效写法(混合使用WHERE和HAVING)
SELECT 
    product_id,
    SUM(quantity) AS total_qty
FROM order_details
WHERE unit_price > 50         -- 有效的行过滤
GROUP BY product_id
HAVING product_id LIKE 'A%'; -- 本该在WHERE处理的过滤

-- 优化后的正确写法
SELECT 
    product_id,
    SUM(quantity) AS total_qty
FROM order_details
WHERE unit_price > 50
  AND product_id LIKE 'A%'    -- 提前过滤无效产品
GROUP BY product_id;

5.2 统计失真危机

用户行为分析系统中的统计漏洞:

-- 错误统计活跃用户的方法
SELECT 
    DATE(login_time),
    COUNT(user_id) AS active_users
FROM user_logins
GROUP BY DATE(login_time)
HAVING active_users > 100;  -- 未去重导致统计错误

-- 正确使用COUNT(DISTINCT)
SELECT 
    DATE(login_time),
    COUNT(DISTINCT user_id) AS active_users
FROM user_logins
GROUP BY DATE(login_time)
HAVING active_users > 100;

6. 性能优化三板斧

6.1 EXPLAIN QUERY PLAN解析实战

分析一个复杂查询的执行计划:

EXPLAIN QUERY PLAN
SELECT 
    c.category_name,
    COUNT(p.product_id) AS product_count,
    AVG(p.unit_price) AS avg_price
FROM products p
JOIN categories c ON p.category_id = c.id
WHERE p.stock_quantity > 0
GROUP BY c.category_name
HAVING avg_price > 100.0;

/* 典型输出解析:
SCAN TABLE products AS p
SEARCH TABLE categories AS c USING INTEGER PRIMARY KEY (id=?)
USE TEMP B-TREE FOR GROUP BY
*/

6.2 统计信息更新机制

保持统计信息的准确性:

-- 手动更新统计信息(在数据变更后执行)
ANALYZE;

-- 查看索引统计情况
SELECT * FROM sqlite_stat1;

-- 定期维护脚本示例
BEGIN;
DELETE FROM expired_records WHERE expiration_date < date('now');
ANALYZE;
COMMIT;

7. 不同量级的数据优化策略

7.1 小型数据集(<10万行)的优化重点

-- 重点优化可读性而非执行速度
SELECT 
    substr(phone, 1, 3) AS area_code,
    COUNT(*) AS user_count
FROM customers
GROUP BY area_code
HAVING user_count > 50;

7.2 中型数据集(10万-500万行)的优化方案

-- 需要建立合适的索引
CREATE INDEX idx_customers_phone 
ON customers(substr(phone, 1, 3));

-- 分页聚合查询优化
SELECT 
    area_code,
    user_count 
FROM (
    SELECT 
        substr(phone,1,3) AS area_code,
        COUNT(*) AS user_count
    FROM customers
    GROUP BY area_code
) 
WHERE user_count > 50
ORDER BY user_count DESC
LIMIT 10 OFFSET 20;

7.3 大型数据集(>500万行)的高级技巧

-- 分区统计法(按时间分片)
SELECT 
    strftime('%Y-%m', order_date) AS month,
    product_category,
    SUM(amount) AS total_sales
FROM orders
WHERE order_date BETWEEN '2020-01-01' AND '2023-12-31'
GROUP BY month, product_category
HAVING total_sales > 1000000;

-- 创建复合索引
CREATE INDEX idx_orders_date_category 
ON orders(order_date, product_category);

8. 应用场景与技术选型

在物联网设备数据统计、金融交易流水分析、用户行为日志处理等场景中,经过优化的聚合查询可以实现:

  • 实时报表响应时间缩短40%-70%
  • 聚合计算内存消耗降低30%
  • 复杂条件筛选效率提升3倍以上

9. 注意事项备忘录

  1. GROUP BY字段顺序影响索引命中率
  2. HAVING条件避免使用复杂表达式
  3. 在事务中谨慎处理临时表
  4. 定期更新统计信息
  5. 监控temp_store的内存使用

10. 文章总结

通过七个章节的系统性讲解,我们深入探讨了SQLite聚合查询优化的核心方法。从基础的条件过滤到索引策略,从临时表技巧到统计信息维护,每个优化手段都结合实际业务场景进行验证。特别要注意HAVING的误用问题和统计失真风险,这些细节往往成为系统瓶颈的元凶。掌握这些优化方法论,可以让您的SQLite数据库在处理复杂聚合时游刃有余。