聚合框架

2.1 新版功能.

概述

聚合框架提供一种方法来计算汇总值,而无需使用 :term:`map-reduce`。虽然映射化简是强大的,但它往往比简单的汇总任务更困难,如字段值总和或平均值。

如果你熟悉 SQL`,聚合框架提供了类似的功能, ``GROUP BY` 和相关的SQL运算符以及简单形式的“self joins”。此外,聚合框架提供了投影功能,以重塑返回的数据。使用的聚合框架中的投影,你可以添加计算字段,创建新的虚拟子对象,并提取子对象到结果的顶级。

也可以参考

一个来自 MongoSV 2011 介绍: MongoDB的新的集合框架.

另外, 更多的文档请看 聚合示例 和 :doc:`/reference/aggregation`。

组件

本节将介绍两个概念,聚合框架的基础: 管道表达式.

管道

从概念上,集合的文档通过一个聚合管道,当他们通过的时候,转换这些对象。对于那些熟悉的类UNIX壳(如:bash),这概念同用于字符串过滤的管道(即 |)是类似的。

在壳环境里,管道重定向一个进程的输出到另一个进程的输入。集合管道从一个 管道运算符 流运文档到下一个的方式来处理文档。在管道中,管道运算符可以重复。

所有的管道操作符都是处理文档流的,如果操作符扫描 collection 管道将起作用,并且将所有匹配的文档插入管道的”顶部”,管道里操作符转换通过管道的每个文档。

注解

管道操作符不需要为每一个输入文档生成一个输出文档:操作符也可能会产生新的文档或过滤文档。

警告

管道不能运算以下类型的值: Binary, Symbol, MinKey, MaxKey, DBRef, Code, 以及 CodeWScope.

也可以参考

聚合参考” 包含了以下管道运算符的文档:

表达式

表达式 基于对输入文件进行计算而生成输出文件. 聚合框架定义表达式中使用的一种文件格式,使用前缀。

表达式是无状态的,并只在被集合进程看到时才计算。管道里的表达式只能操作当前文档,无法整合其他文档中的数据。

$group 操作符使用 累加器 表达式,通过 管道 的文档进程保持状态 (e.g. totals, maximums, minimums, and related data).

也可以参考

框架表达式 聚合框架提供额外的表达式的例子.

使用

调用

mongo 壳或者 数据库命令 aggregate 里使用 aggregate() 包装器调用 聚合`。通常在集合对象里调用 :method:`~db.collection.aggregate() 来决定集合 管道 输入文档。:method:~db.collection.aggregate() 方法的参数指定一系列 管道算子,其中每个操作符可以有许多操作数。

第一步, 假设一个名为 articles 文档 :term:`集合`,格式如下:

{
 title : "this is my title" ,
 author : "bob" ,
 posted : new Date () ,
 pageViews : 5 ,
 tags : [ "fun" , "good" , "fun" ] ,
 comments : [
             { author :"joe" , text : "this is cool" } ,
             { author :"sam" , text : "this is bad" }
 ],
 other : { foo : 5 }
}

The following example aggregation operation pivots data to create a set of author names grouped by tags applied to an article. 通过发出以下命令唤出聚合框架:

db.articles.aggregate(
  { $project : {
     author : 1,
     tags : 1,
  } },
  { $unwind : "$tags" },
  { $group : {
     _id : { tags : "$tags" },
     authors : { $addToSet : "$author" }
  } }
);

The aggregation pipeline begins with the collection articles and selects the author and tags fields using the $project aggregation operator. The $unwind operator produces one output document per tag. Finally, the $group operator pivots these fields.

结果

上一节集合操作返回带两个字段的 document:

  • result pipeline 返回的 ``结果``,含有数组文档
  • ok 持有值 1, 表示成功, 或其它值,如果有一个错误

作为一个文档,结果受 BSON文档大小 的限制,目前16兆字节。

优化性能

Because you will always call aggregate on a collection object, which logically inserts the entire collection into the aggregation pipeline, you may want to optimize the operation by avoiding scanning the entire collection whenever possible.

管道算子和索引

Depending on the order in which they appear in the pipeline, aggregation operators can take advantage of indexes.

The following pipeline operators take advantage of an index when they occur at the beginning of the pipeline:

The above operators can also use an index when placed before the following aggregation operators:

早期过滤

If your aggregation operation requires only a subset of the data in a collection, use the $match operator to restrict which items go in to the top of the pipeline, as in a query. When placed early in a pipeline, these $match operations use suitable indexes to scan only the matching documents in a collection.

Placing a $match pipeline stage followed by a $sort stage at the start of the pipeline is logically equivalent to a single query with a sort, and can use an index.

In future versions there may be an optimization phase in the pipeline that reorders the operations to increase performance without affecting the result. However, at this time place $match operators at the beginning of the pipeline when possible.

累加操作的内存

Certain pipeline operators require access to the entire input set before they can produce any output. For example, $sort must receive all of the input from the preceding pipeline operator before it can produce its first output document. The current implementation of $sort does not go to disk in these cases: in order to sort the contents of the pipeline, the entire input must fit in memory.

$group has similar characteristics: Before any $group passes its output along the pipeline, it must receive the entirety of its input. For the $group operator, this frequently does not require as much memory as $sort, because it only needs to retain one record for each unique key in the grouping specification.

The current implementation of the aggregation framework logs a warning if a cumulative operator consumes 5% or more of the physical memory on the host. Cumulative operators produce an error if they consume 10% or more of the physical memory on the host.

分片操作

注解

在 2.1 版更改.

Some aggregation operations using aggregate will cause mongos instances to require more CPU resources than in previous versions. This modified performance profile may dictate alternate architectural decisions if you use the aggregation framework extensively in a sharded environment.

The aggregation framework is compatible with sharded collections.

When operating on a sharded collection, the aggregation pipeline is split into two parts. The aggregation framework pushes all of the operators up to the first $group or $sort operation to each shard. [1] Then, a second pipeline on the mongos runs. This pipeline consists of the first $group or $sort and any remaining pipeline operators, and runs on the results received from the shards.

The $group operator brings in any “sub-totals” from the shards and combines them: in some cases these may be structures. For example, the $avg expression maintains a total and count for each shard; mongos combines these values and then divides.

[1]If an early $match can exclude shards through the use of the shard key in the predicate, then these operators are only pushed to the relevant shards.

限制

使用 aggregate 命令的聚合操作有以下限制:

  • 管道不能操作一下类型的值: Binary, Symbol, MinKey, MaxKey, DBRef, Code, ``CodeWScope``。
  • pipeline 输出只能容纳16兆字节. 如果你的结果集超过此限制, aggregate 命令将生成一个错误。
  • 如果任何一个聚合操作消耗超过10%的系统RAM,运行中会产生一个错误。