Elasticsearch tutorial: quick start examples for newbie

1.1 Basic concepts

Elasticsearch is also a full-text search library based on Lucene, which essentially stores data. Many concepts are similar to MySQL.

Comparison relationship:

  • Indexes (indices) ——————————– Databases database
  • Type (type) —————————– Table data table
  • Document (Document) —————- Row row
  • Field (Field) —————– Columns column

Detailed description:

concept Explanation
Indexes (indices) indices is the plural of index, representing many indexes,
Type The type is to simulate the concept of table in mysql. There can be different types of indexes under an index library, such as commodity index and order index, and their data formats are different. However, this will lead to chaos in the index library, so this concept will be removed in a future version
Document Save the original data in the index library. For example, each product information is a document
Field Properties in the document
Mapping configuration (mappings) Field data types, attributes, whether to index, whether to store and other characteristics
  • Index set (Indices, plural of index): logically complete index
  • Shard: The parts after the data is split
  • Replica (replica): replication of each shard

1.2. Create Index

1.2.1. Grammar

Elasticsearch adopts Rest style API, so its API is a http request, you can use any tool to initiate http request

Request format for index creation:

Request method:  PUT

Request path:  / index name

Request parameters: json format:

{

    “settings”: {

        “number_of_shards”: 3,

        “number_of_replicas”: 2

      }

}

Settings: Index library settings

number_of_shards: number of shards

number_of_replicas: number of replicas

1.2.2. Test

We try with kibana

Elasticsearch create index by kibana

You can see that the index was created successfully.

1.2.3. Create with posman

Elasticsearch create index by postman

Can also be created successfully, but it is not convenient to use kibana

1.3. View index settings

Get request can help us view the index information, format:

GET /index name

Elasticsearch view index database

Alternatively, we can use * to query all index library configurations

1.4. Delete the index

Delete index using DELETE request

DELETE /index name

2.5. Mapping configuration

With the index, the next step is definitely to add data. However, the mapping must be defined before adding data.

What is mapping?

Mapping is the process of defining the document, which fields the document contains, whether these fields are saved, whether they are indexed, whether they are word segmentation, etc.

Only if the configuration is clear, Elasticsearch will help us create the index library

2.5.1. Create mapping fields

The request method is still PUT

PUT /indexname/_mapping/typename

{

  “properties”: {

    “table”: {

      “type”: “word”,

      “index”: true,

      “store”: true,

      “analyzer”: “cjktoken”

    }

  }

}

  • Type name: It is the concept of type mentioned earlier, similar to different tables in the database. 
    Field name: Fill it in arbitrarily, you can specify many attributes, for example:
  • type: type, which can be text, long, short, date, integer, object, etc.
  • index: whether to index, the default is true
  • store: whether to store, the default is false
  • analyzer: tokenizer, here ik_max_wordis to use ik tokenizer

Examples

Make a request:

PUT testindex/_mapping/goods

{

  “properties”: {

    “title”: {

      “type”: “text”,

      “analyzer”: “ik_max_word”

    },

    “images”: {

      “type”: “keyword”,

      “index”: “false”

    },

    “price”: {

      “type”: “float”

    }

  }

}

Response result:

{

  “acknowledged”: true

}

1.5.2. View the mapping relationship

GET /testindex/_mapping

response:

{

  “testindex”: {

    “mappings”: {

      “goods”: {

        “properties”: {

          “images”: {

            “type”: “keyword”,

            “index”: false

          },

          “price”: {

            “type”: “float”

          },

          “title”: {

            “type”: “text”,

            “analyzer”: “ik_max_word”

          }

        }

      }

    }

  }

}

1.5.3. Detailed field attributes

1.5.3.1.type

The data types supported in Elasticsearch are very rich:

We say a few key ones:

There are two types of String types:

text: can be divided into words, can not participate in aggregation

keyword: indivisible, the data will be matched as a complete field and can be aggregated

Numerical: numerical types, divided into two categories

Basic data types: long, interger, short, byte, double, float, half_float

High precision type of floating point: scaled_floatYou need to specify a precision factor, such as 10 or 100. Elasticsearch will multiply the real value by this factor and store it, then restore it when it is taken out.

Date: Date type

Elasticsearch can format the date as a string storage, but it is recommended that we store it as a millisecond value and store it as long to save space.

1.5.3.2.index

index affects the index of the field.

  • true: The field will be indexed and can be used to search. The default value is true
  • false: the field will not be indexed and cannot be used for searching

The default value of index is true, which means that if you do not configure anything, all fields will be indexed.

But there are some fields that we do not want to be indexed, such as the picture information of the product, we need to manually set the index to false.

1.5.3.3.store

Whether to store the data extra.

When learning lucene and solr, we know that if the store of a field is set to false, then the value of this field will not be in the document list, and the user’s search results will not be displayed.

But in Elasticsearch, even if the store is set to false, you can search for the results.

The reason is that when Elasticsearch creates a document index, it will back up the original data in the document and save it in a _sourceproperty called . And we can _sourceselect which ones to display and which ones to not display through filtering .

If you set store to true, it will store _sourcean extra piece of data outside, which is redundant, so generally we will set store to false. In fact, the default value of store is false.

1.6. New data

1.6.1. Randomly generated id

Through POST requests, you can add data to an existing index library.

Examples:

POST /testindex/goods/

{

    “title”:”iphoneX”,

    “images”:”1,jpg”,

    “price”:111.00

}

response:

{

  “_index”: “testindex”,

  “_type”: “goods”,

  “_id”: “AWsS5Neq-k3yg4WVTNnG”,

  “_version”: 1,

  “result”: “created”,

  “_shards”: {

    “total”: 2,

    “successful”: 1,

    “failed”: 0

  },

  “created”: true

}

View data through kibana:

get _search

{

    “query”:{

        “match_all”:{}

    }

}

{

  “_index”: “testindex”,

  “_type”: “goods”,

  “_id”: “AWsS5Neq-k3yg4WVTNnG”,

  “_version”: 1,

  “_score”: 1,

  “_source”: {

    “title”: “iphoneX”,

    “images”: “1.jpg”,

    “price”: 111

  }

}

  • _source: Source document information, all data are in it.
  • _id: The unique identifier of this document is not associated with the document’s own id field

1.6.2. Custom id

If we want to specify the id when adding ourselves, we can do this:

Examples:

POST /testindex/goods/2

{

    “title”:”IphoneX”,

    “images”:”2.jpg”,

    “price”:222

}

The data obtained:

{

  “_index”: “testindex”,

  “_type”: “goods”,

  “_id”: “2”,

  “_score”: 1,

  “_source”: {

    “title”: “IphoneX”,

    “images”: “2,jpg”,

    “price”: 222

  }

}

2.6.3. Intelligent judgment

When learning Solr, we found that when we add new data, we can only use the fields with the mapping attributes configured in advance, otherwise we will report errors.

However, there is no such requirement in Elasticsearch.

In fact Elasticsearch is very smart, you don’t need to set any mapping mapping for the index library, it can also judge the type based on the data you input, and dynamically add data mapping.

have a test:

POST /testindex/goods/3

{

    “title”:”IphoneX”,

    “images”:”3.jpg”,

    “price”:333,

    “stock”: 200

}

We have added an additional stock field.

Look at the results:

{

  “_index”: “testindex”,

  “_type”: “goods”,

  “_id”: “3”,

  “_version”: 1,

  “_score”: 1,

  “_source”: {

    “title”: “IphoneX”,

    “images”: “3.jpg”,

    “price”: 333,

    “stock”: 200

  }

}

Look at the mapping relationship of the index library:

{

  “testindex”: {

    “mappings”: {

      “goods”: {

        “properties”: {

          “images”: {

            “type”: “keyword”,

            “index”: false

          },

          “price”: {

            “type”: “float”

          },

          “stock”: {

            “type”: “long”

          },

          “title”: {

            “type”: “text”,

            “analyzer”: “ik_max_word”

          }

        }

      }

    }

  }

}

Both stock and saleable are successfully mapped.

1.7. Modify data

Change the request method just added to PUT, it is modified. However, the modification must specify the id,

  • id corresponding document exists, modify
  • id corresponding document does not exist, then add

For example, we modify the data with id 3:

PUT /testindex/goods/3

{

    “title”:”IphoneX”,

    “images”:”3.jpg”,

    “price”:333,

    “stock”: 100

}

result:

{

  “took”: 17,

  “timed_out”: false,

  “_shards”: {

    “total”: 9,

    “successful”: 9,

    “skipped”: 0,

    “failed”: 0

  },

  “hits”: {

    “total”: 1,

    “max_score”: 1,

    “hits”: [

      {

        “_index”: “testindex”,

        “_type”: “goods”,

        “_id”: “3”,

        “_score”: 1,

        “_source”: {

          “title”: “IphoneX”,

          “images”: “3.jpg”,

          “price”: 333,

          “stock”: 100

        }

      }

    ]

  }

}

2.8. Delete data

To delete using DELETE request, similarly, you need to delete according to id:

DELETE /indexname/type/id

3. Inquiry

We query from 4 blocks:

  • Basic query
  • _sourcefilter
  • Results filtering
  • Advanced Search
  • Sort

3.1. Basic query:

The query here represents a query object, which can have different query attributes

  • Query type:
    • For match_allexample: match, term, , rangeetc.
  • Query conditions: The query conditions will be different depending on the type, and the writing method will also be different.

3.1.1 Query all (match_all)

Examples:

GET /testindex/_search

{

    “query”:{

        “match_all”: {}

    }

}

  • query: Represents the query object
  • match_all: On behalf of all

result:

{

  “took”: 2,

  “timed_out”: false,

  “_shards”: {

    “total”: 5,

    “successful”: 5,

    “failed”: 0

  },

  “hits”: {

    “total”: 1,

    “max_score”: 1,

    “hits”: [

      {

        “_index”: “testindex”,

        “_type”: “goods”,

        “_id”: “AWsS5Neq-k3yg4WVTNnG”,

        “_score”: 1,

        “_source”: {

          “title”: “iphoneX”,

          “images”: “1,jpg”,

          “price”: 111

        }

      }

    ]

  }

}

  • took: The query took time, in milliseconds
  • time_out: whether to time out
  • _shards: shard information
  • hits: search results overview object
    • total: the total number of searched
    • max_score: the highest score of all results
    • hits: an array of document objects in the search results, each element is a piece of searched document information
      • _index: index library
      • _type: document type
      • _id: document id
      • _score: document score
      • _source: the source data of the document

3.1.2 Match query (match)

  • or relationship

matchType query, the query conditions will be segmented, and then query, the relationship between multiple entries is or

GET /testindex/_search

{

    “query”:{

        “match”:{

            “title”:”iphoneX”

        }

    }

}

result:

{

  “took”: 26,

  “timed_out”: false,

  “_shards”: {

    “total”: 5,

    “successful”: 5,

    “failed”: 0

  },

  “hits”: {

    “total”: 2,

    “max_score”: 0.51623213,

    “hits”: [

      {

        “_index”: “testindex”,

        “_type”: “goods”,

        “_id”: “AWsS5Neq-k3yg4WVTNnG”,

        “_score”: 0.51623213,

        “_source”: {

          “title”: “iphoneX”,

          “images”: “1,jpg”,

          “price”: 111

        }

      },

      {

        “_index”: “testindex”,

        “_type”: “goods”,

        “_id”: “3”,

        “_score”: 0.25811607,

        “_source”: {

          “title”: “iMac”,

          “images”: “4.jp”,

          “price”: 444

        }

      }

    ]

  }

}

In the above case, not only will Xiaomi phones be queried, but also those related to Xiaomi will be queried, and orthe relationship between multiple words is . (Xiaomi mobile phone is divided into two words, Xiaomi and mobile phone, because of the or relationship, so as long as there is one of the two keywords of Xiaomi or mobile phone will be queried)

  • and relationship

In some cases, we need to find more precisely, and we want this relationship to become and, we can do this:

GET /testindex/_search

{

    “query”:{

        “match”: {

          “title”: {

            “query”: “iphoneX”,

            “operator”: “and”

          }

        }

    }

}

result:

{

  “took”: 26,

  “timed_out”: false,

  “_shards”: {

    “total”: 5,

    “successful”: 5,

    “failed”: 0

  },

  “hits”: {

    “total”: 1,

    “max_score”: 0.51623213,

    “hits”: [

      {

        “_index”: “testindex”,

        “_type”: “goods”,

        “_id”: “AWsS5Neq-k3yg4WVTNnG”,

        “_score”: 0.51623213,

        “_source”: {

          “title”: “iphoneX”,

          “images”: “1,jpg”,

          “price”: 111

        }

      }

    ]

  }

}

In this example, only terms that contain both  IPhone and  IPad will be searched.

  • between or and and?

In orthe androom a second election a little too black and white. If there are 5 query terms after the word segmentation given by the user, and want to find documents that contain only 4 of them, what should I do? The operator operator parameter is set to andonly exclude this document.

Sometimes this is what we expect, but in most application scenarios of full-text search, we want to include those documents that may be relevant, while excluding those that are less relevant. In other words, we want to be somewhere in the middle.

matchQuery support minimum_should_matchminimum matching parameters, which allows us to specify the number of terms that must be matched to represent a document is relevant. We can set it to a specific number, the more common way is to set it to %, because we can not control the number of words entered by the user when searching:

GET /testindex/_search

{

    “query”:{

        “match”:{

            “title”:{

                “query”:”iWatch”,

                “minimum_should_match”: “75%”

            }

        }

    }

}

In this example, the search statement can be divided into 3 words. If you use the and relationship, you need to satisfy 3 words at the same time to be searched. Here we use the minimum number of brands: 75%, then it means that as long as it matches 75% of the total number of entries, here 3 * 75% is approximately equal to 2. So as long as it contains 2 entries, the conditions are met.

3.1.3 Multi-field query (multi_match)

multi_matchAnd matchsimilar, except that it can be queried in multiple fields

GET /testindex/_search

{

    “query”:{

        “multi_match”: {

            “query”:    “IPhone”,

            “fields”:   [ “title”, “image” ]

        }

    }

}

Will take the query in the two fields title and image

3.1.4 term matching (term)

termThe query is used for exact value matching, these exact values ​​may be numbers, time, boolean or those unsegmented strings

GET /testindex/_search

{

    “query”:{

        “term”:{

            “price”:111

        }

    }

}

result:

{

  “took”: 15,

  “timed_out”: false,

  “_shards”: {

    “total”: 5,

    “successful”: 5,

    “failed”: 0

  },

  “hits”: {

    “total”: 1,

    “max_score”: 1,

    “hits”: [

      {

        “_index”: “testindex”,

        “_type”: “goods”,

        “_id”: “AWsS5Neq-k3yg4WVTNnG”,

        “_score”: 1,

        “_source”: {

          “title”: “iphoneX”,

          “images”: “1,jpg”,

          “price”: 111

        }

      }

    ]

  }

}

3.1.5 Multi-term exact matching (terms)

termsThe query is the same as the term query, but it allows you to specify multiple values ​​to match. If this field contains any one of the specified values, then the document meets the conditions:

GET /testindex/_search

{

    “query”:{

        “terms”:{

            “price”:[111,222]

        }

    }

}

3.2. Results filtering

By default, elasticsearch will _sourcereturn all the fields stored in the document in the search results .

If we only want to get some of the fields, we can add _sourcefilters

3.2.1. Directly specify fields

Examples:

GET /testindex/_search

{

  “_source”: [“title”,”price”],

  “query”: {

    “term”: {

      “price”: 111

    }

  }

}

Results returned:

{

  “took”: 28,

  “timed_out”: false,

  “_shards”: {

    “total”: 5,

    “successful”: 5,

    “failed”: 0

  },

  “hits”: {

    “total”: 1,

    “max_score”: 1,

    “hits”: [

      {

        “_index”: “testindex”,

        “_type”: “goods”,

        “_id”: “AWsS5Neq-k3yg4WVTNnG”,

        “_score”: 1,

        “_source”: {

          “price”: 111,

          “title”: “iphoneX”

        }

      }

    ]

  }

}

In this way, there are only two fields title and price in the _source field

3.2.2. Specify includes and excludes

We can also pass:

  • includes: to specify the fields you want to display
  • excludes: to specify fields that you do not want to display

Both are optional.

Examples:

GET /testindex/_search

{

  “_source”: {

    “includes”:[“title”,”price”]

  },

  “query”: {

    “term”: {

      “price”: 111

    }

  }

}

The result will be the same as the following:

GET /testindex/_search

{

  “_source”: {

     “excludes”: [“images”]

  },

  “query”: {

    “term”: {

      “price”: 2699

    }

  }

}

3.3 Advanced query

3.3.1 Boolean combination (bool)

boolCombine various other queries by must(AND), must_not(NOT), should(OR)

GET /testindex/_search

{

    “query”:{

        “bool”:{

            “must”:     { “match”: { “title”: “IPhone” }},

            “must_not”: { “match”: { “title”:  “TV” }},

            “should”:   { “match”: { “title”: “Phone” }}

        }

    }

}

result:

{

  “took”: 18,

  “timed_out”: false,

  “_shards”: {

    “total”: 5,

    “successful”: 5,

    “failed”: 0

  },

  “hits”: {

    “total”: 1,

    “max_score”: 0.51623213,

    “hits”: [

      {

        “_index”: “testindex”,

        “_type”: “goods”,

        “_id”: “AWsS5Neq-k3yg4WVTNnG”,

        “_score”: 0.51623213,

        “_source”: {

          “title”: “iphoneX”,

          “images”: “1,jpg”,

          “price”: 111

        }

      }

    ]

  }

}

rangeThe query allows the following characters:

Operator Explanation
gt more than the
gte greater or equal to
lt Less than
lte Less than or equal to

3.3.3 Fuzzy query (fuzzy)

We add a new product:

POST /testindex/goods/4

{

    “title”:”applePhone”,

    “images”:”apple.jpg”,

    “price”:6899.00

}

fuzzyQueries are termfuzzy equivalent queries. It allows the user to deviate between the spelling of the search term and the actual term, but the edit distance of the deviation must not exceed 2:

GET /testindex/_search

{

  “query”: {

    “fuzzy”: {

      “title”: “appla”

    }

  }

}

The above query can also find the Apple mobile phone

We can fuzzinessspecify the allowed editing distance by:

GET /testindex/_search

{

  “query”: {

    “fuzzy”: {

        “title”: {

            “value”:”appla”,

            “fuzziness”:1

        }

    }

  }

}

3.4 filter

Filter in conditional query

All queries will affect the score and ranking of the document. If we need to filter in the query results, and do not want the filter conditions to affect the score, then do not use the filter conditions as query conditions. Instead, use filter:

GET /testindex/_search

{

    “query”:{

        “bool”:{

            “must”:{ “match”: { “title”: “iphoneX” }},

            “filter”:{

                “range”:{“price”:{“gt”:2000.00,”lt”:3800.00}}

            }

        }

    }

}

Note: filterYou can also boolfilter the combined conditions again .

No query conditions, direct filtering

If a query has only filtering, no query conditions, and no scoring, we can use constant_scorebool query instead of only filter statement. The performance is exactly the same, but it is very helpful to improve the simplicity and clarity of the query.

GET /heima/_search

{

    “query”:{

        “constant_score”:   {

            “filter”: {

                 “range”:{“price”:{“gt”:2000.00,”lt”:3000.00}}

            }

        }

}

3.5 Sort

3.4.1 Single field sorting

sortAllows us to sort by different fields, and by orderspecifying the sorting method

GET /testindex/_search

{

  “query”: {

    “match”: {

      “title”: “iphoneX”

    }

  },

  “sort”: [

    {

      “price”: {

        “order”: “desc”

      }

    }

  ]

}

3.4.2 Sorting multiple fields

Suppose we want to use a combination of price and _score (score) for the query, and the matching results are first sorted by price and then by relevance score:

GET /goods/_search

{

    “query”:{

        “bool”:{

            “must”:{ “match”: { “title”: “iphoneX” }},

            “filter”:{

                “range”:{“price”:{“gt”:200000,”lt”:300000}}

            }

        }

    },

    “sort”: [

      { “price”: { “order”: “desc” }},

      { “_score”: { “order”: “desc” }}

    ]

}

4. Aggregation

Aggregation allows us to achieve extremely convenient statistics and analysis of data. E.g:

  • What brand of mobile phone is the most popular?
  • The average price, the highest price, the lowest price of these phones?
  • How about the monthly sales of these phones?

It is more convenient to implement these statistical functions than the database sql, and the query speed is very fast, which can realize the real-time search effect.

4.1 Basic concepts

Aggregation in Elasticsearch contains multiple types, the two most commonly used, one called bucket and one called metrics:

Bucket

The role of the barrel is to group the data in some way, each set of data referred to in the ES a bucket , for example, we divide people on the basis of nationality, can get CN bucket, UAS bucket …… Jan bucket or our people are divided according to age groups: 0 10,10 20,20 30,30 40 etc.

There are many ways to divide buckets provided in Elasticsearch:

  • Date Histogram Aggregation: grouped according to the date ladder, for example, if the given ladder is a week, it will be automatically divided into a group every week
  • Histogram Aggregation: grouped according to numerical ladder, similar to date
  • Terms Aggregation: grouped according to the content of the entry
  • Range Aggregation: Range grouping of numeric values ​​and dates, specify start and end, and then group by segments

In summary, we found that bucket aggregations are only responsible for grouping data, and do not perform calculations. Therefore, another aggregation is often nested in the bucket: metrics aggregations are metrics

Metrics

After the grouping is completed, we generally perform aggregation operations on the data in the group, such as average, maximum, minimum, summation, etc. These are called in ES metrics

Some commonly used measures aggregation methods:

  • Avg Aggregation: average
  • Max Aggregation: find the maximum value
  • Min Aggregation: Find the minimum value
  • Percentiles Aggregation: seeking percentage
  • Stats Aggregation: return avg, max, min, sum, count, etc. at the same time
  • Sum Aggregation: Sum
  • Top hits Aggregation: Seeking the first few
  • Value Count Aggregation: Find the total

To test the aggregation, we first import some data in bulk

Create an index:

PUT /cars

{

  “settings”: {

    “number_of_shards”: 1,

    “number_of_replicas”: 0

  },

  “mappings”: {

    “transactions”: {

      “properties”: {

        “color”: {

          “type”: “keyword”

        },

        “make”: {

          “type”: “keyword”

        }

      }

    }

  }

}

Note : In ES, the fields that need to be aggregated, sorted, and filtered are treated in a special way, so they cannot be segmented. Here we set the fields of the two text types color and make to the keyword type. This type will not be segmented, and we can participate in aggregation in the future

Import Data

POST /cars/transactions/_bulk

{ “index”: {}}

{ “price” : 10000, “color” : “red”, “make” : “honda”, “sold” : “2014-10-28” }

{ “index”: {}}

{ “price” : 20000, “color” : “red”, “make” : “honda”, “sold” : “2014-11-05” }

{ “index”: {}}

{ “price” : 30000, “color” : “green”, “make” : “ford”, “sold” : “2014-05-18” }

{ “index”: {}}

{ “price” : 15000, “color” : “blue”, “make” : “toyota”, “sold” : “2014-07-02” }

{ “index”: {}}

{ “price” : 12000, “color” : “green”, “make” : “toyota”, “sold” : “2014-08-19” }

{ “index”: {}}

{ “price” : 20000, “color” : “red”, “make” : “honda”, “sold” : “2014-11-05” }

{ “index”: {}}

{ “price” : 80000, “color” : “red”, “make” : “bmw”, “sold” : “2014-01-01” }

{ “index”: {}}

{ “price” : 25000, “color” : “blue”, “make” : “ford”, “sold” : “2014-02-12” }

4.2 Aggregation into buckets

First, we colordivide according to the color of the car bucket

GET /cars/_search

{

    “size” : 0,

    “aggs” : {

        “popular_colors” : {

            “terms” : {

              “field” : “color”

            }

        }

    }

}

  • size: the number of queries, set here to 0, because we do not care about the searched data, only care about the aggregation results, improve efficiency
  • aggs: states that this is an aggregate query, which is an abbreviation of aggregations
    • popular_colors: Give this aggregation a name, arbitrary.
      • terms: the way to divide the bucket, here is divided according to the terms
      • field: the field that divides the bucket

result:

{

  “took”: 1,

  “timed_out”: false,

  “_shards”: {

    “total”: 1,

    “successful”: 1,

    “skipped”: 0,

    “failed”: 0

  },

  “hits”: {

    “total”: 8,

    “max_score”: 0,

    “hits”: []

  },

  “aggregations”: {

    “popular_colors”: {

      “doc_count_error_upper_bound”: 0,

      “sum_other_doc_count”: 0,

      “buckets”: [

        {

          “key”: “red”,

          “doc_count”: 4

        },

        {

          “key”: “blue”,

          “doc_count”: 2

        },

        {

          “key”: “green”,

          “doc_count”: 2

        }

      ]

    }

  }

}

  • hits: The query result is empty because we set the size to 0
  • aggregations: aggregation results
  • popular_colors: the aggregate name we defined
  • buckets: buckets found, each different color field value will form a bucket
    • key: the value of the color field corresponding to this bucket
    • doc_count: the number of documents in this bucket

Through the results of aggregation, we found that the red car is currently selling well!

4.3 In-bucket metrics

The previous example tells us the number of documents in each bucket, which is very useful. But usually, our application needs to provide more complex document metrics. For example, what is the average price of each color car?

Therefore, we need to tell elasticsearch use which field, which metrics, calculates that the information to be nested bucket within, metric the operation will be based on bucket conduct within the document

Now, we add a measure that averages the price to the aggregated result just now:

GET /cars/_search

{

    “size” : 0,

    “aggs” : {

        “popular_colors” : {

            “terms” : {

              “field” : “color”

            },

            “aggs”:{

                “avg_price”: {

                   “avg”: {

                      “field”: “price”

                   }

                }

            }

        }

    }

}

  • aggs: We add new aggs to the last aggs (popular_colors). Visible metric is also an aggregation, the metric is the aggregation in the bucket
  • avg_price: the name of the aggregate
  • avg: the type of measurement, here is the average
  • field: the field of measurement operation

result:

  “aggregations”: {

    “popular_colors”: {

      “doc_count_error_upper_bound”: 0,

      “sum_other_doc_count”: 0,

      “buckets”: [

        {

          “key”: “red”,

          “doc_count”: 4,

          “avg_price”: {

            “value”: 32500

          }

        },

        {

          “key”: “blue”,

          “doc_count”: 2,

          “avg_price”: {

            “value”: 20000

          }

        },

        {

          “key”: “green”,

          “doc_count”: 2,

          “avg_price”: {

            “value”: 21000

          }

        }

      ]

    }

  }

You can see that each bucket has its own avg_pricefield, which is the result of metric aggregation

4.4 Nested buckets in the bucket

In the case just now, we nested measurement operations inside the bucket. In fact, buckets can not only nest operations, but also nest other buckets. That is to say, in each group, there are more groups.

For example: we want to count the manufacturer of each color of the car, and makedivide the buckets according to the field

GET /cars/_search

{

    “size” : 0,

    “aggs” : {

        “popular_colors” : {

            “terms” : {

              “field” : “color”

            },

            “aggs”:{

                “avg_price”: {

                   “avg”: {

                      “field”: “price”

                   }

                },

                “maker”:{

                    “terms”:{

                        “field”:”make”

                    }

                }

            }

        }

    }

}

  • The original color bucket and avg calculations are unchanged
  • maker: Add a new bucket under the nested aggs, called maker
  • terms: The division type of the bucket is still a term
  • filed: divided according to the make field

Partial results:

{“aggregations”: {

    “popular_colors”: {

      “doc_count_error_upper_bound”: 0,

      “sum_other_doc_count”: 0,

      “buckets”: [

        {

          “key”: “red”,

          “doc_count”: 4,

          “maker”: {

            “doc_count_error_upper_bound”: 0,

            “sum_other_doc_count”: 0,

            “buckets”: [

              {

                “key”: “honda”,

                “doc_count”: 3

              },

              {

                “key”: “bmw”,

                “doc_count”: 1

              }

            ]

          },

          “avg_price”: {

            “value”: 32500

          }

        },

        {

          “key”: “blue”,

          “doc_count”: 2,

          “maker”: {

            “doc_count_error_upper_bound”: 0,

            “sum_other_doc_count”: 0,

            “buckets”: [

              {

                “key”: “ford”,

                “doc_count”: 1

              },

              {

                “key”: “toyota”,

                “doc_count”: 1

              }

            ]

          },

          “avg_price”: {

            “value”: 20000

          }

        },

        {

          “key”: “green”,

          “doc_count”: 2,

          “maker”: {

            “doc_count_error_upper_bound”: 0,

            “sum_other_doc_count”: 0,

            “buckets”: [

              {

                “key”: “ford”,

                “doc_count”: 1

              },

              {

                “key”: “toyota”,

                “doc_count”: 1

              }

            ]

          },

          “avg_price”: {

            “value”: 21000

          }

        }

      ]

    }

  }

}

  •  We can see that the new aggregate makeris nested in each original colorbucket.
  • The following each color are makegrouped Field
  • Information we can read:
    • There are 4 red cars
    • The average selling price of a red car is $ 32,500.
    • Three of them are made by Honda and one is made by BMW.

4.5. Other ways of dividing barrels

As mentioned earlier, there are many ways to divide the bucket, for example:

  • Date Histogram Aggregation: grouped according to the date ladder, for example, if the given ladder is a week, it will be automatically divided into a group every week
  • Histogram Aggregation: grouped according to numerical ladder, similar to date
  • Terms Aggregation: grouped according to the content of the entry
  • Range Aggregation: Range grouping of numeric values ​​and dates, specify start and end, and then group by segments

In the case just now, we used Terms Aggregation, which divides buckets according to terms.

Next, we learn a few more practical ones:

4.5.1. Histogram of stepped barrels

principle:

histogram is to group numeric fields according to a certain ladder size. You need to specify a ladder value (interval) to divide the ladder size.

Examples:

For example, if you have a price field, if you set the interval value to 200, then the ladder will look like this:

0, 200, 400, 600, …

The keys listed above are the keys of each ladder and the starting point of the interval.

If the price of a product is 450, which step range will it fall into? Calculated as follows:

bucket_key = Math.floor((value – offset) / interval) * interval + offset

value: the value of the current data, in this case 450

offset: starting offset, default is 0

interval: step interval, such as 200

So the key you get = Math.floor ((450-0) / 200) * 200 + 0 = 400

Operate:

For example, we group the prices of cars and specify the interval to 5000:

GET /cars/_search

{

  “size”:0,

  “aggs”:{

    “price”:{

      “histogram”: {

        “field”: “price”,

        “interval”: 5000

      }

    }

  }

}

result:

{

  “took”: 21,

  “timed_out”: false,

  “_shards”: {

    “total”: 5,

    “successful”: 5,

    “skipped”: 0,

    “failed”: 0

  },

  “hits”: {

    “total”: 8,

    “max_score”: 0,

    “hits”: []

  },

  “aggregations”: {

    “price”: {

      “buckets”: [

        {

          “key”: 10000,

          “doc_count”: 2

        },

        {

          “key”: 15000,

          “doc_count”: 1

        },

        {

          “key”: 20000,

          “doc_count”: 2

        },

        {

          “key”: 25000,

          “doc_count”: 1

        },

        {

          “key”: 30000,

          “doc_count”: 1

        },

        {

          “key”: 35000,

          “doc_count”: 0

        },

        {

          “key”: 40000,

          “doc_count”: 0

        },

        {

          “key”: 45000,

          “doc_count”: 0

        },

        {

          “key”: 50000,

          “doc_count”: 0

        },

        {

          “key”: 55000,

          “doc_count”: 0

        },

        {

          “key”: 60000,

          “doc_count”: 0

        },

        {

          “key”: 65000,

          “doc_count”: 0

        },

        {

          “key”: 70000,

          “doc_count”: 0

        },

        {

          “key”: 75000,

          “doc_count”: 0

        },

        {

          “key”: 80000,

          “doc_count”: 1

        }

      ]

    }

  }

}

You will find that there are a large number of buckets with 0 documents in the middle, which looks very ugly.

We can add a parameter min_doc_count to 1 to restrict the minimum number of documents to 1, so that the bucket with the number of documents 0 will be filtered

Examples:

GET /cars/_search

{

  “size”:0,

  “aggs”:{

    “price”:{

      “histogram”: {

        “field”: “price”,

        “interval”: 5000,

        “min_doc_count”: 1

      }

    }

  }

}

result:

{

  “took”: 15,

  “timed_out”: false,

  “_shards”: {

    “total”: 5,

    “successful”: 5,

    “skipped”: 0,

    “failed”: 0

  },

  “hits”: {

    “total”: 8,

    “max_score”: 0,

    “hits”: []

  },

  “aggregations”: {

    “price”: {

      “buckets”: [

        {

          “key”: 10000,

          “doc_count”: 2

        },

        {

          “key”: 15000,

          “doc_count”: 1

        },

        {

          “key”: 20000,

          “doc_count”: 2

        },

        {

          “key”: 25000,

          “doc_count”: 1

        },

        {

          “key”: 30000,

          “doc_count”: 1

        },

        {

          “key”: 80000,

          “doc_count”: 1

        }

      ]

    }

  }

}

perfect,!

4.5.2. Range

Range bucketing is similar to ladder bucketing, in which numbers are grouped in stages, but the range method requires you to specify the start and end size of each group.