ikun/github/esm

Public

Code Issues Pull requests Events Packages Insights

master

esm/README_ZH.md

IILee

新建文件 README_ZH.md

2eeaf176

PreviewCode viewBlame

Raw

Elasticsearch 迁移工具

Elasticsearch 跨版本数据迁移。

链接:

功能:

支持跨版本迁移
覆盖索引名称
复制索引设置和映射
支持 HTTP 基本认证
支持将索引转储到本地文件
支持从本地文件加载索引
支持 HTTP 代理
支持分片滚动 (elasticsearch 5.0 +)
支持后台运行
通过随机化源文档 ID 生成测试数据
支持重命名字段名称
支持统一文档类型名称
支持指定从源返回的 _source 字段
支持指定查询字符串查询以过滤数据源
支持在执行批量索引时重命名源字段
支持使用 --sync 进行增量更新(添加/更新/删除变更记录)。注意: 它使用不同的实现方式，只处理变更的记录，但不如旧方式快
负载生成

ESM 很快!

一个 3 节点集群(3 * c5d.4xlarge， 16C，32GB，10Gbps)


root@ip-172-31-13-181:/tmp# ./esm -s https://localhost:8000 -d https://localhost:8000 -x logs1kw -y logs122 -m elastic:medcl123 -n elastic:medcl123 -w 40 --sliced_scroll_size=60 -b 5 --buffer_count=2000000  --regenerate_id
[12-19 06:31:20] [INF] [main.go:506,main] start data migration..
Scroll 10064570 / 10064570 [=================================================] 100.00% 55s
Bulk 10062602 / 10064570 [==================================================]  99.98% 55s
[12-19 06:32:15] [INF] [main.go:537,main] data migration finished.

在一分钟内迁移了 10,000,000 个文档，Nginx 日志来自 kibana_sample_data_logs。

使用 ESM 之前

在运行 esm 之前，请手动准备目标索引，包括映射和优化设置以提高速度，例如:


PUT your-new-index
{
  "settings": {
    "index.translog.durability": "async", 
    "refresh_interval": "-1", 
    "number_of_shards": 10,
    "number_of_replicas": 0
  }
}

示例:

将索引 index_name 从 192.168.1.x 复制到 192.168.1.y:9200


./bin/esm  -s http://192.168.1.x:9200   -d http://192.168.1.y:9200 -x index_name  -w=5 -b=10 -c 10000

将索引 src_index 从 192.168.1.x 复制到 192.168.1.y:9200 并保存为 dest_index


./bin/esm -s http://localhost:9200 -d http://localhost:9200 -x src_index -y dest_index -w=5 -b=100

使用同步功能对索引 src_index 从 192.168.1.x 到 192.168.1.y:9200 进行增量更新


./bin/esm --sync -s http://localhost:9200 -d http://localhost:9200 -x src_index -y dest_index

支持基本认证


./bin/esm -s http://localhost:9200 -x "src_index" -y "dest_index"  -d http://localhost:9201 -n admin:111111

复制设置并覆盖分片大小


./bin/esm -s http://localhost:9200 -x "src_index" -y "dest_index"  -d http://localhost:9201 -m admin:111111 -c 10000 --shards=50  --copy_settings

复制设置和映射，重新创建目标索引，添加查询到源获取，迁移后刷新


./bin/esm -s http://localhost:9200 -x "src_index" -q=query:phone -y "dest_index"  -d http://localhost:9201  -c 10000 --shards=5  --copy_settings --copy_mappings --force  --refresh

将 elasticsearch 文档转储到本地文件


./bin/esm -s http://localhost:9200 -x "src_index"  -m admin:111111 -c 5000 -q=query:mixer  --refresh -o=dump.bin

将源索引和目标索引转储到本地文件并进行比较，以便快速找到差异


./bin/esm --sort=_id -s http://localhost:9200 -x "src_index" --truncate_output --skip=_index -o=src.json
./bin/esm --sort=_id -s http://localhost:9200 -x "dst_index" --truncate_output --skip=_index -o=dst.json
diff -W 200 -ry --suppress-common-lines src.json dst.json

从转储文件加载数据，批量插入到另一个 es 实例


./bin/esm -d http://localhost:9200 -y "dest_index"   -n admin:111111 -c 5000 -b 5 --refresh -i=dump.bin

支持代理


 ./bin/esm -d http://123345.ap-northeast-1.aws.found.io:9200 -y "dest_index"   -n admin:111111  -c 5000 -b 1 --refresh  -i dump.bin  --dest_proxy=http://127.0.0.1:9743

使用分片滚动(仅在 elasticsearch v5 中可用)来加速滚动，并更新分片数量


 ./bin/esm -s=http://192.168.3.206:9200 -d=http://localhost:9200 -n=elastic:changeme -f --copy_settings --copy_mappings -x=bestbuykaggle  --sliced_scroll_size=5 --shards=50 --refresh

将 5.x 迁移到 6.x 并将所有类型统一为 doc


./esm -s http://source_es:9200 -x "source_index*"  -u "doc" -w 10 -b 10 - -t "10m" -d https://target_es:9200 -m elastic:passwd -n elastic:passwd -c 5000

迁移版本 7.x 时，您可能需要将 _type 重命名为 _doc


./esm -s http://localhost:9201 -x "source" -y "target"  -d https://localhost:9200 --rename="_type:type,age:myage"  -u"_doc"

使用范围查询过滤迁移


./esm -s https://192.168.3.98:9200 -m elastic:password -o json.out -x kibana_sample_data_ecommerce -q "order_date:[2020-02-01T21:59:02+00:00 TO 2020-03-01T21:59:02+00:00]"

范围查询，关键字类型和转义


./esm -s https://192.168.3.98:9200 -m test:123 -o 1.txt -x test1  -q "@timestamp.keyword:[\"2021-01-17 03:41:20\" TO \"2021-03-17 03:41:20\"]"

生成测试数据，如果 input.json 包含 10 个文档，则以下命令将摄取 100 个文档，适用于测试


./bin/esm -i input.json -d  http://localhost:9201 -y target-index1  --regenerate_id  --repeat_times=10

选择源字段


 ./bin/esm -s http://localhost:9201 -x my_index -o dump.json --fields=author,title

在执行批量索引时重命名字段


./bin/esm -i dump.json -d  http://localhost:9201 -y target-index41  --rename=title:newtitle

使用 buffer_count 控制 ESM 使用的内存，并使用 gzip 压缩网络流量


./esm -s https://localhost:8000 -d https://localhost:8000 -x logs1kw -y logs122 -m elastic:medcl123 -n elastic:medcl123 --regenerate_id -w 20 --sliced_scroll_size=60 -b 5 --buffer_count=1000000 --compress false

下载

https://github.com/medcl/esm/releases

编译:

如果下载的版本不适用于您的环境，您可以尝试自己编译。需要 go 环境。

make build

go 版本 >= 1.7

选项


用法:
  esm [OPTIONS]

应用程序选项:
  -s, --source=                    源 elasticsearch 实例，例如: http://localhost:9200
  -q, --query=                     对源 elasticsearch 实例进行查询，在迁移前过滤数据，例如: name:medcl
      --sort=                      滚动时排序字段，例如: _id (默认: _id)
  -d, --dest=                      目标 elasticsearch 实例，例如: http://localhost:9201
  -m, --source_auth=               源 elasticsearch 实例的基本认证，例如: user:pass
  -n, --dest_auth=                 目标 elasticsearch 实例的基本认证，例如: user:pass
  -c, --count=                     每次处理的文档数: 即滚动请求中的 "size" (10000)
      --buffer_count=              内存中缓冲的文档数 (100000)
  -w, --workers=                   批量工作器的并发数 (1)
  -b, --bulk_size=                 批量大小(MB) (5)
  -t, --time=                      滚动时间 (1m)
      --sliced_scroll_size=        分片滚动的大小，要使其工作，大小应 > 1 (1)
  -f, --force                      复制前删除目标索引
  -a, --all                        复制以 . 和 _ 开头的索引
      --copy_settings              从源复制索引设置
      --copy_mappings              从源复制索引映射
      --shards=                    在新创建的索引上设置分片数
  -x, --src_indexes=               要复制的索引名称，支持正则表达式和逗号分隔列表 (_all)
  -y, --dest_index=                要保存的索引名称，只允许一个索引名，如果未指定则使用原始索引名
  -u, --type_override=             覆盖类型名称
      --green                      在转储前等待两个主机集群状态变为绿色。否则黄色也可以
  -v, --log=                       设置日志级别，选项: trace, debug, info, warn, error (INFO)
  -o, --output_file=               将源索引的文档输出到本地文件
      --truncate_output=           在转储到输出文件前截断
  -i, --input_file=                从本地转储文件进行索引
      --input_file_type=           输入文件的数据类型，选项: dump, json_line, json_array, log_line (dump)
      --source_proxy=              为源 http 连接设置代理，例如: http://127.0.0.1:8080
      --dest_proxy=                为目标 http 连接设置代理，例如: http://127.0.0.1:8080
      --refresh                    迁移完成后刷新
      --sync=                      同步将对源索引和目标索引都使用滚动，比较数据并同步(索引/更新/删除)
      --fields=                    过滤源字段(白名单)，逗号分隔，例如: col1,col2,col3,...
      --skip=                      跳过源字段(黑名单)，逗号分隔，例如: col1,col2,col3,...
      --rename=                    重命名源字段，逗号分隔，例如: _type:type, name:myname
  -l, --logstash_endpoint=         目标 logstash tcp 端点，例如: 127.0.0.1:5055
      --secured_logstash_endpoint  目标 logstash tcp 端点受 TLS 保护
      --repeat_times=              将源数据重复 N 次到目标输出，与参数 regenerate_id 结合使用以放大数据量
  -r, --regenerate_id              为文档重新生成 ID，这将覆盖数据源中的现有文档 ID
      --compress                   使用 gzip 压缩流量
  -p, --sleep=                     完成批量请求后休眠 N 秒 (-1)

帮助选项:
  -h, --help                       显示此帮助信息

常见问题

滚动 ID 太长，请更新源集群上的 elasticsearch.yml。


http.max_header_size: 16k
http.max_initial_line_length: 8k

版本兼容性

来源版本	目标版本
1.x	1.x
1.x	2.x
1.x	5.x
1.x	6.x
1.x	7.x
2.x	1.x
2.x	2.x
2.x	5.x
2.x	6.x
2.x	7.x
5.x	1.x
5.x	2.x
5.x	5.x
5.x	6.x
5.x	7.x
6.x	1.x
6.x	2.x
6.x	5.0
6.x	6.x
6.x	7.x
7.x	1.x
7.x	2.x
7.x	5.x
7.x	6.x
7.x	7.x

35/F,Tencent Building,Kejizhongyi Avenue,Nanshan District,Shenzhen

京ICP备11018762号-111