Public

WeChat Login

Code Issues Pull requests Events Packages Insights

main

Branch

Tag

Dmitry Gruzd<dgruzd@gitlab.com>

Merge branch 'add-changelog-metadata-tracking' into 'main'

0e81a5ee

1293 commits

GitLab Elasticsearch Indexer

Pipeline Status

This project indexes Git repositories into Elasticsearch for GitLab. The indexed data enables GitLab to search through code, wikis, and commits in GitLab repositories using Elasticsearch's powerful search capabilities.

The indexer is designed with a modular architecture that supports different indexing modes to optimize for various deployment scenarios. It uses structured logging to help with troubleshooting and debugging.

Dependencies

This project relies on the following dependencies:

ICU for text encoding
Go 1.20 or later for building from source
Gitaly for accessing Git repositories
Elasticsearch v7.x or compatible OpenSearch instance

Ensure the development packages for your platform are installed before running make:

Debian / Ubuntu


# apt install libicu-dev

Mac OSX


$ brew install icu4c
$ export PKG_CONFIG_PATH="$(brew --prefix)/opt/icu4c/lib/pkgconfig:$PKG_CONFIG_PATH"

Modes Architecture

The GitLab Elasticsearch Indexer supports multiple operating modes that can be configured using the GITLAB_INDEXER_MODE environment variable. Each mode is optimized for different use cases:

Advanced Mode (Default)

The Advanced Mode is the default mode for the indexer. It provides full-featured indexing with support for:

Indexing code (blobs), commits, and wikis
Project permission handling
Namespace traversal IDs
Schema versioning

This mode is recommended for most standard GitLab deployments.


export GITLAB_INDEXER_MODE=advanced # default if not specified

Chunk Mode

The Chunk Mode is an alternative indexing approach designed for large repositories or specialized deployment scenarios. This mode is currently under development and will provide enhanced features for handling very large codebases more efficiently.

To select a specific mode, set the GITLAB_INDEXER_MODE environment variable:


export GITLAB_INDEXER_MODE=chunk

Usage

Chunk mode uses command-line flags to specify the adapter and connection details, with operation-specific options passed as JSON:


gitlab-elasticsearch-indexer \
  -mode chunk \
  -adapter elasticsearch \
  -connection '{"url": ["http://localhost:9200"]}' \
  -options '{
    "project_id": 123,
    "operation": "index",
    "partition_name": "gitlab-code-search",
    "partition_number": 0,
    "timeout": "5m",
    "chunk_size": 1024,
    "gitaly_config": {...}
  }'

Supported Adapters: elasticsearch, postgresql (planned), opensearch (planned)

Operations:

index (default): Index project files as chunks
delete: Remove all chunks for a project

Common Options:

project_id (required): Project ID
operation (defaults to index): Operation type (index|delete)
partition_name (required): Index partition name
partition_number (required): Index partition number
timeout (required): Operation timeout (e.g., 5m, 1h)

Index Operation Options:

from_sha, to_sha: Git commit range
chunk_size: Maximum chunk size in bytes
chunk_overlap: Overlap between chunks in bytes
chunk_strategy: Chunking strategy (see below)
gitaly_config: Gitaly connection configuration
gitaly_batch_size: Batch size for Gitaly operations
elastic_bulk_size: Bulk operation size for Elasticsearch

Chunk Strategies

Chunk mode supports different chunking strategies that determine how files are split into chunks:

code_bytes (Default): Uses byte-based chunking optimized for performance. This is the recommended strategy for production use, providing fast and reliable indexing.
code_pre_bert (Experimental): Uses token-based chunking with pre-BERT token size limits.

⚠️ WARNING: This strategy is EXPERIMENTAL and NOT RECOMMENDED for production use. Performance benchmarks show it is approximately 18x slower than code_bytes:
- code_bytes: ~97 seconds to index the GitLab repository
- code_pre_bert: 30+ minutes, often resulting in timeout errors
This strategy should only be used for research and development purposes.

The chunking strategy is configured via the chunk_strategy option in the JSON options passed to chunk mode.

Building & Installing

Local Build

To build and install the indexer locally:


make
sudo make install

gitlab-elasticsearch-indexer will be installed to /usr/local/bin

You can change the installation path with the PREFIX environment variable. Please remember to pass the -E flag to sudo if you do so.

Example:


PREFIX=/usr sudo -E make install

Development Helpers

The project includes several helpful Makefile targets to assist with development:


# View all available Makefile targets with descriptions
make help

# Run tests in watch mode (automatically re-run on file changes)
make watch-test

Using Docker

You can also build and use the indexer as a Docker image:


docker build . -t gitlab-elasticsearch-indexer

You can edit your shell profile (like ~/.zshrc) to use the image as a binary:


func gitlab-elasticsearch-indexer() {
  docker run --rm -it gitlab-elasticsearch-indexer "$@"
}

Lefthook Static Analysis

Lefthook is a Git hooks manager that allows custom logic to be executed prior to Git committing or pushing. gitlab-elasticsearch-indexer comes with Lefthook configuration (lefthook.yml), which helps ensure code quality by running linters and static analysis tools automatically.

The configuration file is checked in but ignored until Lefthook is installed.

Install Lefthook

Install lefthook
Install Lefthook Git hooks:
```
lefthook install
```
Test Lefthook is working by running the Lefthook pre-push Git hook:
```
lefthook run pre-push
```

Lefthook will now automatically run configured checks before commits and pushes.

Testing

The project includes a comprehensive test suite and developer-friendly testing features to help ensure code quality.

Test Requirements

The test suite expects Gitaly and Elasticsearch to be running on the following ports:

Gitaly: 8075
ElasticSearch v7.14.2: 9201

Make sure you have docker and docker-compose installed. On macOS, you can use colima to run Docker since Docker Desktop cannot be used due to licensing.


brew install docker docker-compose colima
colima start

Quick Tests


# Start the test infrastructure (only needed once)
make test-infra

# Source the default connection settings
source .env.test

# Run the test suite
make test

# Run tests in watch mode (auto-rerun on file changes)
make watch-test

# Run a specific test
go test -v gitlab.com/gitlab-org/gitlab-elasticsearch-indexer -run TestIndexingGitlabTest

If you want to re-create the test infrastructure, you can run make test-infra again.

Custom Test Configuration

For testing with custom configurations:

Start only the services you need:


# Start Gitaly
docker-compose up -d gitaly

# Start ElasticSearch
docker-compose up -d elasticsearch

Configure the test environment:


# These are the defaults from .env.test
export GITALY_CONNECTION_INFO='{"address": "tcp://localhost:8075", "storage": "default"}'
export ELASTIC_CONNECTION_INFO='{"url":["http://localhost:9201"], "index_name":"gitlab-test", "index_name_commits":"gitlab-test-commits"}'

Note: When using a Unix socket, use the format unix://FULL_PATH_WITH_LEADING_SLASH

Example with custom Gitaly connection:


# Source default connections
source .env.test

# Override Gitaly connection for GDK
export GITALY_CONNECTION_INFO='{"address": "unix:///gitlab/gdk/gitaly.socket", "storage": "default"}'

# Run tests
make test

Testing in GDK

You can test changes to the indexer in the GitLab Development Kit (GDK) in multiple ways.

Using the `GITLAB_ELASTICSEARCH_INDEXER_VERSION` File

Warning: Do not create tags to test code. Tags are created for released versions only.

The GITLAB_ELASTICSEARCH_INDEXER_VERSION file accepts commit SHAs and branch names. This method works for both local development and spec execution.

To test a branch or specific commit:

Update the GITLAB_ELASTICSEARCH_INDEXER_VERSION file with your branch name or commit SHA
Run gdk reconfigure to apply the changes

Building a Binary for GDK

You can test changes to the indexer in your GDK by:

Building the indexer with the PREFIX environment variable set to your GDK directory
This installs the indexer directly in the GDK, making it available for immediate testing


# Build and install directly to GDK
PREFIX=<gdk_install_directory>/gitlab-elasticsearch-indexer make install

Note: Running gdk update will reset the indexer back to the version specified in the GITLAB_ELASTICSEARCH_INDEXER_VERSION file. The specs use this file to build the indexer to <gdk_install_directory>/gitlab/tmp/tests/gitlab-elasticsearch-indexer.

Debugging Elasticsearch calls

Set ELASTIC_DEBUG environment variable to print out all calls to Elasticsearch

Example:


ELASTIC_DEBUG=1 go test -v -run TestMixedOperationsBulkSizeTracking ./internal/mode/advanced/elastic

Debugging with Delve

Delve is a powerful Go debugger that can help troubleshoot issues.

Start a debugging session with:


dlv test <path-to-package> -- -test.run <regex-matching-test-name>

Example:


dlv test gitlab.com/gitlab-org/gitlab-elasticsearch-indexer -- -test.run ^TestIndexingWikiBlobs$

Common debugging commands:

Set a breakpoint: break <path-to-file>:<line-number>
Continue execution until next breakpoint: continue
Print variable value: print <variable-name>
Step to next source line: next
Exit debugger: exit

For more details, see the Delve documentation.

Obtaining a package or Docker image for testing an MR

GitLab team members can use the build-package-and-qa job in their MR pipeline to trigger a pipeline in the omnibus-gitlab-mirror project. This pipeline produces:

An omnibus-gitlab package for Ubuntu (as an artifact of the Trigger:package job)
A Docker image (in the Trigger:gitlab-docker job)

These artifacts include the changes from the MR and can be used to deploy a GitLab instance locally for testing.

The job is automatically started if the MR includes changes to any of the dependencies of the project, which could potentially break builds in any of the operating systems GitLab provides packages for. For other types of MRs, this is available as a manual job for developers to run when needed.

Configuration Options

The GitLab Elasticsearch Indexer can be configured using both environment variables and command-line flags.

Environment Variables

Variable	Description	Default	Example
`GITLAB_INDEXER_MODE`	The indexing mode to use	`advanced`	`advanced`, `chunk`
`GITLAB_INDEXER_DEBUG_LOGGING`	Enable debug logging	`false`	`true`, `1`
`CORRELATION_ID`	ID for tracking operations across components	Auto-generated	`abc123`
`GITALY_CONNECTION_INFO`	Gitaly connection details (JSON)		`{"address": "unix:///path/to/gitaly.socket", "storage": "default"}`
`ELASTIC_CONNECTION_INFO`	Elasticsearch connection details (JSON)		`{"url":["http://localhost:9200"], "index_name":"gitlab-production", "index_name_commits":"gitlab-production-commits"}`
`DEBUG`	Legacy debug mode (deprecated)		`true`

Command-line Flags

The indexer supports numerous command-line flags, particularly in advanced mode:

Flag	Description	Example
`--version`	Print version information and exit
`--blob-type`	Type of blobs to index	`blob` (default), `wiki_blob`
`--skip-commits`	Skip indexing commits for the repo
`--skip-blobs`	Skip indexing blobs for the repo
`--visibility-level`	Project/Group visibility access level	`0`, `10`, `20`
`--repository-access-level`	Project repository access level	`0`, `10`, `20`
`--wiki-access-level`	Wiki repository access level	`0`, `10`, `20`
`--project-id`	Project ID	`42`
`--group-id`	Group ID	`24`
`--full-path`	Project or group full path	`group/project`
`--timeout`	Process timeout duration	`5m`, `1h`
`--traversal-ids`	Namespace traversal IDs for indexed documents	`5-1-6-`
`--hashed-root-namespace-id`	Hashed root namespace ID	`42`
`--schema-version-blob`	Schema version for blob documents (YYMM format)	`2305`
`--schema-version-commit`	Schema version for commit documents (YYMM format)	`2305`
`--schema-version-wiki`	Schema version for wiki documents (YYMM format)	`2305`
`--from-sha`	Starting commit SHA for indexing	`abc123...`
`--to-sha`	Ending commit SHA for indexing	`def456...`
`--archived`	Whether the project is archived	`true`, `false`

Logging

The GitLab Elasticsearch Indexer uses structured JSON logging with the Go standard library's log/slog package. This provides:

Consistent log format with key-value pairs
Configurable log levels
Easy integration with log management systems

Debug Logging

Debug logging can be enabled by setting the GITLAB_INDEXER_DEBUG_LOGGING environment variable:


# Enable debug logging
export GITLAB_INDEXER_DEBUG_LOGGING=true
# or
export GITLAB_INDEXER_DEBUG_LOGGING=1

# Run the indexer with debug logging enabled
gitlab-elasticsearch-indexer [options] /path/to/repo

When debug logging is enabled, you'll see additional information about:

Mode selection and initialization
Elasticsearch queries and responses
Git operations
Performance metrics

Debug logs are automatically formatted as structured JSON for easy filtering and analysis.

CI/CD Configuration

Automatic Tag Creation

The project contains a CI job that automatically creates version tags based on the content of the VERSION file. When changes are merged to the main branch, the system checks if a tag for the current version exists and creates one if needed.

TAG_CREATOR_TOKEN Requirements

To enable automatic tag creation, you need to set up a GitLab CI/CD variable:

Variable Name: TAG_CREATOR_TOKEN
Type: Masked and Protected variable
Requirements:
- Must be a project access token with Developer role
- Scope required: api
- The bot user created with the token must have permission to create protected tags

To set up this token:

Create a project access token with Developer role and api scope
Add the token as a Masked and Protected CI/CD variable in your project settings
Go to your project's Settings > Repository > Protected Tags
Add the project bot user (appears as "Project bot: [project-name]") to the list of users allowed to create protected tags

Contributing

Please see the following documentation for contributing to this project:

Contribution guidelines
Development process documentation - Release process, maintainership, and versioning
Go style guide - Coding conventions, error handling, and logging patterns

About

No description, topics, or website provided.

7.82 MiB

0 forks 0 stars 1 branches 77 TagREADMEOther license

Release
0

Tag

Language

Go93.3%

Shell4.5%

Makefile2%

Dockerfile0.2%

35/F,Tencent Building,Kejizhongyi Avenue,Nanshan District,Shenzhen

京ICP备11018762号-111

.gitlab
.lefthook
_support
doc
internal
scripts
.commitlintrc.yml
.env.test
.gitignore
.gitlab-ci.yml
.golangci.yml
.tool-versions
CHANGELOG.md
CONTRIBUTING.md
Dangerfile
Dockerfile
LICENSE.md
Makefile
README.md
STYLE.md
VERSION
cliff.toml
docker-compose.yml
go.mod
go.sum
integration_test.go
lefthook.yml
main.go
tools.go

GitLab Elasticsearch Indexer

Dependencies

Debian / Ubuntu

Mac OSX

Modes Architecture

Advanced Mode (Default)

Chunk Mode

Usage

Chunk Strategies

Building & Installing

Local Build

Development Helpers

Using Docker

Lefthook Static Analysis

Install Lefthook

Testing

Test Requirements

Quick Tests

Custom Test Configuration

Testing in GDK

Using the GITLAB_ELASTICSEARCH_INDEXER_VERSION File

Building a Binary for GDK

Debugging Elasticsearch calls

Debugging with Delve

Obtaining a package or Docker image for testing an MR

Configuration Options

Environment Variables

Command-line Flags

Logging

Debug Logging

CI/CD Configuration

Automatic Tag Creation

TAG_CREATOR_TOKEN Requirements

Contributing

About

Release0

Using the `GITLAB_ELASTICSEARCH_INDEXER_VERSION` File

Release
0