This project indexes Git repositories into Elasticsearch for GitLab. The indexed data enables GitLab to search through code, wikis, and commits in GitLab repositories using Elasticsearch's powerful search capabilities.
The indexer is designed with a modular architecture that supports different indexing modes to optimize for various deployment scenarios. It uses structured logging to help with troubleshooting and debugging.
This project relies on the following dependencies:
Ensure the development packages for your platform are installed before running make:
# apt install libicu-dev
$ brew install icu4c
$ export PKG_CONFIG_PATH="$(brew --prefix)/opt/icu4c/lib/pkgconfig:$PKG_CONFIG_PATH"
The GitLab Elasticsearch Indexer supports multiple operating modes that can be configured using the GITLAB_INDEXER_MODE environment variable. Each mode is optimized for different use cases:
The Advanced Mode is the default mode for the indexer. It provides full-featured indexing with support for:
This mode is recommended for most standard GitLab deployments.
export GITLAB_INDEXER_MODE=advanced # default if not specified
The Chunk Mode is an alternative indexing approach designed for large repositories or specialized deployment scenarios. This mode is currently under development and will provide enhanced features for handling very large codebases more efficiently.
To select a specific mode, set the GITLAB_INDEXER_MODE environment variable:
export GITLAB_INDEXER_MODE=chunk
Chunk mode uses command-line flags to specify the adapter and connection details, with operation-specific options passed as JSON:
gitlab-elasticsearch-indexer \
-mode chunk \
-adapter elasticsearch \
-connection '{"url": ["http://localhost:9200"]}' \
-options '{
"project_id": 123,
"operation": "index",
"partition_name": "gitlab-code-search",
"partition_number": 0,
"timeout": "5m",
"chunk_size": 1024,
"gitaly_config": {...}
}'
Supported Adapters: elasticsearch, postgresql (planned), opensearch (planned)
Operations:
index (default): Index project files as chunksdelete: Remove all chunks for a projectCommon Options:
project_id (required): Project IDoperation (defaults to index): Operation type (index|delete)partition_name (required): Index partition namepartition_number (required): Index partition numbertimeout (required): Operation timeout (e.g., 5m, 1h)Index Operation Options:
from_sha, to_sha: Git commit rangechunk_size: Maximum chunk size in byteschunk_overlap: Overlap between chunks in byteschunk_strategy: Chunking strategy (see below)gitaly_config: Gitaly connection configurationgitaly_batch_size: Batch size for Gitaly operationselastic_bulk_size: Bulk operation size for ElasticsearchChunk mode supports different chunking strategies that determine how files are split into chunks:
code_bytes (Default): Uses byte-based chunking optimized for performance. This is the recommended strategy for production use, providing fast and reliable indexing.
code_pre_bert (Experimental): Uses token-based chunking with pre-BERT token size limits.
⚠️ WARNING: This strategy is EXPERIMENTAL and NOT RECOMMENDED for production use. Performance benchmarks show it is approximately 18x slower than code_bytes:
code_bytes: ~97 seconds to index the GitLab repositorycode_pre_bert: 30+ minutes, often resulting in timeout errorsThis strategy should only be used for research and development purposes.
The chunking strategy is configured via the chunk_strategy option in the JSON options passed to chunk mode.
To build and install the indexer locally:
make sudo make install
gitlab-elasticsearch-indexer will be installed to /usr/local/bin
You can change the installation path with the PREFIX environment variable. Please remember to pass the -E flag to sudo if you do so.
Example:
PREFIX=/usr sudo -E make install
The project includes several helpful Makefile targets to assist with development:
# View all available Makefile targets with descriptions
make help
# Run tests in watch mode (automatically re-run on file changes)
make watch-test
You can also build and use the indexer as a Docker image:
docker build . -t gitlab-elasticsearch-indexer
You can edit your shell profile (like ~/.zshrc) to use the image as a binary:
func gitlab-elasticsearch-indexer() {
docker run --rm -it gitlab-elasticsearch-indexer "$@"
}
Lefthook is a Git hooks manager that allows
custom logic to be executed prior to Git committing or pushing. gitlab-elasticsearch-indexer
comes with Lefthook configuration (lefthook.yml), which helps ensure code quality by running
linters and static analysis tools automatically.
The configuration file is checked in but ignored until Lefthook is installed.
Install Lefthook Git hooks:
lefthook install
Test Lefthook is working by running the Lefthook pre-push Git hook:
lefthook run pre-push
Lefthook will now automatically run configured checks before commits and pushes.
The project includes a comprehensive test suite and developer-friendly testing features to help ensure code quality.
The test suite expects Gitaly and Elasticsearch to be running on the following ports:
Make sure you have docker and docker-compose installed. On macOS, you can use colima to run Docker since Docker Desktop cannot be used due to licensing.
brew install docker docker-compose colima colima start
# Start the test infrastructure (only needed once)
make test-infra
# Source the default connection settings
source .env.test
# Run the test suite
make test
# Run tests in watch mode (auto-rerun on file changes)
make watch-test
# Run a specific test
go test -v gitlab.com/gitlab-org/gitlab-elasticsearch-indexer -run TestIndexingGitlabTest
If you want to re-create the test infrastructure, you can run make test-infra again.
For testing with custom configurations:
Start only the services you need:
# Start Gitaly
docker-compose up -d gitaly
# Start ElasticSearch
docker-compose up -d elasticsearch
Configure the test environment:
# These are the defaults from .env.test
export GITALY_CONNECTION_INFO='{"address": "tcp://localhost:8075", "storage": "default"}'
export ELASTIC_CONNECTION_INFO='{"url":["http://localhost:9201"], "index_name":"gitlab-test", "index_name_commits":"gitlab-test-commits"}'
Note: When using a Unix socket, use the format unix://FULL_PATH_WITH_LEADING_SLASH
Example with custom Gitaly connection:
# Source default connections
source .env.test
# Override Gitaly connection for GDK
export GITALY_CONNECTION_INFO='{"address": "unix:///gitlab/gdk/gitaly.socket", "storage": "default"}'
# Run tests
make test
You can test changes to the indexer in the GitLab Development Kit (GDK) in multiple ways.
Warning: Do not create tags to test code. Tags are created for released versions only.
The GITLAB_ELASTICSEARCH_INDEXER_VERSION file accepts commit SHAs and branch names. This method works for both local development and spec execution.
To test a branch or specific commit:
gdk reconfigure to apply the changesYou can test changes to the indexer in your GDK by:
PREFIX environment variable set to your GDK directory# Build and install directly to GDK
PREFIX=<gdk_install_directory>/gitlab-elasticsearch-indexer make install
Note: Running gdk update will reset the indexer back to the version specified in the GITLAB_ELASTICSEARCH_INDEXER_VERSION file. The specs use this file to build the indexer to <gdk_install_directory>/gitlab/tmp/tests/gitlab-elasticsearch-indexer.
Set ELASTIC_DEBUG environment variable to print out all calls to Elasticsearch
Example:
ELASTIC_DEBUG=1 go test -v -run TestMixedOperationsBulkSizeTracking ./internal/mode/advanced/elastic
Delve is a powerful Go debugger that can help troubleshoot issues.
Start a debugging session with:
dlv test <path-to-package> -- -test.run <regex-matching-test-name>
Example:
dlv test gitlab.com/gitlab-org/gitlab-elasticsearch-indexer -- -test.run ^TestIndexingWikiBlobs$
Common debugging commands:
break <path-to-file>:<line-number>continueprint <variable-name>nextexitFor more details, see the Delve documentation.
GitLab team members can use the build-package-and-qa job in their MR pipeline
to trigger a pipeline in the omnibus-gitlab-mirror project. This pipeline
produces:
omnibus-gitlab package for Ubuntu (as an artifact of the Trigger:package job)Trigger:gitlab-docker job)These artifacts include the changes from the MR and can be used to deploy a GitLab instance locally for testing.
The job is automatically started if the MR includes changes to any of the dependencies of the project, which could potentially break builds in any of the operating systems GitLab provides packages for. For other types of MRs, this is available as a manual job for developers to run when needed.
The GitLab Elasticsearch Indexer can be configured using both environment variables and command-line flags.
| Variable | Description | Default | Example |
|---|---|---|---|
GITLAB_INDEXER_MODE | The indexing mode to use | advanced | advanced, chunk |
GITLAB_INDEXER_DEBUG_LOGGING | Enable debug logging | false | true, 1 |
CORRELATION_ID | ID for tracking operations across components | Auto-generated | abc123 |
GITALY_CONNECTION_INFO | Gitaly connection details (JSON) | {"address": "unix:///path/to/gitaly.socket", "storage": "default"} | |
ELASTIC_CONNECTION_INFO | Elasticsearch connection details (JSON) | {"url":["http://localhost:9200"], "index_name":"gitlab-production", "index_name_commits":"gitlab-production-commits"} | |
DEBUG | Legacy debug mode (deprecated) | true |
The indexer supports numerous command-line flags, particularly in advanced mode:
| Flag | Description | Example |
|---|---|---|
--version | Print version information and exit | |
--blob-type | Type of blobs to index | blob (default), wiki_blob |
--skip-commits | Skip indexing commits for the repo | |
--visibility-level | Project/Group visibility access level | 0, 10, 20 |
--repository-access-level | Project repository access level | 0, 10, 20 |
--wiki-access-level | Wiki repository access level | 0, 10, 20 |
--project-id | Project ID | 42 |
--group-id | Group ID | 24 |
--full-path | Project or group full path | group/project |
--timeout | Process timeout duration | 5m, 1h |
--traversal-ids | Namespace traversal IDs for indexed documents | 5-1-6- |
--hashed-root-namespace-id | Hashed root namespace ID | 42 |
--schema-version-blob | Schema version for blob documents (YYMM format) | 2305 |
--schema-version-commit | Schema version for commit documents (YYMM format) | 2305 |
--schema-version-wiki | Schema version for wiki documents (YYMM format) | 2305 |
--from-sha | Starting commit SHA for indexing | abc123... |
--to-sha | Ending commit SHA for indexing | def456... |
--archived | Whether the project is archived | true, false |
The GitLab Elasticsearch Indexer uses structured JSON logging with the Go standard library's log/slog package. This provides:
Debug logging can be enabled by setting the GITLAB_INDEXER_DEBUG_LOGGING environment variable:
# Enable debug logging
export GITLAB_INDEXER_DEBUG_LOGGING=true
# or
export GITLAB_INDEXER_DEBUG_LOGGING=1
# Run the indexer with debug logging enabled
gitlab-elasticsearch-indexer [options] /path/to/repo
When debug logging is enabled, you'll see additional information about:
Debug logs are automatically formatted as structured JSON for easy filtering and analysis.
The project contains a CI job that automatically creates version tags based on the content of the VERSION file. When changes are merged to the main branch, the system checks if a tag for the current version exists and creates one if needed.
To enable automatic tag creation, you need to set up a GitLab CI/CD variable:
TAG_CREATOR_TOKENapiTo set up this token:
Please see the following documentation for contributing to this project: