logo
0
0
Login

GitLab Elasticsearch Indexer

Pipeline Status

This project indexes Git repositories into Elasticsearch for GitLab. The indexed data enables GitLab to search through code, wikis, and commits in GitLab repositories using Elasticsearch's powerful search capabilities.

The indexer is designed with a modular architecture that supports different indexing modes to optimize for various deployment scenarios. It uses structured logging to help with troubleshooting and debugging.

Dependencies

This project relies on the following dependencies:

  • ICU for text encoding
  • Go 1.20 or later for building from source
  • Gitaly for accessing Git repositories
  • Elasticsearch v7.x or compatible OpenSearch instance

Ensure the development packages for your platform are installed before running make:

Debian / Ubuntu

# apt install libicu-dev

Mac OSX

$ brew install icu4c $ export PKG_CONFIG_PATH="$(brew --prefix)/opt/icu4c/lib/pkgconfig:$PKG_CONFIG_PATH"

Modes Architecture

The GitLab Elasticsearch Indexer supports multiple operating modes that can be configured using the GITLAB_INDEXER_MODE environment variable. Each mode is optimized for different use cases:

Advanced Mode (Default)

The Advanced Mode is the default mode for the indexer. It provides full-featured indexing with support for:

  • Indexing code (blobs), commits, and wikis
  • Project permission handling
  • Namespace traversal IDs
  • Schema versioning

This mode is recommended for most standard GitLab deployments.

export GITLAB_INDEXER_MODE=advanced # default if not specified

Chunk Mode

The Chunk Mode is an alternative indexing approach designed for large repositories or specialized deployment scenarios. This mode is currently under development and will provide enhanced features for handling very large codebases more efficiently.

To select a specific mode, set the GITLAB_INDEXER_MODE environment variable:

export GITLAB_INDEXER_MODE=chunk

Usage

Chunk mode uses command-line flags to specify the adapter and connection details, with operation-specific options passed as JSON:

gitlab-elasticsearch-indexer \ -mode chunk \ -adapter elasticsearch \ -connection '{"url": ["http://localhost:9200"]}' \ -options '{ "project_id": 123, "operation": "index", "partition_name": "gitlab-code-search", "partition_number": 0, "timeout": "5m", "chunk_size": 1024, "gitaly_config": {...} }'

Supported Adapters: elasticsearch, postgresql (planned), opensearch (planned)

Operations:

  • index (default): Index project files as chunks
  • delete: Remove all chunks for a project

Common Options:

  • project_id (required): Project ID
  • operation (defaults to index): Operation type (index|delete)
  • partition_name (required): Index partition name
  • partition_number (required): Index partition number
  • timeout (required): Operation timeout (e.g., 5m, 1h)

Index Operation Options:

  • from_sha, to_sha: Git commit range
  • chunk_size: Maximum chunk size in bytes
  • chunk_overlap: Overlap between chunks in bytes
  • chunk_strategy: Chunking strategy (see below)
  • gitaly_config: Gitaly connection configuration
  • gitaly_batch_size: Batch size for Gitaly operations
  • elastic_bulk_size: Bulk operation size for Elasticsearch

Chunk Strategies

Chunk mode supports different chunking strategies that determine how files are split into chunks:

  • code_bytes (Default): Uses byte-based chunking optimized for performance. This is the recommended strategy for production use, providing fast and reliable indexing.

  • code_pre_bert (Experimental): Uses token-based chunking with pre-BERT token size limits.

    ⚠️ WARNING: This strategy is EXPERIMENTAL and NOT RECOMMENDED for production use. Performance benchmarks show it is approximately 18x slower than code_bytes:

    • code_bytes: ~97 seconds to index the GitLab repository
    • code_pre_bert: 30+ minutes, often resulting in timeout errors

    This strategy should only be used for research and development purposes.

The chunking strategy is configured via the chunk_strategy option in the JSON options passed to chunk mode.

Building & Installing

Local Build

To build and install the indexer locally:

make sudo make install

gitlab-elasticsearch-indexer will be installed to /usr/local/bin

You can change the installation path with the PREFIX environment variable. Please remember to pass the -E flag to sudo if you do so.

Example:

PREFIX=/usr sudo -E make install

Development Helpers

The project includes several helpful Makefile targets to assist with development:

# View all available Makefile targets with descriptions make help # Run tests in watch mode (automatically re-run on file changes) make watch-test

Using Docker

You can also build and use the indexer as a Docker image:

docker build . -t gitlab-elasticsearch-indexer

You can edit your shell profile (like ~/.zshrc) to use the image as a binary:

func gitlab-elasticsearch-indexer() { docker run --rm -it gitlab-elasticsearch-indexer "$@" }

Lefthook Static Analysis

Lefthook is a Git hooks manager that allows custom logic to be executed prior to Git committing or pushing. gitlab-elasticsearch-indexer comes with Lefthook configuration (lefthook.yml), which helps ensure code quality by running linters and static analysis tools automatically.

The configuration file is checked in but ignored until Lefthook is installed.

Install Lefthook

  1. Install lefthook

  2. Install Lefthook Git hooks:

    lefthook install
  3. Test Lefthook is working by running the Lefthook pre-push Git hook:

    lefthook run pre-push

Lefthook will now automatically run configured checks before commits and pushes.

Testing

The project includes a comprehensive test suite and developer-friendly testing features to help ensure code quality.

Test Requirements

The test suite expects Gitaly and Elasticsearch to be running on the following ports:

  • Gitaly: 8075
  • ElasticSearch v7.14.2: 9201

Make sure you have docker and docker-compose installed. On macOS, you can use colima to run Docker since Docker Desktop cannot be used due to licensing.

brew install docker docker-compose colima colima start

Quick Tests

# Start the test infrastructure (only needed once) make test-infra # Source the default connection settings source .env.test # Run the test suite make test # Run tests in watch mode (auto-rerun on file changes) make watch-test # Run a specific test go test -v gitlab.com/gitlab-org/gitlab-elasticsearch-indexer -run TestIndexingGitlabTest

If you want to re-create the test infrastructure, you can run make test-infra again.

Custom Test Configuration

For testing with custom configurations:

  1. Start only the services you need:

    # Start Gitaly docker-compose up -d gitaly # Start ElasticSearch docker-compose up -d elasticsearch
  2. Configure the test environment:

    # These are the defaults from .env.test export GITALY_CONNECTION_INFO='{"address": "tcp://localhost:8075", "storage": "default"}' export ELASTIC_CONNECTION_INFO='{"url":["http://localhost:9201"], "index_name":"gitlab-test", "index_name_commits":"gitlab-test-commits"}'

    Note: When using a Unix socket, use the format unix://FULL_PATH_WITH_LEADING_SLASH

    Example with custom Gitaly connection:

    # Source default connections source .env.test # Override Gitaly connection for GDK export GITALY_CONNECTION_INFO='{"address": "unix:///gitlab/gdk/gitaly.socket", "storage": "default"}' # Run tests make test

Testing in GDK

You can test changes to the indexer in the GitLab Development Kit (GDK) in multiple ways.

Using the GITLAB_ELASTICSEARCH_INDEXER_VERSION File

Warning: Do not create tags to test code. Tags are created for released versions only.

The GITLAB_ELASTICSEARCH_INDEXER_VERSION file accepts commit SHAs and branch names. This method works for both local development and spec execution.

To test a branch or specific commit:

  1. Update the GITLAB_ELASTICSEARCH_INDEXER_VERSION file with your branch name or commit SHA
  2. Run gdk reconfigure to apply the changes

Building a Binary for GDK

You can test changes to the indexer in your GDK by:

  1. Building the indexer with the PREFIX environment variable set to your GDK directory
  2. This installs the indexer directly in the GDK, making it available for immediate testing
# Build and install directly to GDK PREFIX=<gdk_install_directory>/gitlab-elasticsearch-indexer make install

Note: Running gdk update will reset the indexer back to the version specified in the GITLAB_ELASTICSEARCH_INDEXER_VERSION file. The specs use this file to build the indexer to <gdk_install_directory>/gitlab/tmp/tests/gitlab-elasticsearch-indexer.

Debugging Elasticsearch calls

Set ELASTIC_DEBUG environment variable to print out all calls to Elasticsearch

Example:

ELASTIC_DEBUG=1 go test -v -run TestMixedOperationsBulkSizeTracking ./internal/mode/advanced/elastic

Debugging with Delve

Delve is a powerful Go debugger that can help troubleshoot issues.

Start a debugging session with:

dlv test <path-to-package> -- -test.run <regex-matching-test-name>

Example:

dlv test gitlab.com/gitlab-org/gitlab-elasticsearch-indexer -- -test.run ^TestIndexingWikiBlobs$

Common debugging commands:

  • Set a breakpoint: break <path-to-file>:<line-number>
  • Continue execution until next breakpoint: continue
  • Print variable value: print <variable-name>
  • Step to next source line: next
  • Exit debugger: exit

For more details, see the Delve documentation.

Obtaining a package or Docker image for testing an MR

GitLab team members can use the build-package-and-qa job in their MR pipeline to trigger a pipeline in the omnibus-gitlab-mirror project. This pipeline produces:

  • An omnibus-gitlab package for Ubuntu (as an artifact of the Trigger:package job)
  • A Docker image (in the Trigger:gitlab-docker job)

These artifacts include the changes from the MR and can be used to deploy a GitLab instance locally for testing.

The job is automatically started if the MR includes changes to any of the dependencies of the project, which could potentially break builds in any of the operating systems GitLab provides packages for. For other types of MRs, this is available as a manual job for developers to run when needed.

Configuration Options

The GitLab Elasticsearch Indexer can be configured using both environment variables and command-line flags.

Environment Variables

VariableDescriptionDefaultExample
GITLAB_INDEXER_MODEThe indexing mode to useadvancedadvanced, chunk
GITLAB_INDEXER_DEBUG_LOGGINGEnable debug loggingfalsetrue, 1
CORRELATION_IDID for tracking operations across componentsAuto-generatedabc123
GITALY_CONNECTION_INFOGitaly connection details (JSON){"address": "unix:///path/to/gitaly.socket", "storage": "default"}
ELASTIC_CONNECTION_INFOElasticsearch connection details (JSON){"url":["http://localhost:9200"], "index_name":"gitlab-production", "index_name_commits":"gitlab-production-commits"}
DEBUGLegacy debug mode (deprecated)true

Command-line Flags

The indexer supports numerous command-line flags, particularly in advanced mode:

FlagDescriptionExample
--versionPrint version information and exit
--blob-typeType of blobs to indexblob (default), wiki_blob
--skip-commitsSkip indexing commits for the repo
--visibility-levelProject/Group visibility access level0, 10, 20
--repository-access-levelProject repository access level0, 10, 20
--wiki-access-levelWiki repository access level0, 10, 20
--project-idProject ID42
--group-idGroup ID24
--full-pathProject or group full pathgroup/project
--timeoutProcess timeout duration5m, 1h
--traversal-idsNamespace traversal IDs for indexed documents5-1-6-
--hashed-root-namespace-idHashed root namespace ID42
--schema-version-blobSchema version for blob documents (YYMM format)2305
--schema-version-commitSchema version for commit documents (YYMM format)2305
--schema-version-wikiSchema version for wiki documents (YYMM format)2305
--from-shaStarting commit SHA for indexingabc123...
--to-shaEnding commit SHA for indexingdef456...
--archivedWhether the project is archivedtrue, false

Logging

The GitLab Elasticsearch Indexer uses structured JSON logging with the Go standard library's log/slog package. This provides:

  • Consistent log format with key-value pairs
  • Configurable log levels
  • Easy integration with log management systems

Debug Logging

Debug logging can be enabled by setting the GITLAB_INDEXER_DEBUG_LOGGING environment variable:

# Enable debug logging export GITLAB_INDEXER_DEBUG_LOGGING=true # or export GITLAB_INDEXER_DEBUG_LOGGING=1 # Run the indexer with debug logging enabled gitlab-elasticsearch-indexer [options] /path/to/repo

When debug logging is enabled, you'll see additional information about:

  • Mode selection and initialization
  • Elasticsearch queries and responses
  • Git operations
  • Performance metrics

Debug logs are automatically formatted as structured JSON for easy filtering and analysis.

CI/CD Configuration

Automatic Tag Creation

The project contains a CI job that automatically creates version tags based on the content of the VERSION file. When changes are merged to the main branch, the system checks if a tag for the current version exists and creates one if needed.

TAG_CREATOR_TOKEN Requirements

To enable automatic tag creation, you need to set up a GitLab CI/CD variable:

  • Variable Name: TAG_CREATOR_TOKEN
  • Type: Masked and Protected variable
  • Requirements:
    • Must be a project access token with Developer role
    • Scope required: api
    • The bot user created with the token must have permission to create protected tags

To set up this token:

  1. Create a project access token with Developer role and api scope
  2. Add the token as a Masked and Protected CI/CD variable in your project settings
  3. Go to your project's Settings > Repository > Protected Tags
  4. Add the project bot user (appears as "Project bot: [project-name]") to the list of users allowed to create protected tags

Contributing

Please see the following documentation for contributing to this project:

About

No description, topics, or website provided.
7.30 MiB
0 forks0 stars1 branches76 TagREADMEOther license
Language
Go94.5%
Shell3.3%
Makefile2%
Dockerfile0.2%