Document Versioning in Amazon OpenSearch Service: OpenSearch as the Source of Truth. Part 3

In our previous discussion, we emphasized using a primary database as the source of truth, with OpenSearch serving as a search layer. However, certain scenarios necessitate managing document versioning directly within OpenSearch. This article explores strategies for handling document versioning in OpenSearch. 1. Two-Indices Approach One effective method for managing document versioning involves using two separate indices: 1. Immutable Index: Purpose: Stores every document version as an immutable record, providing a complete audit trail. Advantage: Ensures that no version is overwritten, which is crucial for compliance and historical analysis. 2. Search Interface Index: Purpose: Contains only the latest version of each document. Advantage: Optimized for fast retrieval and efficient queries, as it reduces the amount of data to search through. Trade-Off: While this dual-index method simplifies compliance and auditability, it significantly increases data storage and indexing operations. Maintaining two indices means higher ingestion costs, increased storage consumption, and more complex query execution, as both indices must remain synchronized. 2. Single-Index Approach for Versioned Documents in OpenSearch When handling immutable documents with versioning in OpenSearch, a key challenge is ensuring search results reflect only the latest document versions while preserving older content for historical reference. Instead of modifying indices or adding flags like is_latest, we can achieve this with a single optimized query that: Finds documents where the search term appears in either the latest (searchableText) or previous versions (oldVersionsText). Excludes outdated documents where the term appears only in oldVersionsText. Ensures that only the latest document per relationId is returned. Index Structure and Data Handling Index Name: test_index Stored Fields: relationId (keyword) – Groups multiple versions of a document. searchableText (text) – Stores the most recent searchable content. oldVersionsText (text) – Stores previous versions of the content. update_time (date) – Timestamp of the document's last update. How Data is Managed: Document Updates: When a document is updated, a new version is inserted. The previous version’s content is moved to oldVersionsText. Determining Latest Version: The update_time field is used to identify the most recent version. Important Consideration: Storing older versions in every document increases the index size significantly. Over time, this can impact performance and storage costs. This method, while effective in some scenarios, introduces a multi-step query, which may become a performance bottleneck at scale. Why a Refined Query is Necessary If we only search in searchableText, we may miss relevant results because the latest version might not contain the search term, while an older version does. For example: A document initially contains “OpenSearch performance optimization” in searchableText. Later, the document is updated to “OpenSearch advanced techniques”, moving the previous text to oldVersionsText. A search for “performance optimization” would only find the outdated document unless we refine the query. Optimized Query: How It Works Searches in searchableText and oldVersionsText. Ensures that if the search term appears only in oldVersionsText, the outdated document is excluded. Retrieves only the most recent version of each document. Step-by-Step Guide Step 1: Create the Index PUT test_index { "mappings": { "properties": { "relationId": {"type": "keyword"}, "latestContent": {"type": "text"}, "oldVersionsText": {"type": "text"}, "update_time": {"type": "date"} } } } Step 2: Insert Sample Documents POST test_index/_bulk {"index": {"_id": "1"}} {"relationId": "doc1", "latestContent": "OpenSearch advanced techniques", "oldVersionsText": ["OpenSearch performance optimization"], "update_time": "2025-03-12T12:00:00Z"} {"index": {"_id": "2"}} {"relationId": "doc2", "latestContent": "OpenSearch index tuning", "oldVersionsText": [], "update_time": "2025-03-12T13:00:00Z"} {"index": {"_id": "3"}} {"relationId": "doc1", "latestContent": "OpenSearch performance optimization", "oldVersionsText": [], "update_time": "2025-03-11T10:00:00Z"} Step 3: Execute the Optimized Query This query ensures that: The search term appears in searchableText or oldVersionsText. Documents where the term appears only in oldVersionsText are excluded. Only the latest document version per relationId is returned. GET test_index/_search { "query": { "bool": { "should": [ { "match": { "latestContent": "performance optimization" } }, { "match": { "oldVersionsText": "performance optimization" } } ], "minimum_should_match": 1, "must_not": { "bool": {

Apr 11, 2025 - 21:57
 0
Document Versioning in Amazon OpenSearch Service: OpenSearch as the Source of Truth. Part 3

In our previous discussion, we emphasized using a primary database as the source of truth, with OpenSearch serving as a search layer. However, certain scenarios necessitate managing document versioning directly within OpenSearch. This article explores strategies for handling document versioning in OpenSearch.

1. Two-Indices Approach

One effective method for managing document versioning involves using two separate indices:

1. Immutable Index:

  • Purpose: Stores every document version as an immutable record, providing a complete audit trail.
  • Advantage: Ensures that no version is overwritten, which is crucial for compliance and historical analysis.

2. Search Interface Index:

  • Purpose: Contains only the latest version of each document.
  • Advantage: Optimized for fast retrieval and efficient queries, as it reduces the amount of data to search through.

Trade-Off: While this dual-index method simplifies compliance and auditability, it significantly increases data storage and indexing operations. Maintaining two indices means higher ingestion costs, increased storage consumption, and more complex query execution, as both indices must remain synchronized.

2. Single-Index Approach for Versioned Documents in OpenSearch

When handling immutable documents with versioning in OpenSearch, a key challenge is ensuring search results reflect only the latest document versions while preserving older content for historical reference. Instead of modifying indices or adding flags like is_latest, we can achieve this with a single optimized query that:

  • Finds documents where the search term appears in either the latest (searchableText) or previous versions (oldVersionsText).
  • Excludes outdated documents where the term appears only in oldVersionsText.
  • Ensures that only the latest document per relationId is returned.

Index Structure and Data Handling

Index Name: test_index

Stored Fields:

  • relationId (keyword) – Groups multiple versions of a document.
  • searchableText (text) – Stores the most recent searchable content.
  • oldVersionsText (text) – Stores previous versions of the content.
  • update_time (date) – Timestamp of the document's last update.

How Data is Managed:

  • Document Updates: When a document is updated, a new version is inserted. The previous version’s content is moved to oldVersionsText.
  • Determining Latest Version: The update_time field is used to identify the most recent version.

Important Consideration: Storing older versions in every document increases the index size significantly. Over time, this can impact performance and storage costs. This method, while effective in some scenarios, introduces a multi-step query, which may become a performance bottleneck at scale.

Why a Refined Query is Necessary

If we only search in searchableText, we may miss relevant results because the latest version might not contain the search term, while an older version does.

For example:

  1. A document initially contains “OpenSearch performance optimization” in searchableText.
  2. Later, the document is updated to “OpenSearch advanced techniques”, moving the previous text to oldVersionsText.
  3. A search for “performance optimization” would only find the outdated document unless we refine the query.

Optimized Query: How It Works

  • Searches in searchableText and oldVersionsText.
  • Ensures that if the search term appears only in oldVersionsText, the outdated document is excluded.
  • Retrieves only the most recent version of each document.

Step-by-Step Guide

Step 1: Create the Index

PUT test_index
{
  "mappings": {
    "properties": {
      "relationId": {"type": "keyword"},
      "latestContent": {"type": "text"},
      "oldVersionsText": {"type": "text"},
      "update_time": {"type": "date"}
    }
  }
}

Step 2: Insert Sample Documents

POST test_index/_bulk
{"index": {"_id": "1"}}
{"relationId": "doc1", "latestContent": "OpenSearch advanced techniques", "oldVersionsText": ["OpenSearch performance optimization"], "update_time": "2025-03-12T12:00:00Z"}
{"index": {"_id": "2"}}
{"relationId": "doc2", "latestContent": "OpenSearch index tuning", "oldVersionsText": [], "update_time": "2025-03-12T13:00:00Z"}
{"index": {"_id": "3"}}
{"relationId": "doc1", "latestContent": "OpenSearch performance optimization", "oldVersionsText": [], "update_time": "2025-03-11T10:00:00Z"}

Step 3: Execute the Optimized Query

This query ensures that:

  • The search term appears in searchableText or oldVersionsText.
  • Documents where the term appears only in oldVersionsText are excluded.
  • Only the latest document version per relationId is returned.
GET test_index/_search
{
  "query": {
    "bool": {
      "should": [
        { "match": { "latestContent": "performance optimization" } },
        { "match": { "oldVersionsText": "performance optimization" } }
      ],
      "minimum_should_match": 1,
      "must_not": {
        "bool": {
          "must": [
            { "match": { "oldVersionsText": "performance optimization" } },
            { "bool": { "must_not": { "match": { "latestContent": "performance optimization" } } } }
          ]
        }
      }
    }
  },
  "sort": [{ "update_time": "desc" }]
}

How This Query Works

  1. Dual-Field Coverage: The should clause ensures that a document is considered if it contains the term "performance optimization" in either the latest content (latestContent) or in the older versions (oldVersionsText). This guarantees that we capture any document that might be relevant regardless of which field holds the term.
  2. Exclusion of Outdated Matches: The must_not clause is crucial—it specifically excludes documents where the term appears only in oldVersionsText. This means that if a document's latest version does not contain the search term, even if an older version does, that document will not be returned. The inner structure checks for documents matching in oldVersionsText but missing a match in latestContent. Only those documents are filtered out.
  3. Sorting by Update Time: The sort parameter orders the results by update_time in descending order, ensuring that the most recent versions are prioritized.

The Key Points

  • Retrieves all relevant documents — Ensures we don’t miss documents where the term appears in both searchableText and oldVersionsText.
  • Prevents returning outdated documents alone — If the term appears only in an old version, we exclude it.
  • No need for **is_latest** flags or index modifications – Simplifies indexing by handling filtering at the query level.
  • Balances accuracy and efficiency — Uses OpenSearch’s filtering capabilities without extra processing.

Considerations and Trade-Offs

  • Index Size Impact: Storing previous versions in oldVersionsTextincreases the index size over time. If document updates are frequent, this may require a cleanup strategy.
  • Query Complexity: This approach involves multiple steps in query execution (searching in both fields, filtering, and sorting), which could lead to performance
  • Scalability: For high-update environments or large-scale deployments, consider periodic cleanup strategies or even alternative architectures (e.g., the two-indices approach) to maintain performance.

Conclusion

Managing document versioning directly within OpenSearch is inherently complex. While OpenSearch can serve as the source of truth for versioned documents, it isn’t the optimal standalone solution for all production environments. There’s no one-size-fits-all answer; as many experienced consultants say, “it depends.” By deeply understanding the trade-offs, you can select and tailor the approach that best fits your specific use case.

This refined single-index strategy, leveraging the optimized query above, provides a powerful means to retrieve only the latest relevant document versions while still maintaining a comprehensive history of changes.