Skip to content

Conversation

@jonavellecuerdo
Copy link
Contributor

@jonavellecuerdo jonavellecuerdo commented Dec 17, 2025

Purpose and background context

With TDA 3.8.0, we can now retrieve record metadata columns in embeddings read methods. Filtering embeddings by action="index" prevents any attempt to update documents that do not exist in OpenSearch (action="delete"), which results in an API error.. This is important especially with the current state of tim.opensearch.bulk_update, which will raise a BulkOperationError and cause the 'bulk_update_embeddings' CLI command to exit early.

This also includes an additional change to also index embeddings when performing a reindex.
Note: This piece was straightforward but will note the redundant nature of this simple add (but think it's fine for now 🤓 ).

How can a reviewer manually see the effects of these changes?

The embeddings in tests/fixtures/dataset were updated to align with TDA v3.8.0, which renamed the timestamp column as embeddings_timestamp.

Review the contents of TIMDEX dataset at tests/fixtures/dataset

  1. Create a .env file and set:
TIMDEX_DATASET_LOCATION="tests/fixtures/dataset"
  1. Run iPython terminal:
pipenv run ipython
  1. Load the TIMDEX dataset:
import os

from timdex_dataset_api import TIMDEXDataset
from timdex_dataset_api.embeddings import TIMDEXEmbeddings
from timdex_dataset_api.config import configure_dev_logger

configure_dev_logger()

td = TIMDEXDataset(os.environ["TIMDEX_DATASET_LOCATION"])
td.metadata.rebuild_dataset_metadata()
  1. Query data.embeddings:
td.conn.query("select run_id, embedding_timestamp, timdex_record_id from data.embeddings order by embedding_timestamp")

Output:

─────────────────────────────────────┬───────────────────────────────┬─────────────────────────┐
│               run_id                │      embedding_timestamp      │    timdex_record_id     │
│               varchar               │   timestamp with time zone    │         varchar         │
├─────────────────────────────────────┼───────────────────────────────┼─────────────────────────┤
│ 21e7f272-7b96-480c-9c25-36075355f…  │ 2025-12-17 12:40:40.674548-05 │ libguides:guides-175846 │
│ 85cfe316-089c-4639-a5af-c861a7321…  │ 2025-12-17 12:41:31.686431-05 │ libguides:guides-175846 │
│ 85cfe316-089c-4639-a5af-c861a7321…  │ 2025-12-17 12:41:31.686794-05 │ libguides:guides-175847 │
│ 85cfe316-089c-4639-a5af-c861a7321…  │ 2025-12-17 12:41:31.687062-05 │ libguides:guides-175849 │
│ 85cfe316-089c-4639-a5af-c861a7321…  │ 2025-12-17 12:41:31.687307-05 │ libguides:guides-175853 │
│ 85cfe316-089c-4639-a5af-c861a7321…  │ 2025-12-17 12:41:31.687533-05 │ libguides:guides-175855 │
└─────────────────────────────────────┴───────────────────────────────┴─────────────────────────┘

Note: There are 6 embeddings across 2 run ids; the timdex_record_id="libguides:guides-175846" appears twice.

  1. Query data.current_embeddings:
td.conn.query("select run_id, embedding_timestamp, timdex_record_id from data.current_embeddings order by embedding_timestamp")

Output:

┌─────────────────────────────────────┬───────────────────────────────┬─────────────────────────┐
│               run_id                │      embedding_timestamp      │    timdex_record_id     │
│               varchar               │   timestamp with time zone    │         varchar         │
├─────────────────────────────────────┼───────────────────────────────┼─────────────────────────┤
│ 85cfe316-089c-4639-a5af-c861a7321…  │ 2025-12-17 12:41:31.686431-05 │ libguides:guides-175846 │
│ 85cfe316-089c-4639-a5af-c861a7321…  │ 2025-12-17 12:41:31.686794-05 │ libguides:guides-175847 │
│ 85cfe316-089c-4639-a5af-c861a7321…  │ 2025-12-17 12:41:31.687062-05 │ libguides:guides-175849 │
│ 85cfe316-089c-4639-a5af-c861a7321…  │ 2025-12-17 12:41:31.687307-05 │ libguides:guides-175853 │
│ 85cfe316-089c-4639-a5af-c861a7321…  │ 2025-12-17 12:41:31.687533-05 │ libguides:guides-175855 │
└─────────────────────────────────────┴───────────────────────────────┴─────────────────────────┘

Note: There are 5 embeddings; the timdex_record_id="libguides:guides-175846" appears only once from the latest timestamp, which is associated with run_id="85cfe"....

Run a Local OpenSearch instance

  1. Launch a local OpenSearch instance.
  2. Reindex source="libguides" using the fixture.
pipenv run tim reindex-source -s libguides tests/fixtures/dataset

Log output:

2025-12-17 15:20:42,288 INFO tim.cli.main(): Logger 'root' configured with level=INFO
2025-12-17 15:20:42,288 INFO tim.cli.main(): No Sentry DSN found, exceptions will not be sent to Sentry
2025-12-17 15:20:42,288 INFO tim.opensearch.configure_opensearch_client(): OpenSearch request configurations: {'OPENSEARCH_BULK_MAX_CHUNK_BYTES': 104857600, 'OPENSEARCH_BULK_MAX_RETRIES': 50, 'OPENSEARCH_REQUEST_TIMEOUT': 120}
2025-12-17 15:20:42,288 INFO tim.cli.main(): OpenSearch client configured for endpoint 'localhost'
2025-12-17 15:20:42,489 INFO opensearch.log_request_success(): PUT http://localhost:9200/libguides-2025-12-17t20-20-42 [status:200 request:0.199s]
2025-12-17 15:20:42,489 INFO tim.cli.reindex_source(): Index 'libguides-2025-12-17t20-20-42' created.
2025-12-17 15:20:42,492 INFO opensearch.log_request_success(): GET http://localhost:9200/_cat/aliases?format=json [status:200 request:0.003s]
2025-12-17 15:20:42,510 INFO opensearch.log_request_success(): POST http://localhost:9200/_aliases [status:200 request:0.017s]
2025-12-17 15:20:42,513 INFO opensearch.log_request_success(): GET http://localhost:9200/libguides-2025-12-17t20-20-42/_alias [status:200 request:0.003s]
2025-12-17 15:20:42,513 INFO tim.cli.reindex_source(): Index promoted. Current aliases for index 'libguides-2025-12-17t20-20-42': ['all-current', 'libguides']
2025-12-17 15:20:42,515 INFO timdex_dataset_api.dataset.load_pyarrow_dataset(): Dataset successfully loaded: 'tests/fixtures/dataset/data/records', 0.0s
/Users/jcuerdo/.local/share/virtualenvs/timdex-index-manager-P0Ce49-g/lib/python3.12/site-packages/duckdb_engine/__init__.py:630: SAWarning: Did not recognize type 'list' of column 'embedding_vector'
  columns = self._get_columns_info(rows, domains, enums, schema)  # type: ignore[attr-defined]
2025-12-17 15:20:42,876 INFO opensearch.log_request_success(): POST http://localhost:9200/_bulk [status:200 request:0.016s]
2025-12-17 15:20:42,876 INFO tim.opensearch.bulk_index(): Refreshing index.
2025-12-17 15:20:42,891 INFO opensearch.log_request_success(): POST http://localhost:9200/libguides-2025-12-17t20-20-42/_refresh [status:200 request:0.015s]
2025-12-17 15:20:42,936 INFO opensearch.log_request_success(): POST http://localhost:9200/_bulk [status:200 request:0.039s]
2025-12-17 15:20:42,936 INFO tim.opensearch.bulk_update(): Refreshing index.
2025-12-17 15:20:42,955 INFO opensearch.log_request_success(): POST http://localhost:9200/libguides-2025-12-17t20-20-42/_refresh [status:200 request:0.019s]
2025-12-17 15:20:42,955 INFO tim.cli.reindex_source(): Reindex source complete: {"index": {"created": 5, "updated": 0, "errors": 0, "total": 5}, "update": {"updated": 5, "errors": 0, "total": 5}}
2025-12-17 15:20:42,955 INFO tim.cli.log_process_time(): Total time to complete process: 0:00:00.667618

Note: As the logs indicate, 5 records were indexed into OpenSearch and only 5 records were updated with embeddings (as embeddings are pulled from data.current_embeddings).

Includes new or updated dependencies?

YES - make update was run to resolve the following vulnerability:

Found 1 known vulnerability in 1 package
Name     Version ID             Fix Versions
-------- ------- -------------- ------------
filelock 3.20.0  CVE-2025-68146 3.20.1
Name               Skip Reason
------------------ ---------------------------------------------------------------------------------

Changes expectations for external applications?

YES | NO

What are the relevant tickets?

Code review

  • Code review best practices are documented here and you are encouraged to have a constructive dialogue with your reviewers about their preferences and expectations.

Why these changes are being introduced:
* With TDA 3.8.0, we can now retrieve record metadata columns in
embeddings read methods. Filtering embeddings by `action="index"`
prevents any attempt to update documents that do not exist
in OpenSearch (`action="delete"`), which results in an API error..
This is important especially with the current state of tim.opensearch.bulk_update,
which will raise a BulkOperationError and cause the 'bulk_update_embeddings'
CLI command to exit early.

This also includes an additional change to also index embeddings
when performing a reindex.

How this addresses that need:
* Filter embeddings by action="index"
* Install latest version of timdex-dataset-api (latest commit)
* Update embeddings in fixtures/test/dataset to use 'embeddings_timestamp"

Side effects of this change:
* None

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/USE-273
@jonavellecuerdo jonavellecuerdo marked this pull request as ready for review December 17, 2025 21:23
@jonavellecuerdo jonavellecuerdo requested a review from a team as a code owner December 17, 2025 21:23
@coveralls
Copy link

coveralls commented Dec 17, 2025

Pull Request Test Coverage Report for Build 20340697927

Details

  • 7 of 10 (70.0%) changed or added relevant lines in 1 file are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage decreased (-0.4%) to 95.732%

Changes Missing Coverage Covered Lines Changed/Added Lines %
tim/cli.py 7 10 70.0%
Totals Coverage Status
Change from base Build 20272814319: -0.4%
Covered Lines: 471
Relevant Lines: 492

💛 - Coveralls

@ghukill ghukill self-assigned this Dec 17, 2025
Copy link
Contributor

@ghukill ghukill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the most part, looking great! With the addition of action=index I think we avoid a whole class of errors for documents that may not exist. I was able to walk through the PR instructions and successfully index embeddings.

I left a request to udpate a docstring, but have a more substantial question. Unsure if this is a TIM or TDA bug, so not requesting changes for this PR at this time.

I was curious what would happen if we run reindex-source against a dataset without embeddings, which seems like a very real possibility. To do this, I copied the testing dataset to /tmp, removed the data/embeddings directory to simulate no embeddings for the dataset, and re-ran reindex-source:

cp -r tests/fixtures/dataset /tmp/use273
rm -r /tmp/use273/data/embeddings
pipenv run tim reindex-source -s libguides /tmp/use273

The following error is ultimately thrown:

...
...
  File "/Users/ghukill/.local/share/virtualenvs/timdex-index-manager-UCzC4db_/lib/python3.12/site-packages/timdex_dataset_api/embeddings.py", line 383, in read_batches_iter
    data_query = self._build_query(
                 ^^^^^^^^^^^^^^^^^^
  File "/Users/ghukill/.local/share/virtualenvs/timdex-index-manager-UCzC4db_/lib/python3.12/site-packages/timdex_dataset_api/embeddings.py", line 411, in _build_query
    embeddings_table = self.get_sa_table(table)
                       ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ghukill/.local/share/virtualenvs/timdex-index-manager-UCzC4db_/lib/python3.12/site-packages/timdex_dataset_api/embeddings.py", line 174, in get_sa_table
    raise ValueError(f"Could not find table '{table}' in DuckDB schema 'data'.")
ValueError: Could not find table 'current_embeddings' in DuckDB schema 'data'.

My hunch is that we should resolve this at the TDA level, where attempting to use embeddings.read_* methods should fail gracefully embeddings don't exist (or maybe better yet just yield zero rows).

Thoughts @jonavellecuerdo?

If we elect to address this at the TDA level, I'd have no problem approving this PR with the assumption a ticket is created and this reindex-source will not throw errors if embeddings don't exist, because that is handled in TDA.

tim/cli.py Outdated
Comment on lines 431 to 437
This CLI command performs the following:
1. creates a new index for the source
2. promotes this index as the primary for the source alias, and added to any other
aliases passed (e.g. 'timdex')
3. uses the TDA library to yield only current records from the parquet dataset
for the source
4. bulk index these records to the new Opensearch index
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably update this part of the docstring to mention embeddings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just updated the docstring!

@jonavellecuerdo
Copy link
Contributor Author

@ghukill I agree with your suggestion to create a new ticket to handle this scenario at the TDA level. Whatever approach we decide on, I think it would be great if we applied it to the read_batches_iter method of TIMDEXDataset as well!

Copy link
Contributor

@ghukill ghukill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ticket created for the TDA updates: https://mitlibraries.atlassian.net/browse/USE-306.

Rest looks great!

@jonavellecuerdo jonavellecuerdo merged commit 96f0096 into main Dec 18, 2025
4 of 5 checks passed
@jonavellecuerdo jonavellecuerdo deleted the USE-273-bulk-update-embeddings-use-metadata branch December 18, 2025 19:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants