Use metadata in 'bulk-update-embeddings' #375

jonavellecuerdo · 2025-12-17T20:52:51Z

Purpose and background context

With TDA 3.8.0, we can now retrieve record metadata columns in embeddings read methods. Filtering embeddings by action="index" prevents any attempt to update documents that do not exist in OpenSearch (action="delete"), which results in an API error.. This is important especially with the current state of tim.opensearch.bulk_update, which will raise a BulkOperationError and cause the 'bulk_update_embeddings' CLI command to exit early.

This also includes an additional change to also index embeddings when performing a reindex.
Note: This piece was straightforward but will note the redundant nature of this simple add (but think it's fine for now 🤓 ).

How can a reviewer manually see the effects of these changes?

The embeddings in tests/fixtures/dataset were updated to align with TDA v3.8.0, which renamed the timestamp column as embeddings_timestamp.

Review the contents of TIMDEX dataset at tests/fixtures/dataset

Create a .env file and set:

TIMDEX_DATASET_LOCATION="tests/fixtures/dataset"

Run iPython terminal:

pipenv run ipython

Load the TIMDEX dataset:

import os

from timdex_dataset_api import TIMDEXDataset
from timdex_dataset_api.embeddings import TIMDEXEmbeddings
from timdex_dataset_api.config import configure_dev_logger

configure_dev_logger()

td = TIMDEXDataset(os.environ["TIMDEX_DATASET_LOCATION"])
td.metadata.rebuild_dataset_metadata()

Query data.embeddings:

td.conn.query("select run_id, embedding_timestamp, timdex_record_id from data.embeddings order by embedding_timestamp")

Output:

─────────────────────────────────────┬───────────────────────────────┬─────────────────────────┐
│               run_id                │      embedding_timestamp      │    timdex_record_id     │
│               varchar               │   timestamp with time zone    │         varchar         │
├─────────────────────────────────────┼───────────────────────────────┼─────────────────────────┤
│ 21e7f272-7b96-480c-9c25-36075355f…  │ 2025-12-17 12:40:40.674548-05 │ libguides:guides-175846 │
│ 85cfe316-089c-4639-a5af-c861a7321…  │ 2025-12-17 12:41:31.686431-05 │ libguides:guides-175846 │
│ 85cfe316-089c-4639-a5af-c861a7321…  │ 2025-12-17 12:41:31.686794-05 │ libguides:guides-175847 │
│ 85cfe316-089c-4639-a5af-c861a7321…  │ 2025-12-17 12:41:31.687062-05 │ libguides:guides-175849 │
│ 85cfe316-089c-4639-a5af-c861a7321…  │ 2025-12-17 12:41:31.687307-05 │ libguides:guides-175853 │
│ 85cfe316-089c-4639-a5af-c861a7321…  │ 2025-12-17 12:41:31.687533-05 │ libguides:guides-175855 │
└─────────────────────────────────────┴───────────────────────────────┴─────────────────────────┘

Note: There are 6 embeddings across 2 run ids; the timdex_record_id="libguides:guides-175846" appears twice.

Query data.current_embeddings:

td.conn.query("select run_id, embedding_timestamp, timdex_record_id from data.current_embeddings order by embedding_timestamp")

Output:

┌─────────────────────────────────────┬───────────────────────────────┬─────────────────────────┐
│               run_id                │      embedding_timestamp      │    timdex_record_id     │
│               varchar               │   timestamp with time zone    │         varchar         │
├─────────────────────────────────────┼───────────────────────────────┼─────────────────────────┤
│ 85cfe316-089c-4639-a5af-c861a7321…  │ 2025-12-17 12:41:31.686431-05 │ libguides:guides-175846 │
│ 85cfe316-089c-4639-a5af-c861a7321…  │ 2025-12-17 12:41:31.686794-05 │ libguides:guides-175847 │
│ 85cfe316-089c-4639-a5af-c861a7321…  │ 2025-12-17 12:41:31.687062-05 │ libguides:guides-175849 │
│ 85cfe316-089c-4639-a5af-c861a7321…  │ 2025-12-17 12:41:31.687307-05 │ libguides:guides-175853 │
│ 85cfe316-089c-4639-a5af-c861a7321…  │ 2025-12-17 12:41:31.687533-05 │ libguides:guides-175855 │
└─────────────────────────────────────┴───────────────────────────────┴─────────────────────────┘

Note: There are 5 embeddings; the timdex_record_id="libguides:guides-175846" appears only once from the latest timestamp, which is associated with run_id="85cfe"....

Run a Local OpenSearch instance

Launch a local OpenSearch instance.
Reindex source="libguides" using the fixture.

pipenv run tim reindex-source -s libguides tests/fixtures/dataset

Log output:

2025-12-17 15:20:42,288 INFO tim.cli.main(): Logger 'root' configured with level=INFO
2025-12-17 15:20:42,288 INFO tim.cli.main(): No Sentry DSN found, exceptions will not be sent to Sentry
2025-12-17 15:20:42,288 INFO tim.opensearch.configure_opensearch_client(): OpenSearch request configurations: {'OPENSEARCH_BULK_MAX_CHUNK_BYTES': 104857600, 'OPENSEARCH_BULK_MAX_RETRIES': 50, 'OPENSEARCH_REQUEST_TIMEOUT': 120}
2025-12-17 15:20:42,288 INFO tim.cli.main(): OpenSearch client configured for endpoint 'localhost'
2025-12-17 15:20:42,489 INFO opensearch.log_request_success(): PUT http://localhost:9200/libguides-2025-12-17t20-20-42 [status:200 request:0.199s]
2025-12-17 15:20:42,489 INFO tim.cli.reindex_source(): Index 'libguides-2025-12-17t20-20-42' created.
2025-12-17 15:20:42,492 INFO opensearch.log_request_success(): GET http://localhost:9200/_cat/aliases?format=json [status:200 request:0.003s]
2025-12-17 15:20:42,510 INFO opensearch.log_request_success(): POST http://localhost:9200/_aliases [status:200 request:0.017s]
2025-12-17 15:20:42,513 INFO opensearch.log_request_success(): GET http://localhost:9200/libguides-2025-12-17t20-20-42/_alias [status:200 request:0.003s]
2025-12-17 15:20:42,513 INFO tim.cli.reindex_source(): Index promoted. Current aliases for index 'libguides-2025-12-17t20-20-42': ['all-current', 'libguides']
2025-12-17 15:20:42,515 INFO timdex_dataset_api.dataset.load_pyarrow_dataset(): Dataset successfully loaded: 'tests/fixtures/dataset/data/records', 0.0s
/Users/jcuerdo/.local/share/virtualenvs/timdex-index-manager-P0Ce49-g/lib/python3.12/site-packages/duckdb_engine/__init__.py:630: SAWarning: Did not recognize type 'list' of column 'embedding_vector'
  columns = self._get_columns_info(rows, domains, enums, schema)  # type: ignore[attr-defined]
2025-12-17 15:20:42,876 INFO opensearch.log_request_success(): POST http://localhost:9200/_bulk [status:200 request:0.016s]
2025-12-17 15:20:42,876 INFO tim.opensearch.bulk_index(): Refreshing index.
2025-12-17 15:20:42,891 INFO opensearch.log_request_success(): POST http://localhost:9200/libguides-2025-12-17t20-20-42/_refresh [status:200 request:0.015s]
2025-12-17 15:20:42,936 INFO opensearch.log_request_success(): POST http://localhost:9200/_bulk [status:200 request:0.039s]
2025-12-17 15:20:42,936 INFO tim.opensearch.bulk_update(): Refreshing index.
2025-12-17 15:20:42,955 INFO opensearch.log_request_success(): POST http://localhost:9200/libguides-2025-12-17t20-20-42/_refresh [status:200 request:0.019s]
2025-12-17 15:20:42,955 INFO tim.cli.reindex_source(): Reindex source complete: {"index": {"created": 5, "updated": 0, "errors": 0, "total": 5}, "update": {"updated": 5, "errors": 0, "total": 5}}
2025-12-17 15:20:42,955 INFO tim.cli.log_process_time(): Total time to complete process: 0:00:00.667618

Note: As the logs indicate, 5 records were indexed into OpenSearch and only 5 records were updated with embeddings (as embeddings are pulled from data.current_embeddings).

Includes new or updated dependencies?

YES - make update was run to resolve the following vulnerability:

Found 1 known vulnerability in 1 package
Name     Version ID             Fix Versions
-------- ------- -------------- ------------
filelock 3.20.0  CVE-2025-68146 3.20.1
Name               Skip Reason
------------------ ---------------------------------------------------------------------------------

Changes expectations for external applications?

YES | NO

What are the relevant tickets?

https://mitlibraries.atlassian.net/browse/USE-273

Code review

Code review best practices are documented here and you are encouraged to have a constructive dialogue with your reviewers about their preferences and expectations.

Why these changes are being introduced: * With TDA 3.8.0, we can now retrieve record metadata columns in embeddings read methods. Filtering embeddings by `action="index"` prevents any attempt to update documents that do not exist in OpenSearch (`action="delete"`), which results in an API error.. This is important especially with the current state of tim.opensearch.bulk_update, which will raise a BulkOperationError and cause the 'bulk_update_embeddings' CLI command to exit early. This also includes an additional change to also index embeddings when performing a reindex. How this addresses that need: * Filter embeddings by action="index" * Install latest version of timdex-dataset-api (latest commit) * Update embeddings in fixtures/test/dataset to use 'embeddings_timestamp" Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/USE-273

coveralls · 2025-12-17T21:24:56Z

Pull Request Test Coverage Report for Build 20340697927

Details

7 of 10 (70.0%) changed or added relevant lines in 1 file are covered.
No unchanged relevant lines lost coverage.
Overall coverage decreased (-0.4%) to 95.732%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
tim/cli.py	7	10	70.0%

Totals
Change from base Build 20272814319:	-0.4%
Covered Lines:	471
Relevant Lines:	492

💛 - Coveralls

ghukill

For the most part, looking great! With the addition of action=index I think we avoid a whole class of errors for documents that may not exist. I was able to walk through the PR instructions and successfully index embeddings.

I left a request to udpate a docstring, but have a more substantial question. Unsure if this is a TIM or TDA bug, so not requesting changes for this PR at this time.

I was curious what would happen if we run reindex-source against a dataset without embeddings, which seems like a very real possibility. To do this, I copied the testing dataset to /tmp, removed the data/embeddings directory to simulate no embeddings for the dataset, and re-ran reindex-source:

cp -r tests/fixtures/dataset /tmp/use273
rm -r /tmp/use273/data/embeddings
pipenv run tim reindex-source -s libguides /tmp/use273

The following error is ultimately thrown:

...
...
  File "/Users/ghukill/.local/share/virtualenvs/timdex-index-manager-UCzC4db_/lib/python3.12/site-packages/timdex_dataset_api/embeddings.py", line 383, in read_batches_iter
    data_query = self._build_query(
                 ^^^^^^^^^^^^^^^^^^
  File "/Users/ghukill/.local/share/virtualenvs/timdex-index-manager-UCzC4db_/lib/python3.12/site-packages/timdex_dataset_api/embeddings.py", line 411, in _build_query
    embeddings_table = self.get_sa_table(table)
                       ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ghukill/.local/share/virtualenvs/timdex-index-manager-UCzC4db_/lib/python3.12/site-packages/timdex_dataset_api/embeddings.py", line 174, in get_sa_table
    raise ValueError(f"Could not find table '{table}' in DuckDB schema 'data'.")
ValueError: Could not find table 'current_embeddings' in DuckDB schema 'data'.

My hunch is that we should resolve this at the TDA level, where attempting to use embeddings.read_* methods should fail gracefully embeddings don't exist (or maybe better yet just yield zero rows).

Thoughts @jonavellecuerdo?

If we elect to address this at the TDA level, I'd have no problem approving this PR with the assumption a ticket is created and this reindex-source will not throw errors if embeddings don't exist, because that is handled in TDA.

tim/cli.py

ghukill · 2025-12-18T14:19:04Z

tim/cli.py

    This CLI command performs the following:
        1. creates a new index for the source
        2. promotes this index as the primary for the source alias, and added to any other
        aliases passed (e.g. 'timdex')
        3. uses the TDA library to yield only current records from the parquet dataset
        for the source
        4. bulk index these records to the new Opensearch index


We should probably update this part of the docstring to mention embeddings.

Just updated the docstring!

jonavellecuerdo · 2025-12-18T14:45:50Z

@ghukill I agree with your suggestion to create a new ticket to handle this scenario at the TDA level. Whatever approach we decide on, I think it would be great if we applied it to the read_batches_iter method of TIMDEXDataset as well!

ghukill

Ticket created for the TDA updates: https://mitlibraries.atlassian.net/browse/USE-306.

Rest looks great!

jonavellecuerdo added 3 commits December 17, 2025 15:34

Fix unit test

10a887f

Update dependencies

fa936ed

jonavellecuerdo marked this pull request as ready for review December 17, 2025 21:23

jonavellecuerdo requested a review from a team as a code owner December 17, 2025 21:23

ghukill self-assigned this Dec 17, 2025

ghukill reviewed Dec 18, 2025

View reviewed changes

Add docstring

e0107bf

jonavellecuerdo requested a review from ghukill December 18, 2025 14:45

ghukill approved these changes Dec 18, 2025

View reviewed changes

jonavellecuerdo merged commit 96f0096 into main Dec 18, 2025
4 of 5 checks passed

jonavellecuerdo deleted the USE-273-bulk-update-embeddings-use-metadata branch December 18, 2025 19:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use metadata in 'bulk-update-embeddings' #375

Use metadata in 'bulk-update-embeddings' #375

jonavellecuerdo commented Dec 17, 2025 •

edited by ghukill

Loading

Uh oh!

coveralls commented Dec 17, 2025 •

edited

Loading

Uh oh!

ghukill left a comment •

edited

Loading

Uh oh!

Uh oh!

ghukill Dec 18, 2025

Uh oh!

jonavellecuerdo Dec 18, 2025

Uh oh!

jonavellecuerdo commented Dec 18, 2025

Uh oh!

ghukill left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Use metadata in 'bulk-update-embeddings' #375

Use metadata in 'bulk-update-embeddings' #375

Conversation

jonavellecuerdo commented Dec 17, 2025 • edited by ghukill Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose and background context

How can a reviewer manually see the effects of these changes?

Includes new or updated dependencies?

Changes expectations for external applications?

What are the relevant tickets?

Code review

Uh oh!

coveralls commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 20340697927

Details

💛 - Coveralls

Uh oh!

ghukill left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ghukill Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

jonavellecuerdo Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

jonavellecuerdo commented Dec 18, 2025

Uh oh!

ghukill left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jonavellecuerdo commented Dec 17, 2025 •

edited by ghukill

Loading

coveralls commented Dec 17, 2025 •

edited

Loading

ghukill left a comment •

edited

Loading