-
Notifications
You must be signed in to change notification settings - Fork 0
Use metadata in 'bulk-update-embeddings' #375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use metadata in 'bulk-update-embeddings' #375
Conversation
Why these changes are being introduced: * With TDA 3.8.0, we can now retrieve record metadata columns in embeddings read methods. Filtering embeddings by `action="index"` prevents any attempt to update documents that do not exist in OpenSearch (`action="delete"`), which results in an API error.. This is important especially with the current state of tim.opensearch.bulk_update, which will raise a BulkOperationError and cause the 'bulk_update_embeddings' CLI command to exit early. This also includes an additional change to also index embeddings when performing a reindex. How this addresses that need: * Filter embeddings by action="index" * Install latest version of timdex-dataset-api (latest commit) * Update embeddings in fixtures/test/dataset to use 'embeddings_timestamp" Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/USE-273
Pull Request Test Coverage Report for Build 20340697927Details
💛 - Coveralls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the most part, looking great! With the addition of action=index I think we avoid a whole class of errors for documents that may not exist. I was able to walk through the PR instructions and successfully index embeddings.
I left a request to udpate a docstring, but have a more substantial question. Unsure if this is a TIM or TDA bug, so not requesting changes for this PR at this time.
I was curious what would happen if we run reindex-source against a dataset without embeddings, which seems like a very real possibility. To do this, I copied the testing dataset to /tmp, removed the data/embeddings directory to simulate no embeddings for the dataset, and re-ran reindex-source:
cp -r tests/fixtures/dataset /tmp/use273
rm -r /tmp/use273/data/embeddings
pipenv run tim reindex-source -s libguides /tmp/use273
The following error is ultimately thrown:
...
...
File "/Users/ghukill/.local/share/virtualenvs/timdex-index-manager-UCzC4db_/lib/python3.12/site-packages/timdex_dataset_api/embeddings.py", line 383, in read_batches_iter
data_query = self._build_query(
^^^^^^^^^^^^^^^^^^
File "/Users/ghukill/.local/share/virtualenvs/timdex-index-manager-UCzC4db_/lib/python3.12/site-packages/timdex_dataset_api/embeddings.py", line 411, in _build_query
embeddings_table = self.get_sa_table(table)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/ghukill/.local/share/virtualenvs/timdex-index-manager-UCzC4db_/lib/python3.12/site-packages/timdex_dataset_api/embeddings.py", line 174, in get_sa_table
raise ValueError(f"Could not find table '{table}' in DuckDB schema 'data'.")
ValueError: Could not find table 'current_embeddings' in DuckDB schema 'data'.
My hunch is that we should resolve this at the TDA level, where attempting to use embeddings.read_* methods should fail gracefully embeddings don't exist (or maybe better yet just yield zero rows).
Thoughts @jonavellecuerdo?
If we elect to address this at the TDA level, I'd have no problem approving this PR with the assumption a ticket is created and this reindex-source will not throw errors if embeddings don't exist, because that is handled in TDA.
tim/cli.py
Outdated
| This CLI command performs the following: | ||
| 1. creates a new index for the source | ||
| 2. promotes this index as the primary for the source alias, and added to any other | ||
| aliases passed (e.g. 'timdex') | ||
| 3. uses the TDA library to yield only current records from the parquet dataset | ||
| for the source | ||
| 4. bulk index these records to the new Opensearch index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably update this part of the docstring to mention embeddings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just updated the docstring!
|
@ghukill I agree with your suggestion to create a new ticket to handle this scenario at the TDA level. Whatever approach we decide on, I think it would be great if we applied it to the |
ghukill
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ticket created for the TDA updates: https://mitlibraries.atlassian.net/browse/USE-306.
Rest looks great!
Purpose and background context
With TDA 3.8.0, we can now retrieve record metadata columns in embeddings read methods. Filtering embeddings by
action="index"prevents any attempt to update documents that do not exist in OpenSearch (action="delete"), which results in an API error.. This is important especially with the current state of tim.opensearch.bulk_update, which will raise a BulkOperationError and cause the 'bulk_update_embeddings' CLI command to exit early.This also includes an additional change to also index embeddings when performing a reindex.
Note: This piece was straightforward but will note the redundant nature of this simple add (but think it's fine for now 🤓 ).
How can a reviewer manually see the effects of these changes?
The embeddings in
tests/fixtures/datasetwere updated to align with TDA v3.8.0, which renamed thetimestampcolumn asembeddings_timestamp.Review the contents of TIMDEX dataset at
tests/fixtures/dataset.envfile and set:data.embeddings:Output:
Note: There are 6 embeddings across 2 run ids; the
timdex_record_id="libguides:guides-175846"appears twice.data.current_embeddings:Output:
Note: There are 5 embeddings; the
timdex_record_id="libguides:guides-175846"appears only once from the latest timestamp, which is associated withrun_id="85cfe"....Run a Local OpenSearch instance
source="libguides"using the fixture.Log output:
Note: As the logs indicate, 5 records were indexed into OpenSearch and only 5 records were updated with embeddings (as embeddings are pulled from
data.current_embeddings).Includes new or updated dependencies?
YES -
make updatewas run to resolve the following vulnerability:Changes expectations for external applications?
YES | NO
What are the relevant tickets?
Code review