Skip to content

DEICH-5839 - Summary harvesting

Tom Adam requested to merge DEICH-5839-summary-harvesting-from-bibbi into master

DEICH-5839 Mimir changes

Extended mimir v2 api search possibilities to fetch missing summary in addition to missing cover candidates. The changes are implemented in such a way, that additional absent data harvesting might be easily added to the code base.

The entry point of the search is - just like before - the no.deichman.mimir.discogs.controller.SearchController class (note, that the package should be renamed to no.deichman.mimir.harvesting.controller). The controller was altered to provide two end points, one for MARC21 search (/marc) and one for discogs (/discogs) for cleaner separation of concerns.

The code is built around the AbsentData enum class. This enum contains all currently supported, harvestable absent data (however, as of now discogs only supports cover image). A bit biased towards data coming form MARC21, each enum value is connected to a corresponding data extractor, which is responsible for extracting the data for the MARC21 end points (BIBBI, ALMA, DFB).

MARC21 end points are fetched in priority order (BIBBI, ALMA, DFB, from high to low). If an absent data is found, it is removed form the request towards the next source - so, if BIBBI finds a summary, the summary is not being tried to be fetched from ALMA or DFB (see test for details).

Multiple cover images are supported. Same is true for summary, however, does not give much of a value, since for the client it would be hard to decide which summary should be used (currently in such a case the client chooses the first hit).

Further changes:
-Removed option for extraQueryParts from QueryBuilderService - not needed any longer, since we potentially fetch other absent data in addition to cover image.
-Removed empty flag from Response.MarcResponse - just additional complexity, and can rather easily be filtered out on server side.
-Removed still unused data harvesting from discogs. The previous implementation is in the repo, but needs to be altered to fit into the new architecture on demand (the implementation should base itself on the AbsentData enum concept as well).

DEICH-5839 - Changes in cover-harvester

Modified the code to handle multiple missing data types, and all MARC21 search end points.

Increased test coverage (never enough).

Increased type safety and error handling, thus more stable run can be expected.

DEICH-5839 - Euler

Made required changes to enabled fetching and setting of absent data from/in virtuoso.

Refactored API. Renamed endpoints. Now the kind of absent data (OVER_IMAGE or SUMMARY) is also returned with the absent data candidate response.

More consistent use of publicationId - removed from request object, only part of data setter URL (set-data and harvest-attempt end-points).

Updated and renamed the absent data fetching sparql query.

Added required sparql templates to log harvesting attempts.

Changed harvest attempt naming - requires update script in prod - see the task comment added recently.

Added SetDataRequest base class which is to be used as base class for any further data harvesting efforts. The request classes themselves are responsible for generating the insert sparql statement. No more option for overwrite - anyway, it was not used, just additional complexity.

Json serialization/deserialization tests for SetDataRequest implementations.
Edited by Tom Adam

Merge request reports