As a site owner using Content Recommendations,
I would like the ability to schedule re-import of already indexed content in Content Recs,
so that:
a) there would be no need for manually triggering a re-scan, e.g. if we add or change metatag or data attributes to pages.
b) content that has been unpublished/deleted from the site would be purged from the Content Recs index.
c) Images are re-scraped from the source pages - very useful if editors have changed or unpublished the original image since the page was last scraped.
Perhaps a better approach would be pass a 'modified' time meta attribute.
...
<meta property="article:published_time" content="2019-10-18T15:19:00.000-04:00" />
<meta property="article:modified_time" content="2023-01-18T15:19:00.000-04:00" />
...
The scheduled re-import process would only re-index pages that have been modified since the last time it was scanned.
A note on this. I acknowledge that the initial scraping currently is resource-intensive, since it performs a full NLP analysis of the page content. However, a NLP re-scan would NOT be necessary for this suggested re-import - just a superficial re-import of the factual properties like title, (canonical) URL, metatags, image.
Thank you for your input. We will investigate the idea and see whether it's something we can address in a timely manner.