How can I identify how many documents are indexed by a given datasource, what those documents are, and the contents of those documents?
By default, all datasources add certain metadata to each document: in addition to the actual contents of the document (the actual text of a website, for example) each datasource will add information like the URL of a website, the date it was indexed, and so on. These are stored in fields, just like actual data is stored in fields. One of these fields has the datasource name in it - the field is called _lw_data_source_s (leading underscore there is not a typo, the field name starts with an underscore).
Note that storing the datasource in field _lw_data_source_s is the default behavior for datasources, but if you have a customized datasource it's possible that it doesn't add its name to the _lw_data_source_s field. To change this behavior you would have to alter the datasource pipeline.
There is no dedicated screen with this information, but you can get it via the Query Workbench.
- Open the Query Workbench in the Fusion admin UI.
- Use the Add Field Facet button to add the field _lw_data_source_s as a facet.
- Run the default *:* query (which finds all documents) and you should see the field showing the names of the datasources as facets with the number of documents in each datasource in parentheses.
You may click each facet to restrict the query output to only documents from that datasource.