Sitecore Search can search for documents in addition to HTML. In this article, we will show you how to target PDF files.
Check Attribute
In a previous article, we had already added the following two items as Attributes in utilizing this SDK.
- File Type
- Parent_url
We would like to use this to attach PDF attributes to PDF files.
Crawler Settings
Add a target domain
In offering PDFs on sitecore.com, PDF files use a different domain. For example, the following pages
The URL for the PDF file is https://wwwsitecorecom.azureedge.net/-/media/sitecoresite/files/home/customers/technology/canon/2020-canon-jp.pdf?md= 20200622T142037Z and is available for download via CDN. This domain is not included in the current crawler, so we need to add it first.
For this reason, add the domain to the settings of the source you are crawling.
Max Depth setting
This item allows you to set how many links contained in a page will be followed and indexed. This setting is described in the following page,
By default, this value is 0 when sitemap and sitemap index are selected, and 2 otherwise. Since we are using sitemap in this case, we need to set this value to 1.
Now we can crawl and index PDF files if they are linked as we crawl. However, to search only PDFs, we have added the following process.
Add PDF processing in Document Extractor
So far, the data acquired by the crawler has been processed by a single Document Extractor, but this time, we would like to process PDF data in a different way.
First, add PDF processing as part of Document Extractor processing. The contents to be added are as follows
- Name as PDF
- Select JavaScript for processing
- For URLs to Match, select Glob Expression and set **/*.pdf* to process only PDF extensions.
- When referring to Sitecore's Media Library, the key is added after the .pdf, so it looks like the above, but you don't need the trailing * if it ends in .pdf
- The order of processing is placed before the already existing JS processing
Click on Add tagger and add the following processing as JavaScript
function extract(request, response) {
const translate_re = /&(nbsp|amp|quot|lt|gt);/g;
function decodeEntities(encodedString) {
return encodedString.replace(translate_re, function(match, entity) {
return translate[entity];
}).replace(/&#(\d+);/gi, function(match, numStr) {
const num = parseInt(numStr, 10);
return String.fromCharCode(num);
});
}
function sanitize(text) {
return text ? decodeEntities(String(text).trim()) : text;
}
$ = response.body;
url = request.url;
id = url.replace(/[.:/&?=%]/g, '_');
title = sanitize($('title').text());
description = $('body').text().substring(0, 7000);
$p = request.context.parent.response.body;
if (title.length <= 4 && $p) {
title = $p('title').text();
}
parentUrl = request.context.parent.request.url;
type = request.context.parent.documents[0].data.type;
last_modified = request.context.parent.documents[0].data.last_modified;
return [{
'id': id,
'file_type': 'pdf',
'type': type,
'last_modified': last_modified,
'title': title,
'description': description,
'parent_url': parentUrl
}];
}
If the Localized checkbox is also checked and the screen looks like the one below, the standard crawl settings are complete.
Locale is also set for PDF files, but the URL of the PDF file itself does not include the locale. Therefore, we use parent_url to store the locale for each language using the locale of the linked page below. The code looks like this
function extract(request, response) {
parentUrl = request.context.parent.request.url;
locales = ['zh-cn','de-de','ja-jp','da'];
for (idx in locales) {
locale = locales[idx];
if (parentUrl.indexOf('/' + locale + '/') >= 0) {
if (locale == 'da')
locale = 'da-dk';
return locale.toLowerCase().replace('-','_');
}
}
return "en_us";
}
The screen after setting is as follows.
Finally, we add processing for PDFs in the Request Extractor, since PDFs are not crawled by default and this additional processing is necessary.
function extract(request, response) {
const $ = response.body;
const regex = /.*\.pdf(?:\?.*)?$/;
return $('a')
.toArray()
.map((a) => $(a).attr('href'))
.filter((url) => regex.test(url))
.map((url) => ({ url }));
}
If the Document Extractor and Locale Extractor settings have been configured for the source, the screen will look like this
Click Publish to run the crawl with the new settings (this time increasing the target content to 6000).
Content Verification
We will check the actual content that is ready to be completed. From the list of content that is ready to be crawled from the content in the administration screen, specify the target source and file type, and you will see a list of PDF files. If you check the attributes of the content, you will see that PDF is set.
Summary
This time we also added PDF to the crawl and verified that the file type is available as a PDF file in the content list.