Tagged: Media Library

Sitecore: Apache Tika Integration for Secure Media File Indexing

Sitecore-Apache-Tika-Integration-Secure-Media

Problem

We have some secure PDFs in Media Library that were not getting indexed in Solr – They couldn’t be extracted using the PDFSharp library.

The logs were showing the error while extracting secure files

16804 12:04:53 ERROR DefaultMediaItemTextExtractor: Cannot extract content from media item with id ‘{442006A5-8CB6-4ABE-8855-786D2A870201}’.
Exception: PdfSharp.Pdf.IO.PdfReaderException
Message: The PDF document is protected with an encryption not supported by PDFsharp.
Source: PdfSharp
at PdfSharp.Pdf.Security.PdfStandardSecurityHandler.ValidatePassword(String inputPassword)
at PdfSharp.Pdf.IO.PdfReader.Open(Stream stream, String password, PdfDocumentOpenMode openmode, PdfPasswordProvider passwordProvider)
at PdfSharp.Pdf.IO.PdfReader.Open(String path, String password, PdfDocumentOpenMode openmode, PdfPasswordProvider provider)
at Sitecore.ContentSearch.ContentExtraction.Readers.PdfSharpReader.ReadAll(String filePath)
at Sitecore.ContentSearch.ContentExtraction.Common.DefaultMediaItemTextExtractor.ExtractTextFromMedia(MediaItem mediaItem)

38536 12:04:53 ERROR DefaultMediaItemTextExtractor: Cannot extract content from media item with id ‘{442006A5-8CB6-4ABE-8855-786D2A870201}’.
Exception: PdfSharp.Pdf.IO.PdfReaderException
Message: The PDF document is protected with an encryption not supported by PDFsharp.
Source: PdfSharp
at PdfSharp.Pdf.Security.PdfStandardSecurityHandler.ValidatePassword(String inputPassword)
at PdfSharp.Pdf.IO.PdfReader.Open(Stream stream, String password, PdfDocumentOpenMode openmode, PdfPasswordProvider passwordProvider)
at PdfSharp.Pdf.IO.PdfReader.Open(String path, String password, PdfDocumentOpenMode openmode, PdfPasswordProvider provider)
at Sitecore.ContentSearch.ContentExtraction.Readers.PdfSharpReader.ReadAll(String filePath)
at Sitecore.ContentSearch.ContentExtraction.Common.DefaultMediaItemTextExtractor.ExtractTextFromMedia(MediaItem mediaItem)

 

Solution

  • If you like to index the media content, Sitecore recommends using the following libraries IFilter, Apache Tika, or SolrCell. 

https://www.searchstax.com/docs.hc/can-we-use-apache-tika

  • Azure web apps have a limitation in using the IFilter library so I ended up using Apache Tika.

Steps to Integrate:

  • Download the Apache Tika server file –tika-server-1.22.jar.
    • Sitecore recommends Apache Tika version 1.22 refer to the compatibility table for your version
  • Save the server file in a folder on SOLR server e.g: c:\tika
  • In PowerShell navigate to the path and execute the following command to install.

java -jar tika-server-1.22.jar

Sitecore-Apache-Tika-Solr-Indexing.png

Note: The default hostname is localhost and the port is 9998.

If you would like a specific hostname and port number that could be included in the installation command as parameters

 java -jar tika-server-1.22.jar –host=<Tikahostname> –port=<portnumber>

After the installation is completed open the following URL http://localhost:9998 to see if it is working as expected. You should see the welcome message!

Sitecore-Apache-Tika-Solr-Indexing-2.png

  • Add the following patch file into App_Config/Include/zzz folder to replace DefaultMediaFileTextExtractor from Sitecore.ContentSearch.ContentExtration.

 

 

  • Last step – Let’s add Tika URL into ConnectionStrings.config file.

<add name=”tika” connectionString=”http://localhost:9998″ />

  • Let’s test quickly – Rebuild a Tree in the Developer Ribbon for one item or you could Rebuild the entire index.
  • Once the indexing is completed check and see if we have the media item available in the index.

Quick Tip: To search for a particular item in Solr, use the following query in the parameter q on your index page

_uniqueid:*[item id in lowercase without braces]*

 Sitecore-Apache-Tika-Solr-Indexing-3.png

Hope this helps.

Happy Sitecoring!

1

Migrate Sitecore Media Library Assets to DAM

Migrate Sitecore Media Library Assets to DAM

When we move into composable architecture. We will need to move the media assets to other platforms. Let’s explore methods of exporting Sitecore Media Library assets to Digital Assets Management (DAM) like Content Hub, AEM, etc. It’s a two-step process of exporting from the source and importing to the destination. We will export the entire Media Library to a zip file and also the asset details to a spreadsheet for validation.

Sitecore Media Library Export to file:

I was exploring the Sitecore Modules, but I realized It could be quickly done using PowerShell Extensions. Right-click on the Media Library node, Navigate to Scripts, and click Download. 

Sitecore_Media_Library_PowerShell_Download

The PowerShell script will run for a few mins in my case it ran for 20 minutes for 3GB (depending upon the Media Library size). If you run into timeout issues. Execute at folder levels and finally combine them.

Once the execution is completed it will prompt a pop-up to download the zip file.

P.S: The zip file is temporarily stored in the App_Data folder, but once we download it, it gets deleted.

Sitecore Media Library Export to CSV:

PowerShell extensions script to help export the media library assets file names and path to a spreadsheet.

 

 

Another approach to export the data is to use the content export tool.

https://github.com/estockwell-alpert/ContentExportTool

Import

Now the assets are ready to be imported to DAM

Sitecore Content Hub follow the steps in the following article https://docs.stylelabs.com/contenthub/3.5.x/content/user-documentation/content-user-manual/create/create-upload-content.html

Adobe Experience Manager you could use the bulk import process following the article

https://experienceleague.adobe.com/docs/experience-manager-learn/cloud-service/migration/bulk-import.html

I hope this helps.

Happy Sitecoring!

1