File Indexing with Solr
File indexing with Indexed Search has been complicated and restricted to a few file formats only. Also other search engine integrations for TYPO3 have failed to provide good solutions to the issue of file indexing. With Apache Solr for TYPO3 we want to solve that problem.
In fact we had a first file indexer implemented already since mid of May 2010. Back then we implemented a file indexer for a customer project. Due to the lack of time back then we implemented a file indexer that worked with frontend indexing and files managed using the fileadmin only.
Goals and Requirements
Since we want to deliver the best search engine integration for TYPO3 we had higher goals though. We wanted to cover all aspects of file indexing in TYPO3. That means that we need to support both the known way of frontend indexing like Indexed Search, most other search engine integrations, and also the public version of the Solr extension do, as well as the Index Queue we invented for the early access version of our extension. It means also that we need to support indexing of files, whether they are managed using the fileadmin or the DAM. Any of these combinations need to be supported.
Challenges
We all know how powerful and flexible TYPO3 is, that flexibility however also imposes some problems for indexing content in TYPO3 in general. You can set a page to display a different page's content elements, you can put content elements on pages using TypoScript, and so on. So it is hard to impossible to find out what content is going to be rendered on a page in the end without actually rendering the page.
However, any content that is rendered on a page goes through content object rendering in class tslib_cObj. Since TYPO3 4.4, we have a hook in the class' initialization method which allows us to track any content being rendered. We already use that hook in the Index Queue to find out which frontend user group access restrictions are applied to content elements on a page. For file indexing we are simply going to use the same hook to track which files are being rendered on a page.
Another, maybe even worse, issue is how files and specifically links to files are handled in TYPO3. There's a standard way how the core handles file links. However, every extension providing file lists is doing it on its own in a non-standard way, especially extensions which provide downloads from the DAM.
Solutions
First we register a file indexer to be executed at the end of the page rendering process, just like our Solr content indexer and Indexed Search. A file extractor for the pages generated by TYPO3 is registered to be executed whenever an instance of tslib_cObj is created. The file extractor then hands the content element record over to a file detector factory to get a file detector which can find files linked by the given content element. The file extractor statically stores all the files found by file detectors during the page rendering process. At the end when our file indexer is called it asks the file extractor for all the files which have been found and puts them into a File Index Queue.
The File Index Queue is a very interesting aspect of the whole file indexer: Until now every search engine that provided file indexing in TYPO3 indexed files at the point when the files were found on a page. The consequence is that this slows down page rendering and blocks output until the files are indexed. Depending on the amount of files found, the size of the files, and the server's capacity this synchronous file indexing process can be bad in terms of user experience. To overcome this issue we store data about the files we found in a queue, which then can be worked through by an asynchronous scheduler task. Using this approach allows us to render the page without blocking or slowing down output and still indexing the files linked on a page.
To solve the issue of every extension handling file links differently we introduced the concept of File Detectors. The Solr extension will come with a few file detectors for popular extensions, but other extensions can provide specialized file detectors, too. Every file detector can register itself for certain content element types and for specific fields of a content element. The file detector for files managed by the fileadmin for example registers to listen on the "uploads" content element type and specifically will analyze the media field of that content element. When handed a matching content element record it the knows how to find the files linked by that specific content element.
In the end, to actually extract a file's textual content we use Apache Tika for TYPO3. Tika is cool, because it knows about 1,200 file formats and can read about half of them. This allows us to index PDF files, Microsoft Office files including the new XML-based formats, OpenOffice.org files, and pretty much any other file format you encounter in daily life. This has not been possible until now as you needed a special text extraction service for nearly each file format. With Tika you can replace all of them with just one extension.
For the future we want to contribute our experience with file handling in TYPO3 to the File System Abstraction Layer project to make it easier and more consistent to work with files in TYPO3. These contributions will not only benefit Apache Solr for TYPO3, but also any other search engine integration and other extensions dealing with files.
Conclusion
So there are some challenges to index files in TYPO3 and we think we have found a good solution for them. The current status is that we have implemented the file detector for the fileadmin and files are being added to the File Index Queue during frontend indexing. Next up is making the Page Index Queue also index files the same way and adding support for DAM. With the existing foundation that should be relatively easy. Stay tuned for more to come and let us know any questions you may have and what you think about our approach.



