16.11.2011 17:32 - By: Stefan Sprenger

Using Apache Nutch with Solr for TYPO3

In the last blog post we've shown you how to set up and configure Apache Nutch for indexing external websites. Since this post describes how to use Nutch stand-alone many of you have asked how to integrate it in an existing installation of Solr for TYPO3.

The issue that appeared was that you cannot use the Solr schema provided by our TYPO3 extension because it requires several non-default fields to be set.

The most important field is sitehash. It usually contains a MD5 hash of the site's domain, the encryption key and the extension name. The implementation of the site hash is located in tx_solr_Site (classes/class.tx_solr_site.php):

/** 
* Generates the site's unique Site Hash.
*
* The Site Hash is build from the site's main domain, the system encryption
* key, and the extension "tx_solr". These components are concatenated and
* md5-hashed.
*
* @return string Site Hash.
*/
public function getSiteHash() {
$siteHash = md5(
$this->getDomain() .
$GLOBALS['TYPO3_CONF_VARS']['SYS']['encryptionKey'] .
'tx_solr'
); return $siteHash;
}

Because Nutch does not provide such a logic out of the box, we need to add it via a custom Nutch plugin.

The remaining fields can be set using Nutch's Solr mappings that are located in conf/solrindex-mapping.xml.

Get in touch with us if you're interested in indexing external websites with your Solr installation.

Comments

No comments
Add comment

* - required field

*




*