23.09.2011 12:34 - By: Stefan Sprenger

Index external websites with Apache Nutch

Life is not always easy for search engines nowadays. They have to provide a ton of features, scale up and down or simply offer good search results.

Apache Solr is a state-of-the-art Enterprise search technology. It has proven many times that it does its job pretty well. But what if you are in need for a feature that is not supported by Solr?
Besides indexing our own TYPO3 website we also want to index external websites. Unfortunately, Apache Solr does not support this itself.
Anyway, Apache Solr integrates smoothly in the whole Apache ecosystem.

Apache Nutch is a highly scalable web crawler that has a Solr integration. In this article I will show you how to set up, configure and use Nutch with Solr.

We will use Nutch 1.3 which supports Apache Solr 3.X.

Installation

Let's get our hands dirty. We change to the directory /opt

cd /opt

and download Apache Nutch 1.3:

wget http://apache.mirror.clusters.cc//nutch/apache-nutch-1.3-bin.tar.gz

When the download is finished we unpack the archive:

tar xvfz apache-nutch-1.3-bin.tar.gz

Configuration

We move on to the directory nutch-1.3/runtime/local

cd nutch-1.3/runtime/local

and change the permissions of the command nutch:

chmod +x bin/nutch

Please ensure that you have set the variable JAVA_HOME. Nutch needs to know where your Java installation is located:

export JAVA_HOME=/usr/lib/jvm/java-6-openjdk

In the next step we configure the crawler. We open the file conf/nutch-site.xml and save the following configuration: 

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>http.agent.name</name>
<value>Solr Nutch spider</value>
</property>
</configuration>

We create the folder urls. Nutch reads each file inside urls in order to retrieve websites to visit.
Therefore we create the file urls/seeds having the following content:

http://www.dkd.de/

Nutch provides an own Solr schema located in conf/schema.xml. We copy the schema to our Solr installation after making a small fix.
We need to change the line

<field name="content" type="text" stored="false" indexed="true"/>

to

<field name="content" type="text" stored="true" indexed="true"/>

By setting the option stored to true we enable saving the crawled website's content in the Solr index.

Usage

After finishing the configuration we are ready to enjoy the power of Nutch.

For crawling the configured website we are using the command crawl. We call it with our running Solr server and the depth of links to follow:

bin/nutch crawl urls -solr http://127.0.0.1:8983/solr/ -depth 2

The command supports the following options:

  • The option solr defines the used Solr server.
  • depths defines the depth of links to follow.
  • You can set a maximum of websites to crawl by using the option topN.

You can view the indexed websites using the Solr admin interface. If you do not want to search external websites via your TYPO3 website we can provide alternative solutions like Tempo.

Conclusion

Apache Nutch provides an easy to use solution for crawling and indexing external websites. It integrates Apache Solr perfectly.

This use case proves the great flexiblity of the Apache eco system once again. If you want to search external websites using your Solr installation or are interested in an alternative solution for displaying search results get in contact with us!

Resources

Comments

Philipp, 23.09.11 21:58:
The third link is not working :(

Thank for this great article.
Stefan, 26.09.11 13:23:
Hi Philipp,

thank you, I've fixed the link.
Heiko, 28.03.12 18:49:
Hi, does this short tutorial work with nutch-1.4?
I always get a "NoURLs to fetch - check your seed list and URL filters." error.
My seed list and the url filter are similar to the tutorial.
Add comment

* - required field

*




*