With Nutch 1.7 the possibility for integrating with ElasticSearch became available. However setting up the integration turned out to be quite a treasure hunt for me. For anybody else wanting to achieve the same result without tearing out as much hair as I did please find some simple instructions on this page that hopefully will help you in getting Nutch to talk to ElasticSearch.
I’m assuming you have both Nutch and ElasticSearch running fine by which I mean that Nutch does it crawl, fetch, parse thing and ElasticSearch is doing its indexing and searching magic, however not yet together.
All of the work involved is in Nutch and you need to edit nutch-site.xml in the conf directory to get things going. First off you need to activate the elasticsearch indexer plugin by adding the following line to nutch-site.xml:
<property> <name>plugin.includes</name> <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-elastic|scoring-opic|urlnormalizer-(pass|regex|basic)</value> <description>Regular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. In order to use HTTPS please enable protocol-httpclient, but be aware of possible intermittent problems with the underlying commons-httpclient library. </description> </property>
The item to add here is the phrase indexer-elastic. Secondly you need to add the following elements to nutch-site.xml:
<!-- Elasticsearch properties --> <property> <name>elastic.host</name> <value>localhost</value> <description>The hostname to send documents to using TransportClient. Either host and port must be defined or cluster.</description> </property> <property> <name>elastic.port</name> <value>9300</value> <description> </description> </property> <property> <name>elastic.cluster</name> <value>elasticsearch</value> <description>The cluster name to discover. Either host and potr must be defined or cluster.</description> </property> <property> <name>elastic.index</name> <value>nutch</value> <description>Default index to send documents to.</description> </property> <property> <name>elastic.max.bulk.docs</name> <value>250</value> <description>Maximum size of the bulk in number of documents.</description> </property> <property> <name>elastic.max.bulk.size</name> <value>2500500</value> <description>Maximum size of the bulk in bytes.</description> </property>
Please adapt the settings to your own situation. In my case I run elasticsearch on the same box, hence elastic.host is localhost for me. Another important setting is the elastic.cluster name, if you don’t know it (anymore) you can find it in the elasticsearch.yml file with your elasticsearch installation configuration directory. The elastic.port by default is on 9300 for interfacing (the web interface runs on port 9200, don’t use that for nutch integration). Lastly create an index in elasticsearch and use that index name for elastic.index in the configuration (I used the index name nutch in the conf file).
That’s it, quite simple however without almost any documentation available quite a bit of work to puzzle out, especially if documentation is quickly outdated (like references to conf/elasticsearch.conf which isn’t needed) or turns out to deal with Nutch 2.x (which by the way I didn’t get to work in any sane fashion). Anyway that’s me grumbling, in the end I felt quite excilirated that I got the damn thing to work.