Nutch 开发(一)
开发环境 1.IDEA 导入nutch项目2.nutch源码目录了解3.Nutch爬取步骤4.启动类的介绍5.Nutch的sh脚本6.运行injector
6.1 配置6.2创建一个url列表6.3 IDEA创建启动6.4 运行效果对等 7.Injector主函数分析8.运行Generator
8.1 IDEA创建启动8.2 运行效果对等 9.运行Fetcher
9.1 IDEA创建启动9.2 报错分析9.3 配置http.agent.name9.3 运行效果对等 10.运行ParseSegment
10.1 IDEA创建启动10.2 运行效果对等 11.运行CrawlDb
11.1 IDEA创建启动11.2 运行效果对等 12.运行linkDb
12.1 IDEA创建启动12.2 运行效果对等 下一章
开发环境Linux,Ubuntu20.04LSTIDEANutch1.18Solr8.11
转载请声明出处!!!By 鸭梨的药丸哥
1.IDEA 导入nutch项目要开发nutch最好连nutch源码一起下载下来。去官方下载nutch的源码包。
1.18版本的下载地址:https://www.apache.org/dyn/closer.lua/nutch/1.18/apache-nutch-1.18-src.tar.gz
我下载的是Linux的源码包,因为nutch很多命令都需要运行在Linux上面,所以为了方便我是在Linux上对nutch的插件进行开发。
编译源码前,确保已经安装好ant,可以执行下面的方法进行ant的安装
sudo apt-get update sudo apt-get install ant
将nutch构建成eclipse项目
ant eclipse
然后使用IDEA以eclipse工程导入项目,这个网上的资源比较多,正常滴导入Nutch源码项目即可,导入时选择以eclipse项目的方式进行导入。
2.nutch源码目录了解通过nutch源码编译出来的目录结构跟下载的bin包的结构目录有细微的差异
build/ #ant eclipse编译后的生成的 conf/ #配置文件目录 docs/ #接口文档 ivy/ #ivy依赖管理工具的文件夹 lib/ #放置Hadoop本机库的占位符的文件夹(不会自动下载,里面的组件用来加快数据(反)压缩) src/ #源码目录3.Nutch爬取步骤
Nutch整个爬取过程是分很多步骤的:
injector -> generator -> fetcher -> parseSegment -> updateCrawleDB -> Invert links -> Index -> DeleteDuplicates -> IndexMerger
建立初始URL集
将URL集注入crawldb数据库—inject
根据crawldb数据库创建抓取列表—generate
执行抓取,获取网页信息—fetch
4.2)执行解析,解析网页信息—parse
更新数据库,把获取到的页面信息存入数据库中—updatedb
重复进行3~5的步骤,直到预先设定的抓取深度。—这个循环过程被称为“产生/抓取/更新”循环
根据sengments的内容更新linkdb数据库—invertlinks
建立索引—index (如:在solr中建立索引)
Nutch作者画的一个Nutch架构图,以前较老版本的架构,当初nutch还未吧全文检索功能分离出来
4.启动类的介绍主要的启动类如下:
重Nutch的sh脚本可以发现,nutch脚本的本质还是调用具体的启动类来实现其功能。
下面截取sh的部分片段,可以看出不同的COMMAND对应不同的启动类,然后将命令行的参数传递给启动类。
# figure out which class to run if [ "$COMMAND" = "crawl" ] ; then echo "Command $COMMAND is deprecated, please use bin/crawl instead" exit -1 elif [ "$COMMAND" = "inject" ] ; then CLASS=org.apache.nutch.crawl.Injector elif [ "$COMMAND" = "generate" ] ; then CLASS=org.apache.nutch.crawl.Generator elif [ "$COMMAND" = "freegen" ] ; then CLASS=org.apache.nutch.tools.FreeGenerator elif [ "$COMMAND" = "fetch" ] ; then CLASS=org.apache.nutch.fetcher.Fetcher elif [ "$COMMAND" = "parse" ] ; then CLASS=org.apache.nutch.parse.ParseSegment elif [ "$COMMAND" = "readdb" ] ; then CLASS=org.apache.nutch.crawl.CrawlDbReader elif [ "$COMMAND" = "mergedb" ] ; then CLASS=org.apache.nutch.crawl.CrawlDbMerger elif [ "$COMMAND" = "readlinkdb" ] ; then CLASS=org.apache.nutch.crawl.linkDbReader elif [ "$COMMAND" = "readseg" ] ; then CLASS=org.apache.nutch.segment.SegmentReader elif [ "$COMMAND" = "mergesegs" ] ; then CLASS=org.apache.nutch.segment.SegmentMerger elif [ "$COMMAND" = "updatedb" ] ; then CLASS=org.apache.nutch.crawl.CrawlDb elif [ "$COMMAND" = "invertlinks" ] ; then CLASS=org.apache.nutch.crawl.linkDb elif [ "$COMMAND" = "mergelinkdb" ] ; then CLASS=org.apache.nutch.crawl.linkDbMerger elif [ "$COMMAND" = "dump" ] ; then CLASS=org.apache.nutch.tools.FileDumper elif [ "$COMMAND" = "commoncrawldump" ] ; then CLASS=org.apache.nutch.tools.CommonCrawlDataDumper elif [ "$COMMAND" = "solrindex" ] ; then CLASS="org.apache.nutch.indexer.IndexingJob -D solr.server.url=" shift elif [ "$COMMAND" = "index" ] ; then CLASS=org.apache.nutch.indexer.IndexingJob elif [ "$COMMAND" = "solrdedup" ] ; then echo "Command $COMMAND is deprecated, please use dedup instead" exit -1 elif [ "$COMMAND" = "dedup" ] ; then CLASS=org.apache.nutch.crawl.DeduplicationJob elif [ "$COMMAND" = "solrclean" ] ; then CLASS="org.apache.nutch.indexer.CleaningJob -D solr.server.url= " shift; shift elif [ "$COMMAND" = "clean" ] ; then CLASS=org.apache.nutch.indexer.CleaningJob elif [ "$COMMAND" = "parsechecker" ] ; then CLASS=org.apache.nutch.parse.ParserChecker elif [ "$COMMAND" = "indexchecker" ] ; then CLASS=org.apache.nutch.indexer.IndexingFiltersChecker elif [ "$COMMAND" = "filterchecker" ] ; then CLASS=org.apache.nutch.net.URLFilterChecker elif [ "$COMMAND" = "normalizerchecker" ] ; then CLASS=org.apache.nutch.net.URLNormalizerChecker elif [ "$COMMAND" = "domainstats" ] ; then CLASS=org.apache.nutch.util.domain.DomainStatistics elif [ "$COMMAND" = "protocolstats" ] ; then CLASS=org.apache.nutch.util.ProtocolStatusStatistics elif [ "$COMMAND" = "crawlcomplete" ] ; then CLASS=org.apache.nutch.util.CrawlCompletionStats elif [ "$COMMAND" = "webgraph" ] ; then CLASS=org.apache.nutch.scoring.webgraph.WebGraph elif [ "$COMMAND" = "linkrank" ] ; then CLASS=org.apache.nutch.scoring.webgraph.linkRank elif [ "$COMMAND" = "scoreupdater" ] ; then CLASS=org.apache.nutch.scoring.webgraph.ScoreUpdater elif [ "$COMMAND" = "nodedumper" ] ; then CLASS=org.apache.nutch.scoring.webgraph.NodeDumper elif [ "$COMMAND" = "plugin" ] ; then CLASS=org.apache.nutch.plugin.PluginRepository elif [ "$COMMAND" = "junit" ] ; then CLASSPATH="$CLASSPATH:$NUTCH_HOME/test/classes/" if $local; then for f in "$NUTCH_HOME"/test/lib/*.jar; do CLASSPATH="${CLASSPATH}:$f"; done fi CLASS=org.junit.runner.JUnitCore elif [ "$COMMAND" = "startserver" ] ; then CLASS=org.apache.nutch.service.NutchServer elif [ "$COMMAND" = "webapp" ] ; then CLASS=org.apache.nutch.webui.NutchUiServer elif [ "$COMMAND" = "warc" ] ; then CLASS=org.apache.nutch.tools.warc.WARCExporter elif [ "$COMMAND" = "updatehostdb" ] ; then CLASS=org.apache.nutch.hostdb.UpdateHostDb elif [ "$COMMAND" = "readhostdb" ] ; then CLASS=org.apache.nutch.hostdb.ReadHostDb elif [ "$COMMAND" = "sitemap" ] ; then CLASS=org.apache.nutch.util.SitemapProcessor elif [ "$COMMAND" = "showproperties" ] ; then CLASS=org.apache.nutch.tools.ShowProperties else CLASS=$COMMAND fi6.运行injector
inject的主函数在org.apache.nutch.crawl包的injector类中。
6.1 配置要运行inject,首先要apache-nutch-1.18/conf/nutch-site.xml添加plugin.folders配置,用来覆盖掉默认的相对路径的配置。因为使用nutch脚本的运行路径和我们直接用源码运行的路径是不同的。
6.2创建一个url列表plugin.folders /home/liangwy/IdeaProjects/apache-nutch-1.18/src/plugin Directories where nutch plugins are located. Each element may be a relative or absolute path. If absolute, it is used as is. If relative, it is searched for on the classpath.
mkdir urls touch urls/seeds.txt vim urls/seeds.txt #然后输入要第一批进行爬取的url即可6.3 IDEA创建启动
点击主菜单依次选择: Run -> Edit Configurations ,点击 + 号,选择创建 Application :
Name : InjectorMain Class :org.apache.nutch.crawl.Injector (1.x版本的主函数类,具体名字要看源码2.x叫InjectorJob)VM options :-Dhadoop.log.dir=logs -Dhadoop.log.file=hadoop.logProgram arguments : /home/liangwy/IdeaProjects/apache-nutch-1.18/myNutch/crawldb /home/liangwy/apache-nutch-1.18/urls (存储抓取地址文件seed.txt的目录)
注意:Program arguments的填充是跟你nutch提供的脚本传递的参数一样的
6.4 运行效果对等运行效果等价于使用nutch(Bin版本)的bin/目录下的nutch命令一样。
./nutch inject /home/liangwy/IdeaProjects/apache-nutch-1.18/myNutch/crawldb /home/liangwy/apache-nutch-1.18/urls7.Injector主函数分析
injector的main函数如下:
public static void main(String[] args) throws Exception { int res = ToolRunner.run(NutchConfiguration.create(), new Injector(), args); System.exit(res); }
Injector的运行是通过ToolRunner进行的,点开ToolRunner的run函数,发现最后运行的实际调用方法是injector的run函数。
方法参数:
Configuration conf #nutch的配置Tool tool #要运行的工具类(如:injector,generator)String[] args #传递给工具类的命令行参数
public static int run(Configuration conf, Tool tool, String[] args) throws Exception { if (CallerContext.getCurrent() == null) { CallerContext ctx = (new Builder("CLI")).build(); CallerContext.setCurrent(ctx); } if (conf == null) { conf = new Configuration(); } //解析配置 GenericOptionsParser parser = new GenericOptionsParser(conf, args); tool.setConf(conf); String[] toolArgs = parser.getRemainingArgs(); //实际运行还是调用tool自身的run return tool.run(toolArgs); }
8.运行Generator 8.1 IDEA创建启动
点击主菜单依次选择: Run -> Edit Configurations ,点击 + 号,选择创建 Application :
Name : GeneratorMain Class :org.apache.nutch.crawl.GeneratorProgram arguments : /home/liangwy/IdeaProjects/apache-nutch-1.18/myNutch/crawldb /home/liangwy/IdeaProjects/apache-nutch-1.18/myNutch/segments -topN 100
注意:Program arguments的填充是跟你nutch提供的脚本传递的参数一样的
8.2 运行效果对等运行效果等价于使用nutch(Bin版本)的bin/目录下的nutch命令一样。
./nutch generate /home/liangwy/IdeaProjects/apache-nutch-1.18/myNutch/crawldb /home/liangwy/IdeaProjects/apache-nutch-1.18/myNutch/segments -topN 100
9.运行Fetcher 9.1 IDEA创建启动
点击主菜单依次选择: Run -> Edit Configurations ,点击 + 号,选择创建 Application :
Name : FetcherMain Class :org.apache.nutch.fetcher.FetcherProgram arguments : /home/liangwy/IdeaProjects/apache-nutch-1.18/myNutch/segments/20220114175955 -threads 16
注意:Program arguments的填充是跟你nutch提供的脚本传递的参数一样的
9.2 报错分析没有配置http.agent.name,这个配置可以在conf/nutch-site.xml中进行配置
9.3 配置http.agent.nameFetcher: No agents listed in ‘http.agent.name’ property.
Fetcher: java.lang.IllegalArgumentException: Fetcher: No agents listed in ‘http.agent.name’ property.
at org.apache.nutch.fetcher.Fetcher.checkConfiguration(Fetcher.java:563)
at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:431)
at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:545)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:518)
在conf/nutch-site.xml文件中添加如下配置
property>9.3 运行效果对等http.agent.name Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.55 Safari/537.36 Edg/96.0.1054.43 HTTP 'User-Agent' request header. MUST NOT be empty - please set this to a single word uniquely related to your organization. NOTE: You should also check other related properties: http.robots.agents http.agent.description http.agent.url http.agent.email http.agent.version and set their values appropriately. http.robots.agents Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.55 Safari/537.36 Edg/96.0.1054.43,*
运行效果等价于使用nutch(Bin版本)的bin/目录下的nutch命令一样。
./nutch fetch /home/liangwy/IdeaProjects/apache-nutch-1.18/myNutch/segments/20220114175955 -threads 1610.运行ParseSegment 10.1 IDEA创建启动
点击主菜单依次选择: Run -> Edit Configurations ,点击 + 号,选择创建 Application :
Name : ParseSegmentMain Class :org.apache.nutch.parse.ParseSegmentProgram arguments : /home/liangwy/IdeaProjects/apache-nutch-1.18/myNutch/segments/20220114175955
注意:Program arguments的填充是跟你nutch提供的脚本传递的参数一样的
10.2 运行效果对等运行效果等价于使用nutch(Bin版本)的bin/目录下的nutch命令一样。
./nutch parse /home/liangwy/IdeaProjects/apache-nutch-1.18/myNutch/segments/2022011417595511.运行CrawlDb 11.1 IDEA创建启动
点击主菜单依次选择: Run -> Edit Configurations ,点击 + 号,选择创建 Application :
Name : CrawlDbMain Class :org.apache.nutch.crawl.CrawlDbProgram arguments : /home/liangwy/IdeaProjects/apache-nutch-1.18/myNutch/crawldb/ -dir /home/liangwy/IdeaProjects/apache-nutch-1.18/myNutch/segments
注意:Program arguments的填充是跟你nutch提供的脚本传递的参数一样的
11.2 运行效果对等运行效果等价于使用nutch(Bin版本)的bin/目录下的nutch命令一样。
./nutch updatedb /home/liangwy/IdeaProjects/apache-nutch-1.18/myNutch/crawldb/ -dir /home/liangwy/IdeaProjects/apache-nutch-1.18/myNutch/segments12.运行linkDb 12.1 IDEA创建启动
点击主菜单依次选择: Run -> Edit Configurations ,点击 + 号,选择创建 Application :
Name : linkDbMain Class :org.apache.nutch.crawl.linkDbProgram arguments : /home/liangwy/IdeaProjects/apache-nutch-1.18/myNutch/linkdb -dir /home/liangwy/IdeaProjects/apache-nutch-1.18/myNutch/segments/
注意:Program arguments的填充是跟你nutch提供的脚本传递的参数一样的
12.2 运行效果对等运行效果等价于使用nutch(Bin版本)的bin/目录下的nutch命令一样。
/nutch invertlinks /home/liangwy/IdeaProjects/apache-nutch-1.18/myNutch/linkdb -dir /home/liangwy/IdeaProjects/apache-nutch-1.18/myNutch/segments/下一章
下一章,教如何将这些步骤进行整合。
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)