SCRAWLER: A SEED-BY-SEED PARALLEL WEB CRAWLER


     

作者:Joo Yong Lee(School of Computing, Soongsil University, Seoul, Korea jylee@comp.ssu.ac.kr)
Sang Ho Lee
Yanggon Kim
会议名:International Joint Conference on e-Business and Telecommunications (ICETE 2007) - Second International Conference on E-Business (ICE-B 2007)
会议日期:July 28-31, 2007
会议地点:Barcelona, Spain
出版年:2007
ISBN: 978-989-8111-11-1
页码:86-91
总页数:6
馆藏号:hyw01663(1)
分类号:F713.36-53/I61/(2007)
关键词:Web crawlerParallel crawlerScalabilityWeb database
参考中译:
语种:eng
文摘:As the size of the Web grows, it becomes increasingly important to parallelize a crawling process in order to complete downloading pages in a reasonable amount of time. This paper presents the design and implementation of an effective parallel web crawler. We first present various design choices and strategies for a parallel web crawler, and describe our crawler's architecture and implementation techniques. In particular, we investigate the URL distributor for URL balancing and the scalability of our crawler.