Common Crawl

Common Crawl is a nonprofit 501(c)(3) organization that crawls the web and freely provides its archives and datasets to the public. Common Crawl’s web archive consists of hundreds of terabytes of data from several billion webpages. It completes four crawls a year. Common Crawl was founded in 2007 by Gil Elbaz. Advisors to the non-profit include Peter Norvig and Joi Ito. The organization’s crawlers respect nofollow and robots.txt policies. Open source code for processing Common Crawl’s data set is publicly available.

Common Crawl

Recent

Search

Footer

Recent

Search

Tags