Common Crawl: An Open Repository of Web Data

  • CategoryTechnology

  • View2163

  • 1.London HUG Common Crawl : WhatRepositoryAn OpenDoes Theof Web DataData World Mean to Society? Lisa Green Lisa Green 1 October 2012 10 October 2012
  • 2. Photo license: Public Domain Origin:
  • 3. Photo license: CC-BY-SA Origin:
  • 4. Image license: CC-BY Origin:
  • 5. Still Nascent•Even cheaper storage•Even cheaper compute•Education•Open DataImage license: CC-BY Credit: NASA, ESA, and the Hubble Heritage Team (STScI/AURA)
  • 6. GratisProprietaryLibreCommercial
  • 7. ProgressInsightAnalysis Data
  • 8. Gil Elbaz
  • 9. Common Crawl Data• ~8 Billion web pages• ~120 TB• 2008-2012• ARC files, JSON metadata, text files• Available to anyone
  • 10. ARC Files - Raw ContentMetadata• Status information• HTTP response code• File names & offsets of ARC files• HTML title• HTML meta tags• RSS/Atom information• All anchors/hyperlinksText Files - Text Only
  • 11. Change between 2010 and 2012• URLs with embedded data +6%• Microdata +14%• RDFa +26%
  • 12. • 22% of Web pages contain Facebook URLs• 8% of Web pages implement Open Graph tags
  • 13. http://wikientities.appspot.comA corpus of anchortext-WikipediaConcept-Count from the CommonCrawl dataset, to benefit research on WSD, NLP and IR.Given a sentence, it canExplicit Topic Modeling: help identify entities(person, location, organization) in wikipediaGiven a concept (represented as a the sentenceand map them onto Wikipedia, it can tell what are the most commonterms people use to describe the concept.
  • 14. Mapping French websites related to Open Data
  • 15. Other Use Examples• Apache Giraph Testing• Maplight• Tineye• Factual• Sentiment Analysis Projects
  • 16. In Development• N-gram and Link Graph Extracts• Pig Reader• More Frequent Full Crawls• Focused Subset Crawls at High Frequency• Open Educational Resources
  • 17. Thank YouLondon HUG What Does The Data World Lisa Green Mean to Society? @commoncrawlLisa Green @boudicca 1 October 2012
  • Description
    Talk given by Lisa Green from the Common Crawl Foundation at the Hadoop User Group UK meetup on 10 October in London