Common Crawl: An Open Repository of Web Data

Technology

huguk
  • 1.London HUG Common Crawl : WhatRepositoryAn OpenDoes Theof Web DataData World Mean to Society? Lisa Green Lisa Green 1 October 2012 10 October 2012
  • 2. Photo license: Public Domain Origin: http://en.wikipedia.org/wiki/File:Floppy_disk_2009_G1.jpg
  • 3. Photo license: CC-BY-SA Origin: http://en.wikipedia.org/wiki/File:Wikimedia_Foundation_Servers-8055_08.jpg
  • 4. Image license: CC-BY Origin: http://en.wikipedia.org/wiki/File:Internet_map_1024.jpg
  • 5. Still Nascent•Even cheaper storage•Even cheaper compute•Education•Open DataImage license: CC-BY Credit: NASA, ESA, and the Hubble Heritage Team (STScI/AURA)
  • 6. GratisProprietaryLibreCommercial
  • 7. ProgressInsightAnalysis Data
  • 8. Gil Elbaz
  • 9. Common Crawl Data• ~8 Billion web pages• ~120 TB• 2008-2012• ARC files, JSON metadata, text files• Available to anyone
  • 10. ARC Files - Raw ContentMetadata• Status information• HTTP response code• File names & offsets of ARC files• HTML title• HTML meta tags• RSS/Atom information• All anchors/hyperlinksText Files - Text Only http://commoncrawl.org/get-started
  • 11. Change between 2010 and 2012• URLs with embedded data +6%• Microdata +14%• RDFa +26%http://webdatacommons.org
  • 12. • 22% of Web pages contain Facebook URLs• 8% of Web pages implement Open Graph tags
  • 13. http://wikientities.appspot.comA corpus of anchortext-WikipediaConcept-Count from the CommonCrawl dataset, to benefit research on WSD, NLP and IR.Given a sentence, it canExplicit Topic Modeling: help identify entities(person, location, organization) in wikipediaGiven a concept (represented as a the sentenceand map them onto Wikipedia concepts.page), it can tell what are the most commonterms people use to describe the concept.
  • 14. Mapping French websites related to Open Data
  • 15. Other Use Examples• Apache Giraph Testing• Maplight• Tineye• Factual• Sentiment Analysis Projects
  • 16. In Development• N-gram and Link Graph Extracts• Pig Reader• More Frequent Full Crawls• Focused Subset Crawls at High Frequency• Open Educational Resources
  • 17. Thank YouLondon HUG What Does The Data World Lisa Green Mean to Society?lisa@commoncrawl.orgwww.commoncrawl.org @commoncrawlLisa Green @boudicca 1 October 2012
    Please download to view
  • 1
    All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
    Description
    Talk given by Lisa Green from the Common Crawl Foundation at the Hadoop User Group UK meetup on 10 October in London
    Text
    • 1.London HUG Common Crawl : WhatRepositoryAn OpenDoes Theof Web DataData World Mean to Society? Lisa Green Lisa Green 1 October 2012 10 October 2012
  • 2. Photo license: Public Domain Origin: http://en.wikipedia.org/wiki/File:Floppy_disk_2009_G1.jpg
  • 3. Photo license: CC-BY-SA Origin: http://en.wikipedia.org/wiki/File:Wikimedia_Foundation_Servers-8055_08.jpg
  • 4. Image license: CC-BY Origin: http://en.wikipedia.org/wiki/File:Internet_map_1024.jpg
  • 5. Still Nascent•Even cheaper storage•Even cheaper compute•Education•Open DataImage license: CC-BY Credit: NASA, ESA, and the Hubble Heritage Team (STScI/AURA)
  • 6. GratisProprietaryLibreCommercial
  • 7. ProgressInsightAnalysis Data
  • 8. Gil Elbaz
  • 9. Common Crawl Data• ~8 Billion web pages• ~120 TB• 2008-2012• ARC files, JSON metadata, text files• Available to anyone
  • 10. ARC Files - Raw ContentMetadata• Status information• HTTP response code• File names & offsets of ARC files• HTML title• HTML meta tags• RSS/Atom information• All anchors/hyperlinksText Files - Text Only http://commoncrawl.org/get-started
  • 11. Change between 2010 and 2012• URLs with embedded data +6%• Microdata +14%• RDFa +26%http://webdatacommons.org
  • 12. • 22% of Web pages contain Facebook URLs• 8% of Web pages implement Open Graph tags
  • 13. http://wikientities.appspot.comA corpus of anchortext-WikipediaConcept-Count from the CommonCrawl dataset, to benefit research on WSD, NLP and IR.Given a sentence, it canExplicit Topic Modeling: help identify entities(person, location, organization) in wikipediaGiven a concept (represented as a the sentenceand map them onto Wikipedia concepts.page), it can tell what are the most commonterms people use to describe the concept.
  • 14. Mapping French websites related to Open Data
  • 15. Other Use Examples• Apache Giraph Testing• Maplight• Tineye• Factual• Sentiment Analysis Projects
  • 16. In Development• N-gram and Link Graph Extracts• Pig Reader• More Frequent Full Crawls• Focused Subset Crawls at High Frequency• Open Educational Resources
  • 17. Thank YouLondon HUG What Does The Data World Lisa Green Mean to Society?lisa@commoncrawl.orgwww.commoncrawl.org @commoncrawlLisa Green @boudicca 1 October 2012
  • Comments
    Top