2010-04-21 Can we trust web page meta data?

Speaker: Anders Ardö EIT

Anders Ardö presents some recent research regarding how trustworthy the metadata we find in Web-pages is.It is a statistical study of embedded metadata in a sample of more than 4 million HTML Web-pages. The study tries to determine and quantify the validity of this metadata. Of particular interest is to see if it is trustworthy enough for determining the topic of a Web-page. Datasets are collected by a Web crawler running both as a general and a focused crawler. Metadata fields 'title', 'author', 'keywords', 'description', and 'language' are analyzed in detail together with Dublin Core metadata. The study reveals problems with how metadata is created. Among the 75 \% of all Web-pages that have interesting metadata, the field 'language' is the most trustworthy. All other metadata fields show a high degree of duplication thus degrading their usefulness. The strict answer to the title question is 'No', however there is a lot of meaningful and useful information, but it must be interpreted and used with care. The study also provides statistics on the usage of metadata today and how it has changed over time.

