The Content Name Collection

Dataset: unibas-url-names-2014-08

Intro

The unibas-url-names-2014-08 dataset consists of 215 files. Each of it carries 10'000'000 URL content names, except the last one which carries 4'314'011 URL content names. Hence, the full dataset comprises 2'144'314'011 URL content names.

Examples

Features of unibas-url-names-2014-08:

Duplicates are possible and very likely because we cut off appended parameters which often are the only difference between URLs. Nice side effect: The frequency of recurrence simulates popular URLs. If you want every URL to appear just once, check out our unibas-url-names-2014-08-unique dataset.

unordered

lowercase and uppercase letters

The UTF-8 formatted .txt files are compressed using LZMA2/xz.

Feel free to modify the names by yourself. We propose to use simple command line tools like cat and [g]awk which are easily applicable on the .txt files (cut, sed or grep can be used alternatively).Example (removing protocol prefix): cat unibas-url-names-2014-08-teaser.txt | gawk -F'.' '{print $2}' > out.txt