The Content Name Collection

Dataset: unibas-url-names-2014-08-unique

Intro

The unibas-url-names-2014-08-unique dataset consists of 88 files. Each of it carries 10'000'000 URL content names, except the last one which carries 896'633 URL content names. Hence, the full dataset comprises 870'896'633 URL content names.

Examples

Features of unibas-url-names-2014-08-unique:

no duplicates

ordered

lowercase and uppercase letters

The UTF-8 formatted .txt files are compressed using LZMA2/xz.

Feel free to modify the names by yourself. We propose to use simple command line tools like cat and [g]awk which are easily applicable on the .txt files (cut, sed or grep can be used alternatively).Example (removing protocol prefix): cat unibas-url-names-2014-08-teaser.txt | gawk -F'.' '{print $2}' > out.txt