The Content Name Collection

Dataset: unibas-icn-names-2014-08-unique

Intro

The unibas-icn-names-2014-08-unique dataset consists of 88 files. Each of it carries 10'000'000 ICN content names, except the last one which carries 501'646 ICN content names. Hence, the full dataset comprises 870'501'646 ICN content names.

Features of unibas-icn-names-2014-08-unique:

The ICN content names are hierarchical, i.e., the part between the protocol (exclusive) and the Top Level Domain (inclusive) of the URL is inverted.

no duplicates

ordered

lowercase and uppercase letters

"www" removed

The UTF-8 formatted .txt files are compressed using LZMA2/xz.

We omitted to add explicit scheme prefixes from different vendors (ccnx:, ndn:, lci:, (url:), ...). If you wish or need those, feel free to add them by yourself. We propose to use simple command line tools like cat and [g]awk which are easily applicable on the .txt files (cut, sed or grep can be used alternatively).Example (adding scheme prefix): cat unibas-icn-names-2014-08-teaser.txt | gawk '{print "ccnx:"$0}' > out.txt