Towards Knowledge Graph Profiling

Knowledge Graphs, such as DBpedia, YAGO, or Wikidata, are valuable resources for building intelligent applications like data analytics tools or recommender systems. Understanding what is in those knowledge graphs is a crucial prerequisite for selecing a Knowledge Graph for a task at hand. Hence, Knowledge Graph profiling - i.e., quantifying the structure and contents of knowledge graphs, as well as their differences - is essential for fully utilizing the power of Knowledge Graphs. In this paper, I will discuss methods for Knowledge Graph profiling, depict crucial differences of the big, well-known Knowledge Graphs, like DBpedia, YAGO, and Wikidata, and throw a glance at current developments of new, complementary Knowledge Graphs such as DBkWik and WebIsALOD.

6.
10/22/17 Heiko Paulheim 6
Motivation
• In the coming days, you’ll see quite a few works
– that use DBpedia for doing X
– that use Wikidata for doing Y
– ...
• If you see them, do you ever ask yourselves:
– Why DBpedia and not Wikidata?
(or the other way round?)

7.
10/22/17 Heiko Paulheim 7
Motivation
• Questions:
– which knowledge graph should I use for which purpose?
– are there significant differences?
– would it help to combine them?
• For answering those questions, we need knowledge graph profiling
– making quantitative statements about knowledge graphs
– defining measures
– defining setups in which to measure them

9.
10/22/17 Heiko Paulheim 9
Knowledge Graph Creation: CyC
• The beginning
– Encyclopedic collection of knowledge
– Started by Douglas Lenat in 1984
– Estimation: 350 person years and 250,000 rules
should do the job
of collecting the essence of the world’s knowledge
• The present
– >900 person years
– Far from completion
– Used to exist until 2017

25.
10/22/17 Heiko Paulheim 25
Caveats
• Reading the diagrams right…
• So, Wikidata contains more persons
– but less instances of all the interesting subclasses?
• There are classes like Actor in Wikidata
– but they are hardly used
– rather: modeled using profession relation

26.
10/22/17 Heiko Paulheim 26
Caveats
• Reading the diagrams right… (ctd.)
• So, Wikidata contains more data on countries, but less countries?
• First: Wikidata only counts current, actual countries
– DBpedia and YAGO also count historical countries
• “KG1 contains less of X than KG2” can mean
– it actually contains less instances of X
– it contains equally many or more instances,
but they are not typed with X (see later)
• Second: we count single facts about countries
– Wikidata records some time indexed information, e.g., population
– Each point in time contributes a fact

27.
10/22/17 Heiko Paulheim 27
Overlap of Knowledge Graphs
• How largely do knowledge graphs overlap?
• They are interlinked, so we can simply count links
– For NELL, we use links to Wikipedia as a proxy
DBpedia
YAGO
Wikidata
NELL Open
Cyc
Ringler & Paulheim: One Knowledge Graph to Rule them All? KI 2017

28.
10/22/17 Heiko Paulheim 28
Overlap of Knowledge Graphs
• How largely do knowledge graphs overlap?
• They are interlinked, so we can simply count links
– For NELL, we use links to Wikipedia as a proxy
Ringler & Paulheim: One Knowledge Graph to Rule them All? KI 2017

29.
10/22/17 Heiko Paulheim 29
Overlap of Knowledge Graphs
• Links between Knowledge Graphs are incomplete
– The Open World Assumption also holds for interlinks
• But we can estimate their number
• Approach:
– find link set automatically with different heuristics
– determine precision and recall on existing interlinks
– estimate actual number of links
Ringler & Paulheim: One Knowledge Graph to Rule them All? KI 2017

30.
10/22/17 Heiko Paulheim 30
Overlap of Knowledge Graphs
• Idea:
– Given that the link set F is found
– And the (unknown) actual link set would be C
• Precision P: Fraction of F which is actually correct
– i.e., measures how much |F| is over-estimating |C|
• Recall R: Fraction of C which is contained in F
– i.e., measures how much |F| is under-estimating |C|
• From that, we estimate |C|=|F|⋅P⋅
1
R
Ringler & Paulheim: One Knowledge Graph to Rule them All? KI 2017

38.
10/22/17 Heiko Paulheim 38
Common Shortcomings of Knoweldge Graphs
• What reasons can cause incomplete results?
• Two possible problems:
– The resource at hand is not of type dbo:Writer
– The genre relation to dbr:Science_Fiction is missing
select ?x where
{?x a dbo:Writer .
?x dbo:genre dbr:Science_Fiction}
order by ?x

43.
10/22/17 Heiko Paulheim 43
Work in Progress: DBkWik
• Why stop at Wikipedia?
• Wikipedia is based on the MediaWiki software
– ...and so are thousands of Wikis
– Fandom by Wikia: >385,000 Wikis on special topics
– WikiApiary: reports >20,000 installations of MediaWiki on the Web

44.
10/22/17 Heiko Paulheim 44
Work in Progress: DBkWik
• Back to our original example...

45.
10/22/17 Heiko Paulheim 45
Work in Progress: DBkWik
• Back to our original example...

57.
10/22/17 Heiko Paulheim 57
Work in Progress: WebIsALOD
• Current challenges and works in progress
– Distinguishing instances and classes
• i.e.: subclass vs. instance of relations
– Splitting instances
• Bauhaus is a goth band
• Bauhaus is a German school
– Knowledge extraction from pre and post modifiers
• Bauhaus is a goth band → genre(Bauhaus, Goth)
• Bauhaus is a German school → location(Bauhaus, Germany)
Hertling & Paulheim: WebIsALOD: Providing Hypernymy Relations extracted
from the Web as Linked Open Data. ISWC 2017
Tuesday, 2:30 pm
Resource track
paper presentation

58.
10/22/17 Heiko Paulheim 58
Take Aways
• Knowledge Graphs contain a massive amount of information
– Various trade offs in their creation
– That lead to different profiles
– ...and different shortcomings
• Knowledge Graph Profiling
– What is in a knowledge graph?
– At which level of detail is it described?
– How different are knowledge graphs?
• Various methods exist for
– ...addressing the various shortcomings
• New kids on the block
– DBkWik and WebIsALOD
– Focus on long tail entities