Dataset Statistics

DBpedia 2014 Data Set Statistics

This page provides statistics about the DBpedia 2014 release. The release contains localized editions of DBpedia for 125 languages which have been extracted from the Wikipedia edition in the corresponding language. For 28 out of these languages, we report the overall number of things (instances) being described in the localized version of DBpedia as well as the number of facts (statements) that have been extracted from infoboxes describing these things. Afterwards, we report the number of instances of popular classes within these 28 DBpedia editions.

Dataset statistics for DBpedia 3.9 can be found here. Below we compare the numbers between the two releases.

1 Instances, Properties, and Statements per Language

The same thing, for instance a person or city, might be described by multiple pages within Wikipedia editions in different languages. Pages describing the same thing are often interlinked by cross-language links within Wikipedia.

When DBpedia extracts data from these pages, it produces two types of data sets. The localized data sets contain all things that are described in a specific language and in which things are identified with a language specific URI. In addition, we produce a canonicalized data set for each language. The canonicalized data sets only contain things for which a corresponding page in the English edition of Wikipedia exists. Within all canonicalized datasets, the same thing is identified with the same URI from the generic namespace http://dbpedia.org/resource/.

DBpedia uses two different extractors to extract data from Wikipedia infoboxes. The mapping-based extractor extracts data only for the infoboxes for which a language-specific extraction mapping to the DBpedia ontology exists in the DBpedia mapping wiki. Based on these mappings, it normalizes the different names that are used in various languages to refer to the same property. The second extractor is the raw infobox extractor which uses a generic heuristic to extract data from all infoboxes. The raw infobox extractor does not normalize property names but produces language-specific properties that directly reflect the property name in the Wikipedia infobox.

Below we report the overall number of things (instances), different ontology and raw-infobox properties, infobox statements and type statements for all 28 languages for which mappings exist in the DBpedia mapping wiki. The rows are sorted according to the number of instances for which mapping-based infobox data exists (Instances, CD, withMD column).

The column heading have the following meaning:

LD = Localized Data Sets.

CD = Canonicalized Data Sets.

all = Overall number of instances in the data set, calculated based on the labels and redirects dumps.

withMD = Number of instances for which mapping-based infobox data exists.

Raw Properties = Number of different properties that are generated by the raw infobox extractor.

Mapping Properties = Number of different properties that are generated by the mapping-based infobox extractor.

Raw Statements = Number of statements (facts) that are generated by the raw infobox extractor.

Mapping Statements = Number of statements (facts) that are generated by the mapping-based infobox extractor; include type statements.

Instances, LD, all

Instances, CD, all

Instances, CD, withMD

Raw Properties, CD

Mapping Properties, CD

Raw Statements, CD

Mapping Statements, CD

Type Statements, CD

en

4,584,616

4,584,616

4,232,626

55,986

1,122

68,091,260

56,549,445

28,563,803

it

1,128,909

745,345

540,474

10,591

249

13,840,025

7,413,922

3,929,338

de

1,692,634

857,196

479,731

11,695

420

9,677,586

6,059,745

3,468,237

nl

1,774,536

674,849

455,222

8,100

634

8,044,539

5,857,801

3,118,581

es

1,086,296

683,251

419,328

17,347

457

9,728,204

6,538,847

3,190,529

fr

1,504,453

942,505

415,390

15,111

595

11,521,313

6,234,623

3,396,756

pl

1,043,400

653,571

411,883

7,751

219

8,554,227

5,590,196

3,189,677

pt

812,610

552,362

321,211

14,637

522

7,069,586

4,801,340

2,185,948

ru

1,119,142

579,612

266,562

15,665

141

8,825,572

3,717,635

1,986,532

ja

913,488

397,907

134,380

15,981

342

4,403,612

2,028,745

1,002,180

ca

426,696

289,485

128,544

10,183

175

3,643,659

1,574,797

962,352

eu

178,822

139,023

90,948

2,947

118

2,010,728

916,523

577,224

hu

260,512

171,391

76,273

7,806

268

2,429,115

830,290

536,290

ko

276,881

178,872

58,937

8,503

377

1,409,638

878,745

458,870

tr

233,737

143,914

57,034

9,008

370

1,636,893

825,459

443,345

cs

296,094

193,674

48,356

6,368

291

2,272,303

649,900

377,149

bg

161,427

112,571

44,698

5,095

223

964,269

599,891

333,355

ar

266,386

170,430

44,298

11,008

254

1,185,465

479,823

316,167

id

354,326

142,616

43,980

11,514

329

1,599,822

653,002

347,255

el

96,301

67,390

36,255

4,437

445

389,068

382,708

252,492

sl

140,612

85,167

25,494

4,844

406

950,604

323,292

212,340

hr

135,272

92,952

12,003

3,674

139

827,890

200,690

106,691

ga

30,670

27,674

4,176

1,231

67

83,457

51,086

31,872

bn

29,631

26,136

2,160

6,609

83

271,070

30,350

19,015

be (new)

71,656

52,040

23,512

4,998

175

557,540

301,188

168,132

cy (new)

57,127

43,127

11,945

2,084

28

204,058

59,428

54,578

sk (new)

192,410

138,492

5,268

4,757

25

1,814,997

70,207

21,148

sr (new)

246,996

189,158

138,166

6,069

470

2,278,757

1,853,525

873,394

The following table integrates the Dataset Statistic for DBpedia 3.9 with the statistics presented above, thus allowing for comparison between the versions. %-columns contain the increase in the number of instances/statements in version 2014 with respect to 3.9. There are four new languages in the 2014 release: Belarusian (be), Serbian (sr), Welsh (cy), Slovak (sk), for which property mappings has become available; the respective numbers can be found in the four last rows of the table. The decrease in the number of raw properties is due to the fact that triple de-duplication was introduced in 2014.

Instances, LD, all

Instances, CD, all

Instances, CD, withMD

Raw Properties, CD

Mapping Properties, CD

Raw Statements, CD

Mapping Statements, CD

Type Statements, CD

3.9

2014

%

3.9

2014

%

3.9

2014

%

3.9

2014

%

3.9

2014

%

3.9

2014

%

3.9

2014

3.9

2014

%

en

4,258,406

4,584,616

7.7

4,258,406

4,584,616

7.7

3,255,435

4,232,626

30

51,736

55,986

8.2

1,373

1,122

-18.3

70,147,399

68,091,260

-2.9

41,804,545

56,549,445

35.3

16,366,701

28,563,803

74.5

it

1,029,528

1,128,909

9.7

672,981

745,345

10.8

473,595

540,474

14.1

10,241

10,591

3.4

211

249

18.0

14,366,288

13,840,025

-3.7

5,724,415

7,413,922

29.5

2,364,096

3,929,338

66.2

de

1,547,785

1,692,634

9.4

779,104

857,196

10

327,548

479,731

46.5

10,659

11,695

9.7

327

420

28.4

9,284,326

9,677,586

4.2

4,070,927

6,059,745

48.9

1,800,424

3,468,237

92.6

nl

1,461,314

1,774,536

21.4

590,014

674,849

14.4

368,688

455,222

23.5

7,481

8,100

8.3

642

634

-1.2

7,916,452

8,044,539

1.6

5,039,583

5,857,801

16.2

2,144,581

3,118,581

45.4

es

1,003,158

1,086,296

8.3

621,472

683,251

9.9

376,975

419,328

11.2

15,992

17,347

8.5

549

457

-16.8

9,147,643

9,728,204

6.3

5,950,626

6,538,847

9.9

2,305,659

3,190,529

38.4

fr

1,378,099

1,504,453

9.2

856,004

942,505

10.1

346,214

415,390

20

13,990

15,111

8

689

595

-13.6

10,741,192

11,521,313

7.3

5,273,302

6,234,623

18.2

2,145,950

3,396,756

58.3

pl

960,880

1,043,400

8.6

598,754

653,571

9.2

334,214

411,883

23.2

7,478

7,751

3.7

264

219

-17.0

8,113,838

8,554,227

5.4

4,624,126

5,590,196

20.9

2,031,952

3,189,677

57

pt

764,132

812,610

6.3

511,741

552,362

7.9

298,475

321,211

7.6

13,740

14,637

6.5

620

522

-15.8

6,934,107

7,069,586

2

4,489,235

4,801,340

7.0

1,641,916

2,185,948

33.1

ru

999,165

1,119,142

12

516,870

579,612

12.1

236,067

266,562

12.9

14,771

15,665

6.1

149

141

-5.4

8,390,368

8,825,572

5.2

3,174,725

3,717,635

17.1

1,315,619

1,986,532

51

ja

860,917

913,488

6.1

370,912

397,907

7.3

115,227

134,380

16.6

14,752

15,981

8.3

395

342

-13.4

4,353,518

4,403,612

1.2

1,674,891

2,028,745

21.1

656,290

1,002,180

52.7

ca

400,271

426,696

6.6

267,856

289,485

8.1

119,675

128,544

7.4

9,391

10,183

8.4

184

175

-4.9

4,057,610

3,643,659

-10.2

1,420,025

1,574,797

10.9

757,526

962,352

27

eu

150,294

178,822

19

119,752

139,023

16.1

74,114

90,948

22.7

2,683

2,947

9.8

97

118

21.6

2,381,903

2,010,728

-15.6

975,775

916,523

-6.1

456,815

577,224

26.4

hu

239,711

260,512

8.7

157,034

171,391

9.1

68,939

76,273

10.6

7,283

7,806

7.2

298

268

-10.1

2,859,593

2,429,115

-15.1

669,836

830,290

24.0

358,586

536,290

49.6

ko

237,506

276,881

16.6

154,397

178,872

15.9

47,081

58,937

25.2

7,605

8,503

11.8

435

377

-13.3

1,276,866

1,409,638

10.4

646,461

878,745

35.9

271,610

458,870

68.9

tr

213,820

233,737

9.3

127,281

143,914

13.1

47,673

57,034

19.6

8,172

9,008

10.2

438

370

-15.5

1,701,192

1,636,893

-3.8

648,288

825,459

27.3

270,546

443,345

63.9

cs

263,317

296,094

12.4

172,763

193,674

12.1

40,549

48,356

19.3

5,873

6,368

8.4

340

291

-14.4

2,192,854

2,272,303

3.6

556,742

649,900

16.7

244,058

377,149

54.5

bg

146,608

161,427

10.1

101,310

112,571

11.1

43,961

44,698

1.7

4,728

5,095

7.8

268

223

-16.8

950,554

964,269

1.4

564,830

599,891

6.2

225,843

333,355

47.6

ar

215,042

266,386

23.9

129,600

170,430

31.5

25,325

44,298

74.9

9,492

11,008

16

286

254

-11.2

883,730

1,185,465

34.1

256,761

479,823

86.9

143,042

316,167

121

id

208,891

354,326

69.6

113,047

142,616

26.2

33,385

43,980

31.7

10,264

11,514

12.2

372

329

-11.6

1,417,031

1,599,822

12.9

449,244

653,002

45.4

199,564

347,255

74

el

84,359

96,301

14.2

57,249

67,390

17.7

27,856

36,255

30.2

3,695

4,437

20.1

461

445

-3.5

287,562

389,068

35.3

275,669

382,708

38.8

159,570

252,492

58.2

sl

136,684

140,612

2.9

80,102

85,167

6.3

23,584

25,494

8.1

4,473

4,844

8.3

474

406

-14.3

1,335,247

950,604

-28.8

265,908

323,292

21.6

151,203

212,340

40.4

hr

127,930

135,272

5.7

82,016

92,952

13.3

11,452

12,003

4.8

3,501

3,674

4.9

158

139

-12.0

779,862

827,890

6.2

168,804

200,690

18.9

74,455

106,691

43.3

ga

19,450

30,670

57.7

17,350

27,674

59.5

3,791

4,176

10.2

1,128

1,231

9.1

72

67

-6.9

76,746

83,457

8.7

41,331

51,086

23.6

21,847

31,872

45.9

bn

25,811

29,631

14.8

20,753

26,136

25.9

1,275

2,160

69.4

5,467

6,609

20.9

86

83

-3.5

176,630

271,070

53.5

13,852

30,350

119.1

6,856

19,015

177.3

be

71,656

52,040

23,512

4998

175

557,540

301,188

168,132

cy

57,127

43,127

11,945

2,084

28

204,058

59,428

54,578

sk

192,410

138,492

5,268

4,757

25

1,814,997

70,207

21,148

sr

246,996

189,158

138,166

6,069

470

2,278,757

1,853,525

873,394

2 Instances of Selected Classes per Language

The table below reports the number of instances for a set of selected classes within the canonicalized DBpedia data sets for each language.