Obsolete

Bayesian analysis of UK place-names

or, where is Ambridge, anyway?

The idea

This is an idea inspired by these initially disconnected thoughts:

Drawing maps of the distribution of place-name types is an old game -
see e.g. Smith's English place-name elements or Watts' Cambridge dictionary of English place-names (pages li-lxii). I have even automated the process myself with my maps here. But this process is deterministic.

By contrast, we all have a feeling about different place-name types in the various regions. For example, Buckfastleigh could only be in Devon, Giggleswick in Yorkshire, and Flegg in Norfolk.
But is there any science in this? Could we capture feelings such as these
in a mathematical model? If we could, it would have to be a probabilistic
model.

In the last few years big advances in Bayesian text classification (both
in theory and in practice) have been stimulated by the requirements of spam
filtering. Can we exploit these advances?

Made-up place-names in novels and films always sound wrong, even though they are made up of the right elements. Why is this?

The method

So this is what I have done:

I built Bayesian models for each square in the National Grid, using all
village and town names in each square (29326 names in total).

For each non-existent name below, I computed the posterior probability for it to be in each square according to the model.

If the probability was above 1%, I coloured the square with an approximately spectral colour proportional to the probability. Dark blue indicates a very small probability, and bright red the maximum probability, with the levels in between being light blue, green, yellow, and orange, except that in the maps, the largest probability is scaled to 1 to determine the colour. See the key lower down on this page.

Thus, the reddest square would be the likeliest place for the non-existent name to be located, if it really existed.

In essence, I am doing text classification with 49 categories, rather than the three (spam, possible spam, legitimate email) as is done in spam filtering.

Some initial checks

Just to check that the whole idea is sensible, we'll do some real names first. Note that the model does not know the location of any place-name.
But if this idea is at all reasonable, it should get real English place-names in roughly the right places, and it should be able to locate Welsh and Gaelic names. Here are some checks of this type. (Note that there really is a Cambridge in Gloucestershire.)

Made-up names from radio, television, films, and comics

Here, the lower the peak probability, the less the name resembles a real name.

Names made up by Hardy

Names made up by the Brontës

Made-up names from other novels

Names made up by me

Technical details

This is the colour coding used in the maps:

For computing the Bayesian models, I used the dbacl software by Laird Breyer.

The next table gives the number of names per square used to build the Bayesian models, and the easting, northing, latitude and longitude of its bottom left-hand corner.

square

number

easting

northing

latitude

longitude

NA

53

0

900

57.810

-8.739

NB

355

100

900

57.889

-7.063

NC

768

200

900

57.945

-5.380

ND

430

300

900

57.978

-3.691

NF

267

0

800

56.918

-8.577

NG

799

100

800

56.994

-6.941

NH

995

200

800

57.048

-5.298

NJ

677

300

800

57.080

-3.650

NK

113

400

800

57.091

-2.000

NM

820

100

700

56.098

-6.825

NN

967

200

700

56.151

-5.220

NO

887

300

700

56.182

-3.611

NR

621

100

600

55.203

-6.716

NS

956

200

600

55.253

-5.147

NT

966

300

600

55.284

-3.575

NU

218

400

600

55.294

-2.000

NV

556

0

500

54.239

-8.142

NW

710

100

500

54.307

-6.613

NX

639

200

500

54.356

-5.078

NY

723

300

500

54.386

-3.540

NZ

717

400

500

54.395

-2.000

SB

194

100

400

53.411

-6.515

SC

7

200

400

53.458

-5.013

SD

716

300

400

53.487

-3.507

SE

1022

400

400

53.496

-2.000

SH

644

200

300

52.561

-4.951

SJ

1061

300

300

52.588

-3.476

SK

1201

400

300

52.597

-2.000

SM

145

100

200

51.618

-6.335

SN

655

200

200

51.663

-4.892

SO

988

300

200

51.689

-3.447

SP

1170

400

200

51.698

-2.000

SR

26

100

100

50.721

-6.251

SS

474

200

100

50.764

-4.836

ST

983

300

100

50.790

-3.419

SU

1046

400

100

50.799

-2.000

SW

336

100

0

49.824

-6.172

SX

547

200

0

49.866

-4.783

SY

249

300

0

49.891

-3.392

SZ

212

400

0

49.900

-2.000

TA

272

500

400

53.487

-0.493

TF

803

500

300

52.588

-0.524

TG

296

600

300

52.561

0.951

TL

1006

500

200

51.689

-0.553

TM

473

600

200

51.663

0.892

TQ

1236

500

100

50.790

-0.581

TR

306

600

100

50.764

0.836

TV

20

500

0

49.891

-0.608

This website uses no cookies. This page was last modified 2013-05-12 10:17
by .