Category Archives: Creative Commons

pHash does it’s mathematical operations for every pixels for original image size. Therefore, when the image is resized, the result is slightly different depending on image size. My assumption is that if every image is resized to certain size when the image is bigger than the size, the general matching quality would be better.

I tested the same set of image samples with previous posting, however, because of the speed, the comparison performed for 3644 images.

To find which size is good for normalization, I resized images to 2000, 1500, and 1000 width. And hamming distance between resized image to from 90% to 10%.

Hamming Distance is bigger than 4

normalization size 2000

5000 <

4000 <

3000 <

2000 <

1000 <

< 1000

5000 <

4000 <

3000 <

2000 <

1000 <

< 1000

100%

0

0

0

0

0

1

100%

0.00

0.00

0.00

0.00

0.00

0.01

90%

0

0

0

0

6

17

90%

0.00

0.00

0.00

0.00

0.08

0.23

80%

0

0

0

1

12

19

80%

0.00

0.00

0.00

0.01

0.16

0.25

70%

0

0

0

1

18

36

70%

0.00

0.00

0.00

0.01

0.24

0.48

60%

0

0

0

12

48

87

60%

0.00

0.00

0.00

0.16

0.64

1.16

50%

0

0

3

26

77

141

50%

0.00

0.00

0.04

0.35

1.03

1.89

40%

0

0

9

62

172

272

40%

0.00

0.00

0.12

0.83

2.30

3.64

30%

1

12

54

156

333

475

30%

0.01

0.16

0.72

2.09

4.45

6.35

20%

27

99

246

424

693

851

20%

0.36

1.32

3.29

5.67

9.27

11.38

10%

163

360

753

1093

1442

1636

10%

2.18

4.82

10.07

14.62

19.29

21.89

normalization size 1500

5000 <

4000 <

3000 <

2000 <

1000 <

< 1000

5000 <

4000 <

3000 <

2000 <

1000 <

< 1000

100%

0

0

0

0

0

1

100%

0.00

0.00

0.00

0.00

0.00

0.01

90%

0

0

0

0

2

13

90%

0.00

0.00

0.00

0.00

0.03

0.17

80%

0

0

0

0

7

14

80%

0.00

0.00

0.00

0.00

0.09

0.19

70%

0

0

0

1

15

33

70%

0.00

0.00

0.00

0.01

0.20

0.44

60%

0

0

0

2

25

64

60%

0.00

0.00

0.00

0.03

0.33

0.86

50%

0

0

0

7

46

110

50%

0.00

0.00

0.00

0.09

0.62

1.47

40%

0

0

4

25

123

223

40%

0.00

0.00

0.05

0.33

1.65

2.98

30%

0

0

18

86

247

389

30%

0.00

0.00

0.24

1.15

3.30

5.20

20%

6

27

116

257

520

678

20%

0.08

0.36

1.55

3.44

6.96

9.07

10%

137

308

654

969

1313

1507

10%

1.83

4.12

8.75

12.96

17.57

20.16

normalization size 1000

5000 <

4000 <

3000 <

2000 <

1000 <

< 1000

5000 <

4000 <

3000 <

2000 <

1000 <

< 1000

100%

0

0

0

0

0

1

100%

0.00

0.00

0.00

0.00

0.00

0.01

90%

0

0

0

0

0

11

90%

0.00

0.00

0.00

0.00

0.00

0.15

80%

0

0

0

0

0

7

80%

0.00

0.00

0.00

0.00

0.00

0.09

70%

0

0

0

0

5

23

70%

0.00

0.00

0.00

0.00

0.07

0.31

60%

0

0

0

0

6

45

60%

0.00

0.00

0.00

0.00

0.08

0.60

50%

0

0

0

0

26

90

50%

0.00

0.00

0.00

0.00

0.35

1.20

40%

0

0

0

3

56

156

40%

0.00

0.00

0.00

0.04

0.75

2.09

30%

0

0

2

17

132

274

30%

0.00

0.00

0.03

0.23

1.77

3.67

20%

0

4

39

122

354

512

20%

0.00

0.05

0.52

1.63

4.74

6.85

10%

61

161

406

679

999

1193

10%

0.82

2.15

5.43

9.08

13.36

15.96

Hamming Distance is bigger than 6

normalization size 2000

5000 <

4000 <

3000 <

2000 <

1000 <

< 1000

5000 <

4000 <

3000 <

2000 <

1000 <

< 1000

100%

0

0

0

0

0

0

100%

0.00

0.00

0.00

0.00

0.00

0.00

90%

0

0

0

0

0

1

90%

0.00

0.00

0.00

0.00

0.00

0.01

80%

0

0

0

1

2

3

80%

0.00

0.00

0.00

0.01

0.03

0.04

70%

0

0

0

0

4

11

70%

0.00

0.00

0.00

0.00

0.05

0.15

60%

0

0

0

0

8

21

60%

0.00

0.00

0.00

0.00

0.11

0.28

50%

0

0

0

6

20

46

50%

0.00

0.00

0.00

0.08

0.27

0.62

40%

0

0

4

21

46

94

40%

0.00

0.00

0.05

0.28

0.62

1.26

30%

0

0

11

45

106

175

30%

0.00

0.00

0.15

0.60

1.42

2.34

20%

4

14

63

142

286

381

20%

0.05

0.19

0.84

1.90

3.83

5.10

10%

59

153

347

539

752

869

10%

0.79

2.05

4.64

7.21

10.06

11.63

normalization size 1500

5000 <

4000 <

3000 <

2000 <

1000 <

< 1000

5000 <

4000 <

3000 <

2000 <

1000 <

< 1000

100%

0

0

0

0

0

0

100%

0.00

0.00

0.00

0.00

0.00

0.00

90%

0

0

0

0

0

1

90%

0.00

0.00

0.00

0.00

0.00

0.01

80%

0

0

0

0

0

1

80%

0.00

0.00

0.00

0.00

0.00

0.01

70%

0

0

0

0

2

9

70%

0.00

0.00

0.00

0.00

0.03

0.12

60%

0

0

0

0

6

19

60%

0.00

0.00

0.00

0.00

0.08

0.25

50%

0

0

0

1

10

36

50%

0.00

0.00

0.00

0.01

0.13

0.48

40%

0

0

0

8

28

76

40%

0.00

0.00

0.00

0.11

0.37

1.02

30%

0

0

3

26

81

150

30%

0.00

0.00

0.04

0.35

1.08

2.01

20%

1

4

30

88

221

316

20%

0.01

0.05

0.40

1.18

2.96

4.23

10%

39

99

257

433

639

756

10%

0.52

1.32

3.44

5.79

8.55

10.11

normalization size 1000

5000 <

4000 <

3000 <

2000 <

1000 <

< 1000

5000 <

4000 <

3000 <

2000 <

1000 <

< 1000

100%

0

0

0

0

0

0

100%

0.00

0.00

0.00

0.00

0.00

0.00

90%

0

0

0

0

0

1

90%

0.00

0.00

0.00

0.00

0.00

0.01

80%

0

0

0

0

0

1

80%

0.00

0.00

0.00

0.00

0.00

0.01

70%

0

0

0

0

1

8

70%

0.00

0.00

0.00

0.00

0.01

0.11

60%

0

0

0

0

2

15

60%

0.00

0.00

0.00

0.00

0.03

0.20

50%

0

0

0

0

9

35

50%

0.00

0.00

0.00

0.00

0.12

0.47

40%

0

0

0

0

14

62

40%

0.00

0.00

0.00

0.00

0.19

0.83

30%

0

0

0

4

35

104

30%

0.00

0.00

0.00

0.05

0.47

1.39

20%

0

0

11

39

138

233

20%

0.00

0.00

0.15

0.52

1.85

3.12

10%

14

38

135

270

449

566

10%

0.19

0.51

1.81

3.61

6.01

7.57

Conclusion

According to the test result, in terms of matching percentage, resizing before hashing gives better results; this can be a solution for better matching. However, false positive matching percentage is important.

DCT Hash in pHash is selected as image similarity search algorithm for Creative Commons image license search. Recently, we found that some images are not matched when they are resized. So, I tested it for flickr CC images.

Conclusion

The result shows when the image is resized, there could be some images that are cannot detected. Possible solution is resizing the image to a certain size when the image is bigger than the size before hashing. I tested when the size is 2000, 1500, and 1000 width.

Currently, APIs to add and match image license get a pHash value that are extracted from image. This hash value is 64bit binary. For the fast processing, database and C++ daemon used it as unsigned long long type. However, recently, while Anna is developing Javascript pHash module, there was a problem. When Javascript calculation print the output hash value, last 4 or 5 characters were wrong values. That was because maximum value of number in javascript was 2^53.

php API doesn’t have to changed because it bypasses by base64 encoding.

For MySQL database field, we decided to keep 64bit unsigned integer type for DCT hash value. That is because this way doesn’t need to be changed from string type to number type to load on the memory for indexing.

Picture

UI

Previously, my colleague Anna made a page that search similar images by uploading or from the link. This UI page can be either inside the server or outside the sever. It uses only PHP API without accessing Database directly.

PHP API

This is open API that have functions of Adding, Deleting, and Matching image. It can be accessed by anyone who want this function. UI page or client implementation such as browser extension uses this API. The matching result is JSON format.
This API page Add/Delete/Match by asking “C++ Daemon” without changing Database.
Only for read-only access to the Database will be permitted.

C++ Daemon

All adding/deleting operation will be done in this daemon. By doing so, we can remove the problem of synchronization between database and index for matching. That is because this daemon will have content index on the memory all the time for fast matching.
Because this daemon is active all the time, to get the request and give result to “PHP API”, it works as domain socket server. PHP API will request using domain socket.

MySQL

Database contains all metadatas about CC license images and thumbnail path that are used to show as a preview in the matching result.

So far, I tested Pastec in terms of the quality of image matching. In this posting, I tested speed of adding and searching.

Adding images to index

Firstly I added 100 images. Adding 100 images took 48.339 seconds. Then I added all directory from 22 to 31. Those images are uploaded to wikimedia commons from 2013.12.22 to 2013.12.21.

Directory

Start

End

Duration

Count

Average

22

17:32:42

18:43:50

01:11:08

8785

00:00.49

23

18:43:50

19:42:03

00:58:13

7314

00:00.48

24

19:42:03

20:28:56

00:46:53

6001

00:00.47

25

20:28:57

21:28:02

00:59:05

7783

00:00.46

26

21:28:02

22:41:12

01:13:10

9300

00:00.47

27

22:41:19

23:54:28

01:13:09

9699

00:00.45

28

00:54:28

01:53:23

00:58:55

7912

00:00.45

29

00:53:23

02:27:42

01:34:19

11839

00:00.48

30

02:27:42

03:31:48

01:04:06

8827

00:00.44

31

03:31:48

04:23:15

00:51:27

6880

00:00.45

Average time for adding an image was around 0.46 second and it didn’t increased as the index grows. Most of the time for adding an image is extracting features.
I saved the index file for 100 images, from 22 to 26, and from 22 to 31. The size were 8.7mb, 444.1mb, and 935.8mb respectively.

Searching images

I loaded the index file for 100 images. And searched all 100 images that are used to add.

Directory

Start

End

Duration

Count

Average

22

00:01:14

100

00:00.74

Searching took 1m14.781s. Since it is 100 images, average time to add one image was 0.74 second.

Then I loaded the index file that contains index for 39,183 images in the directory from 22 to 26.

Directory

Start

End

Duration

Count

Average

22

09:00:05

11:21:06

02:21:01

8785

00:00.96

23

11:21:06

13:13:52

01:52:46

7314

00:00.93

24

13:13:52

14:48:26

01:34:34

6001

00:00.95

25

14:48:26

16:48:44

02:00:18

7783

00:00.93

26

16:48:44

19:13:11

02:24:27

9300

00:00.93

This time, average time for searching one image was 0.95 second.

Then I loaded the index file that contains index for 84,340 images that are in the directory from 22 to 31.

Directory

Start

End

Duration

Count

Average

22

19:32:54

22:44:09

03:11:15

8785

00:01.31

23

20:44:09

23:16:59

02:32:50

7314

00:01.25

24

01:16:59

03:24:52

02:07:53

6001

00:01.28

25

03:24:52

06:11:33

02:46:41

7783

00:01.28

26

06:11:33

09:30:53

03:19:20

9300

00:01.29

Searching performed for the same images from 22 to 26. Average time for searching was 1.3 seconds.

Conclusion

Adding an image took 0.47 second.

Adding time didn’t varied by index size.

Searching an image varied by index size.

When the index size was 100, 39183, and 84340, searching time was 0.74, 0.95, and 1.3 seconds, respectively.
In the chart, y-axis is time in milliseconds. Around 0.6 second is likely to be for reading an image and extracting features. And searching time will be increased in proportion to the size of index.

In the previous test of Pastec, I used 900 jpeg image that was mainly computer generated images. This time, I tested images from WikiMedia Commons Archive of CC License Image that are uploaded from 2013-12-25 to 2013-12-30. They are zip file 17GB to 41GB and it contains around 10,000 files including jpg, gif, png, tiff, ogg, pdf, djvu, svg, and webm. Before testing, I deleted xml, pdf, djvu and webm. Then there are 55,643 images.

Indexing

Indexing 55,643 images took around 12 hours and Index file was 622mb. At first, I made separate index files for each day. However, Pastec can load only 1 index file. So I added all 6 days’ images and saved it to one index file.

While indexing there are some errors.

Pastec uses OpenCV, and OpenCV doesn’t support gif and svg. For these two format, OpenCV didn’t open.

Pastec adds images that is bigger than 150×150 pixel.

There are zero bytes images : 153 files in 55,643 files. However on the web page of wikimedia, there are valid images. Anyways it causes an error.

One tiff image cause crash inside the Pastec. It need debugging.

Searching

After loading the 622 mb index file, images can be searched. Searching 55,643 images took around 15 hours. Every searching process, it extracts features before searching, therefore, searching takes more time.

Search result

Among 55,643 images, 751 images(1.43%) are smaller than 150×150, so they were not added. 51479 images are proper size, proper format for OpenCV, they are indexed and can be searched.

42,931 (83%) images are matched with only themselves (exactly the same image)

8,459 (15%) images are matched more than one image

90 (0.17%) images are not matched with any images even with themselves.

Images didn’t match with any images

These 90 images are properly indexed, but didn’t match even with themselves.

55 images were png image that include transparency. Other than this case, jpg images

14 images were long panorama images like followings

6 images were simple images like followings

8 vague images : lines are not clear and photographs that are out of focus

Other cases
These two images are a bit out of focus.

Original image size of this is 150×150 pixel. May be it is too small and simple.

Images matched with more than one image

8,459 images were matched with more than one images. To compare the result, I generated an html file that shows all match results like following :

I converted all images to 250×250 pixel using convert -resize 250x250 filename command to show it on one page. The html file size was 6.8 mb and it shows 64,630 images.

As I mentioned on my previous blog, Pastec is good for detecting rotated/cropped image.
Almost all matching was reasonable(similar). Followings are significant matchings :

In these two cases, the logo was matched.

This matching looks like false positive.

This matching also is false positive.

In this case, the map is shifted.

This is obvious false positive, maybe sharp part of the airplane and the roof part was matched.

From my observation, obvious false positive matching that doesn’t share any object was less than 50, which was 0.08%. Usually when the image contains graphs or documents, there were wrong matching. When the image was normal photograph, the result was very reliable.