Image Extraction From URL by Scala

There are less methods or posts which are talking about how to extract image from url. Unfortunately recently in company’s project I need to have this feature. I search lots of places, but not too much results. Even though some people provide chrome plugins or codes to obtain all images in url (This is quite simple, you only need to parse the url’s html and find all img and that’s all), this is not what we want. I want to get one main image in url; It is likely to use one main image within the link to reflect the url’s whole content. Now we live in internet era, we don’t lack information. In fact, we already lost in too many messages. If we can read less words or texts to use several images to show all, our life can speed up. (Of course, if you are old man, you like slow life. Just Enjoy.) So that’s the purpose we want to provide image before user clicks the url.

Idea:

I already write down the reason why we need this feature. Next step is to explain how we achieve it. There are several logics we follow:

obtain all img tag by parsing url’s html

filter all known public bad images, like logo, brand, icon, etc (Because nobody would like to use one icon to show an article’s content. It is a known knowledge.)

filter all images which sizes are not qualified, like too long, too wide, too small, etc. (Because a main image which can be described should have some size to hold in page.)

obtain the rest images’ real size and do 3rd step again. (Because sometimes, img’s attr does not contain width/height attribution. In this case, we need to read real data from img link)

map the rest images to sort by its real image area. (Because we believe the larger the size, the more opportunity the main image.)

I have one more filter is that I know the url’s main topic, so the img’s value/des also can be a measure when sorting.

To be honest, it is not 100% true to obtain main image from url, even though we already use multiple methods to filter, to sort. You also need to modify the parameters to make it with high performance.