Apart from that terrible reference, it is indeed time to tell you where the anime database stands right now. But first of all, I thought it would be best to clear up what this is all really about since my former related post was just really me typing at 200wpm while breathing heavily as the idea held a cast over me.

What is this sh*t?

There’s a bunch of anime databases out there apart from MyAnimeList, such as Anime Planet, AniDB and Anime News Network to name a few. Websites like these contain anime/manga/novel entries which detail the item. It can be compared to IMDB which does the same – except for movies. Sometimes, it’s useful to integrate a RESTful API which can allow developers to fetch these item details from your databases and add them to their own applications. Because the last thing we want to do is input all the anime/manga data into our own databaes using traditional methods. Why not let the computer do it for us, amirite?

via https://codeplanet.io/principles-good-restful-api-design/

Now, back to MyAnimeList. MAL has an API but it’s very lacking. You can’t fetch anime, manga, people or even character details directly. Furthermore, the output is in XML rather than JSON. 😦

Okay, what now?

So what do we do? We create our own. Let’s say that now we have an API that can fetch any anime or manga data via their link through means of Scraping.

Let’s talk about Scraping. Scraping is a method that fetches the web page and goes through all the nicely written/s HTML code using an algorithm that extracts the information you need from that web page. When there’s no API, this is an only solution. This or we use another service that provides an API but I really wanted to see how far I could go with this project – so why not?

What’s left?

We now have code that scrapes the web page and returns juicy data that you can cache/save/add/whatever. This requires you to provide the algorithm a link to the page you want to be scrapped, but there’s over hundreds of thousands of anime and manga out there. It would be ridiculous to leave that to human hands. This is where the Crawler comes in.

The Crawler

What a ‘Crawler’ generally does is start at some page and scans that page for other links. Those other links get saved and then it visits those links, and this recursively keeps on going and going and going.

Now as the crawler is doing its job, the scraper is going through the newly cache of links that are being populated and gets the data from that. This is basically how search engines index pages.

But we’re making a really specific crawler. What I’m looking for are links to anime entries within MAL, as I mentioned before. Which falls unto this: https://myanimelist.net/anime/{anime id}

The crawler looks for links with this pattern and save them and then we have the scraper go through them and we get an indexed database!

What’s new?

Due to busy college life and other projects, I’ve been unable to pay complete attention to finish this, however as summer approaches, I find myself once again with a lot of time on my hands.

Realizing that MyAnimeList was lacking a simple API to fetch anime or manga details, I decided to create my own. I teased a few screenshots at the end of the previous related post as well. I basically decided to create an unofficial API that lets you simply do what you can’t do from the official API.

Meet ‘Jikan’ – The Unofficial MyAnimeList API

This is the Scraper I’ve been talking about, it’s written in PHP and OOP. So far it can fetch Anime, Manga and some Character details. It’s going to be a lot more, very soon.

Hell, I even got a domain for it: http://jikan.me, although there is nothing to be seen there at the moment. For now, I plan on hosting the API there once it finishes for others to utilize as well with easy. Jikan returns data in JSON format with a simple, RESTful GET request.

It seems I’ve gotten quite side tracked. Right now I have a solid algorithm to fetch the details requires to make an Anime database. The next obvious step would be to make a robust crawler, right?

No.

That would double bandwidth and processing power. Each page will be required to be downloaded and scanned twice. Once for the crawler, once for the scraper. I do realize that I previously used the crawler method and got a list of quite a few anime with their details but it was not until a few days later I realized that MAL had a sitemap.

According to this and this we have two less time consuming methods. The first one is a sitemap for anime listings for crawlers/search engines. Then the second one consists of a method to download a huge list of entries using wildcards in the search. Personally, I have a terrible internet speed and wish to conclude that this works by testing my API against the data it scrapes. The sitemap goes upto 33,000 anime IDs where as the wildcard search results yields more than 107,000 anime IDs! I’ll go with the former that consists of 30~ish % of the entries.