Need database dump of comprehensive manga / webcomics database like mangaupdates.
Anybody scraped the site? or is there some other source?
>>367066
Somebody was scraping it for a while, since they specifically disabled the ability to go past the 50th page when you search their database to discourage it. There was also a fake mangaupdates site, sheeky forums style, but I don't remember what its url was. That may have been a couple years ago.
>>367084
What I feel is that the site works like a black box, and is too important for non-japanese manga fans. Something that important should not be a black box. We need to scrape all the data and do a db dump.
>>367102
Maybe scrape archive.org's cache for every series? Filling every series id number from 1 to 144285 (current highest series) after:
https://www.mangaupdates.com/series.html?id=
Some have been pruned as either duplicates or oneshots absorbed into collections.
>>367166
brute forcing the links till 150000 was in my plan but it cannot be done without something like http://luminati.io/ . But I don't want to make the effort if it has already been done before / alternates exist.
I've been searching high and low for a dataset comprising mangaupdates' content of manga/info/tags... to no avail.
would really appreciate it if someone did it before and is willing to share.
also have a bump.
>>367240
Found the thread about the fake mangaupdates site. It's dead now, so you won't be able to ask whoever was running that for what they had.
https://www.mangaupdates.com/showtopic.php?tid=52190
>>367240
I'm willing to scrape the site if I don't find similar datasets as well. I'm could to write a new scraper in my free time. Will dump the data in pastebin or something. What would you use it for?
>>367308
All kinds of things, but first of all is write an automatic renamer to standardize the file names of my (huge ass) collection and help people in the same shoes as I in the process (madokami for example).
I am planning to get into AI and machine learning, but I need something that motivates me, such a dataset is sure to help me put together a mini project or two about my manga consumption and my involvement with all kinds of people over the years.
Also generally that site is the backbone of scanlations today, it'd be disastrous for the community if something were to happen to it, think of nyaa for example, it left a huge mess to clean up. Why the people over at MU never release their dataset as a free resource is beyond me, but we need to preserve the damn thing for sure.
>>367364
>Why the people over at MU never release their dataset as a free resource is beyond me
Because the site hasn't had any sort of programmer and thus has been ignoring reasonable suggestions for improvement for years. You think they want to give potential competitors a free head start? If the database goes elsewhere, they've got nothing but founder and network effects in their favor, and both of those are pretty weak given that 90% of the readers just go to online readers anyway. It'd be easy to advertise the shiny, new site on mangafox, batoto and the like.