[Boards: 3 / a / aco / adv / an / asp / b / bant / biz / c / can / cgl / ck / cm / co / cock / d / diy / e / fa / fap / fit / fitlit / g / gd / gif / h / hc / his / hm / hr / i / ic / int / jp / k / lgbt / lit / m / mlp / mlpol / mo / mtv / mu / n / news / o / out / outsoc / p / po / pol / qa / qst / r / r9k / s / s4s / sci / soc / sp / spa / t / tg / toy / trash / trv / tv / u / v / vg / vint / vip / vp / vr / w / wg / wsg / wsr / x / y ] [Search | Free Show | Home]

/co/mrade here. I'm trying to scrape the archives of a

This is a blue board which means that it's for everybody (Safe For Work content only). If you see any adult content, please report it.

Thread replies: 7
Thread images: 1

/co/mrade here.

I'm trying to scrape the archives of a webcomic from 10 years ago so I can put it all in a .cbr file and not have to go through the website on a browser.

I've got a pretty good handle on how wget works from Googling, but the trouble I'm having I think has to do with how the website is set up.

When you View Image a comic from the archives, it shows up as
http://www.errantstory.com/comics/2003-01-31.jpg
You can change the filename to a different comic and it will go to that comic right away, but if you go to
http://www.errantstory.com/comics
it just redirects you to the main page.

The command I'm using is
wget -r -nd -P /home/myname/Downloads/ES -A "jpg" http://www.errantstory.com/comics/
, but it must be being redirected, because it only downloads the one comic that's on the main page.

Is there a way to get wget to go put in any possible filenames in that directory so it can download them? Or any way to stop it from getting redirected?
>>
save this file http://pastebin.com/raw/v07TYA9J

then run:

wget -i <filename>

you'll get a bunch of 404s but at the end of it all, you'll have every comic jpg.
>>
>>56676839
-e robots=off
and
-U 'Mozilla/5.0'
is always a good idea in case a website doesn't like robots
>>
>>56677095
That's exactly what I was just trying to do!

I got as far as getting the list of dates exported to a text file. Can I ask what you used to add the URL and .jpg to all of them?
>>
>>56677146

the script/loop i wrote added them.
otherwise you could do regex replace in sublime text

^ means start of line so just replace ^ with the url
$ means end of line so replace $ with .jpg

here's my hack job script:

for x in {2002..2012}; 
do for y in $(seq -f "%02g" 1 12);
do for z in $(seq -f "%02g" 1 31);
do echo "http://www.errantstory.com/comics/$x-$y-$z.jpg";
done;
done;
done;
>>
>>56677296
Cool. Thanks.
>>
>>56677323
no probs, all the best
Thread posts: 7
Thread images: 1


[Boards: 3 / a / aco / adv / an / asp / b / bant / biz / c / can / cgl / ck / cm / co / cock / d / diy / e / fa / fap / fit / fitlit / g / gd / gif / h / hc / his / hm / hr / i / ic / int / jp / k / lgbt / lit / m / mlp / mlpol / mo / mtv / mu / n / news / o / out / outsoc / p / po / pol / qa / qst / r / r9k / s / s4s / sci / soc / sp / spa / t / tg / toy / trash / trv / tv / u / v / vg / vint / vip / vp / vr / w / wg / wsg / wsr / x / y] [Search | Top | Home]

I'm aware that Imgur.com will stop allowing adult images since 15th of May. I'm taking actions to backup as much data as possible.
Read more on this topic here - https://archived.moe/talk/thread/1694/


If you need a post removed click on it's [Report] button and follow the instruction.
DMCA Content Takedown via dmca.com
All images are hosted on imgur.com.
If you like this website please support us by donating with Bitcoins at 16mKtbZiwW52BLkibtCr8jUg2KVUMTxVQ5
All trademarks and copyrights on this page are owned by their respective parties.
Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.
This is a 4chan archive - all of the content originated from that site.
This means that RandomArchive shows their content, archived.
If you need information for a Poster - contact them.