[Boards: 3 / a / aco / adv / an / asp / b / bant / biz / c / can / cgl / ck / cm / co / cock / d / diy / e / fa / fap / fit / fitlit / g / gd / gif / h / hc / his / hm / hr / i / ic / int / jp / k / lgbt / lit / m / mlp / mlpol / mo / mtv / mu / n / news / o / out / outsoc / p / po / pol / qa / qst / r / r9k / s / s4s / sci / soc / sp / spa / t / tg / toy / trash / trv / tv / u / v / vg / vint / vip / vp / vr / w / wg / wsg / wsr / x / y ] [Search | Free Show | Home]

What's the best way to scrape part of a website for offline

This is a blue board which means that it's for everybody (Safe For Work content only). If you see any adult content, please report it.

Thread replies: 12
Thread images: 2

File: 1460609152734.jpg (177KB, 600x740px) Image search: [Google]
1460609152734.jpg
177KB, 600x740px
What's the best way to scrape part of a website for offline reading? I'm on GNU+Linux so preferably something that isn't too complicated for a relative noob. I want to have all of Warosu's /g/ saved on my computer.
>>
here comes the plane, open your mouth~

wget >>60818943
>>
Some kid actually turned this in for a grade. The fact that we have become this outrageously stupid as a species makes me furious.
>>
>>60818943

php file_get_contents(), $dom = new DOMDocument(); $xpath = new DOMXpath()

easy as shit
>>
>>60818986

Some kid actually thought these pictures are real. The fact that we have become this outrageously stupid as a species makes me furious.
>>
>>60818997
Even if you're only pretending to be retarded, you're still being retarded.
>>
>>60818990
How do I translate this into a command? I want this (https://warosu.org/g/) with every thumbnail, image, post, thread, etc saved on my hard drive. I'm an Ubuntu user so I'm not too familiar with the CLI lingo.
>>
>>60819006
He's right though. These pictures are old as hell and fake as fuck. They're still funny though.
>>
>>60818943
>>60819072
>I want to have all of Warosu's /g/ saved on my computer.
Not happening. Don't bother.
>>
>>60819221
Too large? I have a lot of storage space. I wonder if I should just find some neet on /r9k/ to pay and have him manually save every thread one by one.
>>
>>60819072
build a crawler with Python
>>
File: 1496982985498.jpg (35KB, 640x480px) Image search: [Google]
1496982985498.jpg
35KB, 640x480px
>>60819072
can you write js? python? ruby?
1. grab the pages
2. parse it
3. ???
4. profit!
Thread posts: 12
Thread images: 2


[Boards: 3 / a / aco / adv / an / asp / b / bant / biz / c / can / cgl / ck / cm / co / cock / d / diy / e / fa / fap / fit / fitlit / g / gd / gif / h / hc / his / hm / hr / i / ic / int / jp / k / lgbt / lit / m / mlp / mlpol / mo / mtv / mu / n / news / o / out / outsoc / p / po / pol / qa / qst / r / r9k / s / s4s / sci / soc / sp / spa / t / tg / toy / trash / trv / tv / u / v / vg / vint / vip / vp / vr / w / wg / wsg / wsr / x / y] [Search | Top | Home]

I'm aware that Imgur.com will stop allowing adult images since 15th of May. I'm taking actions to backup as much data as possible.
Read more on this topic here - https://archived.moe/talk/thread/1694/


If you need a post removed click on it's [Report] button and follow the instruction.
DMCA Content Takedown via dmca.com
All images are hosted on imgur.com.
If you like this website please support us by donating with Bitcoins at 16mKtbZiwW52BLkibtCr8jUg2KVUMTxVQ5
All trademarks and copyrights on this page are owned by their respective parties.
Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.
This is a 4chan archive - all of the content originated from that site.
This means that RandomArchive shows their content, archived.
If you need information for a Poster - contact them.