Spiders & Scraping

Thread replies: 10
Thread images: 1

Anonymous
Spiders & Scraping 2016-10-28 00:48:25 Post No. 57272836
[Report] Image search: [Google]

File: oython.png (8KB, 200x200px) Image search: [Google]

Spiders & Scraping Anonymous 2016-10-28 00:48:25 Post No. 57272836 [Report]

Hey Anon. I recently started learning about Spiders and Scraping with Python. Just wondering if anyone has any good sites on learning to scrape anonymously? I have been trying to read anything I can get my hands on. I feel like this is such a valuable skill and I want to learn to do it as best as I can.

Anonymous 2016-10-28 00:53:18 Post No.57272900
[Report]

Anonymous 2016-10-28 00:53:18 Post No.57272900 [Report]

You're not being clear, what do you want to scrape?

Anonymous 2016-10-28 00:53:41 Post No.57272909
[Report]

Anonymous 2016-10-28 00:53:41 Post No.57272909 [Report]

Is there a reason to do it anonymously?

If there is then the answer is to use a proxy.

Anonymous 2016-10-28 00:54:41 Post No.57272922
[Report]

Anonymous 2016-10-28 00:54:41 Post No.57272922 [Report]

>>57272836
m8 just use the requests + htmlparser libraries it's real ez

Anonymous 2016-10-28 02:11:02 Post No.57273819
[Report]

Anonymous 2016-10-28 02:11:02 Post No.57273819 [Report]

>>57272836
Cefpython is comfy as fuck!

Anonymous 2016-10-28 02:54:44 Post No.57274300
[Report]

Anonymous 2016-10-28 02:54:44 Post No.57274300 [Report]

>>57272836
Install the following and do some tutorials.
http://docs.python-requests.org/en/master/
https://www.crummy.com/software/BeautifulSoup/bs4/doc/

Anonymous 2016-10-28 04:48:49 Post No.57275453
[Report]

Anonymous 2016-10-28 04:48:49 Post No.57275453 [Report]

>>57272836

why anonymous? just google "scrapy" and you'll get plenty, youtube it too

if you're trying to do craigslist their listing only update ever 15 minutes anyways, so set your interval to 15 minutes and you won't get banned.

I have a vps so I have my vps and my home computer scraping craigslist every 30 minutes, offset by 15 minutes

Anonymous 2016-10-28 04:56:07 Post No.57275529
[Report]

Anonymous 2016-10-28 04:56:07 Post No.57275529 [Report]

>>57273819
this. skip all the stupid requests/scrapy shit and go straight to cefpython/selenium/ghost.py/QtWebEngine because that's the only shit that actually works

Anonymous 2016-10-28 09:17:29 Post No.57277538
[Report]

Anonymous 2016-10-28 09:17:29 Post No.57277538 [Report]

>>57272922
As long as you do this, it is easy. After you want something more than that, it becomes freakin hard.

How do you decide when to re-crawl a certain site?
How would you handle millions of pages that needs to be crawled?
What If you want to be certain, that no duplicating happened? (crawling the same url, you need to recognize if you crawled it already, from millions of urls..)
What If you want to limit the load a target site can get from you, I mean, If you just rush a site with everything you got, it is basically a ddos...

Anonymous 2016-10-28 11:40:28 Post No.57278662
[Report]

Anonymous 2016-10-28 11:40:28 Post No.57278662 [Report]

>>57272836
Async + aiohttp and some html parsing.
Much faster than doing scraping synchronously

Use a database to store source website and all the links from it.
That way you can also check for duplicates easily

I'm aware that Imgur.com will stop allowing adult images since 15th of May. I'm taking actions to backup as much data as possible. Read more on this topic here - https://archived.moe/talk/thread/1694/

I'm aware that Imgur.com will stop allowing adult images since 15th of May. I'm taking actions to backup as much data as possible.
Read more on this topic here - https://archived.moe/talk/thread/1694/