Hey Anon. I recently started learning about Spiders and Scraping with Python. Just wondering if anyone has any good sites on learning to scrape anonymously? I have been trying to read anything I can get my hands on. I feel like this is such a valuable skill and I want to learn to do it as best as I can.
You're not being clear, what do you want to scrape?
Is there a reason to do it anonymously?
If there is then the answer is to use a proxy.
>>57272836
m8 just use the requests + htmlparser libraries it's real ez
>>57272836
Cefpython is comfy as fuck!
>>57272836
Install the following and do some tutorials.
http://docs.python-requests.org/en/master/
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
>>57272836
why anonymous? just google "scrapy" and you'll get plenty, youtube it too
if you're trying to do craigslist their listing only update ever 15 minutes anyways, so set your interval to 15 minutes and you won't get banned.
I have a vps so I have my vps and my home computer scraping craigslist every 30 minutes, offset by 15 minutes
>>57273819
this. skip all the stupid requests/scrapy shit and go straight to cefpython/selenium/ghost.py/QtWebEngine because that's the only shit that actually works
>>57272922
As long as you do this, it is easy. After you want something more than that, it becomes freakin hard.
How do you decide when to re-crawl a certain site?
How would you handle millions of pages that needs to be crawled?
What If you want to be certain, that no duplicating happened? (crawling the same url, you need to recognize if you crawled it already, from millions of urls..)
What If you want to limit the load a target site can get from you, I mean, If you just rush a site with everything you got, it is basically a ddos...
>>57272836
Async + aiohttp and some html parsing.
Much faster than doing scraping synchronously
Use a database to store source website and all the links from it.
That way you can also check for duplicates easily