[Boards: 3 / a / aco / adv / an / asp / b / bant / biz / c / can / cgl / ck / cm / co / cock / d / diy / e / fa / fap / fit / fitlit / g / gd / gif / h / hc / his / hm / hr / i / ic / int / jp / k / lgbt / lit / m / mlp / mlpol / mo / mtv / mu / n / news / o / out / outsoc / p / po / pol / qa / qst / r / r9k / s / s4s / sci / soc / sp / spa / t / tg / toy / trash / trv / tv / u / v / vg / vint / vip / vp / vr / w / wg / wsg / wsr / x / y ] [Search | Free Show | Home]

Spiders & Scraping

This is a blue board which means that it's for everybody (Safe For Work content only). If you see any adult content, please report it.

Thread replies: 10
Thread images: 1

File: oython.png (8KB, 200x200px) Image search: [Google]
oython.png
8KB, 200x200px
Hey Anon. I recently started learning about Spiders and Scraping with Python. Just wondering if anyone has any good sites on learning to scrape anonymously? I have been trying to read anything I can get my hands on. I feel like this is such a valuable skill and I want to learn to do it as best as I can.
>>
You're not being clear, what do you want to scrape?
>>
Is there a reason to do it anonymously?

If there is then the answer is to use a proxy.
>>
>>57272836
m8 just use the requests + htmlparser libraries it's real ez
>>
>>57272836
Cefpython is comfy as fuck!
>>
>>57272836
Install the following and do some tutorials.
http://docs.python-requests.org/en/master/
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
>>
>>57272836

why anonymous? just google "scrapy" and you'll get plenty, youtube it too

if you're trying to do craigslist their listing only update ever 15 minutes anyways, so set your interval to 15 minutes and you won't get banned.

I have a vps so I have my vps and my home computer scraping craigslist every 30 minutes, offset by 15 minutes
>>
>>57273819
this. skip all the stupid requests/scrapy shit and go straight to cefpython/selenium/ghost.py/QtWebEngine because that's the only shit that actually works
>>
>>57272922
As long as you do this, it is easy. After you want something more than that, it becomes freakin hard.

How do you decide when to re-crawl a certain site?
How would you handle millions of pages that needs to be crawled?
What If you want to be certain, that no duplicating happened? (crawling the same url, you need to recognize if you crawled it already, from millions of urls..)
What If you want to limit the load a target site can get from you, I mean, If you just rush a site with everything you got, it is basically a ddos...
>>
>>57272836
Async + aiohttp and some html parsing.
Much faster than doing scraping synchronously

Use a database to store source website and all the links from it.
That way you can also check for duplicates easily
Thread posts: 10
Thread images: 1


[Boards: 3 / a / aco / adv / an / asp / b / bant / biz / c / can / cgl / ck / cm / co / cock / d / diy / e / fa / fap / fit / fitlit / g / gd / gif / h / hc / his / hm / hr / i / ic / int / jp / k / lgbt / lit / m / mlp / mlpol / mo / mtv / mu / n / news / o / out / outsoc / p / po / pol / qa / qst / r / r9k / s / s4s / sci / soc / sp / spa / t / tg / toy / trash / trv / tv / u / v / vg / vint / vip / vp / vr / w / wg / wsg / wsr / x / y] [Search | Top | Home]

I'm aware that Imgur.com will stop allowing adult images since 15th of May. I'm taking actions to backup as much data as possible.
Read more on this topic here - https://archived.moe/talk/thread/1694/


If you need a post removed click on it's [Report] button and follow the instruction.
DMCA Content Takedown via dmca.com
All images are hosted on imgur.com.
If you like this website please support us by donating with Bitcoins at 16mKtbZiwW52BLkibtCr8jUg2KVUMTxVQ5
All trademarks and copyrights on this page are owned by their respective parties.
Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.
This is a 4chan archive - all of the content originated from that site.
This means that RandomArchive shows their content, archived.
If you need information for a Poster - contact them.