Lets say i have a simple webscraping tool that i have created as a tutorial project.
I scrape a house reseller site every 5-10 min (random value), take all new houses added since last scrape and add them to my database.
I have couple of questions, in different areas:
1. What's the best way to take all new houses (with it's parameters such as sq. meters, price etc) and store them in a database? How do i store only new houses, all old ones will already be in a database
2. Should i store every house as a separate query, or should i first create a batch of new houses and use ACID transactions to save them? If batch is the solution, what would be the best way to implement it?
3. How to not get banned? My plan is to set random delay time, from 5-10 min and list of real useragents. Anything else i can do?
4. Any general tips for total noob?
Just a slight bump
General webscrape discussion is also welcome
Luring people with hot pic and inane post
>>61050746
Looks like she's wearing one of those Smooth Groove's
Bump because I'm curious
By the way, what are the best languages for writing web scrapers? Personally, I'm a big fan of perl.
>>61052542
Python+BeautifulSoup