[Boards: 3 / a / aco / adv / an / asp / b / bant / biz / c / can / cgl / ck / cm / co / cock / d / diy / e / fa / fap / fit / fitlit / g / gd / gif / h / hc / his / hm / hr / i / ic / int / jp / k / lgbt / lit / m / mlp / mlpol / mo / mtv / mu / n / news / o / out / outsoc / p / po / pol / qa / qst / r / r9k / s / s4s / sci / soc / sp / spa / t / tg / toy / trash / trv / tv / u / v / vg / vint / vip / vp / vr / w / wg / wsg / wsr / x / y ] [Search | Free Show | Home]

So i'm writing a script who parses sites and write retr

This is a blue board which means that it's for everybody (Safe For Work content only). If you see any adult content, please report it.

Thread replies: 5
Thread images: 1

File: Screenshot (19).png (62KB, 1360x765px) Image search: [Google]
Screenshot (19).png
62KB, 1360x765px
So i'm writing a script who parses sites and write retrieved data on a csv file.
here is the code:
 from bs4 import BeautifulSoup
import requests
import csv

r= requests.get('http://www.mediadata.it/en/aziende-comunicatori/elenco/{}/')
data = r.text
soup = BeautifulSoup(data, "html.parser")
with open('mbsmediadata.csv', 'w') as csvfile:
fieldnames=['nome', 'responsabili', 'email','posizione']
writer=csv.DictWriter(csvfile, fieldnames=fieldnames)
for i,j,z,y in zip(soup.find_all('h5',attrs={'class': 'ng-binding'})):
writer.writeheader()
writer.writerow({'nome':i.text,'responsabili':j.text,'email':z.text,'posizione':y.text})

but the format is shit tier. i've tried reading many documentations and previous questions but even if the .format() don't give syntax errors it don't format at all.
second issue is that i am committed to write fieldnames in every row, and google sheets import only those fieldnames.
do you know how to figure out the solution?

pic related, it's the shitty format output
>>
>>62279848
this is the retarded way of doing it
read the page source and find the actual data source. the webpage is angular so there is obviously some sort of rest endpoint providing the data. find the endpoint scrape the endpoint
>>
>>62279893
i wrote the wronge code bro:
from bs4 import BeautifulSoup
import requests
import csv
r = requests.get('https://www.paginegialle.it/ricerca/pizzerie/Milano?mr=50')
data = r.text
soup = BeautifulSoup(data,"html.parser")
with open('mbsprprova.csv', 'w') as csvfile:
fieldnames = ['nome','indirizzo','telefono']
writer=csv.DictWriter(csvfile, fieldnames=fieldnames)
for i,j,z in zip(soup.find_all('span', attrs={'itemprop':'name'}),soup.find_all('span', attrs={'class':'street-address'}), soup.find_all('div', attrs={'class':'tel elementPhone'})):
writer.writeheader()
writer.writerow ({'nome':i.text,'telefono':j.text,'indirizzo':z.text})
>>
Here I like your chaining solution but I'm not sure how you will fix the address like that
Pbin ZRfd5Kch
>>
here make something of yourself kiddo
from bs4 import BeautifulSoup
import requests
data = requests.get('https://www.paginegialle.it/ricerca/pizzerie/Milano?mr=50')
soup = BeautifulSoup(data.text,"lxml")
businesses = []
mapping = {
'street-address' : 'address',
'postal-code': 'postcode',
'locality': 'city',
'region': 'state'
}
for i,j,z in zip(soup.find_all('span', attrs={'itemprop':'name'}),soup.find_all('div', attrs={'itemprop':'address'}), soup.find_all('div', attrs={'class':'tel elementPhone'})):
data = {}
data['name']=i.text.strip()

for addressfield in j.find_all('span'):
tomap = str(addressfield.attrs['class'][0])
data[mapping[tomap]] = addressfield.text.strip()

data['telephones'] = z.text.strip().split(',')
map(lambda x: x.strip(),data['telephones'])
# print(z.text)
print(data)
businesses.append(data)
Thread posts: 5
Thread images: 1


[Boards: 3 / a / aco / adv / an / asp / b / bant / biz / c / can / cgl / ck / cm / co / cock / d / diy / e / fa / fap / fit / fitlit / g / gd / gif / h / hc / his / hm / hr / i / ic / int / jp / k / lgbt / lit / m / mlp / mlpol / mo / mtv / mu / n / news / o / out / outsoc / p / po / pol / qa / qst / r / r9k / s / s4s / sci / soc / sp / spa / t / tg / toy / trash / trv / tv / u / v / vg / vint / vip / vp / vr / w / wg / wsg / wsr / x / y] [Search | Top | Home]

I'm aware that Imgur.com will stop allowing adult images since 15th of May. I'm taking actions to backup as much data as possible.
Read more on this topic here - https://archived.moe/talk/thread/1694/


If you need a post removed click on it's [Report] button and follow the instruction.
DMCA Content Takedown via dmca.com
All images are hosted on imgur.com.
If you like this website please support us by donating with Bitcoins at 16mKtbZiwW52BLkibtCr8jUg2KVUMTxVQ5
All trademarks and copyrights on this page are owned by their respective parties.
Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.
This is a 4chan archive - all of the content originated from that site.
This means that RandomArchive shows their content, archived.
If you need information for a Poster - contact them.