[Boards: 3 / a / aco / adv / an / asp / b / bant / biz / c / can / cgl / ck / cm / co / cock / d / diy / e / fa / fap / fit / fitlit / g / gd / gif / h / hc / his / hm / hr / i / ic / int / jp / k / lgbt / lit / m / mlp / mlpol / mo / mtv / mu / n / news / o / out / outsoc / p / po / pol / qa / qst / r / r9k / s / s4s / sci / soc / sp / spa / t / tg / toy / trash / trv / tv / u / v / vg / vint / vip / vp / vr / w / wg / wsg / wsr / x / y ] [Search | Free Show | Home]

Scraping

This is a blue board which means that it's for everybody (Safe For Work content only). If you see any adult content, please report it.

Thread replies: 32
Thread images: 7

File: Capture3.png (96KB, 993x693px) Image search: [Google]
Capture3.png
96KB, 993x693px
Perhaps someone here is more familiar with this. I constantly frequent websites from 2 different banks to see what prices they quote. Now, instead of going back and forth the whole day I am writing my own Java program to display the current values for me which I plan to get through scraping the websites of both banks.

I have written a simple class (using JSoup) that extracts an element by its CSS selector (pic related). In Chrome, I go on their website and right-click > [Inspect] to find the CSS node. This approach works fine for Deutsche Bank (see below) but not for Commerzbank (see below). What am I missing here?

Would really appreciate some help. I have a feeling Commerzbank works with some "lightstreamer" application which is not easily scrapable as I cant extract anything from their site. It basically gives me nothing instead of any value or string.
>>
File: Capture2.png (371KB, 1679x1049px) Image search: [Google]
Capture2.png
371KB, 1679x1049px
Works for Deutsche Bank (pic related).
>>
>>61270513
Stop cheapening out and get a data provider. Scraping is for beggars
>>
File: Capture1.png (414KB, 1679x1049px) Image search: [Google]
Capture1.png
414KB, 1679x1049px
Does not work for Commerzbank (pic related).
>>
And let me tell you that sites like Yahoo Finance has started spoofing live prices and historical prices to combat scrapers, so your scraping days won't last
>>
>>61270524
I work at a bank and I have access to Reuters Eikon and Bloomberg market data services there. This is however for a side project at home, which is not as far progressed yet. If it does, I would be willing to pay money. But this would be the interim solution for now.
>>
set your user-agent
>>
>>61270528
Set a useragent. I am sure this will help.
Some sites do not output anything if you are a bot or they see that you have no useragent.
>>
File: Capture4.png (95KB, 1086x700px) Image search: [Google]
Capture4.png
95KB, 1086x700px
>>61270565
Thanks for the tip. I will read up on that. You mean something like in the picture? Still does not seem to work.
>>
>>61270626
Dump the HTML and compare with the web site
>>
I'd guess the second bank retrieves the data with some javascript executed on load rather than the data being returned inside the HTML
>>
File: Capture5.png (75KB, 659x701px) Image search: [Google]
Capture5.png
75KB, 659x701px
>>61270893
>>61270970

I guess one has to send some form of request to the "lightstreamer" to receive any values. Just looking at the HTML does not yield much. (pic related).
>>
>>61270626
4chan also blocks URL requests with no UserAgent, so you can confirm if you've set it up correctly by trying to scrape this thread, before trying the bank website again.
>>
>>61270991
They're using websockets (fucking why) to get the data.
So you're shit out of luck. You could try Headless Chrome to actually load the page and then use JS to pull it out the data.
>>
File: Capture6.png (99KB, 1285x861px) Image search: [Google]
Capture6.png
99KB, 1285x861px
>>61270995
4chan seems to work with or without setting a user agent. I tried both
>>
>>61271048
Just mentioned it, because when I tried scraping inside a python script it'd fail when using urllib2, so whatever java lib you're using is probably already using a known-good UserAgent. But some sites are pickier when it comes to the UserAgent string than others, and requires stuff like info about character encodings and charsets.
>>
File: Capture7.png (46KB, 884x593px) Image search: [Google]
Capture7.png
46KB, 884x593px
>>61271041
hmm, that sounds like a lot of work unfortunately and I wouldn't be surprised if it still doesn't work then. Seems like Commerzbank has guarded their website against automated access... I managed to get it done before with iMacros, a program loosely based on Visual Basic (pic related) but that required the Browser to stay open and the program would essentially go through all fields you would like to extract and copy them (simulating roughly would a user would do with mouse and CTRL+C I guess). It was also slow.
>>
>>61271199
>https://developers.google.com/web/updates/2017/04/headless-chrome
I realized that it won't support Windows until the next version of Chrome. It's a full browser, it just doesn't render to screen.

Maybe that page will help, but I haven't used it personally.
>>
>>61270513
I wrote quite a few scrapers and as a general suggestion I would recommend you use Groovy instead of Java for this kind of project. You will save literally days.
>>
>>61271240
>>61271041
Why not an actual websockets library?
>>
>>61271687
Gonna further this and suggest using the Geb library within Groovy. Makes creating browser bots really quick.
>>
Browser proxy to find the loads the site makes?
>>
Why are you using Java when Python is literally made for this shit? BeautifulSoup and Mechanize.

Or, selenium.
>>
>>61274529
Are you sure this would work? I tried Python and BeautifulSoup before and it did not work (though I did not use Selenium or Mechanize).
>>
>>61270513
If the bank requires javascript enabled for whatever reason, try using a headless browser. Or a ws library
>>
>>61271199
check out selenium for java, used it a couple of times and it's surprisingly easy
>>
>>61276697
>>61276571
thanks will take a look into both ways.
>>
Take a look at the page's source code (not via Inspect Element), check if the data is there. If it is not the site may be generated with Javascript. Then you need something like Selenium.
>>
>>61276571
This
>>
>>61276697
Not sure if you are dead set on using Java but I use a JS approach with CasperJS, it loads all the scripts properly and can fire click events and whatnot

http://casperjs.org/
>>
>>61276697
Use Groovy/Geb for Selenium drivers. Raw Selenium is an unnecessary headache.
>>
>>61270513
Your Java is severely triggering me
Thread posts: 32
Thread images: 7


[Boards: 3 / a / aco / adv / an / asp / b / bant / biz / c / can / cgl / ck / cm / co / cock / d / diy / e / fa / fap / fit / fitlit / g / gd / gif / h / hc / his / hm / hr / i / ic / int / jp / k / lgbt / lit / m / mlp / mlpol / mo / mtv / mu / n / news / o / out / outsoc / p / po / pol / qa / qst / r / r9k / s / s4s / sci / soc / sp / spa / t / tg / toy / trash / trv / tv / u / v / vg / vint / vip / vp / vr / w / wg / wsg / wsr / x / y] [Search | Top | Home]

I'm aware that Imgur.com will stop allowing adult images since 15th of May. I'm taking actions to backup as much data as possible.
Read more on this topic here - https://archived.moe/talk/thread/1694/


If you need a post removed click on it's [Report] button and follow the instruction.
DMCA Content Takedown via dmca.com
All images are hosted on imgur.com.
If you like this website please support us by donating with Bitcoins at 16mKtbZiwW52BLkibtCr8jUg2KVUMTxVQ5
All trademarks and copyrights on this page are owned by their respective parties.
Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.
This is a 4chan archive - all of the content originated from that site.
This means that RandomArchive shows their content, archived.
If you need information for a Poster - contact them.