Scraping

Thread replies: 32
Thread images: 7

Anonymous
Scraping 2017-07-08 08:36:20 Post No. 61270513
[Report] Image search: [Google]

File: Capture3.png (96KB, 993x693px) Image search: [Google]

Scraping Anonymous 2017-07-08 08:36:20 Post No. 61270513 [Report]

Perhaps someone here is more familiar with this. I constantly frequent websites from 2 different banks to see what prices they quote. Now, instead of going back and forth the whole day I am writing my own Java program to display the current values for me which I plan to get through scraping the websites of both banks.

I have written a simple class (using JSoup) that extracts an element by its CSS selector (pic related). In Chrome, I go on their website and right-click > [Inspect] to find the CSS node. This approach works fine for Deutsche Bank (see below) but not for Commerzbank (see below). What am I missing here?

Would really appreciate some help. I have a feeling Commerzbank works with some "lightstreamer" application which is not easily scrapable as I cant extract anything from their site. It basically gives me nothing instead of any value or string.

Anonymous 2017-07-08 08:37:04 Post No.61270517
[Report] Image search: [Google]

Anonymous 2017-07-08 08:37:04 Post No.61270517 [Report]

File: Capture2.png (371KB, 1679x1049px) Image search: [Google]

371KB, 1679x1049px

Works for Deutsche Bank (pic related).

Anonymous 2017-07-08 08:37:51 Post No.61270524
[Report]

Anonymous 2017-07-08 08:37:51 Post No.61270524 [Report]

>>61270513
Stop cheapening out and get a data provider. Scraping is for beggars

Anonymous 2017-07-08 08:38:08 Post No.61270528
[Report] Image search: [Google]

Anonymous 2017-07-08 08:38:08 Post No.61270528 [Report]

File: Capture1.png (414KB, 1679x1049px) Image search: [Google]

414KB, 1679x1049px

Does not work for Commerzbank (pic related).

Anonymous 2017-07-08 08:38:55 Post No.61270537
[Report]

Anonymous 2017-07-08 08:38:55 Post No.61270537 [Report]

And let me tell you that sites like Yahoo Finance has started spoofing live prices and historical prices to combat scrapers, so your scraping days won't last

Anonymous 2017-07-08 08:39:42 Post No.61270545
[Report]

Anonymous 2017-07-08 08:39:42 Post No.61270545 [Report]

>>61270524
I work at a bank and I have access to Reuters Eikon and Bloomberg market data services there. This is however for a side project at home, which is not as far progressed yet. If it does, I would be willing to pay money. But this would be the interim solution for now.

Anonymous 2017-07-08 08:40:08 Post No.61270549
[Report]

Anonymous 2017-07-08 08:40:08 Post No.61270549 [Report]

set your user-agent

Anonymous 2017-07-08 08:41:42 Post No.61270565
[Report]

Anonymous 2017-07-08 08:41:42 Post No.61270565 [Report]

>>61270528
Set a useragent. I am sure this will help.
Some sites do not output anything if you are a bot or they see that you have no useragent.

Anonymous 2017-07-08 08:49:35 Post No.61270626
[Report] Image search: [Google]

Anonymous 2017-07-08 08:49:35 Post No.61270626 [Report]

File: Capture4.png (95KB, 1086x700px) Image search: [Google]

95KB, 1086x700px

>>61270565
Thanks for the tip. I will read up on that. You mean something like in the picture? Still does not seem to work.

Anonymous 2017-07-08 09:25:53 Post No.61270893
[Report]

Anonymous 2017-07-08 09:25:53 Post No.61270893 [Report]

>>61270626
Dump the HTML and compare with the web site

Anonymous 2017-07-08 09:34:11 Post No.61270970
[Report]

Anonymous 2017-07-08 09:34:11 Post No.61270970 [Report]

I'd guess the second bank retrieves the data with some javascript executed on load rather than the data being returned inside the HTML

Anonymous 2017-07-08 09:37:34 Post No.61270991
[Report] Image search: [Google]

Anonymous 2017-07-08 09:37:34 Post No.61270991 [Report]

File: Capture5.png (75KB, 659x701px) Image search: [Google]

75KB, 659x701px

>>61270893
>>61270970

I guess one has to send some form of request to the "lightstreamer" to receive any values. Just looking at the HTML does not yield much. (pic related).

Anonymous 2017-07-08 09:38:29 Post No.61270995
[Report]

Anonymous 2017-07-08 09:38:29 Post No.61270995 [Report]

>>61270626
4chan also blocks URL requests with no UserAgent, so you can confirm if you've set it up correctly by trying to scrape this thread, before trying the bank website again.

Anonymous 2017-07-08 09:44:29 Post No.61271041
[Report]

Anonymous 2017-07-08 09:44:29 Post No.61271041 [Report]

>>61270991
They're using websockets (fucking why) to get the data.
So you're shit out of luck. You could try Headless Chrome to actually load the page and then use JS to pull it out the data.

Anonymous 2017-07-08 09:45:06 Post No.61271048
[Report] Image search: [Google]

Anonymous 2017-07-08 09:45:06 Post No.61271048 [Report]

File: Capture6.png (99KB, 1285x861px) Image search: [Google]

99KB, 1285x861px

>>61270995
4chan seems to work with or without setting a user agent. I tried both

Anonymous 2017-07-08 09:51:22 Post No.61271099
[Report]

Anonymous 2017-07-08 09:51:22 Post No.61271099 [Report]

>>61271048
Just mentioned it, because when I tried scraping inside a python script it'd fail when using urllib2, so whatever java lib you're using is probably already using a known-good UserAgent. But some sites are pickier when it comes to the UserAgent string than others, and requires stuff like info about character encodings and charsets.

Anonymous 2017-07-08 10:05:47 Post No.61271199
[Report] Image search: [Google]

Anonymous 2017-07-08 10:05:47 Post No.61271199 [Report]

File: Capture7.png (46KB, 884x593px) Image search: [Google]

46KB, 884x593px

>>61271041
hmm, that sounds like a lot of work unfortunately and I wouldn't be surprised if it still doesn't work then. Seems like Commerzbank has guarded their website against automated access... I managed to get it done before with iMacros, a program loosely based on Visual Basic (pic related) but that required the Browser to stay open and the program would essentially go through all fields you would like to extract and copy them (simulating roughly would a user would do with mouse and CTRL+C I guess). It was also slow.

Anonymous 2017-07-08 10:10:27 Post No.61271240
[Report]

Anonymous 2017-07-08 10:10:27 Post No.61271240 [Report]

>>61271199
>https://developers.google.com/web/updates/2017/04/headless-chrome
I realized that it won't support Windows until the next version of Chrome. It's a full browser, it just doesn't render to screen.

Maybe that page will help, but I haven't used it personally.

Anonymous 2017-07-08 11:06:19 Post No.61271687
[Report]

Anonymous 2017-07-08 11:06:19 Post No.61271687 [Report]

>>61270513
I wrote quite a few scrapers and as a general suggestion I would recommend you use Groovy instead of Java for this kind of project. You will save literally days.

Anonymous 2017-07-08 11:38:14 Post No.61271926
[Report]

Anonymous 2017-07-08 11:38:14 Post No.61271926 [Report]

>>61271240
>>61271041
Why not an actual websockets library?

Anonymous 2017-07-08 03:26:04 Post No.61274133
[Report]

Anonymous 2017-07-08 03:26:04 Post No.61274133 [Report]

>>61271687
Gonna further this and suggest using the Geb library within Groovy. Makes creating browser bots really quick.

Anonymous 2017-07-08 03:56:22 Post No.61274510
[Report]

Anonymous 2017-07-08 03:56:22 Post No.61274510 [Report]

Browser proxy to find the loads the site makes?

Anonymous 2017-07-08 03:57:31 Post No.61274529
[Report]

Anonymous 2017-07-08 03:57:31 Post No.61274529 [Report]

Why are you using Java when Python is literally made for this shit? BeautifulSoup and Mechanize.

Or, selenium.

Anonymous 2017-07-08 06:56:21 Post No.61276536
[Report]

Anonymous 2017-07-08 06:56:21 Post No.61276536 [Report]

>>61274529
Are you sure this would work? I tried Python and BeautifulSoup before and it did not work (though I did not use Selenium or Mechanize).

Anonymous 2017-07-08 06:58:52 Post No.61276571
[Report]

Anonymous 2017-07-08 06:58:52 Post No.61276571 [Report]

>>61270513
If the bank requires javascript enabled for whatever reason, try using a headless browser. Or a ws library

Anonymous 2017-07-08 07:11:08 Post No.61276697
[Report]

Anonymous 2017-07-08 07:11:08 Post No.61276697 [Report]

>>61271199
check out selenium for java, used it a couple of times and it's surprisingly easy

Anonymous 2017-07-08 07:21:13 Post No.61276820
[Report]

Anonymous 2017-07-08 07:21:13 Post No.61276820 [Report]

>>61276697
>>61276571
thanks will take a look into both ways.

Anonymous 2017-07-08 07:23:51 Post No.61276854
[Report]

Anonymous 2017-07-08 07:23:51 Post No.61276854 [Report]

Take a look at the page's source code (not via Inspect Element), check if the data is there. If it is not the site may be generated with Javascript. Then you need something like Selenium.

Anonymous 2017-07-08 07:24:47 Post No.61276868
[Report]

Anonymous 2017-07-08 07:24:47 Post No.61276868 [Report]

>>61276571
This

Anonymous 2017-07-08 07:26:09 Post No.61276886
[Report]

Anonymous 2017-07-08 07:26:09 Post No.61276886 [Report]

>>61276697
Not sure if you are dead set on using Java but I use a JS approach with CasperJS, it loads all the scripts properly and can fire click events and whatnot

http://casperjs.org/

Anonymous 2017-07-08 08:01:52 Post No.61277255
[Report]

Anonymous 2017-07-08 08:01:52 Post No.61277255 [Report]

>>61276697
Use Groovy/Geb for Selenium drivers. Raw Selenium is an unnecessary headache.

Anonymous 2017-07-08 10:31:53 Post No.61279265
[Report]

Anonymous 2017-07-08 10:31:53 Post No.61279265 [Report]

>>61270513
Your Java is severely triggering me

I'm aware that Imgur.com will stop allowing adult images since 15th of May. I'm taking actions to backup as much data as possible. Read more on this topic here - https://archived.moe/talk/thread/1694/

I'm aware that Imgur.com will stop allowing adult images since 15th of May. I'm taking actions to backup as much data as possible.
Read more on this topic here - https://archived.moe/talk/thread/1694/