How would you read the text from a webpage with Java? Without

Thread replies: 22
Thread images: 1

Anonymous
2016-01-27 12:59:42 Post No. 52649295
[Report] Image search: [Google]

File: cs.jpg (669KB, 1280x1240px) Image search: [Google]

Anonymous 2016-01-27 12:59:42 Post No. 52649295 [Report]

How would you read the text from a webpage with Java? Without 3rd party libraries

Stackoverflow suggested using the Html.fromHtml method but it doesn't seem to be part of Java 8.

Anonymous 2016-01-27 13:21:57 Post No.52649453
[Report]

Anonymous 2016-01-27 13:21:57 Post No.52649453 [Report]

>>52649295
Just use Html.fromHtml

Anonymous 2016-01-27 13:25:32 Post No.52649482
[Report]

Anonymous 2016-01-27 13:25:32 Post No.52649482 [Report]

>>52649295
streamreader

Anonymous 2016-01-27 13:28:24 Post No.52649500
[Report]

Anonymous 2016-01-27 13:28:24 Post No.52649500 [Report]

>>52649482
That's how i download the entire page but i need something to distinguish the html shit from actual text

Anonymous 2016-01-27 13:30:21 Post No.52649515
[Report]

Anonymous 2016-01-27 13:30:21 Post No.52649515 [Report]

you need to be more specific about "read the text"

Anonymous 2016-01-27 13:30:27 Post No.52649517
[Report]

Anonymous 2016-01-27 13:30:27 Post No.52649517 [Report]

>>52649295
>using java to scrap web pages
aaaand another fine example of using the wrong tech

Anonymous 2016-01-27 13:31:11 Post No.52649527
[Report]

Anonymous 2016-01-27 13:31:11 Post No.52649527 [Report]

Regexes

Anonymous 2016-01-27 13:31:57 Post No.52649534
[Report]

Anonymous 2016-01-27 13:31:57 Post No.52649534 [Report]

>>52649517
not op, but what would you suggest?

Anonymous 2016-01-27 13:33:14 Post No.52649549
[Report]

Anonymous 2016-01-27 13:33:14 Post No.52649549 [Report]

>>52649500
parse the header, that will tell you how big it is, cut it off, the rest is html, if you want to parse the html from scratch you can gf.

Anonymous 2016-01-27 13:33:15 Post No.52649550
[Report]

Anonymous 2016-01-27 13:33:15 Post No.52649550 [Report]

>>52649534
Python + beautiful soup

Anonymous 2016-01-27 13:33:55 Post No.52649557
[Report]

Anonymous 2016-01-27 13:33:55 Post No.52649557 [Report]

>>52649534
That would probably be a 3 or 4 poor lines of code in PHP without any lib. Python comes to mind too.

Anonymous 2016-01-27 13:34:14 Post No.52649562
[Report]

Anonymous 2016-01-27 13:34:14 Post No.52649562 [Report]

>>52649527
:^)

Anonymous 2016-01-27 13:37:59 Post No.52649600
[Report]

Anonymous 2016-01-27 13:37:59 Post No.52649600 [Report]

>>52649534
Python + default HTMLParser

Anonymous 2016-01-27 13:38:16 Post No.52649601
[Report]

Anonymous 2016-01-27 13:38:16 Post No.52649601 [Report]

>>52649534
Python

Anonymous 2016-01-27 13:39:11 Post No.52649614
[Report]

Anonymous 2016-01-27 13:39:11 Post No.52649614 [Report]

>>52649295
Use Python, Scala, Ruby or Perl.

They're some languages with the not so annoying parsing.

Java has decent parsing power, but using it is so fucking verbose and annoying...

Anonymous 2016-01-27 13:43:32 Post No.52649644
[Report]

Anonymous 2016-01-27 13:43:32 Post No.52649644 [Report]

>>52649614
OP here: I have to use java for this shitty uni class

Anonymous 2016-01-27 13:48:48 Post No.52649702
[Report]

Anonymous 2016-01-27 13:48:48 Post No.52649702 [Report]

>>52649644
Uni class... so no Jsoup or HtmlCleaner or Jericho either? (Three decent HTML parser libs for Java.)

Well, I guess you'll just have to deal with the verbosity. It's not hard. Just annoying. Too much work for real life projects, eh.

Anonymous 2016-01-27 13:51:46 Post No.52649731
[Report]

Anonymous 2016-01-27 13:51:46 Post No.52649731 [Report]

>>52649527
How do you read something like this?
>replaceAll("\\<[^>]*>","")

Replace all \\ and everything between < >?

>>52649702
Nope, no 3rd party libraries. Sudoku inbound

Anonymous 2016-01-27 13:57:54 Post No.52649808
[Report]

Anonymous 2016-01-27 13:57:54 Post No.52649808 [Report]

>>52649731
> Nope, no 3rd party libraries. Sudoku inbound
It's not *that* bad, DESU. Just verbose.

Feel free to do the same exercise with one of the languages I suggested, should be a easier when it's some ugly ass real life HTML. If it's only an exercise HTML or one of the few neat web pages, there's nothing to be afraid of...

Anonymous 2016-01-27 14:30:40 Post No.52650150
[Report]

Anonymous 2016-01-27 14:30:40 Post No.52650150 [Report]

>>52649808
>Feel free to do the same exercise with one of the languages I suggested
This is not bad advice, desu. But if you're strapped for time, don't bother obviously.

I suggest using regexes. Just Google regex helper or something like that (I think I use regex101) and copy and paste the file you're supposed to be parsing into there. Fiddle with the regex until everything you're looking for is selected, then boom you're done. All you've got to do is look through the Java documentation for how to apply that bad boy and you're good to go.

Regexes are bullshit to learn, but they're too useful not to use in the long run. Might as well get started now.

Anonymous 2016-01-27 14:32:31 Post No.52650175
[Report]

Anonymous 2016-01-27 14:32:31 Post No.52650175 [Report]

>>52650150
>using regex to parse xml
kill yourself

Anonymous 2016-01-27 14:38:08 Post No.52650240
[Report]

Anonymous 2016-01-27 14:38:08 Post No.52650240 [Report]

>>52650175
How do you suggest he do it in Java with no external libs then? I don't know enough about Java and its standard library to think of any less insane solution right now.

I'm aware that Imgur.com will stop allowing adult images since 15th of May. I'm taking actions to backup as much data as possible. Read more on this topic here - https://archived.moe/talk/thread/1694/

I'm aware that Imgur.com will stop allowing adult images since 15th of May. I'm taking actions to backup as much data as possible.
Read more on this topic here - https://archived.moe/talk/thread/1694/