[Boards: 3 / a / aco / adv / an / asp / b / bant / biz / c / can / cgl / ck / cm / co / cock / d / diy / e / fa / fap / fit / fitlit / g / gd / gif / h / hc / his / hm / hr / i / ic / int / jp / k / lgbt / lit / m / mlp / mlpol / mo / mtv / mu / n / news / o / out / outsoc / p / po / pol / qa / qst / r / r9k / s / s4s / sci / soc / sp / spa / t / tg / toy / trash / trv / tv / u / v / vg / vint / vip / vp / vr / w / wg / wsg / wsr / x / y ] [Search | Free Show | Home]

How would you read the text from a webpage with Java? Without

This is a blue board which means that it's for everybody (Safe For Work content only). If you see any adult content, please report it.

Thread replies: 22
Thread images: 1

File: cs.jpg (669KB, 1280x1240px) Image search: [Google]
cs.jpg
669KB, 1280x1240px
How would you read the text from a webpage with Java? Without 3rd party libraries


Stackoverflow suggested using the Html.fromHtml method but it doesn't seem to be part of Java 8.
>>
>>52649295
Just use Html.fromHtml
>>
>>52649295
streamreader
>>
>>52649482
That's how i download the entire page but i need something to distinguish the html shit from actual text
>>
you need to be more specific about "read the text"
>>
>>52649295
>using java to scrap web pages
aaaand another fine example of using the wrong tech
>>
Regexes
>>
>>52649517
not op, but what would you suggest?
>>
>>52649500
parse the header, that will tell you how big it is, cut it off, the rest is html, if you want to parse the html from scratch you can gf.
>>
>>52649534
Python + beautiful soup
>>
>>52649534
That would probably be a 3 or 4 poor lines of code in PHP without any lib. Python comes to mind too.
>>
>>52649527
:^)
>>
>>52649534
Python + default HTMLParser
>>
>>52649534
Python
>>
>>52649295
Use Python, Scala, Ruby or Perl.

They're some languages with the not so annoying parsing.

Java has decent parsing power, but using it is so fucking verbose and annoying...
>>
>>52649614
OP here: I have to use java for this shitty uni class
>>
>>52649644
Uni class... so no Jsoup or HtmlCleaner or Jericho either? (Three decent HTML parser libs for Java.)

Well, I guess you'll just have to deal with the verbosity. It's not hard. Just annoying. Too much work for real life projects, eh.
>>
>>52649527
How do you read something like this?
>replaceAll("\\<[^>]*>","")

Replace all \\ and everything between < >?


>>52649702
Nope, no 3rd party libraries. Sudoku inbound
>>
>>52649731
> Nope, no 3rd party libraries. Sudoku inbound
It's not *that* bad, DESU. Just verbose.

Feel free to do the same exercise with one of the languages I suggested, should be a easier when it's some ugly ass real life HTML. If it's only an exercise HTML or one of the few neat web pages, there's nothing to be afraid of...
>>
>>52649808
>Feel free to do the same exercise with one of the languages I suggested
This is not bad advice, desu. But if you're strapped for time, don't bother obviously.

I suggest using regexes. Just Google regex helper or something like that (I think I use regex101) and copy and paste the file you're supposed to be parsing into there. Fiddle with the regex until everything you're looking for is selected, then boom you're done. All you've got to do is look through the Java documentation for how to apply that bad boy and you're good to go.

Regexes are bullshit to learn, but they're too useful not to use in the long run. Might as well get started now.
>>
>>52650150
>using regex to parse xml
kill yourself
>>
>>52650175
How do you suggest he do it in Java with no external libs then? I don't know enough about Java and its standard library to think of any less insane solution right now.
Thread posts: 22
Thread images: 1


[Boards: 3 / a / aco / adv / an / asp / b / bant / biz / c / can / cgl / ck / cm / co / cock / d / diy / e / fa / fap / fit / fitlit / g / gd / gif / h / hc / his / hm / hr / i / ic / int / jp / k / lgbt / lit / m / mlp / mlpol / mo / mtv / mu / n / news / o / out / outsoc / p / po / pol / qa / qst / r / r9k / s / s4s / sci / soc / sp / spa / t / tg / toy / trash / trv / tv / u / v / vg / vint / vip / vp / vr / w / wg / wsg / wsr / x / y] [Search | Top | Home]

I'm aware that Imgur.com will stop allowing adult images since 15th of May. I'm taking actions to backup as much data as possible.
Read more on this topic here - https://archived.moe/talk/thread/1694/


If you need a post removed click on it's [Report] button and follow the instruction.
DMCA Content Takedown via dmca.com
All images are hosted on imgur.com.
If you like this website please support us by donating with Bitcoins at 16mKtbZiwW52BLkibtCr8jUg2KVUMTxVQ5
All trademarks and copyrights on this page are owned by their respective parties.
Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.
This is a 4chan archive - all of the content originated from that site.
This means that RandomArchive shows their content, archived.
If you need information for a Poster - contact them.