[Boards: 3 / a / aco / adv / an / asp / b / bant / biz / c / can / cgl / ck / cm / co / cock / d / diy / e / fa / fap / fit / fitlit / g / gd / gif / h / hc / his / hm / hr / i / ic / int / jp / k / lgbt / lit / m / mlp / mlpol / mo / mtv / mu / n / news / o / out / outsoc / p / po / pol / qa / qst / r / r9k / s / s4s / sci / soc / sp / spa / t / tg / toy / trash / trv / tv / u / v / vg / vint / vip / vp / vr / w / wg / wsg / wsr / x / y ] [Search | Free Show | Home]

Hey guys, I'm parsing files (Wiki pages, basically) using

This is a blue board which means that it's for everybody (Safe For Work content only). If you see any adult content, please report it.

Thread replies: 7
Thread images: 1

Hey guys,

I'm parsing files (Wiki pages, basically) using the re package in python (might not matter) and I'm having problem with the following. The structure is as such

=== Header1 ===
[[link1]], [[link2]],
[[link3]]
=== Header2 ===
[[link3]]
=== Header3 ===
[[link1]], [[link2]]

This is read in as a STRING, i.e.
STRING = "=== Header1 ===\n[[link1]], [[link2]],\n[[link3]]\n=== Header2 ===\n[[link3]]\n=== Header3 ===\n[[link1]], [[link2]]"

Here the names are dummies. Names and quantity of headers and links is basically random

I'd like to get the data, e.g. in a dict.

d = {"Header1": ["link1", "link2", "link3"], "Header2": ["link3"], "Header1": ["link1", "link2"]}
>>
here is the reference for the re package I use to try and assemble some code

https://www.tutorialspoint.com/python/python_reg_expressions.htm
>>
>>58424118
>(might not matter)
It does, though.
http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags
Seriously, stop using re and start using an HTML parser.
>>
>>58424458
mhm, well I'd rather stay inside the framework I use for the pre- and post-processing of the data, though.
My main question is how to take into account the possible repetition of [[]]
>>
>>58424535
You mean you want to capture repeated groups?
>>
>>58424673
yes

(among other things - for example tell it to go from "=== Header1 ===" to the next "===" or end of file, and not further)
>>
>>58424701
It's possible you can't with re: http://stackoverflow.com/questions/9764930/capturing-repeating-subpatterns-in-python-regex

Well, it's a bad idea anyway. Better get the HTML and parse it properly.
Thread posts: 7
Thread images: 1


[Boards: 3 / a / aco / adv / an / asp / b / bant / biz / c / can / cgl / ck / cm / co / cock / d / diy / e / fa / fap / fit / fitlit / g / gd / gif / h / hc / his / hm / hr / i / ic / int / jp / k / lgbt / lit / m / mlp / mlpol / mo / mtv / mu / n / news / o / out / outsoc / p / po / pol / qa / qst / r / r9k / s / s4s / sci / soc / sp / spa / t / tg / toy / trash / trv / tv / u / v / vg / vint / vip / vp / vr / w / wg / wsg / wsr / x / y] [Search | Top | Home]

I'm aware that Imgur.com will stop allowing adult images since 15th of May. I'm taking actions to backup as much data as possible.
Read more on this topic here - https://archived.moe/talk/thread/1694/


If you need a post removed click on it's [Report] button and follow the instruction.
DMCA Content Takedown via dmca.com
All images are hosted on imgur.com.
If you like this website please support us by donating with Bitcoins at 16mKtbZiwW52BLkibtCr8jUg2KVUMTxVQ5
All trademarks and copyrights on this page are owned by their respective parties.
Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.
This is a 4chan archive - all of the content originated from that site.
This means that RandomArchive shows their content, archived.
If you need information for a Poster - contact them.