Hey guys,
I'm parsing files (Wiki pages, basically) using the re package in python (might not matter) and I'm having problem with the following. The structure is as such
=== Header1 ===
[[link1]], [[link2]],
[[link3]]
=== Header2 ===
[[link3]]
=== Header3 ===
[[link1]], [[link2]]
This is read in as a STRING, i.e.
STRING = "=== Header1 ===\n[[link1]], [[link2]],\n[[link3]]\n=== Header2 ===\n[[link3]]\n=== Header3 ===\n[[link1]], [[link2]]"
Here the names are dummies. Names and quantity of headers and links is basically random
I'd like to get the data, e.g. in a dict.
d = {"Header1": ["link1", "link2", "link3"], "Header2": ["link3"], "Header1": ["link1", "link2"]}
here is the reference for the re package I use to try and assemble some code
https://www.tutorialspoint.com/python/python_reg_expressions.htm
>>58424118
>(might not matter)
It does, though.
http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags
Seriously, stop using re and start using an HTML parser.
>>58424458
mhm, well I'd rather stay inside the framework I use for the pre- and post-processing of the data, though.
My main question is how to take into account the possible repetition of [[]]
>>58424535
You mean you want to capture repeated groups?
>>58424673
yes
(among other things - for example tell it to go from "=== Header1 ===" to the next "===" or end of file, and not further)
>>58424701
It's possible you can't with re: http://stackoverflow.com/questions/9764930/capturing-repeating-subpatterns-in-python-regex
Well, it's a bad idea anyway. Better get the HTML and parse it properly.