Hey guys, I'm parsing files (Wiki pages, basically) using

Thread replies: 7
Thread images: 1

Anonymous
2017-01-10 06:59:40 Post No. 58424118
[Report] Image search: [Google]

File: Screen Shot 2017-01-10 at 19.58.04.png (25KB, 587x143px) Image search: [Google]

Anonymous 2017-01-10 06:59:40 Post No. 58424118 [Report]

Hey guys,

I'm parsing files (Wiki pages, basically) using the re package in python (might not matter) and I'm having problem with the following. The structure is as such

=== Header1 ===
[[link1]], [[link2]],
[[link3]]
=== Header2 ===
[[link3]]
=== Header3 ===
[[link1]], [[link2]]

This is read in as a STRING, i.e.
STRING = "=== Header1 ===\n[[link1]], [[link2]],\n[[link3]]\n=== Header2 ===\n[[link3]]\n=== Header3 ===\n[[link1]], [[link2]]"

Here the names are dummies. Names and quantity of headers and links is basically random

I'd like to get the data, e.g. in a dict.

d = {"Header1": ["link1", "link2", "link3"], "Header2": ["link3"], "Header1": ["link1", "link2"]}

OP 2017-01-10 07:02:30 Post No.58424156
[Report]

OP 2017-01-10 07:02:30 Post No.58424156 [Report]

here is the reference for the re package I use to try and assemble some code

https://www.tutorialspoint.com/python/python_reg_expressions.htm

Anonymous 2017-01-10 07:23:09 Post No.58424458
[Report]

Anonymous 2017-01-10 07:23:09 Post No.58424458 [Report]

>>58424118
>(might not matter)
It does, though.
http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags
Seriously, stop using re and start using an HTML parser.

Anonymous 2017-01-10 07:27:14 Post No.58424535
[Report]

Anonymous 2017-01-10 07:27:14 Post No.58424535 [Report]

>>58424458
mhm, well I'd rather stay inside the framework I use for the pre- and post-processing of the data, though.
My main question is how to take into account the possible repetition of [[]]

Anonymous 2017-01-10 07:34:17 Post No.58424673
[Report]

Anonymous 2017-01-10 07:34:17 Post No.58424673 [Report]

>>58424535
You mean you want to capture repeated groups?

Anonymous 2017-01-10 07:36:01 Post No.58424701
[Report]

Anonymous 2017-01-10 07:36:01 Post No.58424701 [Report]

>>58424673
yes

(among other things - for example tell it to go from "=== Header1 ===" to the next "===" or end of file, and not further)

Anonymous 2017-01-10 07:41:10 Post No.58424759
[Report]

Anonymous 2017-01-10 07:41:10 Post No.58424759 [Report]

>>58424701
It's possible you can't with re: http://stackoverflow.com/questions/9764930/capturing-repeating-subpatterns-in-python-regex

Well, it's a bad idea anyway. Better get the HTML and parse it properly.

I'm aware that Imgur.com will stop allowing adult images since 15th of May. I'm taking actions to backup as much data as possible. Read more on this topic here - https://archived.moe/talk/thread/1694/

I'm aware that Imgur.com will stop allowing adult images since 15th of May. I'm taking actions to backup as much data as possible.
Read more on this topic here - https://archived.moe/talk/thread/1694/