[Boards: 3 / a / aco / adv / an / asp / b / bant / biz / c / can / cgl / ck / cm / co / cock / d / diy / e / fa / fap / fit / fitlit / g / gd / gif / h / hc / his / hm / hr / i / ic / int / jp / k / lgbt / lit / m / mlp / mlpol / mo / mtv / mu / n / news / o / out / outsoc / p / po / pol / qa / qst / r / r9k / s / s4s / sci / soc / sp / spa / t / tg / toy / trash / trv / tv / u / v / vg / vint / vip / vp / vr / w / wg / wsg / wsr / x / y ] [Search | Free Show | Home]

{S/E/D/HELP} Is there a SED wizard in the audience? This s

This is a blue board which means that it's for everybody (Safe For Work content only). If you see any adult content, please report it.

Thread replies: 27
Thread images: 2

File: index.jpg (9KB, 259x194px) Image search: [Google]
index.jpg
9KB, 259x194px
{S/E/D/HELP}

Is there a SED wizard in the audience?

This sed searches through some text and successively finds instances of the keyword "conditions" and prints the closest instance of what ever follows after "subreddit_id":

sed -n '/conditions/{s/.*,"subreddit_id":"\([^"]*\)".*/\1/;p;}' /home/a/Desktop/subredditidsearch.txt

The contents of the data set "subredditidsearch.txt":

{"ups":1,"created_utc":"1204329608","subreddit":"reddit.com","link_id":"t3_69ta9","author_flair_text":null,"score":1,"subreddit_id":"t5_6","body":"What conditions would have to be met in order for the U.S. led invasion of Iraq to be considered a genocide?","name":"t1_c03bgax","distinguished":null,"edited":false,"parent_id":"t1_c03bg9s","archived":true,"author_flair_css_class":null,"gilded":0,"retrieved_on":1425832318,"id":"c03bgax","controversiality":0,"downs":0,"author":"bsiviglia9","score_hidden":false}
{"edited":false,"parent_id":"t1_c03bc4g","distinguished":null,"name":"t1_c03bgay","body":"You missed his their/they're mixup.","score":1,"subreddit_id":"t5_6","author_flair_text":null,"link_id":"t3_6aezn","ups":1,"subreddit":"reddit.com","created_utc":"1204329608","score_hidden":false,"author":"Cyrius","downs":0,"controversiality":0,"id":"c03bgay","retrieved_on":1425832318,"author_flair_css_class":null,"gilded":0,"archived":true}

Now, if I change "subreddit_id" to say "author", it will print the correct author, in this case "Cyrius".

Is there an easy way to modify this sed to print both "subreddit_id" and "author" when it finds the keyword? Any help would be greatly appreaciated!
>>
Why don't you go ba?ck to Re?dd?it and ask them
>>
>>59038482
/thread
>>
>>59038482
I've had more luck with sed and regex questions in here than on reddit desu.

Besides, I'm trying to data mine reddit, not ask them for help.
>>
>>59038431

>working with JSON
>not just using the right tool for the job like Node.js

Kill yourself

>JSON file named .txt

Literally kill yourself.
>>
>>59038592
>>working with JSON
That's what the data set is structured as. Blame the dumper not me

>not just using the right tool for the job like Node.js

I have nearly two billion reddit comments to look through. With sed I manage to saturate the wirte speed of the HDDs I'm storing it on

>JSON file named .txt
It's a set of strings, call the police
>>
>>59038642
read speed*
>>
>>59038592
>Node.js

heartylaugh.flv
>>
>>59038715
underrated post
>>
Bump.
My current solution is to run the entire command for every item I want to extract, but because it takes me 11 hours to run through the data set it would save me a lot of time if I could use a capture group or something to add say both subreddit_id and author and have it print both. I\ve been trying various things but I can\t seem to get it to work, the sex regex is very unfamiliar to me.
>>
>>59039389
sed regex*
>>
bumping for interest.
>>
sed is terrible for this
>>
>>59039919
sed works wonderfully, and the speed is incredible. It also handles a 1TB+ data set with ease.

It's actually perfect for the job.
>>
>>59039938
>sed is perfect for parsing json
lmfao
>>
>>59039389
>because it takes me 11 hours to run through the data set
Use a JSON parser and dump to database you utter retard.
>>
File: 1446968882748.jpg (153KB, 1920x1080px) Image search: [Google]
1446968882748.jpg
153KB, 1920x1080px
>>59040127
>Use a JSON parser and dump to database you utter retard.

Name a JSON parser that can handle 6 billion lines at 200+ MB/s and I'll give it a try
>>
I took a shot at it lol. maybe it helps? :)

sed -R "s/.*subreddit_id...([a-z]{1}\d_\d).*(conditions).*author...
(\w+).*/ID: [\1] by \3 contained topic with the word \2/" <input.txt >output.txt
>>
>>59040293
I'll give it a try, hold on
>>
>>59040293

So I tried the folowing:

sed -r "s/.*subreddit_id...([a-z]{1}\d_\d).*(conditions).*author...
> (\w+).*/ID: [\1] by \3 contained topic with the word \2/" /home/a/Desktop/subredditidsearch.txt

Which gives me:

sed: -e expression #1, char 59: unterminated `s' command
>>
>>59040685
You're on Linux right? If so, try replacing " with ' and use capital R instead of sed -r.
>>
>>59040750

Yeah, Ubuntu.

Now I'm getting this:

a@1:~$ sed R 's/.*subreddit_id...([a-z]{1}\d_\d).*(conditions).*author...
> (\w+).*/ID: [\1] by \3 contained topic with the word \2/' /home/a/Desktop/subredditidsearch.txt
sed: can't read s/.*subreddit_id...([a-z]{1}\d_\d).*(conditions).*author...
(\w+).*/ID: [\1] by \3 contained topic with the word \2/: No such file or directory
{"ups":1,"created_utc":"1204329608","subreddit":"reddit.com","link_id":"t3_69ta9","author_flair_text":null,"score":1,"subreddit_id":"t5_6","body":"What conditions would have to be met in order for the U.S. led invasion of Iraq to be considered a genocide?","name":"t1_c03bgax","distinguished":null,"edited":false,"parent_id":"t1_c03bg9s","archived":true,"author_flair_css_class":null,"gilded":0,"retrieved_on":1425832318,"id":"c03bgax","controversiality":0,"downs":0,"author":"bsiviglia9","score_hidden":false}
{"edited":false,"parent_id":"t1_c03bc4g","distinguished":null,"name":"t1_c03bgay","body":"You missed his their/they're mixup.","score":1,"subreddit_id":"t5_6","author_flair_text":null,"link_id":"t3_6aezn","ups":1,"subreddit":"reddit.com","created_utc":"1204329608","score_hidden":false,"author":"Cyrius","downs":0,"controversiality":0,"id":"c03bgay","retrieved_on":1425832318,"author_flair_css_class":null,"gilded":0,"archived":true}
>>
>>59040848
-R. Also fix your filepath
>>
>>59040941
>-R

sed: invalid option -- 'R'
Usage: sed [OPTION]... {script-only-if-no-other-script} [input-file]...
>>
>>59040165
No clue how fast your machine is. But here are fast full JSON parsers:

https://github.com/fabienrenaud/java-json-benchmark

https://github.com/miloyip/nativejson-benchmark
>>
>>59040165
You will almost guaranteed to be bottlenecked on IO not CPU, unless you're running a toaster.
>>
>>59038571
> I'm trying to data mine reddit, not ask them for help.
How about you first throw the data into something more suitable, such as Apache Spark?

It'll make the actual data mining far easier going forward.
Thread posts: 27
Thread images: 2


[Boards: 3 / a / aco / adv / an / asp / b / bant / biz / c / can / cgl / ck / cm / co / cock / d / diy / e / fa / fap / fit / fitlit / g / gd / gif / h / hc / his / hm / hr / i / ic / int / jp / k / lgbt / lit / m / mlp / mlpol / mo / mtv / mu / n / news / o / out / outsoc / p / po / pol / qa / qst / r / r9k / s / s4s / sci / soc / sp / spa / t / tg / toy / trash / trv / tv / u / v / vg / vint / vip / vp / vr / w / wg / wsg / wsr / x / y] [Search | Top | Home]

I'm aware that Imgur.com will stop allowing adult images since 15th of May. I'm taking actions to backup as much data as possible.
Read more on this topic here - https://archived.moe/talk/thread/1694/


If you need a post removed click on it's [Report] button and follow the instruction.
DMCA Content Takedown via dmca.com
All images are hosted on imgur.com.
If you like this website please support us by donating with Bitcoins at 16mKtbZiwW52BLkibtCr8jUg2KVUMTxVQ5
All trademarks and copyrights on this page are owned by their respective parties.
Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.
This is a 4chan archive - all of the content originated from that site.
This means that RandomArchive shows their content, archived.
If you need information for a Poster - contact them.