[Boards: 3 / a / aco / adv / an / asp / b / bant / biz / c / can / cgl / ck / cm / co / cock / d / diy / e / fa / fap / fit / fitlit / g / gd / gif / h / hc / his / hm / hr / i / ic / int / jp / k / lgbt / lit / m / mlp / mlpol / mo / mtv / mu / n / news / o / out / outsoc / p / po / pol / qa / qst / r / r9k / s / s4s / sci / soc / sp / spa / t / tg / toy / trash / trv / tv / u / v / vg / vint / vip / vp / vr / w / wg / wsg / wsr / x / y ] [Search | Free Show | Home]

Pastebin Downloader Script Troubleshooting

This is a blue board which means that it's for everybody (Safe For Work content only). If you see any adult content, please report it.

Thread replies: 9
Thread images: 2

File: Old Code.jpg (169KB, 1004x406px) Image search: [Google]
Old Code.jpg
169KB, 1004x406px
I've got an old python pastebin downloader script that stopped working.

It's supposed to give me the paste in the following format:

[Title of paste]
[Author]

[Paste link]
[Author's bin]

[Last Edit]
[Date retrieved]

[The paste contents...]

When I try to run lots of pastebin links in the command prompt, it spits out these errors, and doesn't download them all. When I only load one link, it downloads it, but throws a bunch of garbage into the "last edit" section of the paste.

>[Error code in command prompt window starts]

Traceback (most recent call last):
File "C:\Users\...blahblah...\storydownloader.py", line 22, in <module>
file.write('"{0}"\nBy: {1}\n\nhttp://pastebin.com/u/{1}\n{2}\n\nLast Edit: {3}\nRetrieved: {4}\n\n{5}'.format(name, author, url, lastdate, time.strftime("%d/%m/%Y"), story))
File "C:\Python34\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0] UnicodeEncodeError: 'charmap' codec can't encode character '\u200f' in position 69778: character maps to <undefined>

>[Error code in command prompt window ends]

I tried running the code on 42 links at once, and got that error. Only a few stories downloaded, and they all had a huge bunch of code in between the last edit, and the actual date of the last edit.

I did not get this error when I tried running it through with just one story, but the produced file still ended up having a lot of code in the last edit section.

I tried manually updating from version 3.4 of python to the current 3.6 version, but the error still happened.

Starting from square one, with someone who can kind of parse html well enough to rip bandcamp songs by hand and once messed around with visual basic in high school to build a metal detector or automatic electromagnet on a rotating arm, I don't know how long it would take before I can reliably troubleshoot this and write my own scripts to automate grabbing things from webpages.

I'll put the code in the next post.
>>
>>299225

from urllib.request import urlopen
from html import unescape
import re
import time
import string
import sys
valid_chars = "-_.() %s%s" % (string.ascii_letters, string.digits)
for url in sys.argv[1:]:
data = str(urlopen(url).read())
quickdata = re.sub(r'<div id="code_buttons">.*','',data)
filename = re.sub(r' - Pastebin.com</title>.*','',re.sub(r'.*<title>','',quickdata))
name = filename
author = re.sub(r'".*','',re.sub(r'.*By: <a href="/u/','',quickdata))
if author[:5] == "b'<!D":
author = 'Unknown'
lastdate = re.sub(r' [0-9][0-9]:.*','',re.sub(r'.*<span title="Last edit on:','',quickdata))
if lastdate[:5] == "b'<!D":
lastdate = re.sub(r' [0-9][0-9]:.*','',re.sub(r'.*on <span title="','',quickdata))
story = unescape(data[data.find('"return catchTab(this,event)"')+30:data.find('</textarea>')].replace(r'\r\n','\n').replace(r"\'",r"'"))
filename = ''.join(c for c in filename if c in valid_chars)
file = open(filename + '.txt','w')
file.write('"{0}"\nBy: {1}\n\nhttp://pastebin.com/u/{1}\n{2}\n\nLast Edit: {3}\nRetrieved: {4}\n\n{5}'.format(name, author, url, lastdate, time.strftime("%d/%m/%Y"), story))
file.close()
- - - - -

Obviously, the best long-term solution would be to learn python, but I've got my hands full for the time being. Going from "Has used some visual basic in a high school class, can sort of parse html and rip bandcamp songs by hand, and sorta messed with PHP for a school project but doesn't fully get it" to being able to write simple retrieval scripts like this, how long would that take?

Or, if you can, tell me which chapters of the documentation I should read in order to figure out how this code works, so that I can do some troubleshooting that way. I'd rather get this working with help, and then set learning to make scripts as a side project for the next while.
>>
>>299228
>file.write('"{0}"\nBy: {1}\n\nhttp://pastebin.com/u/{1}\n{2}\n\nLast Edit: {3}\nRetrieved: {4}\n\n{5}'.format(name, author, url, lastdate, time.strftime("%d/%m/%Y"), story))

Replace "story" on this line with "story.encode('utf-8')" and let me know if it works.
>>
>>299231
It didn't work.

The problem with the "last edit" is still there, and linebreaks seem to be replaced with "/n", resulting in a wall of text since I have wordwrap on.
>>
>>299234
I also got a bunch of new errors, but I can't seem to copy them from the command window.
>>
>>299235
Try https://pastebin.com/3rwKWLNm
I just got rid of his html unescaping.
>>
File: screenshot.5.jpg (218KB, 661x430px) Image search: [Google]
screenshot.5.jpg
218KB, 661x430px
>>299249
I still get the "last edit" error, and now every ">" has been changed into "&gt;".

I capped the errors I got.

This code used to work fine, but I haven't used it in months. I did think of updating my python library, from 3.4 to 3.6, I think, but that didn't do anything.
>>
>>299256
The error in your screenshot is saying that the link you are giving it is returning a 404 error.
>>
>>299266
That's odd, but I don't know if that's important. That's not what showed up in the errors in the original version of the code (listed near the top of the thread).

For the record, my testing method is to run the script with one pastebin link first, then check the file for errors. The command window doesn't give errors at this stage. The second round of testing is running the script with around fourty-two different links. The code works with multiple links, so I normally just keep adding new ones to notepad++ and run the script when it starts looking full.
Thread posts: 9
Thread images: 2


[Boards: 3 / a / aco / adv / an / asp / b / bant / biz / c / can / cgl / ck / cm / co / cock / d / diy / e / fa / fap / fit / fitlit / g / gd / gif / h / hc / his / hm / hr / i / ic / int / jp / k / lgbt / lit / m / mlp / mlpol / mo / mtv / mu / n / news / o / out / outsoc / p / po / pol / qa / qst / r / r9k / s / s4s / sci / soc / sp / spa / t / tg / toy / trash / trv / tv / u / v / vg / vint / vip / vp / vr / w / wg / wsg / wsr / x / y] [Search | Top | Home]

I'm aware that Imgur.com will stop allowing adult images since 15th of May. I'm taking actions to backup as much data as possible.
Read more on this topic here - https://archived.moe/talk/thread/1694/


If you need a post removed click on it's [Report] button and follow the instruction.
DMCA Content Takedown via dmca.com
All images are hosted on imgur.com.
If you like this website please support us by donating with Bitcoins at 16mKtbZiwW52BLkibtCr8jUg2KVUMTxVQ5
All trademarks and copyrights on this page are owned by their respective parties.
Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.
This is a 4chan archive - all of the content originated from that site.
This means that RandomArchive shows their content, archived.
If you need information for a Poster - contact them.