I've got an old python pastebin downloader script that stopped working.
It's supposed to give me the paste in the following format:
[Title of paste]
[Author]
[Paste link]
[Author's bin]
[Last Edit]
[Date retrieved]
[The paste contents...]
When I try to run lots of pastebin links in the command prompt, it spits out these errors, and doesn't download them all. When I only load one link, it downloads it, but throws a bunch of garbage into the "last edit" section of the paste.
>[Error code in command prompt window starts]
Traceback (most recent call last):
File "C:\Users\...blahblah...\storydownloader.py", line 22, in <module>
file.write('"{0}"\nBy: {1}\n\nhttp://pastebin.com/u/{1}\n{2}\n\nLast Edit: {3}\nRetrieved: {4}\n\n{5}'.format(name, author, url, lastdate, time.strftime("%d/%m/%Y"), story))
File "C:\Python34\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0] UnicodeEncodeError: 'charmap' codec can't encode character '\u200f' in position 69778: character maps to <undefined>
>[Error code in command prompt window ends]
I tried running the code on 42 links at once, and got that error. Only a few stories downloaded, and they all had a huge bunch of code in between the last edit, and the actual date of the last edit.
I did not get this error when I tried running it through with just one story, but the produced file still ended up having a lot of code in the last edit section.
I tried manually updating from version 3.4 of python to the current 3.6 version, but the error still happened.
Starting from square one, with someone who can kind of parse html well enough to rip bandcamp songs by hand and once messed around with visual basic in high school to build a metal detector or automatic electromagnet on a rotating arm, I don't know how long it would take before I can reliably troubleshoot this and write my own scripts to automate grabbing things from webpages.
I'll put the code in the next post.
>>299225
from urllib.request import urlopen
from html import unescape
import re
import time
import string
import sys
valid_chars = "-_.() %s%s" % (string.ascii_letters, string.digits)
for url in sys.argv[1:]:
data = str(urlopen(url).read())
quickdata = re.sub(r'<div id="code_buttons">.*','',data)
filename = re.sub(r' - Pastebin.com</title>.*','',re.sub(r'.*<title>','',quickdata))
name = filename
author = re.sub(r'".*','',re.sub(r'.*By: <a href="/u/','',quickdata))
if author[:5] == "b'<!D":
author = 'Unknown'
lastdate = re.sub(r' [0-9][0-9]:.*','',re.sub(r'.*<span title="Last edit on:','',quickdata))
if lastdate[:5] == "b'<!D":
lastdate = re.sub(r' [0-9][0-9]:.*','',re.sub(r'.*on <span title="','',quickdata))
story = unescape(data[data.find('"return catchTab(this,event)"')+30:data.find('</textarea>')].replace(r'\r\n','\n').replace(r"\'",r"'"))
filename = ''.join(c for c in filename if c in valid_chars)
file = open(filename + '.txt','w')
file.write('"{0}"\nBy: {1}\n\nhttp://pastebin.com/u/{1}\n{2}\n\nLast Edit: {3}\nRetrieved: {4}\n\n{5}'.format(name, author, url, lastdate, time.strftime("%d/%m/%Y"), story))
file.close()
- - - - -
Obviously, the best long-term solution would be to learn python, but I've got my hands full for the time being. Going from "Has used some visual basic in a high school class, can sort of parse html and rip bandcamp songs by hand, and sorta messed with PHP for a school project but doesn't fully get it" to being able to write simple retrieval scripts like this, how long would that take?
Or, if you can, tell me which chapters of the documentation I should read in order to figure out how this code works, so that I can do some troubleshooting that way. I'd rather get this working with help, and then set learning to make scripts as a side project for the next while.
>>299228
>file.write('"{0}"\nBy: {1}\n\nhttp://pastebin.com/u/{1}\n{2}\n\nLast Edit: {3}\nRetrieved: {4}\n\n{5}'.format(name, author, url, lastdate, time.strftime("%d/%m/%Y"), story))
Replace "story" on this line with "story.encode('utf-8')" and let me know if it works.
>>299231
It didn't work.
The problem with the "last edit" is still there, and linebreaks seem to be replaced with "/n", resulting in a wall of text since I have wordwrap on.
>>299234
I also got a bunch of new errors, but I can't seem to copy them from the command window.
>>299235
Try https://pastebin.com/3rwKWLNm
I just got rid of his html unescaping.
>>299249
I still get the "last edit" error, and now every ">" has been changed into ">".
I capped the errors I got.
This code used to work fine, but I haven't used it in months. I did think of updating my python library, from 3.4 to 3.6, I think, but that didn't do anything.
>>299256
The error in your screenshot is saying that the link you are giving it is returning a 404 error.
>>299266
That's odd, but I don't know if that's important. That's not what showed up in the errors in the original version of the code (listed near the top of the thread).
For the record, my testing method is to run the script with one pastebin link first, then check the file for errors. The command window doesn't give errors at this stage. The second round of testing is running the script with around fourty-two different links. The code works with multiple links, so I normally just keep adding new ones to notepad++ and run the script when it starts looking full.