What webcrawlers have you made that you use on a regular basis?

>>56659412
I used a html parser module for python called beautifulSoup, you wont need more than this 99%of the time. If more interaction is needed with the website, selenium is my goto module.

Anonymous 2016-09-18 10:04:10 Post No.56660154
[Report]

Anonymous 2016-09-18 10:04:10 Post No.56660154 [Report]

>>56659441
is there any method of doing this without resorting the the cucks of programming languages`?

is it easy to do with C or Ada?

Anonymous 2016-09-18 10:12:11 Post No.56660216
[Report]

Anonymous 2016-09-18 10:12:11 Post No.56660216 [Report]

>>56660154
i hope this is a bait post
you are constrained mostly by your network speed, therefore the c/ada program running faster makes absolutely no difference
and no, it isnt easy either
just use python/ruby/any scripting language w/ a library that has bindings to a native html parser

Anonymous 2016-09-18 10:17:51 Post No.56660244
[Report]

Anonymous 2016-09-18 10:17:51 Post No.56660244 [Report]

>>56660216
Mostly i dont know those and would like to use something im comfortable with

Anonymous 2016-09-18 10:21:10 Post No.56660266
[Report]

Anonymous 2016-09-18 10:21:10 Post No.56660266 [Report]

Just came up with the idea for a crawler that looks for credit card info or pics of cards (eg on Twitter) and makes donations to something worthwhile, like cloning Harambe

Anonymous 2016-09-18 10:23:59 Post No.56660284
[Report]

Anonymous 2016-09-18 10:23:59 Post No.56660284 [Report]

>>56659441
>I used a html parser module for python called beautifulSoup
beautifulSoup is nice, but it's slow as shit. Unless you REALLY need the tolerance, use LXML.

>>56660154
Why the crap would you want to tackle a problem like this using C or Ada?

>cucks of programming languages
Are you 12?

>>56660244
>Mostly i dont know those
Then learn one. If you know C well then I can't imagine you'll have trouble picking up Python.

>>56660266
>Automatic image recognition against random images on Twitter
Good luck.

Anonymous 2016-09-18 10:33:20 Post No.56660352
[Report]

Anonymous 2016-09-18 10:33:20 Post No.56660352 [Report]

>>56660284
Training a neutral net to recognize credit cards would be stupidly easy. Throw in some OCR magic and hook it up to Twitter's API and there you go. Something similar was already done, there used to be an account that would retweet photos of credit cards.

Anonymous 2016-09-18 10:47:16 Post No.56660440
[Report]

Anonymous 2016-09-18 10:47:16 Post No.56660440 [Report]

Tempted to make an interpals crawler that searches girls for keywords and sends a message created around those keywords

Anonymous 2016-09-18 11:16:18 Post No.56660691
[Report]

Anonymous 2016-09-18 11:16:18 Post No.56660691 [Report]

>>56660352
>Training a neutral net to recognize credit cards would be stupidly easy.
Maybe?
I suspect the broad range of patterns and images on credit cards would make identifying them tricky, but I don't have real experience in that area.

Anonymous 2016-09-18 11:26:09 Post No.56660769
[Report]

Anonymous 2016-09-18 11:26:09 Post No.56660769 [Report]

>>56657276
did one to log into Mergent Online and search for top preformers for the day in the, well, some market like nyse. downloads financial statements, competitors, etc

pretty useless thing to do

Anonymous 2016-09-18 11:28:44 Post No.56660785
[Report]

Anonymous 2016-09-18 11:28:44 Post No.56660785 [Report]

parse university canteens websites

offer machine readable data

Anonymous 2016-09-18 11:40:06 Post No.56660880
[Report]

Anonymous 2016-09-18 11:40:06 Post No.56660880 [Report]

>>56657276
any guides on building one?

Anonymous 2016-09-18 11:41:30 Post No.56660901
[Report]

Anonymous 2016-09-18 11:41:30 Post No.56660901 [Report]

>>56660216
how hard would it be to do it in C though ? Just for fun to learn and become more familiar with C.
Or is it just too bad of an idea

Anonymous 2016-09-18 11:52:52 Post No.56660997
[Report]

Anonymous 2016-09-18 11:52:52 Post No.56660997 [Report]

>>56657276
WEBCRAWLING IN MY SKIN

Anonymous 2016-09-18 11:57:52 Post No.56661045
[Report]

Anonymous 2016-09-18 11:57:52 Post No.56661045 [Report]

>>56660997
i was fucking listening to crawling too

Anonymous 2016-09-18 11:58:02 Post No.56661046
[Report]

Anonymous 2016-09-18 11:58:02 Post No.56661046 [Report]

I work at an industrial equipment distributor.
Made some scripts to gather data and process from manufacturers pages in order to use them on our company page.
Does this count as crawling, I haven't really looked into the definition of crawling, I just did my junk to do what it need it to.

Anonymous 2016-09-18 12:06:06 Post No.56661110
[Report]

Anonymous 2016-09-18 12:06:06 Post No.56661110 [Report]

>>56659412
Scrapy framework for Python is very good and use concurrent requests.
A lot of options are also available.

Anonymous 2016-09-18 12:10:11 Post No.56661144
[Report]

Anonymous 2016-09-18 12:10:11 Post No.56661144 [Report]

>>56657276
Made some python/scrapy cronjobs to automatically like the fb/twitter posts of my gf every hour or so.

Cause you know, I'm a vagina slave developer with no time for childishness like social networks.

Anonymous 2016-09-18 12:27:46 Post No.56661291
[Report]

Anonymous 2016-09-18 12:27:46 Post No.56661291 [Report]

Sometimes I like to use nmap to scan millions of random IPs on port 80 and then see if a web page resolves. It's usually just boring shit like chinese sites and stuff. I found someone's home videos once.

Anonymous 2016-09-18 13:18:27 Post No.56661851
[Report]

Anonymous 2016-09-18 13:18:27 Post No.56661851 [Report]

One for "subscribing" to youtube channels, without having an account, navigating through a laggy GUI or getting distracted from my work by recommendations.
After the scan, the videos open in a vlc media stream.
Quite comfy on low-end computers.

Anonymous 2016-09-18 13:26:43 Post No.56661952
[Report]

Anonymous 2016-09-18 13:26:43 Post No.56661952 [Report]

>>56661851
Nicely done.

Anonymous 2016-09-18 13:28:31 Post No.56661978
[Report]

Anonymous 2016-09-18 13:28:31 Post No.56661978 [Report]

>>56661851
>One for "subscribing" to youtube channels, without having an account
I do that too.

>After the scan, the videos open in a vlc media stream.
Huh, okay. Mine returns an Atom feed that gets read by my feed reader.

Anonymous 2016-09-18 13:33:57 Post No.56662051
[Report]

Anonymous 2016-09-18 13:33:57 Post No.56662051 [Report]

>>56661851
>>56661978
Are you doing that via the Youtube API? I did a similar API to RSS kind of thing for search results a while ago, but it had some arbitrary limits in API v2 or whatever it was at the time.

Anonymous 2016-09-18 13:37:43 Post No.56662096
[Report]

Anonymous 2016-09-18 13:37:43 Post No.56662096 [Report]

>>56662051
>>56661978
Y'know, every channel does have an RSS feed. Y'can just use that.

Anonymous 2016-09-18 13:38:34 Post No.56662107
[Report]

Anonymous 2016-09-18 13:38:34 Post No.56662107 [Report]

>>56662096
Uuhh uh whaat

Anonymous 2016-09-18 13:41:13 Post No.56662139
[Report]

Anonymous 2016-09-18 13:41:13 Post No.56662139 [Report]

>>56662051
>Are you doing that via the Youtube API?
God no. The Youtube API actually requires you to authenticate with an account.

I'm just scraping the HTML of the Uploads page (or Playlist page) and the individual Video pages. To save re-scraping the same pages over and over, I store the info on the Video pages in a SQLite DB between scrapes.

If Google doesn't like me doing that, then they're free to bring back channel RSS feeds.

>>56662096
>every channel does have an RSS feed
That's been gone for years.

Anonymous 2016-09-18 13:41:50 Post No.56662144
[Report]

Anonymous 2016-09-18 13:41:50 Post No.56662144 [Report]

>>56662107
yeah. There was some specific url you paste the channels ID after, but sometimes even just view source and look for 'RSS' works. Ill see if I have the URL saved somewhere.
All you need is RSS though, for that, yeah.

Anonymous 2016-09-18 13:43:02 Post No.56662159
[Report]

Anonymous 2016-09-18 13:43:02 Post No.56662159 [Report]

>>56662139
>That's been gone for years.
Its not. It's still there. Just not obviously available.

Anonymous 2016-09-18 13:45:28 Post No.56662185
[Report]

Anonymous 2016-09-18 13:45:28 Post No.56662185 [Report]

>>56662159
>Its not. It's still there. Just not obviously available.
Shit, really? I did a bunch of searching for stuff like that before I wrote the scraper, but I found nothing that still worked.
Do you have any information you could post / link to?

Anonymous 2016-09-18 13:52:08 Post No.56662241
[Report] Image search: [Google]

Anonymous 2016-09-18 13:52:08 Post No.56662241 [Report]

File: foo.png (93KB, 707x337px) Image search: [Google]

93KB, 707x337px

I have some automated betting process going on with a few sports betting sites.

Anonymous 2016-09-18 14:05:12 Post No.56662366
[Report]

Anonymous 2016-09-18 14:05:12 Post No.56662366 [Report]

>>56662185
https://www.youtube.com/feeds/videos.xml?channel_id=[HEX-ID]

[HEX-ID] => search for the tag "channel-external-id" on the channel HTML

Anonymous 2016-09-18 14:09:25 Post No.56662412
[Report]

Anonymous 2016-09-18 14:09:25 Post No.56662412 [Report]

>>56662241
this seems cool..is it checking the betting lines in those games?

Anonymous 2016-09-18 14:10:35 Post No.56662420
[Report]

Anonymous 2016-09-18 14:10:35 Post No.56662420 [Report]

>>56662185
>>56662366
However, on osme channels, you can just view source and ctrl+f 'rss'. for example, from the channel of a random video in my recommended videos-
https://www.youtube.com/channel/UCxr2d4As312LulcajAkKJYw
https://www.youtube.com/feeds/videos.xml?channel_id=UCxr2d4As312LulcajAkKJYw

Otherwise you'll have to do that however.
hoowever, I'm...having trouble finding one its not working for right now, looking for one, even though it didnt work for alot of the channels in my rss. So its pretty helpful still, apparantly.

Anonymous 2016-09-18 14:12:21 Post No.56662444
[Report]

Anonymous 2016-09-18 14:12:21 Post No.56662444 [Report]

>>56660901
if there are existing libs in C for scraping like beautifulsoup, then easy

Otherwise, start from scratch would be an intermediate task for a new C programmer

Anonymous 2016-09-18 14:19:07 Post No.56662522
[Report]

Anonymous 2016-09-18 14:19:07 Post No.56662522 [Report]

>>56662139
>channel RSS feeds are gone
Its still there, using it right now...

Anonymous 2016-09-18 14:22:16 Post No.56662560
[Report]

Anonymous 2016-09-18 14:22:16 Post No.56662560 [Report]

>>56660901
I suggest going for libcurl and libtidy.
libtidy comes with a buffer that can be passed to the curl_easy_setopt on CURLOPT_WRITEDATA
But listen to >>56660216

Anonymous 2016-09-18 14:39:56 Post No.56662756
[Report]

Anonymous 2016-09-18 14:39:56 Post No.56662756 [Report]

>>56662366
>>56662420
>>56662522
>https://www.youtube.com/feeds/videos.xml?channel_id=UCqbkm47qBxDj-P3lI9voIAw
Alright, I don't know if that was added since I wrote this thing, or if I missed it somehow.
Still, Thanks!

Anonymous 2016-09-18 16:05:35 Post No.56663891
[Report]

Anonymous 2016-09-18 16:05:35 Post No.56663891 [Report]

>>56660284
>>I used a html parser module for python called beautifulSoup
>beautifulSoup is nice, but it's slow as shit. Unless you REALLY need the tolerance, use LXML.
What the fuck is this and how do I use it?
Im manually parsing html with C right now..
>protip: it just werks

I'm aware that Imgur.com will stop allowing adult images since 15th of May. I'm taking actions to backup as much data as possible. Read more on this topic here - https://archived.moe/talk/thread/1694/

I'm aware that Imgur.com will stop allowing adult images since 15th of May. I'm taking actions to backup as much data as possible.
Read more on this topic here - https://archived.moe/talk/thread/1694/