[Boards: 3 / a / aco / adv / an / asp / b / bant / biz / c / can / cgl / ck / cm / co / cock / d / diy / e / fa / fap / fit / fitlit / g / gd / gif / h / hc / his / hm / hr / i / ic / int / jp / k / lgbt / lit / m / mlp / mlpol / mo / mtv / mu / n / news / o / out / outsoc / p / po / pol / qa / qst / r / r9k / s / s4s / sci / soc / sp / spa / t / tg / toy / trash / trv / tv / u / v / vg / vint / vip / vp / vr / w / wg / wsg / wsr / x / y ] [Search | Free Show | Home]

What webcrawlers have you made that you use on a regular basis?

This is a blue board which means that it's for everybody (Safe For Work content only). If you see any adult content, please report it.

Thread replies: 45
Thread images: 2

File: web-crawlers.jpg (41KB, 600x600px) Image search: [Google]
web-crawlers.jpg
41KB, 600x600px
What webcrawlers have you made that you use on a regular basis?
>>
ona that checks every chaturbate room for a pair of tetas theb it takes a screenshot and compresses it down to a 32x32 gif and sends it to my phone
>>
>>56657292
why
>>
>>56657292
ur my hero m8
>>
EVERYONE GIMME YOUR WEBCRAWLER IDEAS
>>
One that hops on my bank account and gets any cash movement.

A crawler to download poster/favorited videos of a user from xhamster.
>>
>>56657276
How do I make one? Where can I get examples? What do I need? Is this like imacros in Firefox? I am really interested
>>
>>56659412
I used a html parser module for python called beautifulSoup, you wont need more than this 99%of the time. If more interaction is needed with the website, selenium is my goto module.
>>
>>56659441
is there any method of doing this without resorting the the cucks of programming languages`?

is it easy to do with C or Ada?
>>
>>56660154
i hope this is a bait post
you are constrained mostly by your network speed, therefore the c/ada program running faster makes absolutely no difference
and no, it isnt easy either
just use python/ruby/any scripting language w/ a library that has bindings to a native html parser
>>
>>56660216
Mostly i dont know those and would like to use something im comfortable with
>>
Just came up with the idea for a crawler that looks for credit card info or pics of cards (eg on Twitter) and makes donations to something worthwhile, like cloning Harambe
>>
>>56659441
>I used a html parser module for python called beautifulSoup
beautifulSoup is nice, but it's slow as shit. Unless you REALLY need the tolerance, use LXML.

>>56660154
Why the crap would you want to tackle a problem like this using C or Ada?

>cucks of programming languages
Are you 12?

>>56660244
>Mostly i dont know those
Then learn one. If you know C well then I can't imagine you'll have trouble picking up Python.

>>56660266
>Automatic image recognition against random images on Twitter
Good luck.
>>
>>56660284
Training a neutral net to recognize credit cards would be stupidly easy. Throw in some OCR magic and hook it up to Twitter's API and there you go. Something similar was already done, there used to be an account that would retweet photos of credit cards.
>>
Tempted to make an interpals crawler that searches girls for keywords and sends a message created around those keywords
>>
>>56660352
>Training a neutral net to recognize credit cards would be stupidly easy.
Maybe?
I suspect the broad range of patterns and images on credit cards would make identifying them tricky, but I don't have real experience in that area.
>>
>>56657276
did one to log into Mergent Online and search for top preformers for the day in the, well, some market like nyse. downloads financial statements, competitors, etc


pretty useless thing to do
>>
parse university canteens websites

offer machine readable data
>>
>>56657276
any guides on building one?
>>
>>56660216
how hard would it be to do it in C though ? Just for fun to learn and become more familiar with C.
Or is it just too bad of an idea
>>
>>56657276
WEBCRAWLING IN MY SKIN
>>
>>56660997
i was fucking listening to crawling too
>>
I work at an industrial equipment distributor.
Made some scripts to gather data and process from manufacturers pages in order to use them on our company page.
Does this count as crawling, I haven't really looked into the definition of crawling, I just did my junk to do what it need it to.
>>
>>56659412
Scrapy framework for Python is very good and use concurrent requests.
A lot of options are also available.
>>
>>56657276
Made some python/scrapy cronjobs to automatically like the fb/twitter posts of my gf every hour or so.

Cause you know, I'm a vagina slave developer with no time for childishness like social networks.
>>
Sometimes I like to use nmap to scan millions of random IPs on port 80 and then see if a web page resolves. It's usually just boring shit like chinese sites and stuff. I found someone's home videos once.
>>
One for "subscribing" to youtube channels, without having an account, navigating through a laggy GUI or getting distracted from my work by recommendations.
After the scan, the videos open in a vlc media stream.
Quite comfy on low-end computers.
>>
>>56661851
Nicely done.
>>
>>56661851
>One for "subscribing" to youtube channels, without having an account
I do that too.

>After the scan, the videos open in a vlc media stream.
Huh, okay. Mine returns an Atom feed that gets read by my feed reader.
>>
>>56661851
>>56661978
Are you doing that via the Youtube API? I did a similar API to RSS kind of thing for search results a while ago, but it had some arbitrary limits in API v2 or whatever it was at the time.
>>
>>56662051
>>56661978
Y'know, every channel does have an RSS feed. Y'can just use that.
>>
>>56662096
Uuhh uh whaat
>>
>>56662051
>Are you doing that via the Youtube API?
God no. The Youtube API actually requires you to authenticate with an account.

I'm just scraping the HTML of the Uploads page (or Playlist page) and the individual Video pages. To save re-scraping the same pages over and over, I store the info on the Video pages in a SQLite DB between scrapes.

If Google doesn't like me doing that, then they're free to bring back channel RSS feeds.

>>56662096
>every channel does have an RSS feed
That's been gone for years.
>>
>>56662107
yeah. There was some specific url you paste the channels ID after, but sometimes even just view source and look for 'RSS' works. Ill see if I have the URL saved somewhere.
All you need is RSS though, for that, yeah.
>>
>>56662139
>That's been gone for years.
Its not. It's still there. Just not obviously available.
>>
>>56662159
>Its not. It's still there. Just not obviously available.
Shit, really? I did a bunch of searching for stuff like that before I wrote the scraper, but I found nothing that still worked.
Do you have any information you could post / link to?
>>
File: foo.png (93KB, 707x337px) Image search: [Google]
foo.png
93KB, 707x337px
I have some automated betting process going on with a few sports betting sites.
>>
>>56662185
https://www.youtube.com/feeds/videos.xml?channel_id=[HEX-ID]

[HEX-ID] => search for the tag "channel-external-id" on the channel HTML
>>
>>56662241
this seems cool..is it checking the betting lines in those games?
>>
>>56662185
>>56662366
However, on osme channels, you can just view source and ctrl+f 'rss'. for example, from the channel of a random video in my recommended videos-
https://www.youtube.com/channel/UCxr2d4As312LulcajAkKJYw
https://www.youtube.com/feeds/videos.xml?channel_id=UCxr2d4As312LulcajAkKJYw

Otherwise you'll have to do that however.
hoowever, I'm...having trouble finding one its not working for right now, looking for one, even though it didnt work for alot of the channels in my rss. So its pretty helpful still, apparantly.
>>
>>56660901
if there are existing libs in C for scraping like beautifulsoup, then easy

Otherwise, start from scratch would be an intermediate task for a new C programmer
>>
>>56662139
>channel RSS feeds are gone
Its still there, using it right now...
>>
>>56660901
I suggest going for libcurl and libtidy.
libtidy comes with a buffer that can be passed to the curl_easy_setopt on CURLOPT_WRITEDATA
But listen to >>56660216
>>
>>56662366
>>56662420
>>56662522
>https://www.youtube.com/feeds/videos.xml?channel_id=UCqbkm47qBxDj-P3lI9voIAw
Alright, I don't know if that was added since I wrote this thing, or if I missed it somehow.
Still, Thanks!
>>
>>56660284
>>I used a html parser module for python called beautifulSoup
>beautifulSoup is nice, but it's slow as shit. Unless you REALLY need the tolerance, use LXML.
What the fuck is this and how do I use it?
Im manually parsing html with C right now..
>protip: it just werks
Thread posts: 45
Thread images: 2


[Boards: 3 / a / aco / adv / an / asp / b / bant / biz / c / can / cgl / ck / cm / co / cock / d / diy / e / fa / fap / fit / fitlit / g / gd / gif / h / hc / his / hm / hr / i / ic / int / jp / k / lgbt / lit / m / mlp / mlpol / mo / mtv / mu / n / news / o / out / outsoc / p / po / pol / qa / qst / r / r9k / s / s4s / sci / soc / sp / spa / t / tg / toy / trash / trv / tv / u / v / vg / vint / vip / vp / vr / w / wg / wsg / wsr / x / y] [Search | Top | Home]

I'm aware that Imgur.com will stop allowing adult images since 15th of May. I'm taking actions to backup as much data as possible.
Read more on this topic here - https://archived.moe/talk/thread/1694/


If you need a post removed click on it's [Report] button and follow the instruction.
DMCA Content Takedown via dmca.com
All images are hosted on imgur.com.
If you like this website please support us by donating with Bitcoins at 16mKtbZiwW52BLkibtCr8jUg2KVUMTxVQ5
All trademarks and copyrights on this page are owned by their respective parties.
Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.
This is a 4chan archive - all of the content originated from that site.
This means that RandomArchive shows their content, archived.
If you need information for a Poster - contact them.