[Boards: 3 / a / aco / adv / an / asp / b / bant / biz / c / can / cgl / ck / cm / co / cock / d / diy / e / fa / fap / fit / fitlit / g / gd / gif / h / hc / his / hm / hr / i / ic / int / jp / k / lgbt / lit / m / mlp / mlpol / mo / mtv / mu / n / news / o / out / outsoc / p / po / pol / qa / qst / r / r9k / s / s4s / sci / soc / sp / spa / t / tg / toy / trash / trv / tv / u / v / vg / vint / vip / vp / vr / w / wg / wsg / wsr / x / y ] [Search | Free Show | Home]

I am building a search engine

This is a blue board which means that it's for everybody (Safe For Work content only). If you see any adult content, please report it.

Thread replies: 122
Thread images: 19

File: 1493136569774.jpg (1MB, 2048x1537px) Image search: [Google]
1493136569774.jpg
1MB, 2048x1537px
It is located at wibr.me. I am sick of google and the boring results it returns so I'm building my own. I want to rebuild the web to be like it was in the early days, where mostly personal and hobbyist type pages existed, that were lightweight and easy to read. Please try it out and give me feedback. If you want to submit a page to it, I'd appreciate it. There aren't many pages indexed at the moment.
>>
File: 1491762690407.gif (2MB, 540x501px) Image search: [Google]
1491762690407.gif
2MB, 540x501px
What is your policy on copyright?
Will your system be built on top of any other platform or system?
Will your system use or depend on third-party libraries or systems?
What languages do you use for programming?
Do you use any form of logging in your software?
Can we see the source?
>>
>>60237613
>What is your policy on copyright?
I'm just a one man show so I honestly haven't really considered this. I dont have a policy except to say if its illegal where I live, I probably wouldnt be able to have it. :(

>Will your system be built on top of any other platform or system?
The LAMP stack. Thats it.

>What languages do you use for programming?
The pages are php. The web crawler is pure C (for the speed).

>Do you use any form of logging in your software?
I really, really don't want to log peoples IP's. That ads a huge layer of complexity and for what purpose? To feed to the government? Screw them. My fear is what would happen if they tried to compel me? This will probably never make any real popularity so I don't think it would be a worry.
>>
File: 1490027717885.jpg (110KB, 640x640px) Image search: [Google]
1490027717885.jpg
110KB, 640x640px
>>60237681
Neat. I like the slim 'n cozy feel.
>>
>>60237766
Thanks
>>
bump for interest
>>
>>60237564
are you just not indexing big sites or something?
I searched for wikipedia and got stallman.org
>>
>>60237956
stallman > wikipedia

But in all seriousness I don't think I want to index wikipedia or other wikies all that much, Its not trying to be like google where it will deliver you the answer to the technical question you may search for at the expense of all the fun interesting sites being buried. I could never compete with them on that level anyway. Rather I want this to be more of a search engine for when you have no idea what you're looking for. Myave you just have a general top of interest. Its hard to put into words. kindof just flying by the seat of my pants here.
>>
>>60237956
Also, all the big sites are bloated and full of scripts. They aren't going to make it on this search engine.
>>
>>60238018
This might just be absolutely genius.
>>
>>60237564
results are a little weak
>>
>>60238018
what do you mean don't know what your're looking for? how does it work?
>>
File: Cbadapture.png (5KB, 447x116px) Image search: [Google]
Cbadapture.png
5KB, 447x116px
Its shit
>>
>>60238045
:D
>>60238049
yeah, it will get better as I can improve the code behind it, but also there just aren't many pages indexed.
>>
>>
>>60238070
Say you want to find pages about.. cats. But not a page on say, "cat hair removal". "Cat hair removal" is a technical question that google is best suited for. Of course, as more pages get indexed, this sort of query might actually work.
>>
>>60238093
Sorry :(
>>
>>60238102
but can't you just google "cat"? what does it do differently?

the suprise me feature is pretty good though
>>
>>60238151
What I hope to do, is when you search for 'cat', you will get a bunch of results of people's personal webpages, about their own cats. If you google 'cat', you get SPCA type results, or mainstream news articles, or pages that are mainly profit/ad driven that yield little genuine content. Those are not the type of results I want it to generate.
>>
>>60237564
>tfw almost every search I've done pointed me to stallman.org
Is this some kind of joke

Also, I appreciate your efforts OP, but nothing I've searched for gave me meaningful results.
>>
>>60238177
ah okay makes sense now

you should put a tagline somewhere on the main page that explains that
>>
>>60238180
heheh.
Well come back in a year, I hope to have more pages indexed by then.
>>
>>60238240
Yeah I think I might have to do just that
>>
Searched for 007 and got this, that's the kind of thing you would get on old Google, good job OP.
https://goldeneye007.detstar.com
>>
File: 1483313465864.jpg (24KB, 313x367px) Image search: [Google]
1483313465864.jpg
24KB, 313x367px
>>60237564
>one man show trying to build a search engine
>LAMP
>pure C "for speed"

I smell a college sophomore who is trying to spin his data structures final into a half assed product because they couldn't get an internship for the summer.
>>
>>60238263
Yeah those are the pages I need to find among all the crap on the interweb.

>>60238304
I come from the electronics/hardware side of things, long since graduated. SW is a skill I like to explore and keep up with. Maybe it looks half assed to you but I'm happy with the way its turning out.
>>
I like this feature to

https://wibr.me/surprise/

Not gonna use this for anything I'm really trying to find out but for finding things you don't know you want to know seems to work great.
>>
>>60238425
>tfw this is the first website I found
http://motherfuckingwebsite.com/
Seems like I'll be using this often
>>
>>60238425
>>60238546
Glad you like it. Only way to keep that interesting is to keep the same kind of requirements for what gets indexed.
>>
>>60238304
I smell a retard.
>>
>>60238425
That took me to
>http://www.angelfire.com/trek/caver
>>
>>60238425
http://vxheaven.org/
Neat
>>
>>60238830
Interesting stuff
>>
yeah i'm liking the surprise feature. feels like late 90s, early 2000s. but then i'm just using it as a stumbleupon which was/is already a thing to go to random pages.
>>
>>60237564
Give sites that use non-free javascript lower search priority

Blacklist sites which abuse search hits for clicks.
For instance, sites that create empty pages for certain keystrokes like jiskha.com or chegg.com which puts everything behind a paywall (advertising).
>>
File: screebsgitty.png (5KB, 607x338px) Image search: [Google]
screebsgitty.png
5KB, 607x338px
>>60237564
Need some work
>>
>>60239309
also let us submit sites to blacklist

you can start with twitter, youtube, google, wikipedia, twitter, facebook, reddit and 4chan.

Although some of these are indeed useful, they do not need to appear in search results.
>>
You should make a search engine that ignores the top 1000 sites.
>>
>>60239309
>>60239368

These sites will never be able to be indexed on here. They are all retroactively banned I tell you.

>>60239340
You can help me with that...
>>
>>60239414
>These sites will never be able to be indexed on here. They are all retroactively banned I tell you.
Except the chans
>>
>>60237564

What an awful place to work. Are these tools even on the spectrum? What is privacy? What is personal space? Fucking disgusting
>>
>>60239718
rude
>>
File: 1471949190434.png (58KB, 842x456px) Image search: [Google]
1471949190434.png
58KB, 842x456px
Maybe it's a good thing.
>>
>>60239414
ultimately if you blacklist enough crap and set up filters, you could create a fairly neat search engine without having to manually approve each address.
>>
>>60238304
>I smell a college sophomore who is trying to spin his data structures final into a half assed product because they couldn't get an internship for the summer.

fucking savage I love it
this is so true

any lone man trying to create a search engine is either delusional, or a sophomore in uni
>>
>>60239810
>learn linux
>learn php
>learn sql
>learn how search engines work
>improve coding skills
>can leverage new skills to get better job

hurr durr totally delusional.
>>
http://www.bbspot.com/News/2000/5/clock_rift.html

Now this is the type of content I want on /g/
>>
Your algorithm decides which sites to includes by how much CSS they use and the general size of images I suppose?
I think it's cool but how do you make money? Or rather: Why should I use your search engine instead of just keep using google?
You're literally trying to make a separate web happen.
Yes technically every search engine is a separate web that allows you to peek into different parts of the internet.
But which sites will get a higher pagerank than other sites? Those that use less CSS? So a site that could be seriously popular might only show up on page 3 because it is too 'modern' or too corporate?
>>
>>60237564
Anon, be careful. One day, you *will* index CP. This could be a life ruining event for you. You are not Google and do not have over 9000 jewish lawyers on your payroll.
>>
>>60239878
>linking to illegal material is now illegal
>???
>>
>>60239869
>>Your algorithm decides which sites to includes by how much CSS they use and the general size of images I suppose?

This I want to know, How does this algorithm work that filters out 'big' websites? Or do you possibly just go through your web crawler's results yourself?
>>
>>60239907
Go set up a CP link site. Lemme know how that works out of you.

Yes, the law is retarded.
>>
>>60239869
I'm not trying to replace or supplant google. This is as I said earlier up the thread, a totally different kind of approach. Google wants to index the entire web, and deliver you the exact answer you are looking for related usually to a technical question. Whereas I want my search engine to be more of a stroll in a cozy village. You wanted to go for a walk, you weren't sure exactly what you will run into, and you end up seeing some interesting things along the way. The sites also will fit a certain aesthetic criteria. Those which do not fit the criteria are not indexed.

Yes its strange. Its not going to impress most people. Just a niche. But thats how it works. I'm fine with it.

The problem with google indexing literally everything, is that you end up with many awful, cancerous pages. They are primarily ad driven, click bait type pages that make my your modern computer sweat. Or they are boring as hell associations, corporate landing pages, mainstream new articles, etc. The gem type pages I am looking for get lost and crowded out by all the other filth.
>>
>>60239869
>>60239992

I'm not sure I want to explain how that part works :D I will say however, that only the page you submit will get indexed. Thought about crawling other sub-pages and decided against it. It relies entirely on user submissions (or me finding the pages myself and submitting them).
>>
>>60237564
It's great that you're building your own search-engine because most search-engines are heavily censored and monitored.

There is one alternative that may interest you, http://yacy.net/

YaCy is P2P based and it's written in Java. It really isn't very good. I've been running it for years anyway, though.

You should probably look into that and think long and hard about how you want to be sorting your index.

YaCy has a ton of pages in it's index, and since it's P2P the entire network "knows" about a large portion of the Internet. Where this project really breaks down is it's ability to sort search-results, it doesn't. At all. It's almost like that code which supposedly ranks search-results simply pulls /dev/random and outputs the pages in that order.
>>
Thanks for the feedback yall, time for bed.
>>
>>60240050
I hate the name. Try something more like "frontier.web". It's more meaningful to what it seems like you're trying to do, and I'm pretty sure that people are sick of meaningless names, especially if you're trying to distance yourself from the existing Silicon Valley circle jerk.
>>
>>60240152
alright will check that out and thanks for the feedback. night
>>
>>60240050

> The sites also will fit a certain aesthetic criteria. Those which do not fit the criteria are not indexed.

yeah but sites change their design when they get more popular. So at some point you may be removing sites from your search engine which became popular due to your search engine?
that's a funny scenario
>>
>>60240186
oh well :)
>>
File: hipsterglasses.jpg (13KB, 400x167px) Image search: [Google]
hipsterglasses.jpg
13KB, 400x167px
>>60240186
>more popular
I think you mean, "too mainstream".
>>
This is really neat anon. I bookmarked this.
>>
Any plans to monetize it yet? If so, how?
>>
>>60240206

look, if you want actual advice: Make a search engine that *only* finds high performance websites. You ping them, if they respond and load well, you include them in search results.
You can start by just requiring sites to be under lets say 250kb or 500kb and go from there to more features.
>>
>>60240166
I'm not sure how much I like it either, but its short and easy to remember. Also, unfortunately most of the premium, english domains are taken, so you have to resort to some kind of weird new word.
>>
>>60240228
A lightweight, minimal ad to the right of the search results will one day be put in, is the plan anyway.

>>60240240
So every page taken by a domain shark will get indexed then :). They all have responsive, light weight pages. You need an extra level of QC.

fuck i need to goto bed but cant stop
>>
>>60240264
> So every page taken by a domain shark will get indexed then

mate then add your own level of QC. But you need some kind of actually objectively useful draw other than "this search engine only has sites I like"
>>
>>60240283
As I said, hobbyist and personal web pages that are lightweight. Thats the main criteria.
>>
bump'd

I submitted some websites
>>
File: asdf.png (20KB, 614x211px) Image search: [Google]
asdf.png
20KB, 614x211px
>>60237564
>>
>>60238546
>http://motherfuckingwebsite.com
lmfao
>>
Good work OP! i will submit some websites
>>
File: 1455294753473.png (212KB, 582x571px) Image search: [Google]
1455294753473.png
212KB, 582x571px
I like the idea, OP, it's like a niche stumbleupon. I'm somehow getting addicted to it.

>>60240555
nice
>>
>>60237681
>I really, really don't want to log peoples IP's. That ads a huge layer of complexity
i dont think you know how to program
>>
That fucking surprise hahhahah
>>
>>60237564
congratulations! unlike front end guys, you are really an engineer.
>>
>surprise me
>home.mcom.com/home/welcome.html

FUCK
>>
File: 123.png (158KB, 1916x972px) Image search: [Google]
123.png
158KB, 1916x972px
>>60237564
Does it even use keywords?
Looks like completely random redirect from yahoo search. If it does not find shit just make it say "sorry nothing relevant found onii-chan"
>>
>>60240006
Bullshit, i highly doubt every "young models/ family nudism/ lolita" site that links to literal cp is run by some pro. Some of them are obviously amateurs and only half are Russian.

OP, put a disclaimer on the site and have a report feature, that's all you really need.
>>
>>60238425
I got http://www.omfgdogs.com/
before I read the other comments I thought, this just brought you to a random but useless site
>>
File: lolnoob hackerman.png (910KB, 1024x1024px) Image search: [Google]
lolnoob hackerman.png
910KB, 1024x1024px
>surprise me
>cia.gov
>>
File: Hitler.jpg (13KB, 289x292px) Image search: [Google]
Hitler.jpg
13KB, 289x292px
Awesome site, OP. Surprise feature a fucking best.
>>
File: hqdefault.jpg (17KB, 480x360px) Image search: [Google]
hqdefault.jpg
17KB, 480x360px
>>60245615
>surprise me
>heavensgate.com
>>
File: Screenshot_2017-05-06-10-26-02.png (213KB, 720x1280px) Image search: [Google]
Screenshot_2017-05-06-10-26-02.png
213KB, 720x1280px
>>60237564
>Surprise me
Ok, that's kinda funny m8. But the regular search makes no sense, the few things it bring back don't even have the same words in it.
>>
>>60244785
I can improve upon it, but honestly for such a technical question like what you are asking, should probably stick with google.
>>
>>60244893
>OP, put a disclaimer on the site and have a report feature, that's all you really need.
Will look into that! Thanks.
>>
>>60246695
Yeah I think that will improve as more pages get indexed.
>>
>>60237613
>What is your policy on copyright?
I have thought about creating a search engine too but this was actually one of the biggest questions on my mind was the legality of running automated bots to spider peoples sites cause I dont want to end up in rape you in the ass prison because of some bullshit laws. remember what they did to that kid at MIT sentencing him to like 50 years in prison for automatically downloading papers that where available through the campus network?
>>
>>60246686
>hqdefault.jpg

Question: did u make your image that?
>>
search: cats
https://www.fimfiction.net/
FUUUUUUCK
REEEEE
>>
is it actually using the words you input in the "search" OP ?
>>
>>60247923
yes, but there arent a lot of pages in this yet so it doesn't have much to work with right now
>>
>>60237564

What is the user agent of your crawler and how does it work?

Do you read robots.txts? IMO you shouldn't because fuck 'em
>>
>>60240226
>>60243267
>>60244336
>>60244635

Thanks anons :)

>>60248304
User Agent? Sorry I'm not sure what you mean but I'll try to answer. Using libcurl to download the submitted page. Then it goes to an html parser that I wrote to separate the text from the html. Then it gets put into the db. There were some off-the-shelf solutions for a web crawler and even a full search engine, but they are often just as hard to figure out as it would take to just build your own.

>robots.txts
I dont read it but if the page has "noindex", it wont get indexed.
>>
>>60248185
cool just checking, glad to know your awake hope you slept well :3
>>
I searched for "dog" and got "puppy linux" as the first result. I like it.
>>
>>60238018
Please continue development. This could be very valuable. Most might not get it but you're right... Searching for interesting things when you don't really know what you're looking for is actually a very useful thing.
>>
>>60248416

This is what it looks like to the site admin when you crawl a website

172.93.49.59 - - [06/May/2017:13:39:48 -0600] "GET / HTTP/1.1" 200 1088 "-" "-"
172.93.49.59 - - [06/May/2017:13:39:48 -0600] "GET /posts/post4.html HTTP/1.1" 200 2923 "-" "-"
172.93.49.59 - - [06/May/2017:13:39:48 -0600] "GET /posts/post1.html HTTP/1.1" 200 2146 "-" "-"


This is what another random bot looks like

100.43.81.141 - - [06/May/2017:08:16:33 -0600] "GET /contact.html HTTP/1.1" 200 741 "-" "Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)"


the difference is you have no user agent.
>>
>>60248498
Never do sleep well but that helps when you have projects to work on.

>>60248646
Will do!
>>
>>60248824
Ahhh!, interdasting
>>
>>60248824

curl_easy_setopt(curl, CURLOPT_USERAGENT, "WibrBot");


I haven't used libcurl and just grabbed this example off the web
>>
>>60248924
Well that looks like an easy fix. Thanks I appreciate you helping me with that!
>>
>search for an album
>get Reverse engineering the 76477 "Space Invaders" sound effect chip from die photos

can't complain
>>
>>60237681
>The LAMP stack. Thats it.
>search engine

LMAO
>>
How often do you plan on working on this? I'd be interested in keeping up with development. It's been a while since /g/ has spit out something cool.
>>
>>60244372
>The LAMP stack
He's got no chance.
>>
>>60251659
>>60252114
k
>>
>>60251894
I'll keep at getting more and more sites. Probably will never stop. Its been pretty fun.
>>
Open source it.
>>
>>60237681
>The web crawler is pure C (for the speed).
lolololololol
that makes no sense
the code isn't going to be the bottleneck when webcrawling
>>
>>60253901
Oh. Should I have used Rust?
>>
>>60237564
>wibr.me
It's fun. I like the surprise feature
>>
File: anonwebsearch.png (129KB, 1705x939px) Image search: [Google]
anonwebsearch.png
129KB, 1705x939px
>>60237564
>was interested
>went to site
>search basic bitch "water cooling computer"
>pic related
get your shit together anon
>>
>>60253973
clojure, obviously
>>
>>60254129
I think its just trying to tell you that you have your priorities out of sorts. Should be focusing on beer instead. But in all seriousness, I dont have any pages about watercooling your computer that are indexed.
>>
File: Screenshot_20170507-082400.png (265KB, 1057x1724px) Image search: [Google]
Screenshot_20170507-082400.png
265KB, 1057x1724px
Well whoa, seems like you really do like stallman.org, Anon. It's turning up for most searches.
>>
Don't host or link to images.
Your web crawler will inevitably pick up on child pornography, which you will be obliged by law to handle and report to the FBI the moment you spot it.
>>
I searched "ass". First result: http://motherfuckingwebsite.com/
>>
Hey this is vulnerable to xss if you use "> as a bypass plz fix
>>
>>60237564
This open office gives me the fucking chills.
Enjoy not being able to concentrate on anything worthwhile, hipster fucks.
>>
File: literally-nothing.png (9KB, 1598x760px) Image search: [Google]
literally-nothing.png
9KB, 1598x760px
>>60237564
>wibr.me
>>
bump, good luck.
maybe i can use your engine as a secondary one after duckduckgo to just look up interesting things
Thread posts: 122
Thread images: 19


[Boards: 3 / a / aco / adv / an / asp / b / bant / biz / c / can / cgl / ck / cm / co / cock / d / diy / e / fa / fap / fit / fitlit / g / gd / gif / h / hc / his / hm / hr / i / ic / int / jp / k / lgbt / lit / m / mlp / mlpol / mo / mtv / mu / n / news / o / out / outsoc / p / po / pol / qa / qst / r / r9k / s / s4s / sci / soc / sp / spa / t / tg / toy / trash / trv / tv / u / v / vg / vint / vip / vp / vr / w / wg / wsg / wsr / x / y] [Search | Top | Home]

I'm aware that Imgur.com will stop allowing adult images since 15th of May. I'm taking actions to backup as much data as possible.
Read more on this topic here - https://archived.moe/talk/thread/1694/


If you need a post removed click on it's [Report] button and follow the instruction.
DMCA Content Takedown via dmca.com
All images are hosted on imgur.com.
If you like this website please support us by donating with Bitcoins at 16mKtbZiwW52BLkibtCr8jUg2KVUMTxVQ5
All trademarks and copyrights on this page are owned by their respective parties.
Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.
This is a 4chan archive - all of the content originated from that site.
This means that RandomArchive shows their content, archived.
If you need information for a Poster - contact them.