[Boards: 3 / a / aco / adv / an / asp / b / bant / biz / c / can / cgl / ck / cm / co / cock / d / diy / e / fa / fap / fit / fitlit / g / gd / gif / h / hc / his / hm / hr / i / ic / int / jp / k / lgbt / lit / m / mlp / mlpol / mo / mtv / mu / n / news / o / out / outsoc / p / po / pol / qa / qst / r / r9k / s / s4s / sci / soc / sp / spa / t / tg / toy / trash / trv / tv / u / v / vg / vint / vip / vp / vr / w / wg / wsg / wsr / x / y ] [Search | Free Show | Home]

I am building a search engine

This is a blue board which means that it's for everybody (Safe For Work content only). If you see any adult content, please report it.

Thread replies: 21
Thread images: 4

Located at wiby.me. Had made a thread a couple months ago, this will be my final update. I want to rebuild the web to be like the 90's / early 00's where pages weren't so bloated and were based more on a subject of interest rather than making money. I am trying to gather as many of these kinds of pages and if you know any, please submit them.
It used to be called wibr but have changed it to wiby because its easier to sound out. The search engine will now crawl indexed pages on a weekly basis to keep updated.
Some server info: This is a lamp server hosted on a VPS with an SSD. The crawler and update scheduler I wrote in C. That's about 60kb compiled and does the job well. The site is usually pretty snappy but there isn't may users either.
Anyhoo I will keep trying to find old-school style pages to keep it growing even if there aren't many submissions, although I've had a good number of contributions.
>>
bump
Can't provide anything since I'm a babby but I'm rooting for you anon
>>
File: 1493758636838.jpg (71KB, 639x480px) Image search: [Google]
1493758636838.jpg
71KB, 639x480px
>>61608945
>>61608971
Same. good luck mane
>>
>>61608945
http://prog.ide.sk/pas.php is a nice page for learning the basics of the pascal language. It's in hungarian though
>>
ubunto server, eww
>>
>>61608945
Keep posting, I gladly provide cool pages to index.
http://www.catb.org/esr/
http://ifs.nog.cc/
http://wwwtxt.org/tagged/txt

bonus:
http://txti.es/
>>
>>61608945
does it werk without js
>>
File: 1501025220657.jpg (106KB, 800x600px) Image search: [Google]
1501025220657.jpg
106KB, 800x600px
>>61608945
http://cpuville.com/

This is a nice page about a DIY cpu a guy built out of transistors.
>>
>>61611734
>>61608945
And heres another DIY cpu, this guy made his in the 1970s out of salvaged parts.
http://members.iinet.net.au/~daveb/simplex/simplex.html

Theres a whole plethora of pages about homebrew CPUs, in this thing called the Homebrew CPUs Webring. Whats even better is that they are all comfy web 1.0 pages.

Ima go submit a few to your engine.
>>
>>61608971
>>61609028
Thanks g/ents
>>61609098
Indexed
>>61609412
200mb ram usage
>>61610454
Yes
>>61610414
>>61611734
Appreciate you providing all those quality pages thanks!
>>
>>61611793
Thanks!!!!
>>
>>61611827
This is probably the funniest thing I've read on any of the Homebrew CPUs Webring pages:

An encore for Magic-1?

Shortly after I declared Magic-1 "hardware complete", I casually mentioned to my wife that I was starting to think about Magic-2. Her response was swift, and final:

"No, there will be no Magic-2!"

I can't blame her. She was an extraordinary good sport during Magic-1's design and construction - especially during the wire-wrapping phase. For most of a year, she put up with electronic junk littering the kitchen table, wire-wrap insulation fragments on the floor and a husband often lost in concentration while the kids were hollering for attention.

She's the love of my life, the woman I plan on growing old with, mother of my children, my partner and best friend. I have to respect her wishes on this.

So, there will be no Magic-2.

Instead, we'll call the follow-on project "Magic-16".

http://www.homebrewcpu.com/magic-16.htm
>>
File: 1476415300409.jpg (48KB, 500x393px) Image search: [Google]
1476415300409.jpg
48KB, 500x393px
>>61611871
> I have to respect her wishes on this.
>So, there will be no Magic-2.
>Instead, we'll call the follow-on project "Magic-16".
>>
>>61608945
I happen to do search for a living, so I'm curious about your tech.

* What analysis/normalization do you apply to text?
* How do you rank? Classic TF/IDF, PageRank-style, ...?
* How do you deal with different languages?
* How do you handle semantics and synonyms?
* What user input error correction mechanisms do you employ?

I'm happy for any info, and would be glad to share experience.
>>
why not
http://textfiles.com/

it has
textfiles.com/sex/fuckdead.txt
>>
>>61608945
suckless
cat-v
9p.io
>>
>>61612233
>* What analysis/normalization do you apply to text?
Whatever is employed within the realm of the SQL full text search
>* How do you rank? Classic TF/IDF, PageRank-style, ...?
No page ranking. There are no cancerous pages on wiby worth suppressing.
>* How do you deal with different languages?
This will only be an issue for the 'surprise me' feature. Eventually I'll have to automatically determine the language of the page submitted, and then include/remove those for the surprise feature depending on what region of the world you live in based on IP. English pages will always be included however.
>* How do you handle semantics and synonyms?
Is a one man show here, so I'm not handling this. Hopefully you will find a page with a word match somewhere.
>* What user input error correction mechanisms do you employ?
None. Only checking for security h4x. Users will have to know how to spell I guess.
>>
>>61612578
>>* What analysis/normalization do you apply to text?
>Whatever is employed within the realm of the SQL full text search
Depending on which RDBMS you use, the system itself may already cover quite a bit. What are you using? I'm assuming that you're exploiting those capabilities instead of just matching substrings.

>>* How do you rank? Classic TF/IDF, PageRank-style, ...?
>No page ranking. There are no cancerous pages on wiby worth suppressing.
Ranking is not about "suppressing" things, but about putting more relevant content first. TF/IDF is quite useful, because it's a great tradeoff between precision and recall. If you're not doing that, what does determine the order of results?

>>* How do you deal with different languages?
>This will only be an issue for the 'surprise me' feature.
I don't think so. Multi-language capabilities of RDBMS'es are currently rather limited as far as I know. So systems I've worked with before employ different ways of processing different languages because they have different stopwords, and **very** different ways of stemming.

>Eventually I'll have to automatically determine the language of the page submitted, and then include/remove those for the surprise feature depending on what region of the world you live in based on IP. English pages will always be included however.
I don't like the idea of limiting content based on the physical location. Think about people traveling, or VPN users. Maybe the browser's locale would be a better approach.
>>
>>61612578
cont.

>>* How do you handle semantics and synonyms?
>Is a one man show here, so I'm not handling this. Hopefully you will find a page with a word match somewhere.
That's probably worth looking at. It's been a very important topic for me as an implementer, and a huge point of frustration as a user, often having trouble to find the right word to look up things. Not finding something because of wrong words and giving up then is a very annoying experience.

>>* What user input error correction mechanisms do you employ?
>None. Only checking for security h4x. Users will have to know how to spell I guess.
Also worth looking at. Simple fuzzy search based on permutations up to a certain Levenshtein distance can already get you most of the way.
>>
>>61612938
>Depending on which RDBMS you use, the system itself may already cover quite a bit. What are you using? I'm assuming that you're exploiting those capabilities instead of just matching substrings.

You'll have to excuse me because I'm just flying by the seat of my pants with all of this. I'm not sure what you're asking.

>what does determine the order of results?

Its prioritizing matches with titles first, followed by matches within body. Using the full text search algorithm.

>because they have different stopwords, and **very** different ways of stemming.

Hwelp, I don't know what to say. I'm hoping the black box that is SQL can handle it. Would have to do a lot more research on these subjects.

>Maybe the browser's locale would be a better approach.

I shall remember this when I do eventually have to tackle it.

>It's been a very important topic for me as an implementer, and a huge point of frustration as a user, often having trouble to find the right word to look up things. Not finding something because of wrong words and giving up then is a very annoying experience.

Yes, I will try to improve the search algorithm over time. Though I prefer to stick to simple, brutish solutions since its more of a hobby project for me at the moment.
>>
>>61613229
>>Depending on which RDBMS you use, the system itself may already cover quite a bit. What are you using? I'm assuming that you're exploiting those capabilities instead of just matching substrings.
>You'll have to excuse me because I'm just flying by the seat of my pants with all of this. I'm not sure what you're asking.
The question was basically what database software you're using.

In any case, is there any possibility to get a dump of your index in a generic format (CSV, JSON, whatever) to play around with? I feel like trying some things to improve search quality by running it through some specialized setups, and I'm too lazy to write a crawler and curate some sites.
Thread posts: 21
Thread images: 4


[Boards: 3 / a / aco / adv / an / asp / b / bant / biz / c / can / cgl / ck / cm / co / cock / d / diy / e / fa / fap / fit / fitlit / g / gd / gif / h / hc / his / hm / hr / i / ic / int / jp / k / lgbt / lit / m / mlp / mlpol / mo / mtv / mu / n / news / o / out / outsoc / p / po / pol / qa / qst / r / r9k / s / s4s / sci / soc / sp / spa / t / tg / toy / trash / trv / tv / u / v / vg / vint / vip / vp / vr / w / wg / wsg / wsr / x / y] [Search | Top | Home]

I'm aware that Imgur.com will stop allowing adult images since 15th of May. I'm taking actions to backup as much data as possible.
Read more on this topic here - https://archived.moe/talk/thread/1694/


If you need a post removed click on it's [Report] button and follow the instruction.
DMCA Content Takedown via dmca.com
All images are hosted on imgur.com.
If you like this website please support us by donating with Bitcoins at 16mKtbZiwW52BLkibtCr8jUg2KVUMTxVQ5
All trademarks and copyrights on this page are owned by their respective parties.
Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.
This is a 4chan archive - all of the content originated from that site.
This means that RandomArchive shows their content, archived.
If you need information for a Poster - contact them.