Alright /g/
I need some ideas to develop a big data system for my final year project
Ideally it should be something that integrates multiple sources of data (even in different formats) and after applying different machine learning algorithms produces some insights from that data
I can't think of a problem or data sets I could use for this
Anyone have any ideas?
>>56956453
1. write a webcrawler for 4chan
2. get posts from /g/, /v/, /pol/, /b/ and /mlp/ over one month
3. map/reduce posts to topics/badwords/..
4. do some analytical circlejerking with some figures
5. come back here and post results
>>56956453
do like a german anon did once
get all 4chan posts/information and try to separate them by user, using writing patterns and shit like that
>>56957010
Source, plox.
>>56956453
Nobody will care about what exactly the result of that is, right? Just about the data system?
Shit, put some data from an US or Swiss or whatever statistics buerau into Apache Slick / Hadoop shit and wrangle it by state / canton.
Not big enough data? Grab wikipedia and try to learn what was the cause for most reverts in the changelog, at what time of day it was done, who did it, yadda yadda. Or whatever.
Or try to classify the pictures and comments posted on some fucking social network.
>>56956453
Try weather prediction, you have massive amount of weather data to train on, just look into yahoo weather api.
>>56957114
They even tracked the weather?
Those motherfuckers..
>>56956453
try using EM clustering algorithim to diagnosis diseases in patients. I was going to do the same thing for my masters but unfortunately I chose to work in actuarial models (more $$).
>>56957114
Wouldn't do this for a final year project.
It won't work well and in most instances you'll just get poor grades for that.
Usually you do just some damn shit nobody else has done (too) much and get some result.
If you're good like that, how about you instead try to find possible (past) rivers and lakes and so on from topological data. Or find sites where humans mined minerals or had stone quarries.
My suggestion: Get a shitload of images or audio files and train an sparse autoencoder on them.
Next, using the vector representation you got you can create an image/audio search engine using cosine or some other similarity measure on the feature space you created.
>>56957148
Why would you do an unsupervised learning technique for a classification task?
>>56957032
don't have it, read about it here a few months back, the guy even said that the german nsa contacted him about it
>>56957596
Oh shit, if they are able to track people down by their way of posting.. I don't like this idea.
BigData is cool, but why are all BigData applications evil deeds?