Any machine learning / stats people here?
I need to fit a curve by manipulating 3 variables. The process that calculates my guesses is expensive to run (~45 minutes) so I would like to minimize my number of guesses. Are there some statistical methods that can give the optimal numbers to try?
This *is* the problem.
I am not going to describe the underlying process much, because it's pretty irrelevant and I'd like to treat it as a black box.
I can make a weak assumption that results from changing the input vars are smooth, or at least smooth most of the time.
it's expansive because you have a lots of data points?
work on samples.
Fit on one sample and validate performance on another.
Decide your model structure doing that.
Allow for a few more degrees of freedom and fit on the full dataset.
Fit on the whole dataset for maximum
What is the degree of your polynomial? Have you done bayesian analysis to compare which of your models has the highest information content?
Note this does not mean that you should pick the model with the best accuracy as it might overfit.
Here is more: http://jakevdp.github.io/blog/2015/08/07/frequentism-and-bayesianism-5-model-selection/
Good link, cross-validation is probably the most important as its the best defense against overfitting.
It's not a polynomial. The input vars go into a complex model which spits out outputs after a long calculation. We're basically tweaking A subset of tweakables to study its behavior. I would like to fit it onto some preexisting curves.
Sorry. My inner smartass took over for a moment.
It is hard to answer your question without more info. The easy answer is grid search, but at 45 mins a pop to test a combination of parameters, that's not what you are looking for and I suspect this is exactly what you are already trying to avoid. Is there any more info you can provide about the parameter search space?
I'm playing around with PETSe. We're studying the collapse of structures (eg bridges) under pressure. The curve is stress/strain. We put in some custom physics ( fon't ask me for details, I'm just one of the unpaid student / slave labor) and would like to fit our output to some curves with exact analytical solutions.
OK. I have almost zero domain specific knowledge about what you are working on, but I believe I understand what you are trying to accomplish.
When you say that you are trying to "fit a curve" to your data, that implies that you actually have data with which to work. Let's go back to the "black box" concept. If I understand correctly, you are feeding a set of input values into your computation engine. After waiting 45 minutes, you get an output value. That is a single data point corresponding to your chosen input values. Then you try another set of input values, wait another 45 minutes, and get another output value. That would be your second data point. Then, lather rinse and repeat... Do I understand your situation properly?
Now, after you do the above you have a bunch of data points to work with and you want to fit a curve to this data. You are hoping to fit a curve with an analytical solution, correct? Is this going to work for you? I don't know. I would need to look at the data to answer that, but if the data has a decently tight pattern to it your chances are good.
How much data do you need to fit the curve? Well, my standard answer is as much as you can get your hands on. But gathering your data is expensive time-wise, so how little can you get away with? Once again, it depends on the nature of the data. How about if you gather 50 or 100 data points and start trying to fit your curve. Curve-fitting for this amount of data will be lightning fast, so you can start playing around with something. Continue gathering more data while you play. Add more data to your model as it becomes available. See where that process takes you.
>45 minute objective function runtime
jesus christ man you're fucked
you need to bring that down man or you're never going to get anywhere, any optimization approach is going to need to run that hundreds to thousands of times
Not quite right: we enter three variables, and the output is a curve. We would like this curve to be the same as the analytical curve, so that we can figure out what our numbers actually represent.
use regression since you already have data, download Minitab... btw yield strength cannot be obtained thru Hardness number of the sample
>The curve is stress/strain.
It should be representable as a splined polynomial, like pic related.
>all these "i took a stats class once" answers
OP nobody here has any clue how to solve your problem. you're the expert here unfortunately.
try seeing if your lab has any collaborators with experience in this shit, or if there's any labs on campus who do a lot of optimization stuff
I would try to run several of these in parallel, then converge upon it a la simulated annealing.
Alternatively. I've been messing around with the Nelder-Mead method. Seems applicable.
I don't get it.
Why would it take 45 minutes to compute if there are only 3 variables?
What happens if you simply try to apply some general linear models in R or Python or matlab or whatever package you like?
How inadequate is it then?
If that's still inadequate then try k nearest neighbours regression or some variants thereof.
No idea why it's taking you 45 minutes with 3 variables.
>Are there some statistical methods that can give the optimal numbers to try?
Yes. It's called a regression, there are several methods but for only 3 parameters, even the most simple should do.
What's your model?
How many data points do you already have?
Hey guys, how come bootstrapping and cross-validation are said to be used for the same thing i.e. testing and validating a model? FWIW I'm coming from a computer science background and this is in the context of neural networks. I don't have much formal stats training
Maybe I have a super basic misunderstanding of the concepts, but the way I see it, in k-fold CV, I pick a K (say 10) then I do K number of tests and train the model on K-1 sections and validate it against the remaining section. Then I get an error from this, and I do this K-1 more times and get an average error; thus I know how likely is for the model to generalize to a population (ish)
Now with bootstrapping, it seems like I simply select a random number of data points and then calculate some parameters like mean and variance, and maybe try to fit them if it's a regression problem. Then I keep doing this and I eventually get a function that solves the problem by using all the fits I got. Or something.
However, the CV seems to VALIDATE an already existing, trained model; and bootstrapping seems to actually TRAIN a dataset? How do people use bootstrapping for validation then?
So, your output is more complex than I initially understood, which makes the problem more interesting, but I think the same concepts and approach apply. My next question is this: do you actually know the equation for the analytical curve that you are trying to fit you parameters to? If you do, then this leads you to the ability to construct an error measure for each set of parameters you test. Once you have this error measure in hand, you can probably apply a gradient descent approach to the error function to lead you much more quickly to your desired parameter values.
OP hasn't provided any info re: his computing environment or available resources. I had assumed that if he had access to parallel computation resources then he wouldn't have posted in the first place. But, that might have been a bad assumption...