Hello /sci/, I'm trying to estimate the probability of two

Thread replies: 15
Thread images: 4

Anonymous
2016-05-11 02:48:44 Post No. 8065235
[Report] Image search: [Google]

File: chart.png (193KB, 800x600px) Image search: [Google]

Anonymous 2016-05-11 02:48:44 Post No. 8065235 [Report]

Hello /sci/,
I'm trying to estimate the probability of two statistically independent events. I know that in conditional probability
P(A|B) = P(A&B)/P(B)
but for statistically independent events:
P(A&B) = P(A)*P(B)
which means that
P(A|B) = P(A)

This results in charts like this one, where each of my categories ends up with the same probability for each other category. The lines only vary according to the real value of the number of members of that category.

Is there a better transformation I can do on my data, such that each of these lines is different, and so that each category produces a different probability?

I can try and explain myself more clearly if that doesn't make sense.

Thanks for your help.

Anonymous 2016-05-11 03:12:15 Post No.8065270
[Report]

Anonymous 2016-05-11 03:12:15 Post No.8065270 [Report]

>>8065235
>Yvalues
>Xvalues

Anonymous 2016-05-11 03:14:25 Post No.8065275
[Report] Image search: [Google]

Anonymous 2016-05-11 03:14:25 Post No.8065275 [Report]

File: chart.png (198KB, 800x600px) Image search: [Google]

198KB, 800x600px

>>8065270
They are placeholder variable names. Is this better for you?

Anonymous 2016-05-11 03:19:59 Post No.8065284
[Report] Image search: [Google]

Anonymous 2016-05-11 03:19:59 Post No.8065284 [Report]

File: chartkey.png (10KB, 300x240px) Image search: [Google]

10KB, 300x240px

It occurs to me that these events actually aren't independent. If someone is employed in Agriculture, for instance, there is some chance that they earn band 2 per week, some chance they earn band 3 per week, etc. And this is dependent upon the industry of employment. I just don't know what this probability is, and that's what I want to determine. How do I work this out from what I have?
Pic is the earnings bands the numbers represent.

Anonymous 2016-05-11 03:26:17 Post No.8065295
[Report]

Anonymous 2016-05-11 03:26:17 Post No.8065295 [Report]

>>8065284
Are you the guy who posted this type of data with income and age paired and job and age paired and wanted to know how to estimate the income and job pairings?

Anonymous 2016-05-11 03:30:06 Post No.8065302
[Report]

Anonymous 2016-05-11 03:30:06 Post No.8065302 [Report]

>>8065295
Yeah, I'm that guy. Turns out computing the number of permutations of a matrix with ~5000000 entries in either axis is really, really computationally intensive. I'm trying to use probability to make it easier.

Anonymous 2016-05-11 03:39:13 Post No.8065318
[Report]

Anonymous 2016-05-11 03:39:13 Post No.8065318 [Report]

>>8065284
Yes, obviously income and job are not independent. The chance of earning X given you have job Y is the amount of people who earn X and have job Y divided by the amount of people who have job Y. But you don't actually have this data do you?

>>8065302
Well there is an easier way to do it computation wise, which is to let the matrix take continuous values instead of just integers. Then you can use calculus to find the average value for each element. Unfortunately this involves solving for the hypervolume of a generalized polyhedron of a very large amount of dimensions (this is called a polytope). Basically, the range allowable for each element can be represented as a side of the polytope, and the hypervolume represents the probability of a particular value for the element, which allows you to calculate its expected value. Unfortunately this is probably way over your head and still too difficult to program.

Essentially you can't do what you're trying to do. Even if you could get the expected value of each element, this is just the average value it would take if all permutations were equally likely. But we know not all permutations of a job and an income are equally likely.

Anonymous 2016-05-11 03:45:22 Post No.8065324
[Report] Image search: [Google]

Anonymous 2016-05-11 03:45:22 Post No.8065324 [Report]

File: Industry.png (37KB, 800x600px) Image search: [Google]

37KB, 800x600px

>>8065318
Ah, that second paragraph is a good explanation of what I'm trying to do, mathematically. Thanks for it.

No, you're right, I don't have the data concerning how many people earn X with job Y. That's what I'm trying to infer from what I've got.
So it's P(X|Y) = P(X&Y)/P(Y), then. But since I'm trying to infer P(X|Y) and I don't have P(X&Y), I was trying to use independence to assert that P(X&Y) == P(X)*P(Y), which isn't true.
So I can't work this out from what I have? Can I estimate it in any sensible way?

Side note, I was able to make some nice graphs from the data I did actually have.

Anonymous 2016-05-11 04:05:46 Post No.8065350
[Report]

Anonymous 2016-05-11 04:05:46 Post No.8065350 [Report]

>>8065324
Yes, you can estimate it by the method discussed in the previous thread or this one, but this just uses the assumption that all possible permutations of the data are equally likely, which is not accurate. It's the best way of estimating without any more information, but it's not accurate. This won't give you a uniform distribution between incomes and jobs, as jobs with many people and incomes with many people will result in a higher estimated amount of people with both that job and that income. But that's all the information you have can tell you.

Anonymous 2016-05-11 04:08:42 Post No.8065354
[Report]

Anonymous 2016-05-11 04:08:42 Post No.8065354 [Report]

>>8065350
Okay, the method proposed was computing the average value in each cell of the matrix for each possible permutation. However, this is computationally beyond the scope of my project. Is there a way I can determine the average value for each cell without computing each permutation? There was another method proposed where you started by assuming an even distribution, eg the values of the first row are all 1/n where n is the number of columns, but this only works if you have the same number of rows and columns, and I don't.

Anonymous 2016-05-11 04:13:27 Post No.8065362
[Report]

Anonymous 2016-05-11 04:13:27 Post No.8065362 [Report]

>>8065354
Yes, I told you just now. Use calculus to find the expected value of the matrix. But that will be too hard also. Your data set is too big and uncorrelated to do what you want. I suggest you rethink entirely what you're trying to do.

>There was another method proposed where you started by assuming an even distribution, eg the values of the first row are all 1/n where n is the number of columns, but this only works if you have the same number of rows and columns, and I don't.
No, that method doesn't work at all. I thought I already told you that.

Anonymous 2016-05-11 04:15:46 Post No.8065365
[Report]

Anonymous 2016-05-11 04:15:46 Post No.8065365 [Report]

>>8065362
Nah, I think you came up with that one first and then realised it wouldn't work, but you didn't explicitly state it. I considered it as well after I worked out that the other method wouldn't work for such a large dataset.

Yeah, it's not too much of a problem. I'll just have to say that I needed more information to achieve what I wanted to achieve. Thanks for all your help, anon.

Anonymous 2016-05-11 11:56:15 Post No.8066004
[Report]

Anonymous 2016-05-11 11:56:15 Post No.8066004 [Report]

>>8065324
wtf does the numbers on the x-axis mean?

Anonymous 2016-05-11 13:31:10 Post No.8066163
[Report]

Anonymous 2016-05-11 13:31:10 Post No.8066163 [Report]

You have data a_{ij} and b_{ik}, and
want to find probabilites
p_i q_j and r_k such that
p_i q_j = a_{ij} and
p_i r_k = b_{ik}. Take logs to get
linear constraints:
log(p_i) + log(q_j) = log(a_{ij})
log(p_i) + log(r_k) = log(b_{ik})
Now find the minimum of the quadratic
E^2 = \sum_i log(p_i)^2
+ \sum_j log(q_j)^2
+ \sum_k log(r_k)^2
subject to the constraints. This
will just require solving a system
of linear equations. This
will give you and estimate for
the unknowns p_i q_j and r_k.

A bit hacky, but might give you something that's reasonable and readily calculated.

Anonymous 2016-05-11 17:34:30 Post No.8066586
[Report]

Anonymous 2016-05-11 17:34:30 Post No.8066586 [Report]

>>8066163
You could also give up the independence assumption and just try minimizing

[math]E^2=\sum_{ijk} p^2_{ijk}[/math]

subject to

[math]\sum_k p_{ijk}=a_{ij}[/math] and [math]\sum_j p_{ijk}=b_{ik}[/math] and [math]\sum_{ijk}p_{ijk}=1[/math].

Again, you will get a linear system. You might have to adjust a bit to get positivity.

I'm aware that Imgur.com will stop allowing adult images since 15th of May. I'm taking actions to backup as much data as possible. Read more on this topic here - https://archived.moe/talk/thread/1694/

I'm aware that Imgur.com will stop allowing adult images since 15th of May. I'm taking actions to backup as much data as possible.
Read more on this topic here - https://archived.moe/talk/thread/1694/