I would use my right hand to slap Alison and Charlie and get them to do their work again... this time forcing them to label their samples.
I would then take them into my office and hold an inquest on how two completely different data-sets got mixed up in the first place.
At which point I would probably find out Alison and Charlie have been knocking boots on company time.. Alison and Charlie would then be fired as this is against company policy.
Due to the difficulty in finding new jobs Alison and Charlie's fledgling romance would end... leaving Alison unexpectedly pregnant and Charlie with an 18 year bill for child support.
The morale of this story? Label your fking work.
Note: I was unable to solve this problem as my math skills are poor.
PEOPLE!!! Are you hackers or not!?!? Download the Excel spreadsheet and look at the raw numbers. The numbers in columns A,D and E have 2 significant digits after the decimal point, whereas the numbers in columns B,C and F have 13 significant digits after decimal point! No calculations necessary!
As far as the test goes, if D is small and p is high, you cannot reject the hypothesis that the two datasets came from the same distribution. The p-value is roughly how often, randomly you would get similar looking data assuming the null hypothesis (in this case that they are drawn from the same dataset)
In light of this evidence, if they are not lying to us, and really each of these sets came from and A-like or B-like distribution, I'd say fairly confidently that:
there is a 98.3% chance that f is b-like
there is a 98.3% chance that e is a-like
there is a 89.3% chance that d is a-like
there is a 53.1% chance that c is b-like
If these are multiplied together, it appears that there is only a 45.8% chance that they are all classified correctly?
You're asking a good question -- but you know a lot more than what you write above.
The main thing is, you know that C, D, E, and F came from either A or B. The p-values above don't account for that; they just say what's the chance, due to random fluctuation, that a sample could have come from from the same source as A.
That's reflected in the fact that the pairs of p-values don't add to one! (Like (A,C) and (B,C) in the table above.)
You also implicitly know that at least one of {C,D,E,F} is A-like and one is B-like (otherwise there would not be a problem). So even if you know P(X and Y have same source) for all (X,Y), which you don't, you couldn't multiply them.
Finally, the p-value returned by the KS test will underestimate the true probability of discrepancy. This is because it's only looking at one thing, the max value of a CDF difference. The significant differences between the distributions may lie elsewhere, like in the tails, and the KS test is known to be relatively insensitive to tail behavior. (Although at n=40 you won't be able to see far into the tails.)
There are a host of other tests that use the same idea (empirical CDF difference) but weight differently. Some can be more effective than the KS test if you're looking for certain types of difference. Here's an OK overview, albeit for the goal of assessing normality:
In a real problem, it's always a good idea to use more empirical-cdf tests than just the KS test, to compare variances and other moments as some people in the thread have done, and to make histogram or CDF plots -- especially if you're in just 1 dimension and the plots are easy to interpret.
I used my app https://app.prettygraph.com to quickly look at the histograms. B and F look almost identical with a very normal distribution. The others look more like twin-peaks.
Edit: If you want to try, copy the data and click the "Paste data from Excel link" on the left under the Data Tab, paste and load. Then choose histogram under the Graph tab. It plots the X var (ignores Y). Sorry it's a bit shabby, I haven't worked on it in a while.
An interesting problem. My first guess is that teenagers' weights distribution is two-peaked (boys & girls) but this assumes that the teenagers are around the same age and that the peaks are far enough to be detectable in the samples. I can't be arsed to check this hypothesis though :-)
Anyway, a serious approach to this problem would require comparing against solid real-life data. Off the bat assumptions about distributions of real life data are often very wrong.
Orange http://orange.biolab.si/
"Open source data visualization and analysis for novice and experts. Data mining through visual programming or Python scripting. Components for machine learning. Extensions for bioinformatics and text mining. Packed with features for data analytics."
Weka http://www.cs.waikato.ac.nz/ml/weka/
"Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes."
haha thanks, but definitely trolling. This is perhaps the most inelegant solution in the world. But it does the job in this case. Output is 10.6014, 4.44403, 5.71353, 9.46718, 10.6471, 5.12933, for columns A to F, respectively
Several people have said they would, or have, ploted histograms - but how?
* By hand?
* With Excel?
* With R?
* With processing?
I'm a little sad to see that this item has been flagged heavily, but I guess there are people who think this is sufficiently off-topic that it should be in the same category as spam.
It treats single newlines as the same paragraph and double newlines as a new one. Alternatively, if you want to post code, you can indent it by four spaces. Like this:
par(mfrow = c(3,2))
dat <- read.table("data.txt", sep="\t", header = TRUE)
for(let in c("A","B","C","D","E","F"))
plot(density(dat[,let]), main = let)
We can not say with 100% certainty that a particular set of data fits in either group. This is more of an AI problem than a statistics problem, although in this particular case the differences between the two types of data sets is big enough that a relatively crude statistical analysis can generate a cleat answer to the question.
In a more general sense, I would tend to use a Self Organising Map (http://en.wikipedia.org/wiki/Self-organizing_map) to identify the two groups of data, and then discrimante between them. The dimensions of the SOM would be different statistical analyses of the data sets - what is the mean, the absolute spread, the standard deviation, the median, etc.
The nice thing about using a SOM is that it will show you whether or not you have succeeded in finding a measure that can successfully discriminate between different data sets - whilst at the same time actually doing the discrimination for you.
Permutation tests! Easy to use and easy to explain.
The two examples have the same mean and median, but differ substantially in their min/max/standard deviation/median absolute deviation. Pick your spread, they differ. I got this by poking around in iPython.
Find the standard deviation of A. Reshuffle A and B together, taking half of the resulting list, calculate the standard deviation, record it and repeat that a thousand times. See that you basically never get a standard deviation as large as you did for A alone, so they meaningfully differ.
Repeat on each of the data sets, and you see that D and E are plausibly similar to A whereas C and F are not. Repeat for B to see if we're being fucked with, and lo and behold it all seems to work out, though D is a little iffy.
I used numpy and ten lines of glue code. Happy to post if there is interest.
I'm sure you can easily identify which set is which using statistical methods. It's probably as simple as finding out which type of distribution each set resembles. Then again, my undergrad statistics is very rusty.
For the sake of completeness, my first steps before sorting and plotting were to:
1. take the mean, median and mode of each set,
2. try and fail to use more sophisticated statistical analysis [1],
3. have a look at the minimum and maximum value for each set.
By #3 I already had a good enough guess, but I felt I had to see it.
[1] In fact, could anyone point me to a refresher? I'm talking about distribution curves, regression, that kind of thing. I seem to have lost hang of it — couldn't pass my own "5 minutes to implement the simple thing" criterion.
I would look at the mean and variance in each column and cluster by that. My intuition is that the variance in Temperature measured in Fahrenheit will be greater then that of weight in Kg.
Checking it seems that this is borne out. So is the data real? What is being assessed here?
Interesting to see that people prefer std dev to variance for this problem.
Excel, compare standard deviation + means in a slightly hand-wavey way. The mean is around the same for each interestingly, however the standard deviation varies in such a way as to imply which belongs to which.
I used python with pylab/matplotlib to sort and plot the columns. It was easy to spot the two sets. I suppose a histogram would have worked nicely. This also would have been easy with my selected tools.
as others have said, first thing to do is plot the data. but if you want a statistical test (with significances, so that you have some idea of how confident you can be in your decision) then you could use the two-sample ks test. it's likely more useful that a simple t-test here because it's sensitive to shape.
but to be honest the first thing i would do is google for suitable approaches...
I have only two numbers 50 and 60 - one is a weight and another is a temperature. Now I have another number, 80. What kind of number is that? No way to know. The fact that there are multiple values make no difference - the data is the data!
Does the thread starter wonder about what software tool (i.e. language) people would use, or does he wonder about what method people would use? As far as language goes, the problem requires mathematics on a level so simple that practically any programming language is more than enough sufficient - the histographical method I'd use for solving this problem would be doable in BASIC on a VIC-20 or a TI-81 pocket calculator.
This is silly, but I find it to be more on-topic and fun than a lot of other items that I see on HN. It's really educational to see how people look at the problem.
I would then take them into my office and hold an inquest on how two completely different data-sets got mixed up in the first place.
At which point I would probably find out Alison and Charlie have been knocking boots on company time.. Alison and Charlie would then be fired as this is against company policy.
Due to the difficulty in finding new jobs Alison and Charlie's fledgling romance would end... leaving Alison unexpectedly pregnant and Charlie with an 18 year bill for child support.
The morale of this story? Label your fking work.
Note: I was unable to solve this problem as my math skills are poor.