What tool would *you* use to solve this?

VonLipwig · on Oct 5, 2011

I would use my right hand to slap Alison and Charlie and get them to do their work again... this time forcing them to label their samples.

I would then take them into my office and hold an inquest on how two completely different data-sets got mixed up in the first place.

At which point I would probably find out Alison and Charlie have been knocking boots on company time.. Alison and Charlie would then be fired as this is against company policy.

Due to the difficulty in finding new jobs Alison and Charlie's fledgling romance would end... leaving Alison unexpectedly pregnant and Charlie with an 18 year bill for child support.

The morale of this story? Label your fking work.

Note: I was unable to solve this problem as my math skills are poor.

ovi256 · on Oct 5, 2011

You're on the management fast track aren't you ?

VonLipwig · on Oct 5, 2011

I probably would be if I spent less time writing fictional stories to accompany math problems. ^^

ctdonath · on Oct 5, 2011

And ask Charlie why he's storing weights to 13 decimal places.

antonb2011 · on Oct 5, 2011

PEOPLE!!! Are you hackers or not!?!? Download the Excel spreadsheet and look at the raw numbers. The numbers in columns A,D and E have 2 significant digits after the decimal point, whereas the numbers in columns B,C and F have 13 significant digits after decimal point! No calculations necessary!

bluemanshoe · on Oct 5, 2011

Nice.

gigantor · on Oct 5, 2011

Brilliant.

bluemanshoe · on Oct 5, 2011

I would use the Kolmogorov-Smirnov test: http://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test

It's in scipy.stats.ks_2samp

Results:

Sets| D |p-value

-----------------

A,C |0.275|0.080|

B,C |0.175|0.531|

-----------------

A,D |0.125|0.893|

B,D |0.275|0.080|

-----------------

A,E |0.100|0.983|

B,E |0.300|0.043|

-----------------

A,F |0.300|0.043|

B,F |0.100|0.983|

-----------------

As far as the test goes, if D is small and p is high, you cannot reject the hypothesis that the two datasets came from the same distribution. The p-value is roughly how often, randomly you would get similar looking data assuming the null hypothesis (in this case that they are drawn from the same dataset)

In light of this evidence, if they are not lying to us, and really each of these sets came from and A-like or B-like distribution, I'd say fairly confidently that:

F is B-like

E is A-like

D is A-like and

C is B-like (though with lower confidence)

The box-plot: http://i.imgur.com/epPw7.png seems to confirm.

natekupp · on Oct 5, 2011

You could probably get away with just using a 2-sample t-test (http://en.wikipedia.org/wiki/Students_t-test#Equal_sample_si...), no? No sense using a non-parametric sledgehammer unless absolutely necessary :).

Actually, just looking at the variance of the columns tells the same story as you've discovered above:

> var(data) A B C D E F 117.19610 20.49239 33.13114 90.62195 115.39044 27.34298

A, D, E are in the same group, B, C, F are in the same group.

WeWin · on Oct 5, 2011

Based on the final selections:

there is a 98.3% chance that f is b-like there is a 98.3% chance that e is a-like there is a 89.3% chance that d is a-like there is a 53.1% chance that c is b-like

If these are multiplied together, it appears that there is only a 45.8% chance that they are all classified correctly?

mturmon · on Oct 5, 2011

You're asking a good question -- but you know a lot more than what you write above.

The main thing is, you know that C, D, E, and F came from either A or B. The p-values above don't account for that; they just say what's the chance, due to random fluctuation, that a sample could have come from from the same source as A.

That's reflected in the fact that the pairs of p-values don't add to one! (Like (A,C) and (B,C) in the table above.)

You also implicitly know that at least one of {C,D,E,F} is A-like and one is B-like (otherwise there would not be a problem). So even if you know P(X and Y have same source) for all (X,Y), which you don't, you couldn't multiply them.

Finally, the p-value returned by the KS test will underestimate the true probability of discrepancy. This is because it's only looking at one thing, the max value of a CDF difference. The significant differences between the distributions may lie elsewhere, like in the tails, and the KS test is known to be relatively insensitive to tail behavior. (Although at n=40 you won't be able to see far into the tails.)

There are a host of other tests that use the same idea (empirical CDF difference) but weight differently. Some can be more effective than the KS test if you're looking for certain types of difference. Here's an OK overview, albeit for the goal of assessing normality:

http://www.instatmy.org.my/downloads/e-jurnal%202/3.pdf

In a real problem, it's always a good idea to use more empirical-cdf tests than just the KS test, to compare variances and other moments as some people in the thread have done, and to make histogram or CDF plots -- especially if you're in just 1 dimension and the plots are easy to interpret.

revorad · on Oct 5, 2011

I used my app https://app.prettygraph.com to quickly look at the histograms. B and F look almost identical with a very normal distribution. The others look more like twin-peaks.

Edit: If you want to try, copy the data and click the "Paste data from Excel link" on the left under the Data Tab, paste and load. Then choose histogram under the Graph tab. It plots the X var (ignores Y). Sorry it's a bit shabby, I haven't worked on it in a while.

ovi256 · on Oct 5, 2011

Nice app! Computing the histogram hung on the Apple stock data for me.

revorad · on Oct 5, 2011

Thanks! It gets hung sometimes, but it should work if you select any field other than the Date as the X variable. Refresh and try again if you like.

praptak · on Oct 5, 2011

An interesting problem. My first guess is that teenagers' weights distribution is two-peaked (boys & girls) but this assumes that the teenagers are around the same age and that the peaks are far enough to be detectable in the samples. I can't be arsed to check this hypothesis though :-)

Anyway, a serious approach to this problem would require comparing against solid real-life data. Off the bat assumptions about distributions of real life data are often very wrong.

anthonyb · on Oct 5, 2011

I can just scan the numbers and see the patterns. Column A has much more extreme ranges, lots of 40s and 70s, whereas B is much narrower.

So D and E are temperature, and C and F are weights.

nichol4s · on Oct 5, 2011

I would use one of these two tools:

Orange http://orange.biolab.si/ "Open source data visualization and analysis for novice and experts. Data mining through visual programming or Python scripting. Components for machine learning. Extensions for bioinformatics and text mining. Packed with features for data analytics."

Weka http://www.cs.waikato.ac.nz/ml/weka/ "Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes."

Cieplak · on Oct 5, 2011

Bash

for i in {1..6}; do COL='$'$i; awk -F, "{delta = $COL - avg; avg += delta / NR; mean2 += delta * ($COL - avg); } END { print sqrt(mean2 / NR); }" list.csv; done

[troll]

haldean · on Oct 5, 2011

Troll... or hero?

Cieplak · on Oct 5, 2011

haha thanks, but definitely trolling. This is perhaps the most inelegant solution in the world. But it does the job in this case. Output is 10.6014, 4.44403, 5.71353, 9.46718, 10.6471, 5.12933, for columns A to F, respectively

ColinWright · on Oct 5, 2011

Several people have said they would, or have, ploted histograms - but how?

* By hand?

* With Excel?

* With R?

* With processing?

I'm a little sad to see that this item has been flagged heavily, but I guess there are people who think this is sufficiently off-topic that it should be in the same category as spam.

TheColonel · on Oct 5, 2011

use R

#copy data into a txt file without the first two lines from the xls

par(mfrow = c(3,2))

dat <- read.table("data.txt", sep="\t", header = TRUE)

for(let in c("A","B","C","D","E","F"))

plot(density(dat[,let]), main = let)

easy...(why doesnt HN recognise newlines?)

Dove · on Oct 5, 2011

why doesnt HN recognise newlines?

It treats single newlines as the same paragraph and double newlines as a new one. Alternatively, if you want to post code, you can indent it by four spaces. Like this:

    par(mfrow = c(3,2))
    dat <- read.table("data.txt", sep="\t", header = TRUE)
    for(let in c("A","B","C","D","E","F"))
    plot(density(dat[,let]), main = let)

TheColonel · on Oct 6, 2011

Thanks Dove. I probably should read the posting FAQ or something.

Dove · on Oct 6, 2011

Not your fault, actually. That particular quirk may be written down somewhere, but I've not found it in my three years here.

I offer you knowledge obtained by sheer trial and error. :)

demallien · on Oct 5, 2011

We can not say with 100% certainty that a particular set of data fits in either group. This is more of an AI problem than a statistics problem, although in this particular case the differences between the two types of data sets is big enough that a relatively crude statistical analysis can generate a cleat answer to the question.

In a more general sense, I would tend to use a Self Organising Map (http://en.wikipedia.org/wiki/Self-organizing_map) to identify the two groups of data, and then discrimante between them. The dimensions of the SOM would be different statistical analyses of the data sets - what is the mean, the absolute spread, the standard deviation, the median, etc.

The nice thing about using a SOM is that it will show you whether or not you have succeeded in finding a measure that can successfully discriminate between different data sets - whilst at the same time actually doing the discrimination for you.

intended · on Oct 5, 2011

I calculated the mode in excel, the weight columns have one while the farenheit columns dont.

EDIT: I realize that this was a somewhat curious result.

bmunro · on Oct 5, 2011

I used a spreadsheet to calculate the standard deviations. A,D and E have twice the deviation of B,C and F. (10 vs 5)

The means are all the same.

mshron · on Oct 5, 2011

Permutation tests! Easy to use and easy to explain.

The two examples have the same mean and median, but differ substantially in their min/max/standard deviation/median absolute deviation. Pick your spread, they differ. I got this by poking around in iPython.

Find the standard deviation of A. Reshuffle A and B together, taking half of the resulting list, calculate the standard deviation, record it and repeat that a thousand times. See that you basically never get a standard deviation as large as you did for A alone, so they meaningfully differ.

Repeat on each of the data sets, and you see that D and E are plausibly similar to A whereas C and F are not. Repeat for B to see if we're being fucked with, and lo and behold it all seems to work out, though D is a little iffy.

I used numpy and ten lines of glue code. Happy to post if there is interest.

mberning · on Oct 5, 2011

I'm sure you can easily identify which set is which using statistical methods. It's probably as simple as finding out which type of distribution each set resembles. Then again, my undergrad statistics is very rusty.

bobby07 · on Oct 5, 2011

Standard dev seamed to be enough. I sorted and graphed the data to to be sure.

ianterrell · on Oct 5, 2011

Temps have standard deviation around 10ish, weights around 5ish.

LachlanArthur · on Oct 5, 2011

Since each list is independent, sort them individually and graph them.

Here's the result in Excel: http://ii.snag.gy/vSt4H.jpg

Alison's lists are A, D and E

Charlie's lists are B, C and F

rufibarbatus · on Oct 5, 2011

Holy cow, I did the exact same thing as you!

For the sake of completeness, my first steps before sorting and plotting were to:

  1. take the mean, median and mode of each set,
  2. try and fail to use more sophisticated statistical analysis [1],
  3. have a look at the minimum and maximum value for each set.

By #3 I already had a good enough guess, but I felt I had to see it.

[1] In fact, could anyone point me to a refresher? I'm talking about distribution curves, regression, that kind of thing. I seem to have lost hang of it — couldn't pass my own "5 minutes to implement the simple thing" criterion.

Dn_Ab · on Oct 5, 2011

I would look at the mean and variance in each column and cluster by that. My intuition is that the variance in Temperature measured in Fahrenheit will be greater then that of weight in Kg.

Checking it seems that this is borne out. So is the data real? What is being assessed here?

Interesting to see that people prefer std dev to variance for this problem.

stephen789 · on Oct 5, 2011

Here is the scatter plot. Sorted. The difference isn't hard to see. https://docs.google.com/spreadsheet/oimg?key=0Ar0JB_woDdlodD...

sepent · on Oct 5, 2011

Drawing several charts with LibreOffice Calc:

http://www.hostmypic.net/pictures/93969c88a3ecaac84ba81c8d38...

And it is almost obvious that we have two data sets: (A, D, E) and (B, C, F)

singular · on Oct 5, 2011

Excel, compare standard deviation + means in a slightly hand-wavey way. The mean is around the same for each interestingly, however the standard deviation varies in such a way as to imply which belongs to which.

I am quite rusty on this stuff I must say!

aw3c2 · on Oct 5, 2011

Weird, just looking at the raw numbers I would have expected A to be the weights and B to be the temperature averages because A has a bigger range.

I guess one would correlate each row to the A and B one and look which correlate the most?

bjcubsfan · on Oct 5, 2011

I used python with pylab/matplotlib to sort and plot the columns. It was easy to spot the two sets. I suppose a histogram would have worked nicely. This also would have been easy with my selected tools.

ideaoverload · on Oct 5, 2011

Draw distributions in Excel. B,C and F show similar, normal shape distribution

http://i.imgur.com/Dg6OT.png

andrewcooke · on Oct 5, 2011

as others have said, first thing to do is plot the data. but if you want a statistical test (with significances, so that you have some idea of how confident you can be in your decision) then you could use the two-sample ks test. it's likely more useful that a simple t-test here because it's sensitive to shape.

but to be honest the first thing i would do is google for suitable approaches...

WeWin · on Oct 5, 2011

I have only two numbers 50 and 60 - one is a weight and another is a temperature. Now I have another number, 80. What kind of number is that? No way to know. The fact that there are multiple values make no difference - the data is the data!

ars · on Oct 5, 2011

I would plot some histograms and do it by eye.

dustinupdyke · on Oct 5, 2011

Based on recent events, I'm a little disappointed that nobody proposed a node.js solution.

z01d · on Oct 5, 2011

SPSS -> Cluster Analysis

fmota · on Oct 5, 2011

A computer.

evertonfuller · on Oct 5, 2011

SPSS.

hackermom · on Oct 5, 2011

Does the thread starter wonder about what software tool (i.e. language) people would use, or does he wonder about what method people would use? As far as language goes, the problem requires mathematics on a level so simple that practically any programming language is more than enough sufficient - the histographical method I'd use for solving this problem would be doable in BASIC on a VIC-20 or a TI-81 pocket calculator.

georgieporgie · on Oct 5, 2011

This is silly, but I find it to be more on-topic and fun than a lot of other items that I see on HN. It's really educational to see how people look at the problem.

wedtm · on Oct 5, 2011

Shamiq · on Oct 5, 2011

Excel.

bherms · on Oct 5, 2011

Regression.