Standard Deviation Differences Between Excel and R (and my code in Cube Voyager)

July 24th, 2015

I had a need to get the correlation of count to assignment in Cube Voyager.  I don't know how to do this off the top of my head, and I'm instantly mistrusting of doing things I'd normally trust to R or Excel.  So I looked on Wikipedia for the Pearson product-moment correlation coefficient and ended up at Standard Deviation.  I didn't make it that far down on the page and used the first, which generally made Voyager Code like this:

I left the print statements in, because the output is important.

Avg AADT_TRK = 1121.77
Avg VOLUME = 822.03
n = 230.00



sdx1 = 1588160175
sdy1 = 1196330474
n = 230.00
sd AADT_TRK = 2627.75
sd Volume = 2280.67
r2 = 155.06

Note the standard deviations above.  Ignore the R2 because it's most certainly not correct!

Again, mistrusting my own calculations, I imported the DBF into R and looked at the standard deviations:

> sd(trkIn$AADT_TRK)
[1] 2633.476
> sd(trkIn$V_1)
[1] 2285.64

Now Standard Deviation is pretty easy to compute.  So WHY ARE THESE DIFFERENT?

Just for fun, I did the same in Excel:

Screenshot 2015-07-24 11.09.18

WTF? Am I right or wrong???

So I started looking into it and recalled something about n vs. n-1 in the RMSE equation and discussion in the latest Model Validation and Reasonableness Checking Manual.  So I decided to manually code the standard deviation in Excel and use sqrt(sum(x-xavg)^2/n-1) instead of Excel's function:

Looky there!  Matching Numbers!

Looky there! Matching Numbers!

It's not that Excel is incorrect, it's not using Bessel's Correction.  R is.

/A

Interesting INT() Issue Between Cube and Excel

July 24th, 2012

 

I don't know about anyone else, but I do a lot of calculation prototyping in Excel before applying that in scripts.  One of the most recent was to do a script to add expansion zones (also known as "dummy zones", although they aren't really dumb, just undeveloped!).

The problem I had was related to the following equation:

R=INT((819-N)/22)+1   Where N={820..906}

In Excel, the results are as below (click on it if it is too small to see):

In Cube, I got the result of (click on it to expand, and I only took it into Excel to move stuff around and make it easier to see):

Note the sheer number of zeroes in the Cube version and all the numbers are 'off'.

The reason, as I looked into things was because of how INT() works differently in the two platforms.  In Cube, INT simply removes everything to the right of the decimal, so INT(-0.05) = 0, and INT(-1.05)=-1.  In Excel, INT rounds down to the nearest integer.  This means that negative values will be different between the two platforms.  Note the table below.

Excel Cube
3.4 3 3
2.3 2 2
1.1 3 1
0.5 0 0
0 0 0
-0.5 -1 0
-1.1 -2 -1
-2.3 -3 -2
-3.4 -4 -3

While neither software is truly wrong in it's approach (there is no standard spec for INT()) it is important to know why things may not work as expected.

Using GAWK to Get Through CTPP Data

August 18th, 2011

The 3-year CTPP website lacks a little in usability (just try getting a county-county matrix out of it).

One of the CTPP staff pointed me to the downloads, which are a double-edge sword. On one hand, you have a lot of data without an interface in the way. On the other hand, you have a lot of data.

I found it was easiest to use GAWK to get through the data, and it was pretty easy:

gawk '/.*COUNTY_CODE.*/' *.csv >Filename.txt

Where COUNTY_CODE is the code from Pn-Labels-xx.txt where n is the part number (1,2, or 3) and xx is the state abbreviation.

NOTE: Look up the county code EACH TIME.  It changes among parts 1, 2, and 3.

This command will go through all .csv files and output any line with the county code to the new file.

UPDATE

I have multiple counties to deal with.  There's an easy way to start on getting a matrix:

gawk '/C4300US.*(21037|21015|21117).*32100.*/' *.csv >TotalFlowsNKY.csv

This results in a CSV table of only the total flows from three Northern Kentucky counties (21037, 21015, 21117; Campbell, Boone, and Kenton county, respectfully).  For simplicity's sake, I didn't include all 11 that I used.

Finishing Up

Then, I did a little Excel magic to build a matrix for all 11 counties and externals.  The formula is shown.  I have an additional sheet which is basically a cross reference of the county FIPS codes to the name abbreviations I'm using.  See the image below (click for a larger version).

After this, I built a matrix in Excel.  The matrix uses array summation (when you build this formula, you press CTRL+Enter to set it up right, else the returned value will be 0).

Using these techniques, I was able to get a journey to work matrix fairly quickly and without a lot of manual labor.

NOTE

You need to have GNUWin32 installed to use gawk.