## XKCD-style Plots In R

May 17th, 2016Because XKCD is cool...

Note that the data is totally made up. Well, except that it certainly feels pretty well correct, given the Delta flights I was on going both to and from Denver...

Because XKCD is cool...

Note that the data is totally made up. Well, except that it certainly feels pretty well correct, given the Delta flights I was on going both to and from Denver...

In any iterative process, it is nice to see progress.

Recently, it was gravity model distribution (again), so I had a process in R to calibrate friction factors, and part of it exported the trip length frequency chart:

ggsave(paste("TLF", Purpose, Period, "_Iter", iteration, "plot.png", sep=""))

So I was left with 30 charts showing progress. So for kicks (a little), I animated the charts using ImageMagick:

To do this, you need Imagemagick (scroll down for Windows) and the following command line:

convert -resize 75% TLFHBWOP_Iter%dplot.png[1-30] HBWOP.gif

The resize parameter is optional, I used it to reduce the image size down to something allowable by Twitter.

From time to time, it makes sense to not re-invent the wheel by building something in R that is already available in another program, like Cube Voyager.

This is incredibly easy:

system("voyager.exe script.S /Start /CloseWhenDone")

That's it! BUT...

If you want the output from a matrix available to R, the last line of your Cube script should include something like:

*cube2omxw output.mat

This will convert the output matrix to an omx file. The star runs it from the command prompt.

Important requirements: voyager.exe and cube2omxw.exe must be in your path.

Heatmaps are an easy way to look at data when you have a lot. The heatmap below is from a 30 day traffic count. You can easily and quickly see that peaks in the morning are northbound, peaks in the afternoon are southbound, and peaks on the weekend are in the midday.

There isn't a ton of code needed to do this:

This method could be used with shorter-term data with fewer steps. I used this 30-day count because it tells more of a story than a two or three day count.

In travel modeling, we use matrices to do things like zone-to-zone trip flows. The matrix is like a table, but with an equal number of rows and columns, each representing a location in the model (a traffic analysis zone, like a Census Block Group). In the OKI region, we have 2,299 zones, which means there are 5,285,401 cells. Too many to look at, and we don't have reliable data at that level. However, we do have semi-reliable data at the county level.

The Open Matrix Format (OMX) is used to get these matrix files out of a proprietary format and into something that can be read by many programs. We use this at OKI to move data out of Cube (a proprietary software product) and into R (an open source statistical programming package).

Summarizing a matrix to a county flow table in R is actually pretty easy:

This doesn't take very long, either. Anyone familiar with R can see where the code can be revised to summarize to districts.

*Note: the reason Hamilton, Campbell, and Dearborn county names are cut off is related to a bug in Cube. They (Citilabs) are aware of the bug, but it has not been fixed yet.*

I had a need to get the correlation of count to assignment in Cube Voyager. I don't know how to do this off the top of my head, and I'm instantly mistrusting of doing things I'd normally trust to R or Excel. So I looked on Wikipedia for the Pearson product-moment correlation coefficient and ended up at Standard Deviation. I didn't make it that far down on the page and used the first, which generally made Voyager Code like this:

I left the print statements in, because the output is important.

Avg AADT_TRK = 1121.77 Avg VOLUME = 822.03 n = 230.00 sdx1 = 1588160175 sdy1 = 1196330474 n = 230.00 sd AADT_TRK = 2627.75 sd Volume = 2280.67 r2 = 155.06

Note the standard deviations above. Ignore the R2 because it's most certainly not correct!

Again, mistrusting my own calculations, I imported the DBF into R and looked at the standard deviations:

> sd(trkIn$AADT_TRK) [1] 2633.476 > sd(trkIn$V_1) [1] 2285.64

Now Standard Deviation is pretty easy to compute. So WHY ARE THESE DIFFERENT?

Just for fun, I did the same in Excel:

So I started looking into it and recalled something about n vs. n-1 in the RMSE equation and discussion in the latest Model Validation and Reasonableness Checking Manual. So I decided to manually code the standard deviation in Excel and use sqrt(sum(x-xavg)^2/n-1) instead of Excel's function:

It's not that Excel is incorrect, it's not using Bessel's Correction. R is.

/A

If you don't follow me on twitter or the Open Model Data site, you may have missed that Cube 6.4 makes some DLL changes that rendered the prior version of the Cube2OMX converter unusable. I jumped in (partly because I installed Cube 6.4) and fixed the problem. You can get the source or a binary on Github.

I did this because sending matrices to DBF files means that you'll have a 500 MB DBF for a matrix that's 3200+ zones. Normal R routines chug on it. On the contrary, OMX files are around 10% of that size (60 MB) and R can read them quite quickly - in a matter of seconds instead of minutes.

So the first thing I wanted to do in R with OMX files is Trip Length Frequency Plots. This isn't incredibly quick, but it works. The code is below. According to one run in R, it takes around 6 minutes to run (on 3,312 zones). The largest part is the loop in the function, and this could probably be parallelized (is that a word?) using doParallel.

Code below or here...