XKCD-style Plots In R

May 17th, 2016

Because XKCD is cool...

Note that the data is totally made up.  Well, except that it certainly feels pretty well correct, given the Delta flights I was on going both to and from Denver...

Resulting Plot

Resulting Plot

Animated Charts from R

April 29th, 2016

In any iterative process, it is nice to see progress.

Recently, it was gravity model distribution (again), so I had a process in R to calibrate friction factors, and part of it exported the trip length frequency chart:

ggsave(paste("TLF", Purpose, Period, "_Iter", iteration, "plot.png", sep=""))

So I was left with 30 charts showing progress.  So for kicks (a little), I animated the charts using ImageMagick:

One of the charts

One of the charts. Click for larger and animated.

To do this, you need Imagemagick (scroll down for Windows) and the following command line:

convert -resize 75% TLFHBWOP_Iter%dplot.png[1-30] HBWOP.gif

The resize parameter is optional, I used it to reduce the image size down to something allowable by Twitter.

Running Cube Scripts from R Scripts

April 26th, 2016

From time to time, it makes sense to not re-invent the wheel by building something in R that is already available in another program, like Cube Voyager.

This is incredibly easy:

system("voyager.exe script.S /Start /CloseWhenDone")

That's it!  BUT...

If you want the output from a matrix available to R, the last line of your Cube script should include something like:

*cube2omxw output.mat

This will convert the output matrix to an omx file.  The star runs it from the command prompt.

Important requirements: voyager.exe and cube2omxw.exe must be in your path.

 

Simple Heatmap in R with ggplot2

March 18th, 2016

Heatmaps are an easy way to look at data when you have a lot.  The heatmap below is from a 30 day traffic count.  You can easily and quickly see that peaks in the morning are northbound, peaks in the afternoon are southbound, and peaks on the weekend are in the midday.

Heatmap

Heatmap

There isn't a ton of code needed to do this:

Data to play with

This method could be used with shorter-term data with fewer steps.  I used this 30-day count because it tells more of a story than a two or three day count.

R + OMX County Flows

October 8th, 2015

In travel modeling, we use matrices to do things like zone-to-zone trip flows.  The matrix is like a table, but with an equal number of rows and columns, each representing a location in the model (a traffic analysis zone, like a Census Block Group).  In the OKI region, we have 2,299 zones, which means there are 5,285,401 cells.  Too many to look at, and we don't have reliable data at that level.  However, we do have semi-reliable data at the county level.

The Open Matrix Format (OMX) is used to get these matrix files out of a proprietary format and into something that can be read by many programs.  We use this at OKI to move data out of Cube (a proprietary software product) and into R (an open source statistical programming package).

Summarizing a matrix to a county flow table in R is actually pretty easy:

This doesn't take very long, either.  Anyone familiar with R can see where the code can be revised to summarize to districts.

This is what the data looks like. Note that this is not verified data (please do not use it!).

This is what the data looks like. Note that this is not verified data (please do not use it!).

Note: the reason Hamilton, Campbell, and Dearborn county names are cut off is related to a bug in Cube.  They (Citilabs) are aware of the bug, but it has not been fixed yet.

Standard Deviation Differences Between Excel and R (and my code in Cube Voyager)

July 24th, 2015

I had a need to get the correlation of count to assignment in Cube Voyager.  I don't know how to do this off the top of my head, and I'm instantly mistrusting of doing things I'd normally trust to R or Excel.  So I looked on Wikipedia for the Pearson product-moment correlation coefficient and ended up at Standard Deviation.  I didn't make it that far down on the page and used the first, which generally made Voyager Code like this:

I left the print statements in, because the output is important.

Avg AADT_TRK = 1121.77
Avg VOLUME = 822.03
n = 230.00



sdx1 = 1588160175
sdy1 = 1196330474
n = 230.00
sd AADT_TRK = 2627.75
sd Volume = 2280.67
r2 = 155.06

Note the standard deviations above.  Ignore the R2 because it's most certainly not correct!

Again, mistrusting my own calculations, I imported the DBF into R and looked at the standard deviations:

> sd(trkIn$AADT_TRK)
[1] 2633.476
> sd(trkIn$V_1)
[1] 2285.64

Now Standard Deviation is pretty easy to compute.  So WHY ARE THESE DIFFERENT?

Just for fun, I did the same in Excel:

Screenshot 2015-07-24 11.09.18

WTF? Am I right or wrong???

So I started looking into it and recalled something about n vs. n-1 in the RMSE equation and discussion in the latest Model Validation and Reasonableness Checking Manual.  So I decided to manually code the standard deviation in Excel and use sqrt(sum(x-xavg)^2/n-1) instead of Excel's function:

Looky there!  Matching Numbers!

Looky there! Matching Numbers!

It's not that Excel is incorrect, it's not using Bessel's Correction.  R is.

/A

R + OMX: Trip Length Frequency Plots

May 8th, 2015

If you don't follow me on twitter or the Open Model Data site, you may have missed that Cube 6.4 makes some DLL changes that rendered the prior version of the Cube2OMX converter unusable.  I jumped in (partly because I installed Cube 6.4) and fixed the problem.  You can get the source or a binary on Github.

I did this because sending matrices to DBF files means that you'll have a 500 MB DBF for a matrix that's 3200+ zones.  Normal R routines chug on it.  On the contrary, OMX files are around 10% of that size (60 MB) and R can read them quite quickly - in a matter of seconds instead of minutes.

So the first thing I wanted to do in R with OMX files is Trip Length Frequency Plots.  This isn't incredibly quick, but it works.  The code is below.  According to one run in R, it takes around 6 minutes to run (on 3,312 zones).  The largest part is the loop in the function, and this could probably be parallelized (is that a word?) using doParallel.

Code below or here...