Reading Chart Images in R

August 6th, 2019

I ran into a non-computer issue recently – a marathon I was going to run had no hill profile on the race’s website. Many runners, myself included, use hill profiles to gauge how we want to run a race. For example, the Cincinnati Flying Pig is one where many runners would want to keep their pace moderated during the first 6 miles because miles 7-9 are a ginormous hill.

Since the race didn’t have an elevation profile on their website, I looked elsewhere for information, and came upon several Strava results that include an elevation profile. This is good, but I wanted to take it one step further and compare the marathon I was going to run – the Toledo Glass City Marathon – to one I had already run – the Cincinnati Flying Pig Marathon.

This is a screenshot of the elevation profile from the Flying Pig Marathon from my Strava race in 2018
This is a screenshot of the elevation profile of the Glass City Marathon from some random runner’s Strava upload of the race

Taking the above images requires a few steps. The first is simply cropping down to the chart area.

The next step is to read in the pixel data. This requires the imager package in r (install.packages("imager"))

The first part of the code below loads the libraries and reads the files in as greyscale.

library(imager)
library(ggplot2)
library(dplyr)

fpm.ei = grayscale(load.image("Pictures\\FPM_HillProfile_IR.png"))
gcm.ei = grayscale(load.image("Pictures\\GCM_HillProfile_IR.png"))

The values are input as a matrix indexed with the 0,0 point at the upper-left corner of the image and a value that relates to an estimate of. This means that an initial plot looks upside-down.

This can be fixed and simplified by using the ggplot2 snip below, and while I was at it, I pulled out the grey background and factored the values to 1 (dark) or 0 (light):

ggplot(as.data.frame(fpm.ei), aes(x = x, y = y, color = as.factor(ifelse(value<0.88, 0, 1)))) + 
  geom_point() + 
  scale_y_reverse()
This is the elevation profile

Note: 0.88 was used because the mean of the data is 0.8812. 0.5 wouldn’t work because the min is 0.5294 – this is because the foreground is grey, not black.

The next part is some processing. The processing does a few things:

  1. Do a gradient along the y axis – this removes the dashed lines
  2. Filter the matrix to just the black areas
  3. Get the minimum of the y axis location to get the contour of the line
fpm.gr <- imgradient(fpm.ei,"y")
fpm.grdf = as.data.frame(fpm.gr) %>%
  filter(value == 0)

fpm.grdf2 = fpm.grdf %>%
  group_by(x) %>%
  summarize(y = min(y))

To analyze, we need to relate the pixels to actual values. The x direction is easy – I used 0 and 26.2, the official distance of a marathon. The y direction is not, so I looked at the pixels and measurements from the images and related the pixels to an elevation – in the case of the Flying Pig, y = 27 => 800 feet MSL and y = 171 => 500 feet MSL. Since these are linear models (1 pixel = x feet), I used a linear model. Glass City’s values are 28 => 660 feet MSL and 146 => 580 feet MSL.

lmFP = lm(yy ~ y,
  data.frame(y = c(27, 171), yy = c(800, 500))
)

summary(lmFP)

fpm.grdf2$elevation = predict(lmFP, newdata = fpm.grdf2)

ggplot(fpm.grdf2, aes(x = mileage, y = elevation)) + geom_point()
Elevation profile as points

This is what we want – the elevation points! Note the y axis is elevation in feet and the x axis is the mileage of the marathon. Doing the same exercise with the Glass City Marathon elevation chart yields this…

Let’s compare them, shall we? The code below formats the data, and then draws a chart.

compareM = rbind.data.frame(
  fpm.grdf2 %>%
    arrange(mileage) %>%
    mutate(Marathon = "Flying Pig",
           elevationGain = elevation - lag(elevation, 1)),
  gcm.grdf2 %>%
    arrange(mileage) %>%
    mutate(Marathon = "Glass City",
           elevationGain = elevation - lag(elevation, 1))
)

ggplot(compareM, aes(x = mileage, y = elevation, color = Marathon)) + geom_line(size = 1.2) +
  ylab("Elevation in Feet MSL") +
  xlab("Miles") +
  ggtitle("Marathon Elevation Comparison", "Flying Pig (Cincinnati) vs. Glass City (Toledo)")

So the chart above shows that the Glass City is mere child’s play compared to the Flying Pig.

Animated Charts from R

April 29th, 2016

In any iterative process, it is nice to see progress.

Recently, it was gravity model distribution (again), so I had a process in R to calibrate friction factors, and part of it exported the trip length frequency chart:

ggsave(paste("TLF", Purpose, Period, "_Iter", iteration, "plot.png", sep=""))

So I was left with 30 charts showing progress.  So for kicks (a little), I animated the charts using ImageMagick:

One of the charts

One of the charts. Click for larger and animated.

To do this, you need Imagemagick (scroll down for Windows) and the following command line:

convert -resize 75% TLFHBWOP_Iter%dplot.png[1-30] HBWOP.gif

The resize parameter is optional, I used it to reduce the image size down to something allowable by Twitter.

Running Cube Scripts from R Scripts

April 26th, 2016

From time to time, it makes sense to not re-invent the wheel by building something in R that is already available in another program, like Cube Voyager.

This is incredibly easy:

system("voyager.exe script.S /Start /CloseWhenDone")

That’s it!  BUT…

If you want the output from a matrix available to R, the last line of your Cube script should include something like:

*cube2omxw output.mat

This will convert the output matrix to an omx file.  The star runs it from the command prompt.

Important requirements: voyager.exe and cube2omxw.exe must be in your path.

 

Simple Heatmap in R with ggplot2

March 18th, 2016

Heatmaps are an easy way to look at data when you have a lot.  The heatmap below is from a 30 day traffic count.  You can easily and quickly see that peaks in the morning are northbound, peaks in the afternoon are southbound, and peaks on the weekend are in the midday.

Heatmap

Heatmap

There isn’t a ton of code needed to do this:

Data to play with

This method could be used with shorter-term data with fewer steps.  I used this 30-day count because it tells more of a story than a two or three day count.

R + OMX County Flows

October 8th, 2015

In travel modeling, we use matrices to do things like zone-to-zone trip flows.  The matrix is like a table, but with an equal number of rows and columns, each representing a location in the model (a traffic analysis zone, like a Census Block Group).  In the OKI region, we have 2,299 zones, which means there are 5,285,401 cells.  Too many to look at, and we don’t have reliable data at that level.  However, we do have semi-reliable data at the county level.

The Open Matrix Format (OMX) is used to get these matrix files out of a proprietary format and into something that can be read by many programs.  We use this at OKI to move data out of Cube (a proprietary software product) and into R (an open source statistical programming package).

Summarizing a matrix to a county flow table in R is actually pretty easy:

This doesn’t take very long, either.  Anyone familiar with R can see where the code can be revised to summarize to districts.

This is what the data looks like. Note that this is not verified data (please do not use it!).

This is what the data looks like. Note that this is not verified data (please do not use it!).

Note: the reason Hamilton, Campbell, and Dearborn county names are cut off is related to a bug in Cube.  They (Citilabs) are aware of the bug, but it has not been fixed yet.

Standard Deviation Differences Between Excel and R (and my code in Cube Voyager)

July 24th, 2015

I had a need to get the correlation of count to assignment in Cube Voyager. I don’t know how to do this off the top of my head, and I’m instantly mistrusting of doing things I’d normally trust to R or Excel. So I looked on Wikipedia for the Pearson product-moment correlation coefficient and ended up at Standard Deviation. I didn’t make it that far down on the page and used the first, which generally made Voyager Code like this:

I left the print statements in, because the output is important.

Avg AADT_TRK = 1121.77
Avg VOLUME = 822.03
n = 230.00



sdx1 = 1588160175
sdy1 = 1196330474
n = 230.00
sd AADT_TRK = 2627.75
sd Volume = 2280.67
r2 = 155.06

Note the standard deviations above. Ignore the R2 because it’s most certainly not correct!

Again, mistrusting my own calculations, I imported the DBF into R and looked at the standard deviations:

> sd(trkIn$AADT_TRK)
[1] 2633.476
> sd(trkIn$V_1)
[1] 2285.64

Now Standard Deviation is pretty easy to compute. So WHY ARE THESE DIFFERENT?

Just for fun, I did the same in Excel:

Screenshot 2015-07-24 11.09.18
WTF? Am I right or wrong???

So I started looking into it and recalled something about n vs. n-1 in the RMSE equation and discussion in the latest Model Validation and Reasonableness Checking Manual. So I decided to manually code the standard deviation in Excel and use sqrt(sum(x-xavg)^2/n-1) instead of Excel’s function:

Looky there!  Matching Numbers!
Looky there! Matching Numbers!

It’s not that Excel is incorrect, it’s not using Bessel’s Correction. R is.

/A

R + OMX: Trip Length Frequency Plots

May 8th, 2015

If you don’t follow me on twitter or the Open Model Data site, you may have missed that Cube 6.4 makes some DLL changes that rendered the prior version of the Cube2OMX converter unusable.  I jumped in (partly because I installed Cube 6.4) and fixed the problem.  You can get the source or a binary on Github.

I did this because sending matrices to DBF files means that you’ll have a 500 MB DBF for a matrix that’s 3200+ zones.  Normal R routines chug on it.  On the contrary, OMX files are around 10% of that size (60 MB) and R can read them quite quickly – in a matter of seconds instead of minutes.

So the first thing I wanted to do in R with OMX files is Trip Length Frequency Plots.  This isn’t incredibly quick, but it works.  The code is below.  According to one run in R, it takes around 6 minutes to run (on 3,312 zones).  The largest part is the loop in the function, and this could probably be parallelized (is that a word?) using doParallel.

Code below or here

R Quick-Take: Reading a Ton of Files in a Few Lines

December 1st, 2014

I just downloaded 2,159 traffic count files over the Internet. I’m going to have to work with these in various ways.

So the following quick snippet of code reads all of them into one data frame:

Using R with Phant

November 3rd, 2014

Last week on another blog, I showed a way to connect a temperature and humidity sensor to a Beaglebone Black and read it using some Python code that ultimately started with Adafruit.

So to be able to play (a little) AND after complaining about my office temperature last week, I decided to plug this thing in at work and set it up with Phant, the IOT data service from Sparkfun.  Then I wrote a quick R script to get the JSON file and plot the temperature.

Plot of Temperature

Plot of Temperature

The code is below:

It’s pretty simple, although the plot could use me spending a bit more time on it… and perhaps limiting the data to the last hour or day or something.  All of that can be done within R pretty easily.  Also, I did make a file available for anyone that wishes to play (the file is NOT ‘live’, the Phant server is on a different network).

/A

Quick R Trick: Lists and Data Frames

October 24th, 2014

I didn’t know something better to call this, but you can use lists with data frames as variables to hold multiple field names:

MIN=c('NAICS21','NAICS22','NAICS23')

temp=maz[,MIN] #temp is now a data frame of all rows with just NAICS21, NAICS22, and NAICS23

temp=maz[,c("TAZ",MIN)] #temp is now a data frame of all rows with just TAZ, NAICS21, NAICS22, and NAICS23

This can be incredibly useful in many situations (moving ES202 data to model employment categories is one of many).

 

Lookups in R: The REAL Right Way!

September 9th, 2014

After waiting forever enough to get things to run, I stepped into a better way to do lookups.

mapply on matrix objects.

Basically, I do this:

TSkimLBWPk<-read.dbf("data/TSPKLBW.DBF") #Read the local bus walk skim

TSKimLBWPK_IWAIT=(acast(TSkimLBWPk,I~J,value.var="V1",drop=FALSE,fill=0)) #build a matrix from the data

TSKimLBWPK.IWAIT<-function(i,j) {
if(i<=nrow(TSKimLBWPK_IWAIT) && j<=ncol(TSKimLBWPK_IWAIT))
return(TSKimLBWPK_IWAIT[i,j])
else return(0)
} #build a function to lookup, returning 0 if it is looking for a zone not found

TripsAllPk$LBW.IWAIT=mapply(TSKimLBWPK.IWAIT,TripsAllPk$PTAZ,TripsAllPk$ATAZ) #do the lookup

That’s it. This takes the input DBF (which has I, J, V1, V2, etc. fields), converts to a matrix for a quick lookup, and then applies it.

It runs in about 3 minutes.

Lookups in R: The Wrong Way and the Right Way

August 28th, 2014

I recently wrote a script that takes DBF exports of Cube matrices and prepares them for Biogeme.  The main… well, only reason I did this in R was because I was considering using mlogit for model estimation.  I ultimately decided to ‘go with what I know’ and changed course to use Biogeme. Mind you, the part of Stairway to Heaven applies: “There are two paths you can go by, but in the long run / There’s still time to change the road you’re on.”

The Wrong Way

I’ve changed my code already, so pardon that this is from memory.  Also, these are snippets – I have a lot more code than this.

HSkimPk<-read.dbf("data/HSKIM_PK3.dbf")

for(rn in 1:nrow(TripsAll)){
HSkimPkRow<-subset(HSkimPk,I==TripsAll[rn,"PTAZ"] & J==TripsAll[rn,"ATAZ")
TripsAll$DA.IVT<-HSkimPkRow[,"V1"]
...
}

This took no less than 17 hours to complete for around 23,000 trip records and for values from 5 different tables * 2 time periods.

The Right Way

I (obviously) wanted something that wouldn't take forever, especially as I was working in Biogeme and seeing things that made me think that I wanted to change ONE LITTLE THING.  This seems to always happen.

I took a different approach that by my calculations should be much quicker.

HSkimPk<-read.dbf("data/HSKIM_PK3.dbf")
HSkimPkD<-acast(HSkimPk,I ~ J,value.var="V2",drop=FALSE,fill=0)
HSkimPkT<-acast(HSkimPk,I ~ J,value.var="V1",drop=FALSE,fill=0)

for(rn in 1:nrow(TripsAll)){
if(I<=nrow(HSkimPkT) & J<=nrow(HSkimPkT)){
TripsAll[rn,"DA.IVT"]<-HSkimPkT[I,J]
}
}

Since this is currently running, my only metrics are to look at the time per 50 rows (in my real code, I have a line that outputs a timestamp every 50 rows), and it is taking about 0.27 seconds per record, compared to somewhere around 4.5 seconds per record.  While not perfect, I'll take an estimated completion of 1.75 hours compared to 17 (update: 2 hours).  However, I will say that Cube is faster in this regard and that I may not have the fastest R solution.

Improved Nice, Quick, Effective Sampler in R

August 1st, 2014

I found a problem with the sampler I was using last week.  It is an uncontrolled random sampler, which presents some problems when doing choice modeling.

To deal with the issue, I created a better one.  This version is a stratified random sampler that samples from each income and auto group (which I need to have samples in all income/auto groups in order to develop and test my auto ownership model).  The sampling process works like this:

  1. If there are more than 10 samples for an income/auto group, sample using last week’s random method
  2. If there is at least 2 but not more than 10 samples, pick a random sample to be tested, put the rest into the develop group
  3. If there is 1 sample, put it in both
  4. If there are no samples in the group, print an error to the screen

The code is below:


-A-

Nice, Quick, Effective Sampler in R

July 25th, 2014

It’s always good to test models.  In the best case, we’d test models against a different dataset than what we used to develop data.  In not-the-best-but-still-not-that-bad of cases, we’d use 80% of a dataset to estimate a model, and 20% of it to test against.

In a lot of cases, someone uses a Gibbs Sampler to get that 80%.  I didn’t feel like over-complicating things, so I decided to do a simple random sampler and added in some checks to check that the sample is good.  The following worked well for me.

So to show that it worked.  Click on the pictures to see them large enough to be legible.  The pctHH is the input percentage, the PctHHS is the sample for model estimation, and PctHHT is the sample for model testing.

Screenshot 2014-07-25 11.22.18

Screenshot 2014-07-25 11.22.31

Screenshot 2014-07-25 11.22.46

Screenshot 2014-07-25 11.22.56

 

These will ultimately be part of something larger, but so far, so good.

-A-

Iterating Through DBFs – R Style!

March 6th, 2014

Anyone familiar with transportation modeling is familiar with processes that iterate through data.  Gravity models iterate, feedback loops iterate, assignment processes iterate (well, normally), model estimation processes iterate, gravity model calibration steps, shadow cost loops iterate… the list goes on.

Sometimes it’s good to see what is going on during those iterations, especially with calibration.  For example, in calibrating friction factors in a gravity model, I’ve frequently run around 10 iterations.  However, as an experiment I set the iterations on a step to 100 and looked at the result:

This is the mean absolute error in percentage of observed trips to modeled trips in distribution.

This is the mean absolute error in percentage of observed trips to modeled trips in distribution.  Note the oscillation that starts around iteration 25 – this was not useful nor detected.  Note also that the best point was very early in the iteration process – at iteration 8.

After telling these files to save after each iteration (an easy process), I faced the issue of trying to quickly read 99 files and get some summary statistics.  Writing that in R was not only the path of least resistance, it was so fast to run that it was probably the fastest solution.  The code I used is below, with comments:

ISLR Fridays: Introduction

February 7th, 2014

UPDATE 2014-03-24: I pushed everything back because lots of things have been busy.  

UPDATE 2014-02-25: I pushed everything back 2 weeks because lots of things have been busy.  

Last week, I posted a link to a set of free books to this blog.  Not long after, I got a twitter message from a friend:

You and I should setup to study the R book jointly. Somebody pushing along is tremendously helpful to me. Interested?

The R book is An Introduction to Statistical Learning with Applications in R by Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani.

So I decided I’m going to post biweekly to this blog for the next 18 weeks and talk about what I’ve learned.  Responses are welcome in the comments or via email at andrew .- -.-. siliconcreek .-.-.- net (related comments may be posted to this blog).

The schedule is something like this, based on the chapters of the books:

  1. Statistical Learning - April 18
  2. Linear Regression -  May 2
  3. Classification - May 16
  4. Resampling Methods - May 30
  5. Linear Model Selection and Regularization - June 13 (Friday the 13th???)
  6. Moving Beyond Linearity - July 4 (well, this is when it will post to the blog)
  7. Tree-Based Methods - July 18
  8. Support Vector Machines - August 1
  9. Unsupervised Learning – August 15

So this will be not-too-intense, and with my current workload being spent a lot on waiting for models to run (I’m waiting on one right now, which is partly why I read the introduction), I should be able to spend some time on it.

In addition to the exercises in the book, I intend to post a link to a sanitized version of the Greater Cincinnati Household Travel Survey.  This sanitized version will have a number of changes made to it to protect the privacy of the survey participants (for example, we will not include the names, phone numbers, addresses, or GPS coordinates).

 

Free Statistical Learning Texts!

January 30th, 2014

A few free statistical text books have been posted to the interwebs courtesy of a few universities.  Head over to Hyndsight (Rob Hyndman’s blog) for the links.  One of the books has applications to R.

I’ve downloaded the first two (Elements of Statistical Learning and Introduction to Statistical Learning with applications in R) and sent them to my Nexus 7 Kindle App for later reading.

Illustration of Gravity Model K Factor Effects

December 10th, 2013

While the use of K factors can be very questionable when applied inappropriately, they have been a practice in gravity models for a very long time.  For some regions where psychological boundaries (e.g. state lines, river crossings, etc.) cause an effect on travel, K factors have been used to correct problems.

I decided to take a closer look on the effects of K factors on a small model in R.  I fixed the friction factors to 1 to eliminate the effects of the friction factors an just show the effects of K factors.

Using a single constraint gravity model, the effects are quite pronounced:

This is the base - all K factors are set to 1

This is the base – all K factors are set to 1

Scenario 2 - K factors for 1-5 and 5-1 are set to 2.

Scenario 2 – K factors for 1-5 and 5-1 are set to 2.

Scenario 3 - the K factors for 1-5 and 5-1 are set to 0.5.

Scenario 3 – the K factors for 1-5 and 5-1 are set to 0.5.

Looking at the three, the two things that stand out is that a K of 2 or 0.5 does not mean that twice or half as many trips will be forecasted to those zones.  Also, since this is a single-constrained model, the production totals are all the same, but the attraction totals vary.  The code to run this is on a Github Gist.

This is just a quick example.  It would change with a doubly-constrained model (which I haven’t attempted yet).  The details of that may be posted soon.

Gravity Model Calibration in R (Example)

December 3rd, 2013

Calibrating a gravity model for the first time is difficult.  I stumbled upon a webpage from a professor at Ohio State that really helps.  One thing I like to do is actually do the examples because I can ensure that my code works and it gives me some ideas.

The code to replicate Dr. Viton’s work is below.

Obviously this runs really quick, since it is only three zones.  It is nice that R has some matrix tricks that help make it easy.  One thing to note is that running this for a large matrix in R takes forever (at least the way this is written).  It is possible to run it parallel across all processors, but it still takes forever.

Mapping with R Markdown

November 20th, 2013

A natural extension of my last blog post about using R Markdown is to take it a step further and make maps with it.  RMarkdown is a great tool to not only create the map, but to automate creating the map AND putting the map in a document.

This is a work-in-progress.  This is a day late (I’ve been trying to keep to a Tuesday posting schedule to this blog, and well, it’s Wednesday afternoon when I’m finally typing this). I’m still not happily done, but I’m also ANGRY at the the output of the distribution model (now that I’ve changed a lot of stuff in the model), so I’m having to take a step back and redo friction factors, so rather than post a finished product, I decided to post the work in progress, which is probably as helpful as the finished product.

If you haven’t read last week’s post on using RMarkdown, do so now.  I’ll wait.  Back?  Good.

The code:

The first part, above “Manually create the lines” is basic library-loading and data input (and some manipulation/subsetting)

The second part is creating the desire lines.  These are created by first creating a list with two coordinate pairs in it (l1 – l7).  Those objects are then created into individual lines (Sl1-Sl7).  The individual lines are packaged into one Lines obeject (basically, a list of lines) (DL).  Finally, that object is prepared for the map (deslines<-list…).

The third part is the text.  There are two functions here, one is to get the mid-point of the line, the other is to get the angle of the line.  There was a lot of trial and error that went into this.  In the lines after the functions, the txt1-txt7 and mpt1-mpt7 objects, the text is formatted and map objects are created for them.

The fourth part is the counties for the map.  The col=… and cpl<-… lines handle the colors for the counties.

The last part is drawing the map.  The spplot() function handles all of that.  The primary map is made out of the counties, and the lines and text is added in the sp.layout=… portion of the command.

That’s it!  It really isn’t difficult as long as you remember trigonometry (and of course, you don’t even have to do that since it is in my code above.  I also included some references at the bottom of the most useful resources when I was doing this, as there are many, many, MANY more ways to do this, options to play with, etc.

Example Report.  NOTE: THE DATA IS INCORRECT.

Travel Model Reports with R and knitr

November 12th, 2013

I’ve had my share of complaining about the various “reporting” platforms available to those of us that do travel modeling.  I’ve looked at a few options, and nothing has stuck as “the one”. Until now.

In the meantime, I’ve noticed that a lot of groups have adopted Markdown.  It’s found it’s way onto Github via Jekyll.  Jeckyll’s found it’s way into my life as a quick blogging and site-building solution.  Then, I stumbled upon RStudio RMarkdown.  This is becoming a goldmine because RStudio is a great platform for developing things in R (including presentations and R Markdown).  Even better, the RMarkdown documents can be run via R (in a batch wrapper).  The only missing link is the ability to read matrix files directly.  I guess we can’t have everything, but I have a solution for that, too.

What Is This Markdown Thing And Why Should I Care?

Markdown is a pretty easy thing to grasp.  It’s open and fairly flexible.  It’s a text markup format that is easy to read when not rendered.  Easy to read means easy to write.  The open-ness means that you can do things with it.  In the case of RMarkdown, you can add R code blocks and LaTeX equations.  I will admit that LaTeX equations are not legible until rendered, but when you start adding R in the equation, the focus shifts less on reading the unrendered RMarkdown and more on reading the rendered output.

The link to Github (above) goes to their Markdown cheat sheet.  That alternates between Markdown and HTML output and it’s pretty easy to see how things work.

Getting Model Run Results into RMarkdown and into Rendered Output

There’s a number of things that need to happen to get model run results into R/RMarkdown and then to Output:

  1. Output data to a format R understands
  2. Write RMarkdown document
  3. Write RScript to render RMarkdown to HTML
  4. Write Windows Batch File to automate the RScript

Output data to a format R understands

In the case of zonal data, R can understand CSV out of the box, and with the appropriate library, can understand DBF.  With matrix files, Voyager will export them to DBF with a simple script file:

This script simply reads a bunch of matrix files and outputs them to two DBF files, one for the peak-period distribution and one for the off-peak-period distribution.

One important thing to note in this is that I didn’t put paths in this.  I run this from the command line in the model folder and it picks up the files in that folder and outputs the DBFs into that folder.  This is something that would have to be testing when placed into a model run file.

Resist the urge to do this in two separate steps.  The join process in R takes forever, and reading the data into memory may take a while, too.

Write RMarkdown document

The RMarkdown document is where the magic happens.  Fortunately, Knitr (the R Package that does all this magic) does a few things to split code blocks.  If you want to build my file, add all these together into one file and name it something.rmd

There are three code blocks that do this.  They are importing, summarizing, and graphing.

Importing Data

This block does three things:

  1. Loads libraries.  The foreign library is used to read DBF files.  The plyr library is used to join and summarize the data frames (from the DBF inputs).  The ggplot2 library is used for plots.
  2. Sets a few variables.  Since the OKI model is actually two MPO models, we do three reports of everything – one for the entire model, one for the OKI (Cincinnati) region, and one for the MVRPC (Dayton) region.  zones_oki and zones_mv are used to control which report is which.
  3. Imports the DBF files.  Those read.dbf lines are where that magic happens.  Again, since this is run in the model folder, no paths are used.

Summarizing Data

This block does three things:

  1. It rounds the logsum values to provide some grouping to the values
  2. It gets a subset of the model (for OKI)
  3. It summarizes the rounded values to prepare for charting them

Charting Data

This block does one thing: it draws the chart using the ggplot tool.  This is pretty advanced (although not bad, and the online help is good).  However, for this I’m going to hold to my recommendation to get a copy of The R Graphics Cookbook (where I learned how to use ggplot).  The concepts and examples in the book are far greater than what I will post here.

One point that should not be lost is that text elements (either Markdown headings, etc., or just text, or formatted text) can be added into this outside of the “`…“` blocks. This way, reports can actually look good!

Once this part is complete, the hardest stuff is over.

Write RScript to render RMarkdown to HTML

The RScript to render the RMarkdown file to HTML is pretty simple:

This writes the .Rmd file out to the same filename as .html. You can have as many knit2html lines as needed

There are ways to write the files out to PDF (I haven’t looked into them… perhaps that would be a future blog topic?).

Write Windows Batch File to automate the RScript

The last step is to write a batch file “wrapper” that can be run as part of the travel demand model run.  This is really quite easy:

The first line sets the path to include R (on my system it isn’t in the path, and my path statement is already 9 miles long). The second line runs the R script file (ReportR.R) in R.

 

That’s It! It seems like a lot of work goes into this, but it isn’t as difficult as some of the other reporting platforms out there.

PS: Stay tuned for some example reports

Update:

Example Report (generated from RMarkdown file)

Example RMarkdown File

Trip Rates: Averages and Analysis of Variance

September 13th, 2013

This is second in the R in Transportation Modeling Series of posts.

I’ve been going between R, R Graphics Cookbook, NCHRP Report 716, and several other tasks, and finally got a chance to get back on actually performing the trip generation rates in the model.

The Process

NCHRP Report 716 lays out a process that looks something like this when you make a graphic out of it:

 Flowchart of Trip Rate Process

So, the first part is to summarize the trips by persons, workers, income, and vehicles.  I’m going to skip income (I don’t care for using that as a variable).

This can be done pretty easily in R with a little bit of subsetting and ddply summaries that I’ve written about before. I split this into three groups based on area type, area type 1 is rural, 2 is suburban, and I combined 3 and 4 – urban and CBD.


hh.1<-subset(hhs,AreaType==1) hh.1.sumper<-ddply(hh.1,.(HHSize6),summarise,T.HBW=sum(HBW),T.HH=length(HHSize6),TR=T.HBW/T.HH) hh.1.sumwrk<-ddply(hh.1,.(Workers4),summarise,T.HBW=sum(HBW),T.HH=length(Workers4),TR=T.HBW/T.HH) hh.1.sumaut<-ddply(hh.1,.(HHVeh4),summarise,T.HBW=sum(HBW),T.HH=length(HHVeh4),TR=T.HBW/T.HH)

This makes three tables that looks like this:

>hh.1.sumper
HHSize6 T.HBW T.HH TR
1 1 9 54 0.1666667
2 2 77 98 0.7857143
3 3 24 38 0.6315789
4 4 36 40 0.9000000
5 5 18 10 1.8000000
6 6 4 6 0.6666667

Once all of this is performed (and it isn't much, as the code is very similar among three of the four lines above), you can analyze the variance:

#Perform analysis of variance, or ANOVA
> hh.1.perfit<-aov(TR~HHSize6,data=hh.1.sumper) > hh.1.wrkfit<-aov(TR~Workers4,data=hh.1.sumwrk) > hh.1.autfit<-aov(TR~HHVeh4,data=hh.1.sumaut) #Print summaries >summary(hh.1.perfit)
Df Sum Sq Mean Sq F value Pr(>F)
HHSize6 1 0.4824 0.4824 1.987 0.231
Residuals 4 0.9712 0.2428

> summary(hh.1.wrkfit)
Df Sum Sq Mean Sq F value Pr(>F)
Workers4 1 0.1113 0.1113 0.184 0.697
Residuals 3 1.8146 0.6049

> summary(hh.1.autfit)
Df Sum Sq Mean Sq F value Pr(>F)
HHVeh4 1 0.0994 0.09938 0.536 0.54
Residuals 2 0.3705 0.18526

The items above indicate that none of the three items above (persons per household, workers per household, or autos per household) are very significant predictors of home based work trips per household. Admittedly, I was a bit concerned here, but I pressed on to do the same for suburban and urban/CBD households and got something a little less concerning.

Suburban households

> summary(hh.2.autfit)
Df Sum Sq Mean Sq F value Pr(>F)
HHVeh4 1 0.6666 0.6666 23.05 0.0172 *
Residuals 3 0.0868 0.0289
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> summary(hh.2.wrkfit)
Df Sum Sq Mean Sq F value Pr(>F)
Workers4 1 1.8951 1.8951 11.54 0.0425 *
Residuals 3 0.4926 0.1642
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> summary(hh.2.perfit)
Df Sum Sq Mean Sq F value Pr(>F)
HHSize6 1 0.6530 0.6530 10.31 0.0326 *
Residuals 4 0.2534 0.0634
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Urban and CBD Households

> summary(hh.34.autfit)
Df Sum Sq Mean Sq F value Pr(>F)
HHVeh4 1 1.8904 1.8904 32.8 0.0106 *
Residuals 3 0.1729 0.0576
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> summary(hh.34.wrkfit)
Df Sum Sq Mean Sq F value Pr(>F)
Workers4 1 5.518 5.518 680 0.000124 ***
Residuals 3 0.024 0.008
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> summary(hh.34.perfit)
Df Sum Sq Mean Sq F value Pr(>F)
HHSize6 1 0.7271 0.7271 9.644 0.036 *
Residuals 4 0.3016 0.0754
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Another Way

Another way to do this is to do the ANOVA without summarizing the data. The results may not be the same or even support the same conclusion.

Rural

> summary(hh.1a.perfit)
Df Sum Sq Mean Sq F value Pr(>F)
HHSize6 1 15.3 15.310 10.61 0.00128 **
Residuals 244 352.0 1.442
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> summary(hh.1a.wrkfit)
Df Sum Sq Mean Sq F value Pr(>F)
Workers4 1 60.46 60.46 48.08 3.64e-11 ***
Residuals 244 306.81 1.26
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> summary(hh.1a.autfit)
Df Sum Sq Mean Sq F value Pr(>F)
HHVeh4 1 4.6 4.623 3.111 0.079 .
Residuals 244 362.6 1.486
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Suburban

> hh.2a.perfit<-aov(HBW~HHSize6,data=hh.2) > hh.2a.wrkfit<-aov(HBW~Workers4,data=hh.2) > hh.2a.autfit<-aov(HBW~HHVeh4,data=hh.2) > summary(hh.2a.perfit)
Df Sum Sq Mean Sq F value Pr(>F)
HHSize6 1 136.1 136.05 101.9 <2e-16 *** Residuals 1160 1548.1 1.33 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 > summary(hh.2a.wrkfit)
Df Sum Sq Mean Sq F value Pr(>F)
Workers4 1 376.8 376.8 334.4 <2e-16 *** Residuals 1160 1307.3 1.1 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 > summary(hh.2a.autfit)
Df Sum Sq Mean Sq F value Pr(>F)
HHVeh4 1 103.2 103.20 75.72 <2e-16 *** Residuals 1160 1580.9 1.36 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Urban/CBD

> hh.34a.perfit<-aov(HBW~HHSize6,data=hh.34) > hh.34a.wrkfit<-aov(HBW~Workers4,data=hh.34) > hh.34a.autfit<-aov(HBW~HHVeh4,data=hh.34) > summary(hh.34a.perfit)
Df Sum Sq Mean Sq F value Pr(>F)
HHSize6 1 77.1 77.07 64.38 4.93e-15 ***
Residuals 639 765.0 1.20
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> summary(hh.34a.wrkfit)
Df Sum Sq Mean Sq F value Pr(>F)
Workers4 1 222.0 221.96 228.7 <2e-16 *** Residuals 639 620.1 0.97 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 > summary(hh.34a.autfit)
Df Sum Sq Mean Sq F value Pr(>F)
HHVeh4 1 91.1 91.12 77.53 <2e-16 *** Residuals 639 751.0 1.18 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

What this illustrates is that there is a difference between averages and raw numbers.

The next part of this will be to test a few different models to determine the actual trip rates to use, which will be the subject of the next blog post.

Mode Choice Modeling with R

June 14th, 2013

I started this post (and the work to go with it) as a companion to A Self Instructing Course in Mode Choice Modeling by Bhat and Koppelman.  That’s because I could reproduce the work in the book in R and can (now) reproduce in R.

To continue with this, please get the CD files from my last blog post.  You’ll specifically need “SF MTC Work MC Data.sav”, which is in SPSS format.

The first part:

library(foreign)
library(mlogit)

The items above simply load the libraries.  If any of these are not found, go to Packages (on the menu bar) – Install Packages… and select your closest mirror and select the missing package (either foreign or mlogit).

Next, read in the data and we’ll add a field, too, as there is no unique id in this dataset.

inTab<-read.spss(file.choose(),to.data.frame=T,use.value.labels=F)
inTab$HHPerID=inTab$hhid*100+inTab$perid

The first line reads in the SPSS file (it asks you for the file).  The second adds a "HHPerID" field, which is unique to each case.

The next part is to format the data for mlogit.  This is quite a challenge because it has to be JUST RIGHT or there will be errors.

mc<-mlogit.data(inTab,choice="chosen",shape="long",chid.var="HHPerID",alt.var="altnum",drop.index=T)

The first parts of this are pretty obvious (inTab is the input table, choice="chosen" is the choice field).  Shape="long" indicates that the data is multiple records per case.  "Wide" would indicate each record is on its own line.  chid.var is the case id variable.  alt.var is the alternatives.  drop.index drops the index field out of the resulting table.

Finally, we'll run a simple multinomial logit estimate on this.

nlb<-mlogit(chosen~cost+tvtt|hhinc,mc)

For such a short piece of code, there is a lot going on here.  The formula is (simply) chosen=cost+tvtt+hhinc, BUT hhinc is alternative specific and cost and travel time are not.  So the utilities for this would be something like:

$$U_{da}=\beta_{cost}*cost+\beta_{tt}*tvtt$$

$$U_{sr2}=\beta_{cost}*cost+\beta_{tt}*tvtt+\beta_{inc,sr2}*hhinc+K_{sr2}$$

$$U_{sr3}=\beta_{cost}*cost+\beta_{tt}*tvtt+\beta_{inc,sr3}*hhinc+K_{sr3}$$

$$U_{transit}=\beta_{cost}*cost+\beta_{tt}*tvtt+\beta_{inc,tranist}*hhinc+K_{transit}$$

$$U_{walk}=\beta_{cost}*cost+\beta_{tt}*tvtt+\beta_{inc,walk}*hhinc+K_{walk}$$

$$U_{bike}=\beta_{cost}*cost+\beta_{tt}*tvtt+\beta_{inc,bike}*hhinc+K_{bike}$$

 

The result is this:

>summary(nlb)

Call:
mlogit(formula = chosen ~ cost + tvtt | hhinc, data = mc, method = "nr",
print.level = 0)

Frequencies of alternatives:
1 2 3 4 5 6
0.7232054 0.1028037 0.0320143 0.0990257 0.0099423 0.0330086

nr method
6 iterations, 0h:0m:6s
g'(-H)^-1g = 5.25E-05
successive function values within tolerance limits

Coefficients :
Estimate Std. Error t-value Pr(>|t|)
2:(intercept) -2.17804077 0.10463797 -20.8150 < 2.2e-16 ***
3:(intercept) -3.72512379 0.17769193 -20.9639 < 2.2e-16 ***
4:(intercept) -0.67094862 0.13259058 -5.0603 4.186e-07 ***
5:(intercept) -2.37634141 0.30450385 -7.8040 5.995e-15 ***
6:(intercept) -0.20681660 0.19410013 -1.0655 0.286643
cost -0.00492042 0.00023890 -20.5965 < 2.2e-16 ***
tvtt -0.05134065 0.00309940 -16.5647 < 2.2e-16 ***
2:hhinc -0.00216998 0.00155329 -1.3970 0.162406
3:hhinc 0.00035756 0.00253773 0.1409 0.887952
4:hhinc -0.00528636 0.00182881 -2.8906 0.003845 **
5:hhinc -0.01280827 0.00532413 -2.4057 0.016141 *
6:hhinc -0.00968627 0.00303306 -3.1936 0.001405 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Log-Likelihood: -3626.2
McFadden R^2: 0.25344
Likelihood ratio test : chisq = 2462 (p.value = < 2.22e-16)

And this matches the self-instructing course manual, page 76 (under "Base Model").

Nested Logit

R can do simple nested logit calculations, but unfortunately they have to be *very* simple (which is uncharacteristic for R).  The best thing to do is get a copy of Biogeme and read the next post in this series.

Linear and Nonlinear Models in R

June 7th, 2013

This post will talk about building linear and non-linear models of trip rates in R.  If you haven’t read the first part of this series, please do so, partly because this builds on it.

Simple Linear Models

Simple linear models are, well, simple in R.  An example of a fairly easy linear model with two factors is:

inTab.hbsh


This creates a simple linear home-based-shopping trip generation model based on workers and household size.  Once the estimation completes (it should take less than a second), the summary should show the following data:




> summary(hbsh.lm.W_H)

Call:
lm(formula = N ~ Workers4 + HHSize6, data = hbsh)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.2434 -1.1896 -0.2749  0.7251 11.2946 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.79064    0.10409  17.203  < 2e-16 ***
Workers4    -0.02690    0.05848  -0.460    0.646    
HHSize6      0.24213    0.04365   5.547 3.58e-08 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 1.649 on 1196 degrees of freedom
Multiple R-squared: 0.03228,    Adjusted R-squared: 0.03066 
F-statistic: 19.95 on 2 and 1196 DF,  p-value: 3.008e-09



What all this means is:




Trips = -0.0269*workers+0.24213*HHSize+1.79064




The important things to note on this is that the intercept is very significant (that's bad) and the R2 is 0.03066 (that's horrible).  There's more here, but it's more details.






Non-Linear Least Squares








When doing a non-linear model, the nls function is the way to go.  The two lines below create a trips data frame, and then run a non-linear least-squares model estimation on it (note that the first line is long and wraps to the second line).



trips=3),
start=c(a=1,b=1),trace=true)



The second line does the actual non-linear least-squares estimation.  The input formula is T=a*e^(HHSize+b).  In this type of model, starting values for a and b have to be given to the model.




The summary of this model is a little different:




> summary( trips.hbo.nls.at3p)

Formula: T.HBO ~ a * log(HHSize6 + b)

Parameters:
  Estimate Std. Error t value Pr(>|t|)    
a   1.8672     0.1692  11.034  < 2e-16 ***
b   1.2366     0.2905   4.257 2.58e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 2.095 on 402 degrees of freedom

Number of iterations to convergence: 4 
Achieved convergence tolerance: 1.476e-07 




It doesn't perform R2 on this because it can't directly.  However, we can because we know the actual values and the model predicts the values.  So, one thing that can be done is a plot:




> plot(c(0,10),c(0,10),type='l',xlab='Observed Trips',ylab='Predicted Trips')
> points(subset(trips,AreaType>=3)$T.HBO,fitted(trips.hbo.nls.at3p),col='red')




The resulting graph looks like this.  Not particularly good, but there is also no scale as to the frequency along the 45° line.
rgraph



R2 is still a good measure here.  There's probably an easier way to do this, but this way is pretty simple.  




testTable=3)$T.HBO,fitted(trips.hbo.nls.at3p)))
cor(testTable$X1,testTable$X2)




Since I didn't correct the column names when I created the data frame, R used X1 and X2, as evidenced by checking the summary of testTable:




> summary(testTable)
       X1               X2       
 Min.   : 0.000   Min.   :1.503  
 1st Qu.: 1.000   1st Qu.:1.503  
 Median : 2.000   Median :2.193  
 Mean   : 2.072   Mean   :2.070  
 3rd Qu.: 3.000   3rd Qu.:2.193  
 Max.   :23.000   Max.   :3.696  




So the R2 value is pretty bad...




> cor(testTable$X1,testTable$X2)
[1] 0.2755101




It's better than some of the others, after all, this is semirandom human behavior.



That's it for now.  My next post will be... MORE R!



Also, I have a quick shout-out to Jeremy Raw at FHWA for help via email related to this.  He helped me through some issues via email, and parts of his email helped parts of this post.

				

Getting Started in R

May 24th, 2013

Setting Up

Download R from http://www.r-project.org. Install it normally (on Windows)… Double-click, next, next, next, etc.

Create a project folder with your data and with a shortcut to R (shout-out to Brian Gregor at Oregon DOT for this little trick). Also copy/move the data CSV there.

Inputting and Looking at Data

The data is in CSV, so we need to load the foreign library, and then we’ll load the data. I’m not a fan of typing in long filepaths, so I use the file.choose() function to browse for the data. Note that in many cases the

inTab<-read.csv(file.choose())
summary(inTab)

In the code above, we’ve loaded the dbf into the inTab data frame (a data object in R) and got a summary of it. There’s a few tricks to see parts of the data.

inTab$HHID (only the HHID values)
inTab[1:2] (only the first two fields)
inTab[1:10,] (only the first 10 rows)
inTab[1:10,1] (only the first field of the first 10 rows)

Data can be charted in R as well. A simple histogram is very simple to do in R.

hist(inTab$HHSize)

Sometimes data needs to be summarized. There is a function to do that, but first you’ll probably have to download a package. To download the module, go to Packages – Install Packages. From the list, find plyr and install it.

Once plyr is installed (it shouldn’t take long), you can load the module and use ddply to summarize data.

library(plyr)
inTab.Per<-ddply(inTab,.(HHID,HHSize6,Workers4,HHVEH4,INCOME,WealthClass),
AreaType=min(HomeAT,3),summarise,T.HBSH=min(sum(TP_Text=='HBSh'),6),
T.HBSC=sum(TP_Text=='HBS'),T.HBSR=sum(TP_Text=='HBSoc'),T.HBO=sum(TP_Text=='HBO'))

Where inTab is the input table, .(HHID,HHSize6,HHVEH4,INCOME,WealthClass) are input fields to summarize by, AreaType=min(HomeAT,3) is a calculated field to summarize by, and everything following ‘summarise’ are the summaries.

Conclusion

This is a crash course in R, and in the last steps, you basically computed average trip rates.  Next week’s post will be to run linear and non-linear models on this data.

A Self Instructing Course in Mode Choice Modeling

May 20th, 2013

One thing to ensure you understand how your software of choice works is to compare it to known outcomes. Â For example, while learning Biogeme, I converted and ran some of the scenarios in A Self Instructing Course in Mode Choice Modeling in Biogeme and found interesting issues where certain values coming out of Biogeme were the reciprocal of those in the manual. Neither is wrong, but when applying the data to a model, you have to know these things.

I’ve decided to do the same thing in R, and I had a lot of problems getting the CD. I luckily found one on my hard drive. It is here.

For the sake of making life easier on anyone that gets here looking for the manual, it’s here.

New Series on R in Transportation Modeling [Updated 10 October 2013]

May 17th, 2013

I’ve been doing a lot of statistical stuff over the past several weeks, and I think it is worth some value to the Interwebs if I try and post some of it.  I’m considering making it a course of some sort with some scrubbed HHTS data (no, I can’t post real peoples’ locations and names, I think I might get in a little bit of trouble for that).

The “syllabus” is roughly something like this (last update: 10 October 2013):

  1. Intro to R: getting data in, making summaries
  2. Trip rates – Linear and Non-linear modeling 6/7/13
  3. Mode Choice Estimation in R 6/14/13
  4. Trip rates – Averages 9/13/13
  5. Complex Mode Choice Estimation in Biogeme <-Coming in two weeks or less!
  6. Distribution Friction Factors
  7. Distribution K Factors
  8. Outputs and Graphics

I can’t guarantee that these will be the next eight weeks worth of posts – there will probably be some weeks with a different post, since I don’t know if I can get all this stuff done in six weeks, even with the head start I have.

In other news…

I’ve been disabling comments on these short posts that really don’t warrant any sort of responses.  I’ve been getting a lot of spam on this blog, so I’m cutting it down where I can.  This is not a global thing, I’ve been turning them off on certain posts, but the default is on.