In any iterative process, it is nice to see progress.

Recently, it was gravity model distribution (again), so I had a process in R to calibrate friction factors, and part of it exported the trip length frequency chart:

From time to time, it makes sense to not re-invent the wheel by building something in R that is already available in another program, like Cube Voyager.

Heatmaps are an easy way to look at data when you have a lot. The heatmap below is from a 30 day traffic count. You can easily and quickly see that peaks in the morning are northbound, peaks in the afternoon are southbound, and peaks on the weekend are in the midday.

This method could be used with shorter-term data with fewer steps. I used this 30-day count because it tells more of a story than a two or three day count.

In travel modeling, we use matrices to do things like zone-to-zone trip flows. The matrix is like a table, but with an equal number of rows and columns, each representing a location in the model (a traffic analysis zone, like a Census Block Group). In the OKI region, we have 2,299 zones, which means there are 5,285,401 cells. Too many to look at, and we don't have reliable data at that level. However, we do have semi-reliable data at the county level.

The Open Matrix Format (OMX) is used to get these matrix files out of a proprietary format and into something that can be read by many programs. We use this at OKI to move data out of Cube (a proprietary software product) and into R (an open source statistical programming package).

Summarizing a matrix to a county flow table in R is actually pretty easy:

This doesn't take very long, either. Anyone familiar with R can see where the code can be revised to summarize to districts.

This is what the data looks like. Note that this is not verified data (please do not use it!).

Note: the reason Hamilton, Campbell, and Dearborn county names are cut off is related to a bug in Cube. They (Citilabs) are aware of the bug, but it has not been fixed yet.

I had a need to get the correlation of count to assignment in Cube Voyager. I don't know how to do this off the top of my head, and I'm instantly mistrusting of doing things I'd normally trust to R or Excel. So I looked on Wikipedia for the Pearson product-moment correlation coefficient and ended up at Standard Deviation. I didn't make it that far down on the page and used the first, which generally made Voyager Code like this:

I left the print statements in, because the output is important.

Now Standard Deviation is pretty easy to compute. So WHY ARE THESE DIFFERENT?

Just for fun, I did the same in Excel:

WTF? Am I right or wrong???

So I started looking into it and recalled something about n vs. n-1 in the RMSE equation and discussion in the latest Model Validation and Reasonableness Checking Manual. So I decided to manually code the standard deviation in Excel and use sqrt(sum(x-xavg)^2/n-1) instead of Excel's function:

If you don't follow me on twitter or the Open Model Data site, you may have missed that Cube 6.4 makes some DLL changes that rendered the prior version of the Cube2OMX converter unusable. I jumped in (partly because I installed Cube 6.4) and fixed the problem. You can get the source or a binary on Github.

I did this because sending matrices to DBF files means that you'll have a 500 MB DBF for a matrix that's 3200+ zones. Normal R routines chug on it. On the contrary, OMX files are around 10% of that size (60 MB) and R can read them quite quickly - in a matter of seconds instead of minutes.

So the first thing I wanted to do in R with OMX files is Trip Length Frequency Plots. This isn't incredibly quick, but it works. The code is below. According to one run in R, it takes around 6 minutes to run (on 3,312 zones). The largest part is the loop in the function, and this could probably be parallelized (is that a word?) using doParallel.

Last week on another blog, I showed a way to connect a temperature and humidity sensor to a Beaglebone Black and read it using some Python code that ultimately started with Adafruit.

So to be able to play (a little) AND after complaining about my office temperature last week, I decided to plug this thing in at work and set it up with Phant, the IOT data service from Sparkfun. Then I wrote a quick R script to get the JSON file and plot the temperature.

Plot of Temperature

The code is below:

It's pretty simple, although the plot could use me spending a bit more time on it... and perhaps limiting the data to the last hour or day or something. All of that can be done within R pretty easily. Also, I did make a file available for anyone that wishes to play (the file is NOT 'live', the Phant server is on a different network).

After waiting forever enough to get things to run, I stepped into a better way to do lookups.

mapply on matrix objects.

Basically, I do this:

TSkimLBWPk<-read.dbf("data/TSPKLBW.DBF") #Read the local bus walk skim

TSKimLBWPK_IWAIT=(acast(TSkimLBWPk,I~J,value.var="V1",drop=FALSE,fill=0)) #build a matrix from the data

TSKimLBWPK.IWAIT<-function(i,j) {
if(i<=nrow(TSKimLBWPK_IWAIT) && j<=ncol(TSKimLBWPK_IWAIT))
return(TSKimLBWPK_IWAIT[i,j])
else return(0)
} #build a function to lookup, returning 0 if it is looking for a zone not found

TripsAllPk$LBW.IWAIT=mapply(TSKimLBWPK.IWAIT,TripsAllPk$PTAZ,TripsAllPk$ATAZ) #do the lookup

That's it. This takes the input DBF (which has I, J, V1, V2, etc. fields), converts to a matrix for a quick lookup, and then applies it.

I recently wrote a script that takes DBF exports of Cube matrices and prepares them for Biogeme. The main... well, only reason I did this in R was because I was considering using mlogit for model estimation. I ultimately decided to 'go with what I know' and changed course to use Biogeme. Mind you, the part of Stairway to Heaven applies: "There are two paths you can go by, but in the long run / There's still time to change the road you're on."

The Wrong Way

I've changed my code already, so pardon that this is from memory. Also, these are snippets - I have a lot more code than this.

HSkimPk<-read.dbf("data/HSKIM_PK3.dbf")
for(rn in 1:nrow(TripsAll)){
HSkimPkRow<-subset(HSkimPk,I==TripsAll[rn,"PTAZ"] & J==TripsAll[rn,"ATAZ")
TripsAll$DA.IVT<-HSkimPkRow[,"V1"]
...
}

This took no less than 17 hours to complete for around 23,000 trip records and for values from 5 different tables * 2 time periods.

The Right Way

I (obviously) wanted something that wouldn't take forever, especially as I was working in Biogeme and seeing things that made me think that I wanted to change ONE LITTLE THING. This seems to always happen.

I took a different approach that by my calculations should be much quicker.

Since this is currently running, my only metrics are to look at the time per 50 rows (in my real code, I have a line that outputs a timestamp every 50 rows), and it is taking about 0.27 seconds per record, compared to somewhere around 4.5 seconds per record. While not perfect, I'll take an estimated completion of 1.75 hours compared to 17 (update: 2 hours). However, I will say that Cube is faster in this regard and that I may not have the fastest R solution.

I found a problem with the sampler I was using last week. It is an uncontrolled random sampler, which presents some problems when doing choice modeling.

To deal with the issue, I created a better one. This version is a stratified random sampler that samples from each income and auto group (which I need to have samples in all income/auto groups in order to develop and test my auto ownership model). The sampling process works like this:

If there are more than 10 samples for an income/auto group, sample using last week's random method

If there is at least 2 but not more than 10 samples, pick a random sample to be tested, put the rest into the develop group

If there is 1 sample, put it in both

If there are no samples in the group, print an error to the screen

It's always good to test models. In the best case, we'd test models against a different dataset than what we used to develop data. In not-the-best-but-still-not-that-bad of cases, we'd use 80% of a dataset to estimate a model, and 20% of it to test against.

In a lot of cases, someone uses a Gibbs Sampler to get that 80%. I didn't feel like over-complicating things, so I decided to do a simple random sampler and added in some checks to check that the sample is good. The following worked well for me.

So to show that it worked. Click on the pictures to see them large enough to be legible. The pctHH is the input percentage, the PctHHS is the sample for model estimation, and PctHHT is the sample for model testing.

These will ultimately be part of something larger, but so far, so good.

Anyone familiar with transportation modeling is familiar with processes that iterate through data. Gravity models iterate, feedback loops iterate, assignment processes iterate (well, normally), model estimation processes iterate, gravity model calibration steps, shadow cost loops iterate... the list goes on.

Sometimes it's good to see what is going on during those iterations, especially with calibration. For example, in calibrating friction factors in a gravity model, I've frequently run around 10 iterations. However, as an experiment I set the iterations on a step to 100 and looked at the result:

This is the mean absolute error in percentage of observed trips to modeled trips in distribution. Note the oscillation that starts around iteration 25 - this was not useful nor detected. Note also that the best point was very early in the iteration process - at iteration 8.

After telling these files to save after each iteration (an easy process), I faced the issue of trying to quickly read 99 files and get some summary statistics. Writing that in R was not only the path of least resistance, it was so fast to run that it was probably the fastest solution. The code I used is below, with comments:

So I decided I'm going to post biweekly to this blog for the next 18 weeks and talk about what I've learned. Responses are welcome in the comments or via email at andrew .- -.-. siliconcreek .-.-.- net (related comments may be posted to this blog).

The schedule is something like this, based on the chapters of the books:

Statistical Learning - April 18

Linear Regression - May 2

Classification - May 16

Resampling Methods - May 30

Linear Model Selection and Regularization - June 13 (Friday the 13th???)

Moving Beyond Linearity - July 4 (well, this is when it will post to the blog)

Tree-Based Methods - July 18

Support Vector Machines - August 1

Unsupervised Learning - August 15

So this will be not-too-intense, and with my current workload being spent a lot on waiting for models to run (I'm waiting on one right now, which is partly why I read the introduction), I should be able to spend some time on it.

In addition to the exercises in the book, I intend to post a link to a sanitized version of the Greater Cincinnati Household Travel Survey. This sanitized version will have a number of changes made to it to protect the privacy of the survey participants (for example, we will not include the names, phone numbers, addresses, or GPS coordinates).

A few free statistical text books have been posted to the interwebs courtesy of a few universities. Head over to Hyndsight (Rob Hyndman's blog) for the links. One of the books has applications to R.

I've downloaded the first two (Elements of Statistical Learning and Introduction to Statistical Learning with applications in R) and sent them to my Nexus 7 Kindle App for later reading.

While the use of K factors can be very questionable when applied inappropriately, they have been a practice in gravity models for a very long time. For some regions where psychological boundaries (e.g. state lines, river crossings, etc.) cause an effect on travel, K factors have been used to correct problems.

I decided to take a closer look on the effects of K factors on a small model in R. I fixed the friction factors to 1 to eliminate the effects of the friction factors an just show the effects of K factors.

Using a single constraint gravity model, the effects are quite pronounced:

This is the base - all K factors are set to 1

Scenario 2 - K factors for 1-5 and 5-1 are set to 2.

Scenario 3 - the K factors for 1-5 and 5-1 are set to 0.5.

Looking at the three, the two things that stand out is that a K of 2 or 0.5 does not mean that twice or half as many trips will be forecasted to those zones. Also, since this is a single-constrained model, the production totals are all the same, but the attraction totals vary. The code to run this is on a Github Gist.

This is just a quick example. It would change with a doubly-constrained model (which I haven't attempted yet). The details of that may be posted soon.

Calibrating a gravity model for the first time is difficult. I stumbled upon a webpage from a professor at Ohio State that really helps. One thing I like to do is actually do the examples because I can ensure that my code works and it gives me some ideas.

The code to replicate Dr. Viton's work is below.

Obviously this runs really quick, since it is only three zones. It is nice that R has some matrix tricks that help make it easy. One thing to note is that running this for a large matrix in R takes forever (at least the way this is written). It is possible to run it parallel across all processors, but it still takes forever.

A natural extension of my last blog post about using R Markdown is to take it a step further and make maps with it. RMarkdown is a great tool to not only create the map, but to automate creating the map AND putting the map in a document.

This is a work-in-progress. This is a day late (I've been trying to keep to a Tuesday posting schedule to this blog, and well, it's Wednesday afternoon when I'm finally typing this). I'm still not happily done, but I'm also ANGRY at the the output of the distribution model (now that I've changed a lot of stuff in the model), so I'm having to take a step back and redo friction factors, so rather than post a finished product, I decided to post the work in progress, which is probably as helpful as the finished product.

If you haven't read last week's post on using RMarkdown, do so now. I'll wait. Back? Good.

The code:

The first part, above "Manually create the lines" is basic library-loading and data input (and some manipulation/subsetting)

The second part is creating the desire lines. These are created by first creating a list with two coordinate pairs in it (l1 - l7). Those objects are then created into individual lines (Sl1-Sl7). The individual lines are packaged into one Lines obeject (basically, a list of lines) (DL). Finally, that object is prepared for the map (deslines<-list...).

The third part is the text. There are two functions here, one is to get the mid-point of the line, the other is to get the angle of the line. There was a lot of trial and error that went into this. In the lines after the functions, the txt1-txt7 and mpt1-mpt7 objects, the text is formatted and map objects are created for them.

The fourth part is the counties for the map. The col=... and cpl<-... lines handle the colors for the counties.

The last part is drawing the map. The spplot() function handles all of that. The primary map is made out of the counties, and the lines and text is added in the sp.layout=... portion of the command.

That's it! It really isn't difficult as long as you remember trigonometry (and of course, you don't even have to do that since it is in my code above. I also included some references at the bottom of the most useful resources when I was doing this, as there are many, many, MANY more ways to do this, options to play with, etc.

I've had my share of complaining about the various "reporting" platforms available to those of us that do travel modeling. I've looked at a few options, and nothing has stuck as "the one". Until now.

In the meantime, I've noticed that a lot of groups have adopted Markdown. It's found it's way onto Github via Jekyll. Jeckyll's found it's way into my life as a quick blogging and site-building solution. Then, I stumbled upon RStudio RMarkdown. This is becoming a goldmine because RStudio is a great platform for developing things in R (including presentations and R Markdown). Even better, the RMarkdown documents can be run via R (in a batch wrapper). The only missing link is the ability to read matrix files directly. I guess we can't have everything, but I have a solution for that, too.

What Is This Markdown Thing And Why Should I Care?

Markdown is a pretty easy thing to grasp. It's open and fairly flexible. It's a text markup format that is easy to read when not rendered. Easy to read means easy to write. The open-ness means that you can do things with it. In the case of RMarkdown, you can add R code blocks and LaTeX equations. I will admit that LaTeX equations are not legible until rendered, but when you start adding R in the equation, the focus shifts less on reading the unrendered RMarkdown and more on reading the rendered output.

The link to Github (above) goes to their Markdown cheat sheet. That alternates between Markdown and HTML output and it's pretty easy to see how things work.

Getting Model Run Results into RMarkdown and into Rendered Output

There's a number of things that need to happen to get model run results into R/RMarkdown and then to Output:

Output data to a format R understands

Write RMarkdown document

Write RScript to render RMarkdown to HTML

Write Windows Batch File to automate the RScript

Output data to a format R understands

In the case of zonal data, R can understand CSV out of the box, and with the appropriate library, can understand DBF. With matrix files, Voyager will export them to DBF with a simple script file:

This script simply reads a bunch of matrix files and outputs them to two DBF files, one for the peak-period distribution and one for the off-peak-period distribution.

One important thing to note in this is that I didn't put paths in this. I run this from the command line in the model folder and it picks up the files in that folder and outputs the DBFs into that folder. This is something that would have to be testing when placed into a model run file.

Resist the urge to do this in two separate steps. The join process in R takes forever, and reading the data into memory may take a while, too.

Write RMarkdown document

The RMarkdown document is where the magic happens. Fortunately, Knitr (the R Package that does all this magic) does a few things to split code blocks. If you want to build my file, add all these together into one file and name it something.rmd

There are three code blocks that do this. They are importing, summarizing, and graphing.

Importing Data

This block does three things:

Loads libraries. The foreign library is used to read DBF files. The plyr library is used to join and summarize the data frames (from the DBF inputs). The ggplot2 library is used for plots.

Sets a few variables. Since the OKI model is actually two MPO models, we do three reports of everything - one for the entire model, one for the OKI (Cincinnati) region, and one for the MVRPC (Dayton) region. zones_oki and zones_mv are used to control which report is which.

Imports the DBF files. Those read.dbf lines are where that magic happens. Again, since this is run in the model folder, no paths are used.

Summarizing Data

This block does three things:

It rounds the logsum values to provide some grouping to the values

It gets a subset of the model (for OKI)

It summarizes the rounded values to prepare for charting them

Charting Data

This block does one thing: it draws the chart using the ggplot tool. This is pretty advanced (although not bad, and the online help is good). However, for this I'm going to hold to my recommendation to get a copy of The R Graphics Cookbook (where I learned how to use ggplot). The concepts and examples in the book are far greater than what I will post here.

One point that should not be lost is that text elements (either Markdown headings, etc., or just text, or formatted text) can be added into this outside of the ```...``` blocks. This way, reports can actually look good!

Once this part is complete, the hardest stuff is over.

Write RScript to render RMarkdown to HTML

The RScript to render the RMarkdown file to HTML is pretty simple:

This writes the .Rmd file out to the same filename as .html. You can have as many knit2html lines as needed

There are ways to write the files out to PDF (I haven't looked into them... perhaps that would be a future blog topic?).

Write Windows Batch File to automate the RScript

The last step is to write a batch file "wrapper" that can be run as part of the travel demand model run. This is really quite easy:

The first line sets the path to include R (on my system it isn't in the path, and my path statement is already 9 miles long). The second line runs the R script file (ReportR.R) in R.

That's It! It seems like a lot of work goes into this, but it isn't as difficult as some of the other reporting platforms out there.

This is second in the R in Transportation Modeling Series of posts.

I've been going between R, R Graphics Cookbook, NCHRP Report 716, and several other tasks, and finally got a chance to get back on actually performing the trip generation rates in the model.

The Process

NCHRP Report 716 lays out a process that looks something like this when you make a graphic out of it:

So, the first part is to summarize the trips by persons, workers, income, and vehicles. I'm going to skip income (I don't care for using that as a variable).

This can be done pretty easily in R with a little bit of subsetting and ddply summaries that I've written about before. I split this into three groups based on area type, area type 1 is rural, 2 is suburban, and I combined 3 and 4 - urban and CBD.

This makes three tables that looks like this:
>hh.1.sumper
HHSize6 T.HBW T.HH TR
1 1 9 54 0.1666667
2 2 77 98 0.7857143
3 3 24 38 0.6315789
4 4 36 40 0.9000000
5 5 18 10 1.8000000
6 6 4 6 0.6666667

Once all of this is performed (and it isn't much, as the code is very similar among three of the four lines above), you can analyze the variance:
#Perform analysis of variance, or ANOVA
> hh.1.perfit<-aov(TR~HHSize6,data=hh.1.sumper)
> hh.1.wrkfit<-aov(TR~Workers4,data=hh.1.sumwrk)
> hh.1.autfit<-aov(TR~HHVeh4,data=hh.1.sumaut)
#Print summaries
>summary(hh.1.perfit)
Df Sum Sq Mean Sq F value Pr(>F)
HHSize6 1 0.4824 0.4824 1.987 0.231
Residuals 4 0.9712 0.2428

> summary(hh.1.wrkfit)
Df Sum Sq Mean Sq F value Pr(>F)
Workers4 1 0.1113 0.1113 0.184 0.697
Residuals 3 1.8146 0.6049

> summary(hh.1.autfit)
Df Sum Sq Mean Sq F value Pr(>F)
HHVeh4 1 0.0994 0.09938 0.536 0.54
Residuals 2 0.3705 0.18526

The items above indicate that none of the three items above (persons per household, workers per household, or autos per household) are very significant predictors of home based work trips per household. Admittedly, I was a bit concerned here, but I pressed on to do the same for suburban and urban/CBD households and got something a little less concerning.

Suburban households
> summary(hh.2.autfit)
Df Sum Sq Mean Sq F value Pr(>F)
HHVeh4 1 0.6666 0.6666 23.05 0.0172 *
Residuals 3 0.0868 0.0289
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> summary(hh.2.wrkfit)
Df Sum Sq Mean Sq F value Pr(>F)
Workers4 1 1.8951 1.8951 11.54 0.0425 *
Residuals 3 0.4926 0.1642
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> summary(hh.2.perfit)
Df Sum Sq Mean Sq F value Pr(>F)
HHSize6 1 0.6530 0.6530 10.31 0.0326 *
Residuals 4 0.2534 0.0634
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Urban and CBD Households
> summary(hh.34.autfit)
Df Sum Sq Mean Sq F value Pr(>F)
HHVeh4 1 1.8904 1.8904 32.8 0.0106 *
Residuals 3 0.1729 0.0576
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> summary(hh.34.wrkfit)
Df Sum Sq Mean Sq F value Pr(>F)
Workers4 1 5.518 5.518 680 0.000124 ***
Residuals 3 0.024 0.008
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> summary(hh.34.perfit)
Df Sum Sq Mean Sq F value Pr(>F)
HHSize6 1 0.7271 0.7271 9.644 0.036 *
Residuals 4 0.3016 0.0754
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Another Way

Another way to do this is to do the ANOVA without summarizing the data. The results may not be the same or even support the same conclusion.

Rural
> summary(hh.1a.perfit)
Df Sum Sq Mean Sq F value Pr(>F)
HHSize6 1 15.3 15.310 10.61 0.00128 **
Residuals 244 352.0 1.442
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> summary(hh.1a.wrkfit)
Df Sum Sq Mean Sq F value Pr(>F)
Workers4 1 60.46 60.46 48.08 3.64e-11 ***
Residuals 244 306.81 1.26
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> summary(hh.1a.autfit)
Df Sum Sq Mean Sq F value Pr(>F)
HHVeh4 1 4.6 4.623 3.111 0.079 .
Residuals 244 362.6 1.486
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Suburban
> hh.2a.perfit<-aov(HBW~HHSize6,data=hh.2)
> hh.2a.wrkfit<-aov(HBW~Workers4,data=hh.2)
> hh.2a.autfit<-aov(HBW~HHVeh4,data=hh.2)
> summary(hh.2a.perfit)
Df Sum Sq Mean Sq F value Pr(>F)
HHSize6 1 136.1 136.05 101.9 <2e-16 ***
Residuals 1160 1548.1 1.33
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> summary(hh.2a.wrkfit)
Df Sum Sq Mean Sq F value Pr(>F)
Workers4 1 376.8 376.8 334.4 <2e-16 ***
Residuals 1160 1307.3 1.1
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> summary(hh.2a.autfit)
Df Sum Sq Mean Sq F value Pr(>F)
HHVeh4 1 103.2 103.20 75.72 <2e-16 ***
Residuals 1160 1580.9 1.36
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Urban/CBD
> hh.34a.perfit<-aov(HBW~HHSize6,data=hh.34)
> hh.34a.wrkfit<-aov(HBW~Workers4,data=hh.34)
> hh.34a.autfit<-aov(HBW~HHVeh4,data=hh.34)
> summary(hh.34a.perfit)
Df Sum Sq Mean Sq F value Pr(>F)
HHSize6 1 77.1 77.07 64.38 4.93e-15 ***
Residuals 639 765.0 1.20
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> summary(hh.34a.wrkfit)
Df Sum Sq Mean Sq F value Pr(>F)
Workers4 1 222.0 221.96 228.7 <2e-16 ***
Residuals 639 620.1 0.97
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> summary(hh.34a.autfit)
Df Sum Sq Mean Sq F value Pr(>F)
HHVeh4 1 91.1 91.12 77.53 <2e-16 ***
Residuals 639 751.0 1.18
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

What this illustrates is that there is a difference between averages and raw numbers.

The next part of this will be to test a few different models to determine the actual trip rates to use, which will be the subject of the next blog post.

I started this post (and the work to go with it) as a companion to A Self Instructing Course in Mode Choice Modeling by Bhat and Koppelman. That's because I could reproduce the work in the book in R and can (now) reproduce in R.

To continue with this, please get the CD files from my last blog post. You'll specifically need "SF MTC Work MC Data.sav", which is in SPSS format.

The first part:

library(foreign)
library(mlogit)

The items above simply load the libraries. If any of these are not found, go to Packages (on the menu bar) - Install Packages... and select your closest mirror and select the missing package (either foreign or mlogit).

Next, read in the data and we'll add a field, too, as there is no unique id in this dataset.

The first parts of this are pretty obvious (inTab is the input table, choice="chosen" is the choice field). Shape="long" indicates that the data is multiple records per case. "Wide" would indicate each record is on its own line. chid.var is the case id variable. alt.var is the alternatives. drop.index drops the index field out of the resulting table.

Finally, we'll run a simple multinomial logit estimate on this.

nlb<-mlogit(chosen~cost+tvtt|hhinc,mc)

For such a short piece of code, there is a lot going on here. The formula is (simply) chosen=cost+tvtt+hhinc, BUT hhinc is alternative specific and cost and travel time are not. So the utilities for this would be something like:

And this matches the self-instructing course manual, page 76 (under "Base Model").

Nested Logit

R can do simple nested logit calculations, but unfortunately they have to be *very* simple (which is uncharacteristic for R). The best thing to do is get a copy of Biogeme and read the next post in this series.

This post will talk about building linear and non-linear models of trip rates in R. If you haven't read the first part of this series, please do so, partly because this builds on it.

Simple Linear Models

Simple linear models are, well, simple in R. An example of a fairly easy linear model with two factors is:

This creates a simple linear home-based-shopping trip generation model based on workers and household size. Once the estimation completes (it should take less than a second), the summary should show the following data:

> summary(hbsh.lm.W_H)
Call:
lm(formula = N ~ Workers4 + HHSize6, data = hbsh)
Residuals:
Min 1Q Median 3Q Max
-2.2434 -1.1896 -0.2749 0.7251 11.2946
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.79064 0.10409 17.203 < 2e-16 ***
Workers4 -0.02690 0.05848 -0.460 0.646
HHSize6 0.24213 0.04365 5.547 3.58e-08 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.649 on 1196 degrees of freedom
Multiple R-squared: 0.03228, Adjusted R-squared: 0.03066
F-statistic: 19.95 on 2 and 1196 DF, p-value: 3.008e-09

What all this means is:

Trips = -0.0269*workers+0.24213*HHSize+1.79064

The important things to note on this is that the intercept is very significant (that's bad) and the R2 is 0.03066 (that's horrible). There's more here, but it's more details.

Non-Linear Least Squares

When doing a non-linear model, the nls function is the way to go. The two lines below create a trips data frame, and then run a non-linear least-squares model estimation on it (note that the first line is long and wraps to the second line).

The second line does the actual non-linear least-squares estimation. The input formula is T=a*e^(HHSize+b). In this type of model, starting values for a and b have to be given to the model.

The summary of this model is a little different:

> summary( trips.hbo.nls.at3p)
Formula: T.HBO ~ a * log(HHSize6 + b)
Parameters:
Estimate Std. Error t value Pr(>|t|)
a 1.8672 0.1692 11.034 < 2e-16 ***
b 1.2366 0.2905 4.257 2.58e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.095 on 402 degrees of freedom
Number of iterations to convergence: 4
Achieved convergence tolerance: 1.476e-07

It doesn't perform R2 on this because it can't directly. However, we can because we know the actual values and the model predicts the values. So, one thing that can be done is a plot:

The resulting graph looks like this. Not particularly good, but there is also no scale as to the frequency along the 45° line.
R2 is still a good measure here. There's probably an easier way to do this, but this way is pretty simple.

Since I didn't correct the column names when I created the data frame, R used X1 and X2, as evidenced by checking the summary of testTable:

> summary(testTable)
X1 X2
Min. : 0.000 Min. :1.503
1st Qu.: 1.000 1st Qu.:1.503
Median : 2.000 Median :2.193
Mean : 2.072 Mean :2.070
3rd Qu.: 3.000 3rd Qu.:2.193
Max. :23.000 Max. :3.696

So the R2 value is pretty bad...

> cor(testTable$X1,testTable$X2)
[1] 0.2755101

It's better than some of the others, after all, this is semirandom human behavior.
That's it for now. My next post will be... MORE R!
Also, I have a quick shout-out to Jeremy Raw at FHWA for help via email related to this. He helped me through some issues via email, and parts of his email helped parts of this post.

The data is in CSV, so we need to load the foreign library, and then we'll load the data. I'm not a fan of typing in long filepaths, so I use the file.choose() function to browse for the data. Note that in many cases the

inTab<-read.csv(file.choose())
summary(inTab)

In the code above, we've loaded the dbf into the inTab data frame (a data object in R) and got a summary of it. There's a few tricks to see parts of the data.

inTab$HHID (only the HHID values)
inTab[1:2] (only the first two fields)
inTab[1:10,] (only the first 10 rows)
inTab[1:10,1] (only the first field of the first 10 rows)

Data can be charted in R as well. A simple histogram is very simple to do in R.

hist(inTab$HHSize)

Sometimes data needs to be summarized. There is a function to do that, but first you'll probably have to download a package. To download the module, go to Packages - Install Packages. From the list, find plyr and install it.

Once plyr is installed (it shouldn't take long), you can load the module and use ddply to summarize data.

Where inTab is the input table, .(HHID,HHSize6,HHVEH4,INCOME,WealthClass) are input fields to summarize by, AreaType=min(HomeAT,3) is a calculated field to summarize by, and everything following 'summarise' are the summaries.

Conclusion

This is a crash course in R, and in the last steps, you basically computed average trip rates. Next week's post will be to run linear and non-linear models on this data.

One thing to ensure you understand how your software of choice works is to compare it to known outcomes. For example, while learning Biogeme, I converted and ran some of the scenarios in A Self Instructing Course in Mode Choice Modeling in Biogeme and found interesting issues where certain values coming out of Biogeme were the reciprocal of those in the manual. Neither is wrong, but when applying the data to a model, you have to know these things.

I've decided to do the same thing in R, and I had a lot of problems getting the CD. I luckily found one on my hard drive. It is here.

For the sake of making life easier on anyone that gets here looking for the manual, it's here.

I've been doing a lot of statistical stuff over the past several weeks, and I think it is worth some value to the Interwebs if I try and post some of it. I'm considering making it a course of some sort with some scrubbed HHTS data (no, I can't post real peoples' locations and names, I think I might get in a little bit of trouble for that).

The "syllabus" is roughly something like this (last update: 10 October 2013):

Complex Mode Choice Estimation in Biogeme <-Coming in two weeks or less!

Distribution Friction Factors

Distribution K Factors

Outputs and Graphics

I can't guarantee that these will be the next eight weeks worth of posts - there will probably be some weeks with a different post, since I don't know if I can get all this stuff done in six weeks, even with the head start I have.

In other news...

I've been disabling comments on these short posts that really don't warrant any sort of responses. I've been getting a lot of spam on this blog, so I'm cutting it down where I can. This is not a global thing, I've been turning them off on certain posts, but the default is on.

This is the personal blog of a transportation planner, civil engineer, computer programmer, non-professional photographer, ham radio operator, and just all around geek.