Finding Stuff in Big CSV Files

October 16th, 2017

If you have an activity based model OR big(ish) data, from time-to-time, you need to find something. One record, possibly one in half a million or one in a million. You need GNUWin tools for these if you're on Windows.

Getting the First Line

Getting the first line is pretty easy with the head command:

>head -n 1 file.csv

>head -n 1 jointParticipantResults.csv

If you want the last, record, replace 'head' with 'tail'.

Getting the Number of Rows

This is a pretty simple awk script that returns the number of rows:

>awk 'END {print NR}' jointParticipantResults.csv

Getting a Specific Record

This is a simple awk script that returns the row where the  third field is 158568.  Looking at the first script above, the third field is the hhid field:

>awk '$3 == 158568 {print $0}' FS="," jointParticipantResults.csv

Note the FS part - that tells awk that the field separator is a comma.

GnuWin32 Trick: Quickly Finding Text in File

June 1st, 2017

The downside of not managing a good library of scripts is forgetting where some code is written.  Case in point: I wrote a nice RMSE script, but I forgot where it was.

So I found it with the following command line:

grep "rmse" $ls -R ./*/*.R ./*/*.r

The first part of the script - grep "rmse" - tells the grep command to look for rmse.  The second part - $ls -R .//.R .//.r - tells what files to look through (that command is list recursive looking for *.R and *.r files).

Announcing The Ruby OMX API

October 23rd, 2015

I am happy to announce that there is now a Ruby API for OMX.  This is a read-only API that supports a few ways of reading a matrix, returning an array of J for a given I, an array of I for a given J, and returning the value at a matrix address.

More documentation is available on Github (including the all-important install instructions).

Let me know if you have any questions.  Post issues and bugs to the Github issues tracker.

The motivation behind yet-another-API, Ruby seems (operative word!!!) that it handles being a web-based API better than a lot of other languages.  I’ve built a few just to test things out - for example, I built a versioned API that responds with random quotes from Yogi Berra... And please don’t build that in to anything, I have the free Heroku plan, so that may disappear at a random time!  Aside from the time that it took to adapt my mess of Voyager+Java+C+++Python(GRRR!)+Basic syntax to Ruby, it wasn’t at all difficult and it is incredibly easy to add another API version.  I would like to have a semi-live map of skims that I can click on a zone and see colors for the selected attribute/matrix (e.g. travel time).

Standard Deviation Differences Between Excel and R (and my code in Cube Voyager)

July 24th, 2015

I had a need to get the correlation of count to assignment in Cube Voyager.  I don't know how to do this off the top of my head, and I'm instantly mistrusting of doing things I'd normally trust to R or Excel.  So I looked on Wikipedia for the Pearson product-moment correlation coefficient and ended up at Standard Deviation.  I didn't make it that far down on the page and used the first, which generally made Voyager Code like this:

I left the print statements in, because the output is important.

Avg AADT_TRK = 1121.77
Avg VOLUME = 822.03
n = 230.00

sdx1 = 1588160175
sdy1 = 1196330474
n = 230.00
sd AADT_TRK = 2627.75
sd Volume = 2280.67
r2 = 155.06

Note the standard deviations above.  Ignore the R2 because it's most certainly not correct!

Again, mistrusting my own calculations, I imported the DBF into R and looked at the standard deviations:

> sd(trkIn$AADT_TRK)
[1] 2633.476
> sd(trkIn$V_1)
[1] 2285.64

Now Standard Deviation is pretty easy to compute.  So WHY ARE THESE DIFFERENT?

Just for fun, I did the same in Excel:

Screenshot 2015-07-24 11.09.18

WTF? Am I right or wrong???

So I started looking into it and recalled something about n vs. n-1 in the RMSE equation and discussion in the latest Model Validation and Reasonableness Checking Manual.  So I decided to manually code the standard deviation in Excel and use sqrt(sum(x-xavg)^2/n-1) instead of Excel's function:

Looky there!  Matching Numbers!

Looky there! Matching Numbers!

It's not that Excel is incorrect, it's not using Bessel's Correction.  R is.


Adventures in Model Validation: Why RMSE is NOT a Stand-Alone Measure

November 10th, 2014

It seems like the major component of model validation is Root Mean Square Error, or RMSE.  RMSE is basically this:


And %RMSE is:


These are useful measures to measure the error, but the problem with using them as a wholesale measure is that they ignore the DIFFERENCE of the error - this is evident in the numerator of the RMSE equation where the difference is squared. Any number squared becomes POSITIVE.

In the Model Validation and Reasonableness Checking Manual, the first item in assignment aggregate checks is VMT, as it well should be.  Consider the following scenario:

Two model runs, one with assignments 20-40% high, and the other with assignments 20-40% low, both compared to the counts.  They can have nearly the same RMSE (overall, it'll probably be around 30% FOR BOTH), but the VMT will show one ~30% high and the other ~30% low.

Lookups in R: The REAL Right Way!

September 9th, 2014

After waiting forever enough to get things to run, I stepped into a better way to do lookups.

mapply on matrix objects.

Basically, I do this:

TSkimLBWPk<-read.dbf("data/TSPKLBW.DBF") #Read the local bus walk skim

TSKimLBWPK_IWAIT=(acast(TSkimLBWPk,I~J,value.var="V1",drop=FALSE,fill=0)) #build a matrix from the data

TSKimLBWPK.IWAIT<-function(i,j) {
if(i<=nrow(TSKimLBWPK_IWAIT) && j<=ncol(TSKimLBWPK_IWAIT))
else return(0)
} #build a function to lookup, returning 0 if it is looking for a zone not found

TripsAllPk$LBW.IWAIT=mapply(TSKimLBWPK.IWAIT,TripsAllPk$PTAZ,TripsAllPk$ATAZ) #do the lookup

That's it. This takes the input DBF (which has I, J, V1, V2, etc. fields), converts to a matrix for a quick lookup, and then applies it.

It runs in about 3 minutes.

Lookups in R: The Wrong Way and the Right Way

August 28th, 2014

I recently wrote a script that takes DBF exports of Cube matrices and prepares them for Biogeme.  The main... well, only reason I did this in R was because I was considering using mlogit for model estimation.  I ultimately decided to 'go with what I know' and changed course to use Biogeme. Mind you, the part of Stairway to Heaven applies: "There are two paths you can go by, but in the long run / There's still time to change the road you're on."

The Wrong Way

I've changed my code already, so pardon that this is from memory.  Also, these are snippets - I have a lot more code than this.


for(rn in 1:nrow(TripsAll)){
HSkimPkRow<-subset(HSkimPk,I==TripsAll[rn,"PTAZ"] & J==TripsAll[rn,"ATAZ")

This took no less than 17 hours to complete for around 23,000 trip records and for values from 5 different tables * 2 time periods.

The Right Way

I (obviously) wanted something that wouldn't take forever, especially as I was working in Biogeme and seeing things that made me think that I wanted to change ONE LITTLE THING.  This seems to always happen.

I took a different approach that by my calculations should be much quicker.

HSkimPkD<-acast(HSkimPk,I ~ J,value.var="V2",drop=FALSE,fill=0)
HSkimPkT<-acast(HSkimPk,I ~ J,value.var="V1",drop=FALSE,fill=0)

for(rn in 1:nrow(TripsAll)){
if(I<=nrow(HSkimPkT) & J<=nrow(HSkimPkT)){

Since this is currently running, my only metrics are to look at the time per 50 rows (in my real code, I have a line that outputs a timestamp every 50 rows), and it is taking about 0.27 seconds per record, compared to somewhere around 4.5 seconds per record.  While not perfect, I'll take an estimated completion of 1.75 hours compared to 17 (update: 2 hours).  However, I will say that Cube is faster in this regard and that I may not have the fastest R solution.

Improved Nice, Quick, Effective Sampler in R

August 1st, 2014

I found a problem with the sampler I was using last week.  It is an uncontrolled random sampler, which presents some problems when doing choice modeling.

To deal with the issue, I created a better one.  This version is a stratified random sampler that samples from each income and auto group (which I need to have samples in all income/auto groups in order to develop and test my auto ownership model).  The sampling process works like this:

  1. If there are more than 10 samples for an income/auto group, sample using last week's random method
  2. If there is at least 2 but not more than 10 samples, pick a random sample to be tested, put the rest into the develop group
  3. If there is 1 sample, put it in both
  4. If there are no samples in the group, print an error to the screen

The code is below:


Nice, Quick, Effective Sampler in R

July 25th, 2014

It's always good to test models.  In the best case, we'd test models against a different dataset than what we used to develop data.  In not-the-best-but-still-not-that-bad of cases, we'd use 80% of a dataset to estimate a model, and 20% of it to test against.

In a lot of cases, someone uses a Gibbs Sampler to get that 80%.  I didn't feel like over-complicating things, so I decided to do a simple random sampler and added in some checks to check that the sample is good.  The following worked well for me.

So to show that it worked.  Click on the pictures to see them large enough to be legible.  The pctHH is the input percentage, the PctHHS is the sample for model estimation, and PctHHT is the sample for model testing.

Screenshot 2014-07-25 11.22.18

Screenshot 2014-07-25 11.22.31

Screenshot 2014-07-25 11.22.46

Screenshot 2014-07-25 11.22.56


These will ultimately be part of something larger, but so far, so good.


Using Variance and Standard Deviation in Trip Rate Checks

June 24th, 2014

During the peer review, one thing that we were told to look at was the variance in the cells of our cross-classification table.

Like all statistics, variance needs to be looked at in context with other statistics.  To illustrate this, I created some fake data In Excel.


This is what the data looks like. It is all random and perfect, but you can see that some is "tighter" than others.

The first thing I looked at was the variance compared to the mean.

This is the variance compared to the mean. Note that the colors are the same as on the input sheet. The green line (which had the largest random scatter component) is way high compared to the other two, which are much more centralized around the mean.

This is the variance compared to the mean. Note that the colors are the same as on the input sheet. The green line (which had the largest random scatter component) is way high compared to the other two, which are much more centralized around the mean.

I looked at the standard deviation as well, but variance looks better to me.

I looked at the standard deviation as well, but variance looks better to me.


What does this mean?  Well, lets take a look at it using a subset of NHTS 2009 data.

I looked at the average trips per household by number of workers, and the table below lists the average by worker.

Screenshot 2014-06-23 08.09.32

The var/mean (the varpct column in the table) is pretty high.  The correlation isn't bad at 0.616.  However, there are 3,143 observed trips in the table, and this method results in the same number of trips.

Next, I looked at average trips per household by workers and income.

Screenshot 2014-06-23 09.10.11

Variance/mean of trips

Average Trip Rates by HH

Average Trip Rates by HH

They're a little better looking.  The correlation is 0.623, so that's a little better, and the total trips is 3,143.  Yes, that is the exact number of trips in the table.  Looking at the sum of the difference (which in this case should be zero), it came up to -5.432x10^-14... that's quite close to zero!

Mean trips per household by workers and autos (rows are autos, columns are workers)

Mean trips per household by workers and autos (rows are autos, columns are workers)

Variance/mean of workers and autos (rows are autos, columns are workers)

Variance/mean trips per household by workers and autos (rows are autos, columns are workers)

By workers and autos was similar - the correlation was 0.619, and the total trips is 3,143.  Since it was the exact same number as the last time, I checked the difference in the same manner, and found that the total difference was 4.14x10^-14.  Different by a HUGE (not!) amount.


There are a few things to look at.

Summary of various factors.

Summary of various factors.

Ultimately, this wasn't as different as I hoped to show.  I was hoping to show some bad data vs. good data from the NHTS, but without intentionally monkeying with the data, it was hard to get something better.


New Project In The Works: GPS Data Processing

June 3rd, 2014

GPS household surveys are all the rage these days - all the cool kids have one, and those that don't have one seem to think they are somehow not as cool.  Of course, if everyone jumped off a bridge...

Anyway, I'm the lucky manager of the department that received the first GPS-only household survey.  Since that time, three other agencies (that I can think of, at least) have these and each time there has been some improvements in the quality of the processing algorithms that I am trying to take advantage of.  And everyone else gets the advantages, too, as I posted it on Github.

This currently uses sources presented at the Innovations in Travel Modeling Conference 2014 by Marcelo Oliveira at Westat.  Westat wrote NCHRP 775 (08-89), and while the document had been completed since sometime before the conference, I was unable to sweet talk someone into emailing me an advance copy (it wasn't for the lack of trying).  I did receive word that the document was to be released on Friday the 13th of June.  I really hope the date is a coincidence.

The code on Github has a fairly detailed readme file (that shows up when you scroll down from the source listing, but it basically says that it's here, it's a work in progress, and I don't include data.  My hope is that others can jump in and help and we can get a fairly nice system working.  My other hope is that we can standardize processing a little and reduce the cost of these surveys.

This is my first use of Maven.  I can't say I'm really using it now, so I have a lot to learn.

There are other side projects that may come out of this, but they will brought up in due time.


Quick Cube Voyager Trick: Arrays

December 24th, 2013

While injecting bugs into a script, I found a quick trick in Cube related to arrays.  I didn't test it to see if it works on a single-dimension array, but it DOES work on a multi-dimension array.

ARRAY Test=8,8


This sets everything in Test to 1, so if you write:

LOOP _x=1,8

LOOP _y=1,8

PRINT LIST=Test[_x][_y]



The response will be '1' 64 times.


Transportation Modeling Books to Read

December 17th, 2013

This is a two-part post.  The first part are books that I've read that I think are really important to the modeling community.  These books are important for the development

Recommended Items

A Self-Instructing Course in Mode Choice Models. Frank Koppelman and Chandra Bhat
This is an excellent resource to self-teach multinomial and nested logit modeling. It comes with many examples (a few of which I have discussed here) and talks about many of the tests and metrics that are important to good model formulation and evaluation.

Travel Model Validation and Reasonability Checking Manual. TMIP.
This is a great resource of validation checking and what to look for in regards to reasonableness checking.

Special Report 288: Metropolitan Travel Forecasting: Current Practice and Future Direction. TRB.
This is a critical look at many of the modeling techniques we hold dear to our hearts.  I've been tempted to re-read it and see if things are a little better now that it has been over 5 years since it was released.

Kenneth Train's Website (thanks to Krishnan Viswanathan).  It didn't dawn on me that this should be part of this list, but it should.  I've seen his website (and maybe even linked to it previously) while working on multinomial and nested logit modeling with R.  His website is a treasure trove of discrete choice analysis

On My To-Read List

"Hubris or humility? Accuracy issues for the next 50 years of travel demand modeling". David Hartgen. Transportation volume 40 issue 6.

Computational and Mathematical Modeling in the Social Sciences. Scott de Marchi.

Calibration of Trip Distribution Models by Generalized Linear Models. John Shrewsbury, University of Canterbury.

Megaprojects and Risk: An Anatomy of Ambition. Bent Flyvbjerg, Nils Bruzelius, Werner Rothengatter.



Am I missing any?  Add a recommendation in the comments.

Illustration of Gravity Model K Factor Effects

December 10th, 2013

While the use of K factors can be very questionable when applied inappropriately, they have been a practice in gravity models for a very long time.  For some regions where psychological boundaries (e.g. state lines, river crossings, etc.) cause an effect on travel, K factors have been used to correct problems.

I decided to take a closer look on the effects of K factors on a small model in R.  I fixed the friction factors to 1 to eliminate the effects of the friction factors an just show the effects of K factors.

Using a single constraint gravity model, the effects are quite pronounced:

This is the base - all K factors are set to 1

This is the base - all K factors are set to 1

Scenario 2 - K factors for 1-5 and 5-1 are set to 2.

Scenario 2 - K factors for 1-5 and 5-1 are set to 2.

Scenario 3 - the K factors for 1-5 and 5-1 are set to 0.5.

Scenario 3 - the K factors for 1-5 and 5-1 are set to 0.5.

Looking at the three, the two things that stand out is that a K of 2 or 0.5 does not mean that twice or half as many trips will be forecasted to those zones.  Also, since this is a single-constrained model, the production totals are all the same, but the attraction totals vary.  The code to run this is on a Github Gist.

This is just a quick example.  It would change with a doubly-constrained model (which I haven't attempted yet).  The details of that may be posted soon.

DOS Commands You Should Know: FIND

November 26th, 2013

Recently, I stumbled upon a problem in my new mode choice and distribution code - I was setting unavailable modes to -9999 to ensure that there was no chance of the model to choose an unavailable mode.  I found later that using that value was a bit extreme and I should be using something like -15 (and the difference causes wild logsum values).

After changing these values in 10 scripts, I wanted to ensure that ALL were changed so I didn't end up running them and finding that I had to wait another 15 minutes after finding an error (or worse, not immediately finding the error!).

So, I used the FIND command in DOS.

All of my distribution files begin with 25 and end with .S, so I used:

find "=-9999" 25*.S"

Missed a few in these files.  The filename is listed there so I can go to it and fix it.

Missed a few in these files. The filename is listed there so I can go to it and fix it.

Missed a bunch in this file.  This is why I checked :-)

Missed a bunch in this file. This is why I checked 🙂


Mapping with R Markdown

November 20th, 2013

A natural extension of my last blog post about using R Markdown is to take it a step further and make maps with it.  RMarkdown is a great tool to not only create the map, but to automate creating the map AND putting the map in a document.

This is a work-in-progress.  This is a day late (I've been trying to keep to a Tuesday posting schedule to this blog, and well, it's Wednesday afternoon when I'm finally typing this). I'm still not happily done, but I'm also ANGRY at the the output of the distribution model (now that I've changed a lot of stuff in the model), so I'm having to take a step back and redo friction factors, so rather than post a finished product, I decided to post the work in progress, which is probably as helpful as the finished product.

If you haven't read last week's post on using RMarkdown, do so now.  I'll wait.  Back?  Good.

The code:

The first part, above "Manually create the lines" is basic library-loading and data input (and some manipulation/subsetting)

The second part is creating the desire lines.  These are created by first creating a list with two coordinate pairs in it (l1 - l7).  Those objects are then created into individual lines (Sl1-Sl7).  The individual lines are packaged into one Lines obeject (basically, a list of lines) (DL).  Finally, that object is prepared for the map (deslines<-list...).

The third part is the text.  There are two functions here, one is to get the mid-point of the line, the other is to get the angle of the line.  There was a lot of trial and error that went into this.  In the lines after the functions, the txt1-txt7 and mpt1-mpt7 objects, the text is formatted and map objects are created for them.

The fourth part is the counties for the map.  The col=... and cpl<-... lines handle the colors for the counties.

The last part is drawing the map.  The spplot() function handles all of that.  The primary map is made out of the counties, and the lines and text is added in the sp.layout=... portion of the command.

That's it!  It really isn't difficult as long as you remember trigonometry (and of course, you don't even have to do that since it is in my code above.  I also included some references at the bottom of the most useful resources when I was doing this, as there are many, many, MANY more ways to do this, options to play with, etc.


Travel Model Reports with R and knitr

November 12th, 2013

I've had my share of complaining about the various "reporting" platforms available to those of us that do travel modeling.  I've looked at a few options, and nothing has stuck as "the one". Until now.

In the meantime, I've noticed that a lot of groups have adopted Markdown.  It's found it's way onto Github via Jekyll.  Jeckyll's found it's way into my life as a quick blogging and site-building solution.  Then, I stumbled upon RStudio RMarkdown.  This is becoming a goldmine because RStudio is a great platform for developing things in R (including presentations and R Markdown).  Even better, the RMarkdown documents can be run via R (in a batch wrapper).  The only missing link is the ability to read matrix files directly.  I guess we can't have everything, but I have a solution for that, too.

What Is This Markdown Thing And Why Should I Care?

Markdown is a pretty easy thing to grasp.  It's open and fairly flexible.  It's a text markup format that is easy to read when not rendered.  Easy to read means easy to write.  The open-ness means that you can do things with it.  In the case of RMarkdown, you can add R code blocks and LaTeX equations.  I will admit that LaTeX equations are not legible until rendered, but when you start adding R in the equation, the focus shifts less on reading the unrendered RMarkdown and more on reading the rendered output.

The link to Github (above) goes to their Markdown cheat sheet.  That alternates between Markdown and HTML output and it's pretty easy to see how things work.

Getting Model Run Results into RMarkdown and into Rendered Output

There's a number of things that need to happen to get model run results into R/RMarkdown and then to Output:

  1. Output data to a format R understands
  2. Write RMarkdown document
  3. Write RScript to render RMarkdown to HTML
  4. Write Windows Batch File to automate the RScript

Output data to a format R understands

In the case of zonal data, R can understand CSV out of the box, and with the appropriate library, can understand DBF.  With matrix files, Voyager will export them to DBF with a simple script file:

This script simply reads a bunch of matrix files and outputs them to two DBF files, one for the peak-period distribution and one for the off-peak-period distribution.

One important thing to note in this is that I didn't put paths in this.  I run this from the command line in the model folder and it picks up the files in that folder and outputs the DBFs into that folder.  This is something that would have to be testing when placed into a model run file.

Resist the urge to do this in two separate steps.  The join process in R takes forever, and reading the data into memory may take a while, too.

Write RMarkdown document

The RMarkdown document is where the magic happens.  Fortunately, Knitr (the R Package that does all this magic) does a few things to split code blocks.  If you want to build my file, add all these together into one file and name it something.rmd

There are three code blocks that do this.  They are importing, summarizing, and graphing.

Importing Data

This block does three things:

  1. Loads libraries.  The foreign library is used to read DBF files.  The plyr library is used to join and summarize the data frames (from the DBF inputs).  The ggplot2 library is used for plots.
  2. Sets a few variables.  Since the OKI model is actually two MPO models, we do three reports of everything - one for the entire model, one for the OKI (Cincinnati) region, and one for the MVRPC (Dayton) region.  zones_oki and zones_mv are used to control which report is which.
  3. Imports the DBF files.  Those read.dbf lines are where that magic happens.  Again, since this is run in the model folder, no paths are used.

Summarizing Data

This block does three things:

  1. It rounds the logsum values to provide some grouping to the values
  2. It gets a subset of the model (for OKI)
  3. It summarizes the rounded values to prepare for charting them

Charting Data

This block does one thing: it draws the chart using the ggplot tool.  This is pretty advanced (although not bad, and the online help is good).  However, for this I'm going to hold to my recommendation to get a copy of The R Graphics Cookbook (where I learned how to use ggplot).  The concepts and examples in the book are far greater than what I will post here.

One point that should not be lost is that text elements (either Markdown headings, etc., or just text, or formatted text) can be added into this outside of the ```...``` blocks. This way, reports can actually look good!

Once this part is complete, the hardest stuff is over.

Write RScript to render RMarkdown to HTML

The RScript to render the RMarkdown file to HTML is pretty simple:

This writes the .Rmd file out to the same filename as .html. You can have as many knit2html lines as needed

There are ways to write the files out to PDF (I haven't looked into them... perhaps that would be a future blog topic?).

Write Windows Batch File to automate the RScript

The last step is to write a batch file "wrapper" that can be run as part of the travel demand model run.  This is really quite easy:

The first line sets the path to include R (on my system it isn't in the path, and my path statement is already 9 miles long). The second line runs the R script file (ReportR.R) in R.


That's It! It seems like a lot of work goes into this, but it isn't as difficult as some of the other reporting platforms out there.

PS: Stay tuned for some example reports


Example Report (generated from RMarkdown file)

Example RMarkdown File

AutoHotKey + Cube Voyager: Curly Braces

October 29th, 2013

I am rebuilding the mode choice and distribution parts of the OKI model (that may be obvious from the last two or three posts on this blog), and decided to use AutoHotKey to speed things up.  I wanted to add the MO=... NAME=... and AUTOMDARRAY lines to the MATO and two DBI lines:


In the OKI model, {ZONES_MV} is all internal zones. I used this, did some other copy and paste magic, and was greeted with the errors:

F(795): DBI[2] ARRAYFIELDS=NDSTX must have a valid numeric size specified (-#)

F(795): DBI[1] ARRAYFIELDS=TAZ must have a valid numeric size specified (-#)

It seems the curly braces aren't immediately cool with AHK.

The solution (per this post, and it works) is to enclose the curly braces in curly braces:


Cube Voyager XCHOICE: The Missing Help Doc

October 22nd, 2013

First note: I sent this to the support people at Citilabs recently, so maybe on the next update they'll include this in the help document, as I think it is sorely missing.  Or maybe I'm just crazy (you kinda have to be crazy to do transportation modeling*) and have my own way of thinking.

In the OKI model, we have a nested logit mode choice structure that's pretty common for MPOs our size - it's had a history in several other MPOs.  The nesting structure looks like this:


OKI Mode Choice


The part that I think is missing in the Voyager Help File for XCHOICE is this:

SPLIT=TRAN 0.516 LB 0.516 EB 0.516 LR,
SPLIT=AUTO 0.516 SR 0.516 DA,
SPLIT=TOTAL 0.477 AUTO 0.477 TRAN 0.477 NONM,

More importantly, WHY this is important:


All those green, blue, red, and yellow marks are pointing things out - like what connects to what and so on.  Once these connections are made, you can get answers without a lot of effort.  It's quite nice, really.  However, I haven't figured out where to put upper-level coefficients, but in my latest estimation runs, those are out.

More Stuff

One of the more important things to get out of the mode choice model is the logsum values that would be used as the impedance for a gravity model distribution.  Once you have the above, it's pretty easy.

First off, make your demand matrix have 100 trips per IJ pair:


Then, get the exponential of the SPLITCOMP matrix to get a usable logsum value.


Note that in the OKI model (and in other models that use walk access markets), this needs to be done multiple times and accumulated to another matrix:


And then there is some post-processing that needs to be done at the end of the ILOOP:

MW[702]=MW[701]/MW[2] ;Denom


704 is the output matrix.


* I have other forms of craziness, too.  I've recently announced - and made it "Facebook official" to my close friends and family - that I'm going to run a half marathon next spring.

Biogeme Hints and Tips and My Biogeme Workflow

October 15th, 2013

I've been working a lot with Biogeme, the open-source discrete model estimation tool.  This is really a great tool, and it is distributed free of charge.  However, it has a few "quirks" that I've hit all the time.  That being said, there's a few things I've learned:

  • Text fields in the data file are bad.  The only quotes should be on the top line of the file.  Quoted items throughout the data file will cause Biogeme to error with "No Data In The Sample" or another similar error message.
  • Similar to above, blanks are bad.  Zero fill empty cells.  For CSV files, this is easy - open the file in Excel and replace {blank} with 0.
  • For a large data file and/or when using Network GEV simulation, run it on Linux.  Michel Bierlaire (Biogeme's author) has advocated this many times on the Biogeme Yahoo Group.  My Ubuntu 12.04 virtual machine (running under VirtualBox) runs circles around it's Windows 7-64bit host.
  • Things are case sensitive.

My Workflow

Since everything I use is in Windows (sometimes by choice, sometimes by necessity), I've figured out a workflow that works.  The main dataset is in Microsoft Access where I have several queries and linked tables.

My Biogeme Workflow

In the graphic above, the one thing that ties the two sides together is Dropbox (which could easily be another service, or a network drive).  This allows me to easily view the output html files in the browser on either my Windows host or my Ubuntu VM.  I get the running speed of Ubuntu, and all my data work is on my Windows computer, which makes life easier on me, as I'm used to most of the Windows tools I use, not so much with the Linux equivalents.


Reading and Analyzing Cube Voyager PT Route Trace Files

October 8th, 2013

After an unsuccessful attempt in trading mode choice calibration with bugfixing in Cube, I ended up figuring out several things about Voyager PT and route trace files.

Trace Reports

The trace reports are files that describe the transit routes from one zone to another.  This is a little more detailed than the skim matrices, as the skims don't tell you which routes are used or the operator of the routes (since there are five transit providers in the OKI model, this is a major concern).  The trace reports look like this (this is from a subset of the OKI model):

REval Route(s) from Origin 10 to Destination 18

10 -> 4835
4835 -> 4839 -> 4865 lines Rt1IB Rt10IB Rt10OB Rt28MOB Rt32IB Rt27IB
4865 -> 4859 -> 18 lines Rt5TankOB Rt5TankIB Rt7TankOB Rt7TankIB Rt9TankIB Rt9TankOB Rt11TankIB Rt16TankIB Rt23TankIB Rt25TankIB
Cost= 65.11 Probability=1.0000

REval Route(s) from Origin 10 to Destination 19

10 -> 4835
4835 -> 4839 -> 19 lines Rt1IB Rt10IB Rt10OB Rt28MOB Rt32IB Rt27IB
Cost= 33.42 Probability=1.0000

REval Route(s) from Origin 10 to Destination 20

10 -> 4835
4835 -> 4839 -> 20 lines Rt1IB Rt10IB Rt10OB Rt28MOB Rt32IB Rt27IB
Cost= 33.42 Probability=1.0000

Voyager PT Setup

There is one thing that's not really well documented in Cube Help, and that is how to generate the trace reports.  The one thing that has to be done is on the FILEI ROUTEI or FILEO ROUTEO lines, the line must include both REPORTI=z-Z and REPORTO=z-Z.  The report will not generate if only one side is there - both keys have to be there.  There must also be a FILEO REPORTO file, but that's required for any Voyager PT run.

So What?

Having this was only half the battle.  I needed a matrix of the operator, and for the entire 3,200 (ish) zone model run, the resulting report file is over 1.2 GB (and that doesn't include Dayton's transit!).  So I had to go through this file quickly.  Enter Java.

I wrote a quick (both in code and in running) Java program to read the report file and a DBF of the lines (which I got from the geodatabase route layer). The program sorts through the report file looking for the excerpts above, stores them in memory, and then scans through them to output a dbf with the from-zone (I), to-zone (J), and the operator. This is multi-threaded, so on my 8-core beast, it runs very fast.

The Github repository is here.

Now, on to mode choice...

Trip Rates: Averages and Analysis of Variance

September 13th, 2013

This is second in the R in Transportation Modeling Series of posts.

I've been going between R, R Graphics Cookbook, NCHRP Report 716, and several other tasks, and finally got a chance to get back on actually performing the trip generation rates in the model.

The Process

NCHRP Report 716 lays out a process that looks something like this when you make a graphic out of it:

 Flowchart of Trip Rate Process

So, the first part is to summarize the trips by persons, workers, income, and vehicles.  I'm going to skip income (I don't care for using that as a variable).

This can be done pretty easily in R with a little bit of subsetting and ddply summaries that I've written about before. I split this into three groups based on area type, area type 1 is rural, 2 is suburban, and I combined 3 and 4 - urban and CBD.

hh.1<-subset(hhs,AreaType==1) hh.1.sumper<-ddply(hh.1,.(HHSize6),summarise,T.HBW=sum(HBW),T.HH=length(HHSize6),TR=T.HBW/T.HH) hh.1.sumwrk<-ddply(hh.1,.(Workers4),summarise,T.HBW=sum(HBW),T.HH=length(Workers4),TR=T.HBW/T.HH) hh.1.sumaut<-ddply(hh.1,.(HHVeh4),summarise,T.HBW=sum(HBW),T.HH=length(HHVeh4),TR=T.HBW/T.HH)

This makes three tables that looks like this:

1 1 9 54 0.1666667
2 2 77 98 0.7857143
3 3 24 38 0.6315789
4 4 36 40 0.9000000
5 5 18 10 1.8000000
6 6 4 6 0.6666667

Once all of this is performed (and it isn't much, as the code is very similar among three of the four lines above), you can analyze the variance:

#Perform analysis of variance, or ANOVA
> hh.1.perfit<-aov(TR~HHSize6,data=hh.1.sumper) > hh.1.wrkfit<-aov(TR~Workers4,data=hh.1.sumwrk) > hh.1.autfit<-aov(TR~HHVeh4,data=hh.1.sumaut) #Print summaries >summary(hh.1.perfit)
Df Sum Sq Mean Sq F value Pr(>F)
HHSize6 1 0.4824 0.4824 1.987 0.231
Residuals 4 0.9712 0.2428

> summary(hh.1.wrkfit)
Df Sum Sq Mean Sq F value Pr(>F)
Workers4 1 0.1113 0.1113 0.184 0.697
Residuals 3 1.8146 0.6049

> summary(hh.1.autfit)
Df Sum Sq Mean Sq F value Pr(>F)
HHVeh4 1 0.0994 0.09938 0.536 0.54
Residuals 2 0.3705 0.18526

The items above indicate that none of the three items above (persons per household, workers per household, or autos per household) are very significant predictors of home based work trips per household. Admittedly, I was a bit concerned here, but I pressed on to do the same for suburban and urban/CBD households and got something a little less concerning.

Suburban households

> summary(hh.2.autfit)
Df Sum Sq Mean Sq F value Pr(>F)
HHVeh4 1 0.6666 0.6666 23.05 0.0172 *
Residuals 3 0.0868 0.0289
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> summary(hh.2.wrkfit)
Df Sum Sq Mean Sq F value Pr(>F)
Workers4 1 1.8951 1.8951 11.54 0.0425 *
Residuals 3 0.4926 0.1642
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> summary(hh.2.perfit)
Df Sum Sq Mean Sq F value Pr(>F)
HHSize6 1 0.6530 0.6530 10.31 0.0326 *
Residuals 4 0.2534 0.0634
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Urban and CBD Households

> summary(hh.34.autfit)
Df Sum Sq Mean Sq F value Pr(>F)
HHVeh4 1 1.8904 1.8904 32.8 0.0106 *
Residuals 3 0.1729 0.0576
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> summary(hh.34.wrkfit)
Df Sum Sq Mean Sq F value Pr(>F)
Workers4 1 5.518 5.518 680 0.000124 ***
Residuals 3 0.024 0.008
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> summary(hh.34.perfit)
Df Sum Sq Mean Sq F value Pr(>F)
HHSize6 1 0.7271 0.7271 9.644 0.036 *
Residuals 4 0.3016 0.0754
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Another Way

Another way to do this is to do the ANOVA without summarizing the data. The results may not be the same or even support the same conclusion.


> summary(hh.1a.perfit)
Df Sum Sq Mean Sq F value Pr(>F)
HHSize6 1 15.3 15.310 10.61 0.00128 **
Residuals 244 352.0 1.442
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> summary(hh.1a.wrkfit)
Df Sum Sq Mean Sq F value Pr(>F)
Workers4 1 60.46 60.46 48.08 3.64e-11 ***
Residuals 244 306.81 1.26
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> summary(hh.1a.autfit)
Df Sum Sq Mean Sq F value Pr(>F)
HHVeh4 1 4.6 4.623 3.111 0.079 .
Residuals 244 362.6 1.486
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


> hh.2a.perfit<-aov(HBW~HHSize6,data=hh.2) > hh.2a.wrkfit<-aov(HBW~Workers4,data=hh.2) > hh.2a.autfit<-aov(HBW~HHVeh4,data=hh.2) > summary(hh.2a.perfit)
Df Sum Sq Mean Sq F value Pr(>F)
HHSize6 1 136.1 136.05 101.9 <2e-16 *** Residuals 1160 1548.1 1.33 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 > summary(hh.2a.wrkfit)
Df Sum Sq Mean Sq F value Pr(>F)
Workers4 1 376.8 376.8 334.4 <2e-16 *** Residuals 1160 1307.3 1.1 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 > summary(hh.2a.autfit)
Df Sum Sq Mean Sq F value Pr(>F)
HHVeh4 1 103.2 103.20 75.72 <2e-16 *** Residuals 1160 1580.9 1.36 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


> hh.34a.perfit<-aov(HBW~HHSize6,data=hh.34) > hh.34a.wrkfit<-aov(HBW~Workers4,data=hh.34) > hh.34a.autfit<-aov(HBW~HHVeh4,data=hh.34) > summary(hh.34a.perfit)
Df Sum Sq Mean Sq F value Pr(>F)
HHSize6 1 77.1 77.07 64.38 4.93e-15 ***
Residuals 639 765.0 1.20
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> summary(hh.34a.wrkfit)
Df Sum Sq Mean Sq F value Pr(>F)
Workers4 1 222.0 221.96 228.7 <2e-16 *** Residuals 639 620.1 0.97 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 > summary(hh.34a.autfit)
Df Sum Sq Mean Sq F value Pr(>F)
HHVeh4 1 91.1 91.12 77.53 <2e-16 *** Residuals 639 751.0 1.18 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

What this illustrates is that there is a difference between averages and raw numbers.

The next part of this will be to test a few different models to determine the actual trip rates to use, which will be the subject of the next blog post.

Mode Choice Modeling with R

June 14th, 2013

I started this post (and the work to go with it) as a companion to A Self Instructing Course in Mode Choice Modeling by Bhat and Koppelman.  That's because I could reproduce the work in the book in R and can (now) reproduce in R.

To continue with this, please get the CD files from my last blog post.  You'll specifically need "SF MTC Work MC Data.sav", which is in SPSS format.

The first part:


The items above simply load the libraries.  If any of these are not found, go to Packages (on the menu bar) - Install Packages... and select your closest mirror and select the missing package (either foreign or mlogit).

Next, read in the data and we'll add a field, too, as there is no unique id in this dataset.


The first line reads in the SPSS file (it asks you for the file).  The second adds a "HHPerID" field, which is unique to each case.

The next part is to format the data for mlogit.  This is quite a challenge because it has to be JUST RIGHT or there will be errors.


The first parts of this are pretty obvious (inTab is the input table, choice="chosen" is the choice field).  Shape="long" indicates that the data is multiple records per case.  "Wide" would indicate each record is on its own line.  chid.var is the case id variable.  alt.var is the alternatives.  drop.index drops the index field out of the resulting table.

Finally, we'll run a simple multinomial logit estimate on this.


For such a short piece of code, there is a lot going on here.  The formula is (simply) chosen=cost+tvtt+hhinc, BUT hhinc is alternative specific and cost and travel time are not.  So the utilities for this would be something like:








The result is this:


mlogit(formula = chosen ~ cost + tvtt | hhinc, data = mc, method = "nr",
print.level = 0)

Frequencies of alternatives:
1 2 3 4 5 6
0.7232054 0.1028037 0.0320143 0.0990257 0.0099423 0.0330086

nr method
6 iterations, 0h:0m:6s
g'(-H)^-1g = 5.25E-05
successive function values within tolerance limits

Coefficients :
Estimate Std. Error t-value Pr(>|t|)
2:(intercept) -2.17804077 0.10463797 -20.8150 < 2.2e-16 ***
3:(intercept) -3.72512379 0.17769193 -20.9639 < 2.2e-16 ***
4:(intercept) -0.67094862 0.13259058 -5.0603 4.186e-07 ***
5:(intercept) -2.37634141 0.30450385 -7.8040 5.995e-15 ***
6:(intercept) -0.20681660 0.19410013 -1.0655 0.286643
cost -0.00492042 0.00023890 -20.5965 < 2.2e-16 ***
tvtt -0.05134065 0.00309940 -16.5647 < 2.2e-16 ***
2:hhinc -0.00216998 0.00155329 -1.3970 0.162406
3:hhinc 0.00035756 0.00253773 0.1409 0.887952
4:hhinc -0.00528636 0.00182881 -2.8906 0.003845 **
5:hhinc -0.01280827 0.00532413 -2.4057 0.016141 *
6:hhinc -0.00968627 0.00303306 -3.1936 0.001405 **
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Log-Likelihood: -3626.2
McFadden R^2: 0.25344
Likelihood ratio test : chisq = 2462 (p.value = < 2.22e-16)

And this matches the self-instructing course manual, page 76 (under "Base Model").

Nested Logit

R can do simple nested logit calculations, but unfortunately they have to be *very* simple (which is uncharacteristic for R).  The best thing to do is get a copy of Biogeme and read the next post in this series.

Linear and Nonlinear Models in R

June 7th, 2013

This post will talk about building linear and non-linear models of trip rates in R.  If you haven't read the first part of this series, please do so, partly because this builds on it.

Simple Linear Models

Simple linear models are, well, simple in R.  An example of a fairly easy linear model with two factors is:


This creates a simple linear home-based-shopping trip generation model based on workers and household size.  Once the estimation completes (it should take less than a second), the summary should show the following data:

> summary(hbsh.lm.W_H)

lm(formula = N ~ Workers4 + HHSize6, data = hbsh)

    Min      1Q  Median      3Q     Max 
-2.2434 -1.1896 -0.2749  0.7251 11.2946 

            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.79064    0.10409  17.203  < 2e-16 ***
Workers4    -0.02690    0.05848  -0.460    0.646    
HHSize6      0.24213    0.04365   5.547 3.58e-08 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 1.649 on 1196 degrees of freedom
Multiple R-squared: 0.03228,    Adjusted R-squared: 0.03066 
F-statistic: 19.95 on 2 and 1196 DF,  p-value: 3.008e-09

What all this means is:

Trips = -0.0269*workers+0.24213*HHSize+1.79064

The important things to note on this is that the intercept is very significant (that's bad) and the R2 is 0.03066 (that's horrible).  There's more here, but it's more details.

Non-Linear Least Squares

When doing a non-linear model, the nls function is the way to go.  The two lines below create a trips data frame, and then run a non-linear least-squares model estimation on it (note that the first line is long and wraps to the second line).


The second line does the actual non-linear least-squares estimation.  The input formula is T=a*e^(HHSize+b).  In this type of model, starting values for a and b have to be given to the model.

The summary of this model is a little different:

> summary(

Formula: T.HBO ~ a * log(HHSize6 + b)

  Estimate Std. Error t value Pr(>|t|)    
a   1.8672     0.1692  11.034  < 2e-16 ***
b   1.2366     0.2905   4.257 2.58e-05 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 2.095 on 402 degrees of freedom

Number of iterations to convergence: 4 
Achieved convergence tolerance: 1.476e-07 

It doesn't perform R2 on this because it can't directly.  However, we can because we know the actual values and the model predicts the values.  So, one thing that can be done is a plot:

> plot(c(0,10),c(0,10),type='l',xlab='Observed Trips',ylab='Predicted Trips')
> points(subset(trips,AreaType>=3)$T.HBO,fitted(,col='red')

The resulting graph looks like this.  Not particularly good, but there is also no scale as to the frequency along the 45° line.

R2 is still a good measure here.  There's probably an easier way to do this, but this way is pretty simple.  


Since I didn't correct the column names when I created the data frame, R used X1 and X2, as evidenced by checking the summary of testTable:

> summary(testTable)
       X1               X2       
 Min.   : 0.000   Min.   :1.503  
 1st Qu.: 1.000   1st Qu.:1.503  
 Median : 2.000   Median :2.193  
 Mean   : 2.072   Mean   :2.070  
 3rd Qu.: 3.000   3rd Qu.:2.193  
 Max.   :23.000   Max.   :3.696  

So the R2 value is pretty bad...

> cor(testTable$X1,testTable$X2)
[1] 0.2755101

It's better than some of the others, after all, this is semirandom human behavior.

That's it for now.  My next post will be... MORE R!

Also, I have a quick shout-out to Jeremy Raw at FHWA for help via email related to this.  He helped me through some issues via email, and parts of his email helped parts of this post.


A Self Instructing Course in Mode Choice Modeling

May 20th, 2013

One thing to ensure you understand how your software of choice works is to compare it to known outcomes.  For example, while learning Biogeme, I converted and ran some of the scenarios in A Self Instructing Course in Mode Choice Modeling in Biogeme and found interesting issues where certain values coming out of Biogeme were the reciprocal of those in the manual.  Neither is wrong, but when applying the data to a model, you have to know these things.

I've decided to do the same thing in R, and I had a lot of problems getting the CD.  I luckily found one on my hard drive.  It is here.

For the sake of making life easier on anyone that gets here looking for the manual, it's here.

Taking CSV Exported Cube Voyager Path Files to A New Level Using GAWK (part 1)

January 30th, 2013

In a prior post, I link to some code that outputs a path file.  I've done something a tad different because I needed some select link analysis and reading the path file in Cube was taking far too long to do it the normal way.

So, I took that program on Github and extended it to perform a selected link:

And this outputs a few GB of paths in CSV format.  I went from 42 GB of paths in the AM to 3.4 GB of CSV paths.  Still not good enough. The next thing I did was use GAWK to get just the Origin and Destination

This returns a CSV file of just the origin and destination (which can be linked to the vehicle trip matrix).

Part 2 will discuss how to link to a vehicle trip matrix and if this approach actually works!

Reading a Cube Voyager Path File from Java

October 8th, 2012

As a follow-up to my prior post, this is how to use the Cube Voyager API to read a path file.  I highly recommend you read the other article first, as it talks more about what is going on here.

The Interface

The interface for the path reader is larger because of the return structure.  The code below includes the interfaces to the DLL calls and the structure for the path data returned by some of them.  Note that I didn't do PathReaderReadDirect.  It doesn't seem to work (or I'm not trying hard enough).

The Code

Once the interface is in place, the code is reasonably simple.  However, I'm seeing a few interesting things in the costs and volumes in both C++ and in Java, so I wouldn't use those values.  I guess if you need to determine the costs, you should save the costs with the loaded highway network to a DBF file and read that into an array that can be used to store and get the values.

The Final Word... For Now

Java is a great programming language.  Using these DLLs can help you do some interesting stuff.  However, it seems that there are very few people using the API, which is concerning.  I personally would like to see an interface for reading .NET files and writing matrices.  But I can't expect Citilabs to put time in on that when it seems there are so few people using it.


Reading a Cube Voyager Matrix from Java using JNA

October 5th, 2012

I've begun to really enjoy Java.  It's hot, black exterior exposes a sweet bitterness that matches few other things in this world.  Oh, wait, this is supposed to be about the other Java - the programming language!

The "Holy Grail" of API programming with Cube Voyager to me has been using the API in Java.  I can program in C++ quite well, but I have a staff that can't.  We're likely going to be going to a Java based modeling structure in the next few years, so  it makes sense to write everything in Java and keep the model down to two languages - Cube Voyager and Java.

Setting up the Java Environment

There are three things to do to setup the Java environment to make this work.  The first is to place the Cube DLL in the right location.  The second is to get JNA and locate the libraries to where you need them.  The final is to setup the Java execution environment.

First, copy the VoyagerFileAccess.dll file (and probably it's associated lib file) to C:\Windows.  It should work.  I'm using a Windows 7-64 bit machine, so if it doesn't work, try C:\Windows\System32 and C:\Windows\System.

Second, get JNA.  This allows the Java compiler to connect to the DLL.  The latest version can be downloaded from Github (go down to "Downloads" under the just scroll down 'till you see it, and get both platform.jar and jna.jar).

If you're on a 64-bit computer, the second thing to do is to set your jdk environment to use a 32-bit compiler.  I use Eclipse as my IDE, so this is done through the project properties.  One location is the Java Build Path - on the Libraries tab, make sure the JRE System Library is set to a 32-bit compiler.  In the Java Build Path screenshot below, you can see that all the locations are in C:\Program Files (x86) - this is an easy (although not foolproof) way to show that this is a 32-bit compiler.

Java Build Path Window

While you're setting up stuff in this window, make sure the jna.jar and platform.jar are linked here as well (click "Add External JARs..." and locate those two files).

Another place to check in Eclipse is the Java Compiler settings, which should have "Use Compliance from execution environment..." checked.

The Programming

The thing that makes this work is this part of the programming.  You can see in this that I create an interface t0 the Voyager DLL file by loading the DLL, and then setup some pointer objects to hold the memory pointer variable (the "state" variable in all of these) and set up the functions to read from the matrix.

The next part that makes this work is the actual programming. In the code below, the first thing I do is define vdll as an instance of the voyagerDLL interface.  Then, I open a matrix file (yes, it is hard-coded, but this is an example!), get the number of matrices, zones, the names, and I start reading the matrix (in the for loops).  I only print every 100th value, as printing each one makes this slow a bit. The actual reading is quite fast.  Finally, I close the matrix and the program terminates.


The big issue I noticed is that if the matrix is not found, the Pointer variable returned by MatReaderOpen will be null, but nothing will be in the error value.  I've tried redefining the error value to be a string in the interface, but it does the same thing.  However, I don't recall if it did anything in C++.  At any rate, there needs to be some error checking after the matrix is opened to ensure that it actually has opened, else the program will crash (and it doesn't do a normal crash).

Next Up

The next thing I'm going to do is the path files.

Using the Voyager API for Path Analysis

August 3rd, 2012

Just posted on Github: Path2CSV

This is a tool that will read a Cube Voyager Path file and output the contents by node to a CSV file.  The code is written in C++ and available under the GPL3 license.


Interesting INT() Issue Between Cube and Excel

July 24th, 2012


I don't know about anyone else, but I do a lot of calculation prototyping in Excel before applying that in scripts.  One of the most recent was to do a script to add expansion zones (also known as "dummy zones", although they aren't really dumb, just undeveloped!).

The problem I had was related to the following equation:

R=INT((819-N)/22)+1   Where N={820..906}

In Excel, the results are as below (click on it if it is too small to see):

In Cube, I got the result of (click on it to expand, and I only took it into Excel to move stuff around and make it easier to see):

Note the sheer number of zeroes in the Cube version and all the numbers are 'off'.

The reason, as I looked into things was because of how INT() works differently in the two platforms.  In Cube, INT simply removes everything to the right of the decimal, so INT(-0.05) = 0, and INT(-1.05)=-1.  In Excel, INT rounds down to the nearest integer.  This means that negative values will be different between the two platforms.  Note the table below.

Excel Cube
3.4 3 3
2.3 2 2
1.1 3 1
0.5 0 0
0 0 0
-0.5 -1 0
-1.1 -2 -1
-2.3 -3 -2
-3.4 -4 -3

While neither software is truly wrong in it's approach (there is no standard spec for INT()) it is important to know why things may not work as expected.

More Voyager PT + AWK Goodness

September 20th, 2011

One thing I've missed from the old TranPlan days was the reporting group.  We've used that for many years to compare our transit loadings by major corridor.  Unfortunately, that functionality was lost going to PT.  I still need it, though, and enter awk.

The script below looks at the transit line file and outputs ONLY the line code, comma-separated.  It uses a loop to check each field for ' NAME=' and 'USERN2', which is where we now store our reporting group codes.

	for (i=1;i<20;i++)
		if($i~/ NAME=/)
			printf "%s,",substr($i,8,length($i)-8)
			printf "%s\n",substr($i,9)

The contents of the above need to be saved to a .awk file - I used trn.awk.

To call this, I use a Pilot script to call awk and pass the input and get the output.


The output of this is a simple two-column comma-separated-value file of the route ID and the reporting group.

Emailing an alert that a model run is complete in Cube Voyager

March 6th, 2011

When you are doing many model runs, it makes life easier to know if the modelrun is complete.  The code is below.

SUBJECT='Subject Line Here',
MESSAGE='Message Goes Here',

The things you replace here are pretty obvious.  If you have questions about the SMTPSERVER parameter, ask your IT person.  Also, for Windows domains, the USERNAME parameter should be 'DOMAIN\USERNAME' (you may be able to use your email address, depending on your email setup).

Python and Cube

December 19th, 2010

One of the advantages of the ESRI ArcGIS Framework is that you can write Python scripts that do GIS things and run them from Cube.  Even better, Cube uses the Geodatabase format, so you can store and retrieve things from there.

The first thing that is needed is a python script.  The below is an example that we're not using at the moment, but it merges multiple transit line files together.

import arcgisscripting, sys, os

gp.AddToolbox("C:/Program Files/ArcGIS/ArcToolBox/Toolboxes/Data Management Tools.tbx")

print sys.argv




print input
print output

if gp.Exists(output):

#input=input1+"_PTLine" +';'+input2 +"_PTLine"


print gp.GetMessage

del gp

To call this, we add the following in a Pilot script:

* {CATALOG_DIR}\Inputs\Transit.mdb\Routes1 {CATALOG_DIR}\Inputs\Transit.mdb\Routes2 {CATALOG_DIR}\Inputs\Transit.mdb\RoutesCombined

This makes it easy to create geoprocessing steps in ArcMap, export them to Python, and call them from the model.

Top 6 Resources for a Travel Modeler to Work From Home

December 16th, 2010

It's the most wonderful time of the year, isn't it?  Nothing says "winter" like 6 inches of snow that keeps you from going to the office!

Over the years, I've amassed a set of utilities, many of them free, to make my life easier.  This list can sometimes take the place of things that I would normally use in the office, other times they are things that sync to the "cloud" and I use them both in the office and at home.

1. Dropbox

I don't care too much for USB "thumb" drives, and I've had my fair share of leaving them at home or at work and needing them at the opposite location.  Dropbox clears up this mess, as there are no USB drives to lose or leave in the wrong place.  NOTE: the link that I have IS a referral link.  Clicking on that and creating an account results in both of us getting an extra 250 MB of space with the free account (starts at 2 GB, max for free is 8 GB).

2. Evernote

I take a lot of notes, both on the road at conferences and at the office.  Evernote is what I use to keep them organized.

3. Google Docs

Unless you want to spring for Microsoft Office at home, Google Docs is the way to go.  There are several others including Zoho and Office Online, but I haven't used them.  Google Docs has great collaboration features, document versioning, and its free.  Just make sure to back it up! The only problem: no DBF file support.

4. Notepad++

This is perhaps the greatest text editor.  It understands and does some context highlighting (etc) for many programming languages.  Even better, Colby from Citilabs uploaded his language definition file for Cube Voyager to the user group!

5. Microsoft Visual {whatever} Express Edition

The Express Edition tools have become our go-to tools for new development, particularly MS Visual C++ EE and MS Visual Basic EE.  Since they're free, you can have copies both at home and work.

6. Eclipse

This one's almost optional, but for those working with Java models, this is the standard IDE, and it is open source.

Any tools to add?  Add them in the comments below.

Using a Class Object to Help Read a Control File

December 5th, 2010

One thing we're used to in travel modeling is control files.  It seems to harken back to the days of TranPlan where everything had a control file to control the steps.

In my case, I have a control file for my nested logit mode choice program, and because of the age of the mode choice program, I want to redesign it.  The first part of this is reading the control file, and I did a little trick to help with reading each control file line.  With C++, there is no way to read variables in from a file (like there is with FORTRAN).

The first part of the code reads the control file, and you will see that once I open and read the control file, I section it out (the control file has sections for files ($FILES), operation parameters ($PARAMS), operation options ($OPTIONS), and mode choice parameters ($PARMS). Each section ends with an end tag ($END). This adds the flexibility of being able to re-use variables in different locations.

After the section, the next portion of the code reads the line and checks to see if FPERIN is found. If it is, a ControlFileEntry object is created. This object is a class that is used to return the filename held in the object. This makes it easy to reduce code.

int readControlFile(char *controlFileName){
	cout << "Reading " << controlFileName << endl;
	//Read the control file
	string line;
	bool inFiles=false, inParams=false, inOptions=false, inParms=false;
	ifstream controlFile(controlFileName);
		return 1;
		//check the vars sections
			cout << "Checking files" << endl;
				controlFileEntry cfe(line);
	return 0;

The controlFileEntry is code is below.  This is used at the top of the code, just below the preprocessor directives (the #include stuff).

class controlFileEntry{
	string filename;
	controlFileEntry(string Entry){
	string Entry;
	int beg;
	int end;

The class has one public member, filename, which is what is read in the code where it is used. There are two public functions. The first is the constructor (controlFileEntry) which is used when creating the object. The second is the de-constructor (~controlFileEntry), which sets the beg, end, and filename variables to zero and blank.  The beg, end (misnomer), and the line sent to it are private and cannot be used in code.

This can be extended, as the file entry type is fine when there are quotes around the item (it is setup for that, note the -2 in beg).  I wrote a different one for integers, floating point, and boolean values.

class controlParamEntry{
	int ivalue;
	bool bvalue;
	double dvalue;
	controlParamEntry(string Entry){
		}else if(Entry.substr(beg,end)=="F"){
	string Entry;
	int beg;
	int end;

As you can see above, there are return values for floating point (dvalue), integer (ivalue), and boolean (bvalue).

Tune in next week to see more of the code.

Reading a Matrix File in C++ and Doing Something With It

November 28th, 2010

Last week's post showed how to open a path file using C++. This week's post will show how to open a Cube Voyager matrix file in C++.


Like last week, we need to reference VoyagerFileAccess.lib in the project.  We also need to add the external references in the header file as below:

extern "C" __declspec(dllimport)void* MatReaderOpen(const char *filename, char *errMsg, int errBufLen);
extern "C" __declspec(dllimport)int MatReaderGetNumMats(void* state);
extern "C" __declspec(dllimport)int MatReaderGetNumZones(void* state);
extern "C" __declspec(dllimport)int MatReaderGetMatrixNames(void* state, char **names);
extern "C" __declspec(dllimport)int MatReaderGetRow(void* state, const int mat, const int row, double *buffer);
extern "C" __declspec(dllimport)void MatReaderClose(void* state);

Also ensure that the project is setup with the character set of "Not Set" as opposed to Unicode, which seems to be a default in MS Visual C++ Express Edition.

Main Application

The main application is fairly simple and just opens the matrix and outputs the number of tables and zones to the screen.

#include "stdafx.h"
#include <stdio.h>
#include <iostream.h>

using namespace std;

int _tmain(int argc, _TCHAR* argv[])
	char errMsg[256]="";
	// Open the matrix
	void* matrixState=MatReaderOpen(argv[1],errMsg,256);
	// Get number of tables in the matrix
	int nMats = MatReaderGetNumMats(matrixState);
	// Get number of zones in the matrix
	int nZones = MatReaderGetNumZones(matrixState);
	// Output to screen
	cout << "File " << argv[1] << endl;
	cout << "Number of Tables....." << nMats << endl;
	cout << "Number of Zones......" << nZones << endl;
	// Close the matrix
	cout << "Matrix Closed." << endl;
	cout << "Press any key to close" << endl; 	char tmp; 	cin >> tmp;
	return 0;

The output looks like the below:

Using the Path Files in the New Cube Voyager API

November 23rd, 2010

Matthew M., The Citilabs Director of Development, released an API to use C/C++ to read Voyager Matrixes and Highway Path Files.  I have started into using this API in C++ to read path files.

The first thing with the path files is that it is (as indicated in the documentation) Highway Path Files only.  I first tried PT Route Files (which can be read in Viper in the same way one could use Highway Path files), but alas, you receive an error when trying to do that.

For this, I have created a console application, which could become something to run in a model run.


The first thing is to setup your include file with the DLL references.

Start by adding a reference to VoyagerFileAccess.lib.  In Visual C++ Express 2010, right-click on your solution name and add an existing item (and point it to VoyagerFileAccess.lib).  Then, in a header file (or in your source file, but normal programming conventions seem to dictate that these items belong in headers), add the following lines:

extern "C" __declspec(dllimport)void* PathReaderOpen(const char *filename, char *errMsg, int errBufLen);
extern "C" __declspec(dllimport)void* PathReaderClose(void* state);
extern "C" __declspec(dllimport)int PathReaderGetNumZones(void* state);
extern "C" __declspec(dllimport)int PathReaderGetNumTables(void* state);

These lines tell the compiler that these four functions are imported through a DLL (and thus, it uses the lib file to know where to go).

The next thing, since this is a console application, is to correct the Character Set.  Right-click on the solution, go to properties, select Configuration Properties - General, and set the Character Set to "Not Set".  If you leave it on Unicode, your command line arguments will have only one letter.  See the screen capture below.

Main Application

This is a small application that just shows some simple reading the zones and tables in the path file.  The application takes one command-line argument.

The source, fully commented, is below.

#include "stdafx.h"

#include <Windows.h>

#include <stdio.h>

#include <iostream>

using namespace std;

int _tmain(int argc, char* argv[])


// dim variables
char errorMessage[256]="";
int Zones,Tables;

// Opens the path file and sets the Zones and Tables variables
void* state=PathReaderOpen(argv[1],errorMessage,256);

// Dumps the variables to the screen
cout << "State of PathReaderOpen: " << state << endl;
cout << "PathReaderErrorMessage: " << errorMessage << endl;
cout << "Zones: " << Zones << endl;
cout << "Tables: " << Tables << endl;

// Closes the path file
cout << "Path Reader Closed";

// This makes the command window wait for input from the user before closing
char tmp;
cin >> tmp;

return 0;

For debugging, you will want to set the command-line argument.  This is done by right-clicking on the solution and going to Configuration - Debugging.  See the screen capture below.


The output of this is fairly simple:

In the coming weeks, I will post more about using this new API.

Voyager + C++ With Multi-Dimensional Arrays (Part 2: Writing)

November 7th, 2010

This is part 2 of using Cube Voyager Multi-Dimensional Arrays with C++. To see part 1, click here.

Building on last weeks post, the below shows the modifications necessary in Cube. The first thing I added is the variable itself (else, you will get one of those inexplicable errors). In this case, I add MDARRAY2 as a variable that is the same dimensions as MDARRAY. The second part that I add (which is after the CALL statement) is just to report the values stored in MDARRAY2.




LOOP _c=1,5
  LOOP _r=1,5
    PRINT PRINTO=1 LIST='MDARRAY[',_c(1.0),'][',_r(1.0),']=',MDARRAY[_c][_r](8.6)


LOOP _c=1,5
  LOOP _r=1,5
    PRINT PRINTO=1 LIST='MDARRAY2[',_c(1.0),'][',_r(1.0),']=',MDARRAY2[_c][_r](8.6)

In C++, I add a second variable for MDARRAY2 (called TableRecord2). It is critical that this is a double* variable, as this needs to be a pointer so Cube can access updated values of the variable. Similar with how I read MDARRAY into TableRecord, I do the same with MDARRAY2 and TableRecord2, which reads the pointers to MDARRAY2 into TableRecord2. Then, as I iterate through TableRecord, I set TableRecord2 to 10 * TableRecord. After this, the DLL is complete and Cube ultimately prints all the values to the print output.

int TableReader (Callstack* Stack){
	double* TableRecord;
	double* TableRecord2;
	char message[100];

	for(int x=0;x<=24;x++){ 	if(&TableRecord!=0){ 			 		sprintf(message,"TableRecord=%f",TableRecord[x]); 		Stack->pfPrnLine(1,message);
	return 0;

Additional Considerations

If you decide to use this, you may want to pass the sizes of each dimension if it is important. Then, you can write a function to take the sequential value and return the column or row.

Voyager + C++ With Multi-Dimensional Arrays (Part 1: Reading)

October 31st, 2010

This is part 1 of this subject. Part 2 will be about writing values to the arrays.

One of the cool things with the latest version of Cube Voyager is multi-dimensional arrays. However, it appears behind the scenes (or at least to C++) that the multi-dimensional arrays are a wrapper over a single-dimension array.

The easiest way to show this is to make a random array and send it to the print file. Making the random array in Cube is simple:




LOOP _c=1,5
  LOOP _r=1,5
    PRINT PRINTO=1 LIST='MDARRAY[',_c(1.0),'][',_r(1.0),']=',MDARRAY[_c][_r](8.6)


Then, in C++ we can pull 25 (!) array values from this:

int TableReader (Callstack* Stack){
	double* TableRecord;
	char message[100];

	for(int x=0;x<=24;x++){
	return 0;

For fixed size multi-dimensional arrays, this isn't really an issue. It would be very easy to wrap the Stack->pfFindVar line in a set of loops that fills a multi-dimensional array.

Voyager + C++ With DBI Part 1: Number Fields

October 17th, 2010

This is part 1 of this subject.  Part 2, using C++ with DBI String Fields, will be in a few weeks, once I figure it out!

Extending the posts relating to using C++ with Cube Voyager, the next thing to look at is using C++ with the DBI input module that was added in Cube 5.1.

The key to making this happen is the Matrix FILEI help section that discusses that certain things are held in arrays. My last post on this subject got into arrays a little, but there are a few new tricks to use.

The code below (C++) is some simple source code that reads the input database (DBI.1) and reads the built-in array of field values.

typedef struct { int I,J,Zones,Iterations;
				double** MW;
				void* (*pfFindVar)(char*,...);
				void* (*pfPrnLine)(int,char*);
} Callstack;

int TableReader (Callstack* Stack){

	double* TableRecord;
	char message[100];


	for(int x=0;x<=40;x++){ 		if(&TableRecord[x]!=0){ 			sprintf(message,"Table %i=%f",x,TableRecord[x]); 			Stack->pfPrnLine(1,message);

	return 0;

This reads all the NUMERIC (note the emphasis!) fields and dumps them to the print file. There is a trick in here - I have a table with 39 fields, but I pull 40 fields. If you actually use that 40th field in the sprintf statement, it crashes. This is why I test to ensure that &TableRecord[x] (the pointer to the table record) does not equal 0.

Normally in Cube, one would read the database fields using DI.1.FIELDNAME. This is certainly possible in C++. Note the code below (where HHPERSONID is the field name):

int TableReader2 (Callstack* Stack){
	double HHPersonId;
	char message[100];


	return 0;

This is similar to the code example above.

Tune in next week when I get into more DBI with C++.

Using a C++ DLL in Cube

October 10th, 2010

One thing that can drastically speed Cube is using a DLL to do big tasks, like Nested Logit Mode Choice. However, doing this can be fraught with hair-pulling errors.  This post shows some techniques to keep your hair on your head.  This post is written for a travel demand modeler, not a computer science person!

RTFM (Read The Fine Manual)

Read the help file for Matrix CALL statement.  The struct statement is pretty important, and the sprintf lines will be used throughout.

Memory Pointers

One of the most important things to understand is that because there are so many variables that can be passed between Cube and the C++ DLL, the memory pointers are passed instead.  Also, one of those "pull your hair out" things relates to this - if you attempt to access a memory pointer that hasn't been initialized, you get a crash that gives no error.

Because of this, the variables in the struct statement have a *, which notes that it is a memory pointer.

To keep from getting the crash-with-no-error, the following statement works well to test and allows a default to be used if the variable 'MarketSegments' is not set in Cube.

int MarketSegments=4;


Matrix In, Matrix Out

While the Help file says that you can get to defined working matrixes using

static double **MW;

I can't get it to work using C++ (I have gotten it to work in C).  Instead, use the following:

static double **MW=NULL;

This will enable you to use MW[m][j] (where m is the MW number, and j is the j-zone).

You can also set the MW variables, but it does NOTHING if you don't set the MW to something in Cube Voyager.  Ergo, if you set


Your output will be 0 unless you do the following in Cube Voyager


Array Variables

One of the tricks I use to get array variables out of Cube is this

float ArrayVariable[7]={0,0,0,0,0,0,0};  //Note: I'm using 1-6.  Setting this to 7 means 0-6.  Setting it to 6 would mean 0-5
double* tmpAV=NULL;
for(int x=1;x<=6;x++)

This code above checks that the ArrayVariable, fills them into a temporary variable, and then sets the actual variable.

Compilation Linker Settings

When compiling, you need to set the EXPORT flag so the name is predictable and correct.  To do this, go to your project's property pages - Configuration Properties - Linker - Command Line.  You need to add "/EXPORT:FunctionName" under Additional Options.  See the screenshot below


Other Weird Stuff

Any error in C++ that does not cause a compilation error results in one of those useless "this program has an error and will be closed" and crashes Task Monitor.  That being said, write messages to the output file frequently (if at least during debugging).  This can assist with finding typos (like, say, %10.65f in an sprintf statement, which means 65 decimal places in a 10-width line).

Cube Voyager: Using Cluster with DBI

October 3rd, 2010

Credit for this goes to Citilabs Support, although I made some adaptations.

In Matrix when using DBI, PAR ZONES=1 will effectively shut off Cluster. Therefore, the following works really well.



start = (I-1)*recs+1
end = I*recs

LOOP _r=start,end

This script sets each core to process an equal portion of the database with any remainder (e.g if you cluster 4 records over 3 cores) to the last core.

Cube Voyager Speed Clinic

September 26th, 2010

There are several issues with long travel demand model run times.  Deep down, these are supposed to be planning tools, and taking too long for results can reduce the practicality of using a travel demand model in decision making.

In Cube Voyager, I've been finding more ways to cut runtimes down as much as possible, and below is the list with some of the rationale.

Keep JLoops at a Minimum

The Matrix program runs in an implied ILOOP.  This being the case, anything in the script runs many times (as many as the zones you have).  Using a JLOOP in a 2,000 zone model means that there are  4,000,000 calculations to be done for each COMP statement.  What happens if you have 3 COMP statements?  You go from 4,000,000 to 12,000,000 calculations.  This is even worse if the calculations include a lookup function, which are generally slow.

Keep Files Small - Only Write What You Have To

This is a no-brainer.  The more that Cube (or anything, for that matter) has to write to disk, the longer the runtime.

Replace RECI with DBI wherever possible

DBI can be clustered (look for a post on that in the coming weeks).  While I'm not sure if there is any difference on one core, being able to use Cluster is the trump card.

Use Cluster Dynamically and Wherever Possible

Standardize the Cluster ID in your code, but set the process list to a catalog key as with below:


Using Cluster to your advantage is critical to having a fast model.

Tour-Based Modeling: Why is it Important?

June 12th, 2010

One thing that is constantly bounced around is why tour-based modeling is better than trip based modeling.  We've been using trip based modeling for 50 years, isn't it timeless?


Fifty years ago, when the trip based modeling methodologies were developed, the primary reason was to evaluate highway improvements.  While tolling was in use, the bonding requirements were likely different.  Transit, while extremely important, was not in the public realm (the streetcars were normally privately owned by the area's electric company).

Now, there are a lot of demands on travel models:

  • Tolling/Toll Road analysis at a better level
  • Different tolling schemes (area tolling, cordon tolling)
  • Travel Demand Management (telecommuting, flex hours, flex time, alternative schedules)
  • Better freight modeling (which now is becoming commodity flow and commercial vehicle modeling)
  • Varying levels of transit (local bus, express bus, intercity bus, BRT, light rail, and commuter rail

While many of these can be done with trip based models, most of them cannot be done well with trip based models.  There are a number of reasons, but the few that come to mind are aggregation bias, modal inconsistency, and household interrelationships.

Aggregation Bias

Aggregation bias occurs when averages are used to determine an outcome.  For example, using a zonal average vehicles per household, you miss the components that form the average, such as:

20 households, average VPHH = 2.2
2 HH VPHH = 0
5 HH VPHH = 1
4 HH VPHH = 2
6 HH VPHH = 3
3 HH VPHH = 4+

The trip generation and modal choices (car, bus, bike, walk, etc.) among these households are all different, and are even more more different if you look at the number of workers per household.

Modal Inconsistency

In trip based modeling, "people" are not tracked throughout their day.  So, if someone rides the bus to work, there is nothing in the model to ensure that they don't drive from work to get lunch.  While we don't want to force people to use the same mode, since many people will use the bus to get to work and then walk to lunch or to go shopping during lunch, we want to make sure that there is some compatibility of modes.

Household Interrelationships

One of the features of of tour based models is determining each person's daily activity pattern.  During this process, certain person types can determine what another person is doing.  For example, if a preschool age child is staying home, an adult (whether they are a worker or not) HAS to stay home.  Another example is if a school-non-driving-age child is going on a non-mandatory trip, an adult must accompany them.  Trip based models don't know about the household makeup and the household interaction.

The above are only three of the many reasons why tour-based modeling is important.  There are many more, but I feel these are some of the most important and some of the easiest to understand.