I started this post (and the work to go with it) as a companion to A Self Instructing Course in Mode Choice Modeling by Bhat and Koppelman. That's because I could reproduce the work in the book in R and can (now) reproduce in R.
To continue with this, please get the CD files from my last blog post. You'll specifically need "SF MTC Work MC Data.sav", which is in SPSS format.
The first part:
The items above simply load the libraries. If any of these are not found, go to Packages (on the menu bar) - Install Packages... and select your closest mirror and select the missing package (either foreign or mlogit).
Next, read in the data and we'll add a field, too, as there is no unique id in this dataset.
The first line reads in the SPSS file (it asks you for the file). The second adds a "HHPerID" field, which is unique to each case.
The next part is to format the data for mlogit. This is quite a challenge because it has to be JUST RIGHT or there will be errors.
The first parts of this are pretty obvious (inTab is the input table, choice="chosen" is the choice field). Shape="long" indicates that the data is multiple records per case. "Wide" would indicate each record is on its own line. chid.var is the case id variable. alt.var is the alternatives. drop.index drops the index field out of the resulting table.
Finally, we'll run a simple multinomial logit estimate on this.
For such a short piece of code, there is a lot going on here. The formula is (simply) chosen=cost+tvtt+hhinc, BUT hhinc is alternative specific and cost and travel time are not. So the utilities for this would be something like:
The result is this:
>summary(nlb) Call: mlogit(formula = chosen ~ cost + tvtt | hhinc, data = mc, method = "nr", print.level = 0) Frequencies of alternatives: 1 2 3 4 5 6 0.7232054 0.1028037 0.0320143 0.0990257 0.0099423 0.0330086 nr method 6 iterations, 0h:0m:6s g'(-H)^-1g = 5.25E-05 successive function values within tolerance limits Coefficients : Estimate Std. Error t-value Pr(>|t|) 2:(intercept) -2.17804077 0.10463797 -20.8150 < 2.2e-16 *** 3:(intercept) -3.72512379 0.17769193 -20.9639 < 2.2e-16 *** 4:(intercept) -0.67094862 0.13259058 -5.0603 4.186e-07 *** 5:(intercept) -2.37634141 0.30450385 -7.8040 5.995e-15 *** 6:(intercept) -0.20681660 0.19410013 -1.0655 0.286643 cost -0.00492042 0.00023890 -20.5965 < 2.2e-16 *** tvtt -0.05134065 0.00309940 -16.5647 < 2.2e-16 *** 2:hhinc -0.00216998 0.00155329 -1.3970 0.162406 3:hhinc 0.00035756 0.00253773 0.1409 0.887952 4:hhinc -0.00528636 0.00182881 -2.8906 0.003845 ** 5:hhinc -0.01280827 0.00532413 -2.4057 0.016141 * 6:hhinc -0.00968627 0.00303306 -3.1936 0.001405 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Log-Likelihood: -3626.2 McFadden R^2: 0.25344 Likelihood ratio test : chisq = 2462 (p.value = < 2.22e-16)
And this matches the self-instructing course manual, page 76 (under "Base Model").
R can do simple nested logit calculations, but unfortunately they have to be *very* simple (which is uncharacteristic for R). The best thing to do is get a copy of Biogeme and read the next post in this series.