Lookups in R: The Wrong Way and the Right Way

I recently wrote a script that takes DBF exports of Cube matrices and prepares them for Biogeme.  The main... well, only reason I did this in R was because I was considering using mlogit for model estimation.  I ultimately decided to 'go with what I know' and changed course to use Biogeme. Mind you, the part of Stairway to Heaven applies: "There are two paths you can go by, but in the long run / There's still time to change the road you're on."

The Wrong Way

I've changed my code already, so pardon that this is from memory.  Also, these are snippets - I have a lot more code than this.


for(rn in 1:nrow(TripsAll)){
HSkimPkRow<-subset(HSkimPk,I==TripsAll[rn,"PTAZ"] & J==TripsAll[rn,"ATAZ")

This took no less than 17 hours to complete for around 23,000 trip records and for values from 5 different tables * 2 time periods.

The Right Way

I (obviously) wanted something that wouldn't take forever, especially as I was working in Biogeme and seeing things that made me think that I wanted to change ONE LITTLE THING.  This seems to always happen.

I took a different approach that by my calculations should be much quicker.

HSkimPkD<-acast(HSkimPk,I ~ J,value.var="V2",drop=FALSE,fill=0)
HSkimPkT<-acast(HSkimPk,I ~ J,value.var="V1",drop=FALSE,fill=0)

for(rn in 1:nrow(TripsAll)){
if(I<=nrow(HSkimPkT) & J<=nrow(HSkimPkT)){

Since this is currently running, my only metrics are to look at the time per 50 rows (in my real code, I have a line that outputs a timestamp every 50 rows), and it is taking about 0.27 seconds per record, compared to somewhere around 4.5 seconds per record.  While not perfect, I'll take an estimated completion of 1.75 hours compared to 17 (update: 2 hours).  However, I will say that Cube is faster in this regard and that I may not have the fastest R solution.

Tags: , , ,

Comments from Other Sites

Comments are closed.