Reading Chart Images in R

August 6th, 2019

I ran into a non-computer issue recently – a marathon I was going to run had no hill profile on the race’s website. Many runners, myself included, use hill profiles to gauge how we want to run a race. For example, the Cincinnati Flying Pig is one where many runners would want to keep their pace moderated during the first 6 miles because miles 7-9 are a ginormous hill.

Since the race didn’t have an elevation profile on their website, I looked elsewhere for information, and came upon several Strava results that include an elevation profile. This is good, but I wanted to take it one step further and compare the marathon I was going to run – the Toledo Glass City Marathon – to one I had already run – the Cincinnati Flying Pig Marathon.

This is a screenshot of the elevation profile from the Flying Pig Marathon from my Strava race in 2018
This is a screenshot of the elevation profile of the Glass City Marathon from some random runner’s Strava upload of the race

Taking the above images requires a few steps. The first is simply cropping down to the chart area.

The next step is to read in the pixel data. This requires the imager package in r (install.packages("imager"))

The first part of the code below loads the libraries and reads the files in as greyscale.

library(imager)
library(ggplot2)
library(dplyr)

fpm.ei = grayscale(load.image("Pictures\\FPM_HillProfile_IR.png"))
gcm.ei = grayscale(load.image("Pictures\\GCM_HillProfile_IR.png"))

The values are input as a matrix indexed with the 0,0 point at the upper-left corner of the image and a value that relates to an estimate of. This means that an initial plot looks upside-down.

This can be fixed and simplified by using the ggplot2 snip below, and while I was at it, I pulled out the grey background and factored the values to 1 (dark) or 0 (light):

ggplot(as.data.frame(fpm.ei), aes(x = x, y = y, color = as.factor(ifelse(value<0.88, 0, 1)))) + 
  geom_point() + 
  scale_y_reverse()
This is the elevation profile

Note: 0.88 was used because the mean of the data is 0.8812. 0.5 wouldn’t work because the min is 0.5294 – this is because the foreground is grey, not black.

The next part is some processing. The processing does a few things:

  1. Do a gradient along the y axis – this removes the dashed lines
  2. Filter the matrix to just the black areas
  3. Get the minimum of the y axis location to get the contour of the line
fpm.gr <- imgradient(fpm.ei,"y")
fpm.grdf = as.data.frame(fpm.gr) %>%
  filter(value == 0)

fpm.grdf2 = fpm.grdf %>%
  group_by(x) %>%
  summarize(y = min(y))

To analyze, we need to relate the pixels to actual values. The x direction is easy – I used 0 and 26.2, the official distance of a marathon. The y direction is not, so I looked at the pixels and measurements from the images and related the pixels to an elevation – in the case of the Flying Pig, y = 27 => 800 feet MSL and y = 171 => 500 feet MSL. Since these are linear models (1 pixel = x feet), I used a linear model. Glass City’s values are 28 => 660 feet MSL and 146 => 580 feet MSL.

lmFP = lm(yy ~ y,
  data.frame(y = c(27, 171), yy = c(800, 500))
)

summary(lmFP)

fpm.grdf2$elevation = predict(lmFP, newdata = fpm.grdf2)

ggplot(fpm.grdf2, aes(x = mileage, y = elevation)) + geom_point()
Elevation profile as points

This is what we want – the elevation points! Note the y axis is elevation in feet and the x axis is the mileage of the marathon. Doing the same exercise with the Glass City Marathon elevation chart yields this…

Let’s compare them, shall we? The code below formats the data, and then draws a chart.

compareM = rbind.data.frame(
  fpm.grdf2 %>%
    arrange(mileage) %>%
    mutate(Marathon = "Flying Pig",
           elevationGain = elevation - lag(elevation, 1)),
  gcm.grdf2 %>%
    arrange(mileage) %>%
    mutate(Marathon = "Glass City",
           elevationGain = elevation - lag(elevation, 1))
)

ggplot(compareM, aes(x = mileage, y = elevation, color = Marathon)) + geom_line(size = 1.2) +
  ylab("Elevation in Feet MSL") +
  xlab("Miles") +
  ggtitle("Marathon Elevation Comparison", "Flying Pig (Cincinnati) vs. Glass City (Toledo)")

So the chart above shows that the Glass City is mere child’s play compared to the Flying Pig.

XKCD-style Plots In R

May 17th, 2016

Because XKCD is cool…

Note that the data is totally made up.  Well, except that it certainly feels pretty well correct, given the Delta flights I was on going both to and from Denver…

Resulting Plot

Resulting Plot

Animated Charts from R

April 29th, 2016

In any iterative process, it is nice to see progress.

Recently, it was gravity model distribution (again), so I had a process in R to calibrate friction factors, and part of it exported the trip length frequency chart:

ggsave(paste("TLF", Purpose, Period, "_Iter", iteration, "plot.png", sep=""))

So I was left with 30 charts showing progress.  So for kicks (a little), I animated the charts using ImageMagick:

One of the charts

One of the charts. Click for larger and animated.

To do this, you need Imagemagick (scroll down for Windows) and the following command line:

convert -resize 75% TLFHBWOP_Iter%dplot.png[1-30] HBWOP.gif

The resize parameter is optional, I used it to reduce the image size down to something allowable by Twitter.

Running Cube Scripts from R Scripts

April 26th, 2016

From time to time, it makes sense to not re-invent the wheel by building something in R that is already available in another program, like Cube Voyager.

This is incredibly easy:

system("voyager.exe script.S /Start /CloseWhenDone")

That’s it!  BUT…

If you want the output from a matrix available to R, the last line of your Cube script should include something like:

*cube2omxw output.mat

This will convert the output matrix to an omx file.  The star runs it from the command prompt.

Important requirements: voyager.exe and cube2omxw.exe must be in your path.

 

Simple Heatmap in R with ggplot2

March 18th, 2016

Heatmaps are an easy way to look at data when you have a lot.  The heatmap below is from a 30 day traffic count.  You can easily and quickly see that peaks in the morning are northbound, peaks in the afternoon are southbound, and peaks on the weekend are in the midday.

Heatmap

Heatmap

There isn’t a ton of code needed to do this:

Data to play with

This method could be used with shorter-term data with fewer steps.  I used this 30-day count because it tells more of a story than a two or three day count.

R + OMX County Flows

October 8th, 2015

In travel modeling, we use matrices to do things like zone-to-zone trip flows.  The matrix is like a table, but with an equal number of rows and columns, each representing a location in the model (a traffic analysis zone, like a Census Block Group).  In the OKI region, we have 2,299 zones, which means there are 5,285,401 cells.  Too many to look at, and we don’t have reliable data at that level.  However, we do have semi-reliable data at the county level.

The Open Matrix Format (OMX) is used to get these matrix files out of a proprietary format and into something that can be read by many programs.  We use this at OKI to move data out of Cube (a proprietary software product) and into R (an open source statistical programming package).

Summarizing a matrix to a county flow table in R is actually pretty easy:

This doesn’t take very long, either.  Anyone familiar with R can see where the code can be revised to summarize to districts.

This is what the data looks like. Note that this is not verified data (please do not use it!).

This is what the data looks like. Note that this is not verified data (please do not use it!).

Note: the reason Hamilton, Campbell, and Dearborn county names are cut off is related to a bug in Cube.  They (Citilabs) are aware of the bug, but it has not been fixed yet.

Standard Deviation Differences Between Excel and R (and my code in Cube Voyager)

July 24th, 2015

I had a need to get the correlation of count to assignment in Cube Voyager. I don’t know how to do this off the top of my head, and I’m instantly mistrusting of doing things I’d normally trust to R or Excel. So I looked on Wikipedia for the Pearson product-moment correlation coefficient and ended up at Standard Deviation. I didn’t make it that far down on the page and used the first, which generally made Voyager Code like this:

I left the print statements in, because the output is important.

Avg AADT_TRK = 1121.77
Avg VOLUME = 822.03
n = 230.00



sdx1 = 1588160175
sdy1 = 1196330474
n = 230.00
sd AADT_TRK = 2627.75
sd Volume = 2280.67
r2 = 155.06

Note the standard deviations above. Ignore the R2 because it’s most certainly not correct!

Again, mistrusting my own calculations, I imported the DBF into R and looked at the standard deviations:

> sd(trkIn$AADT_TRK)
[1] 2633.476
> sd(trkIn$V_1)
[1] 2285.64

Now Standard Deviation is pretty easy to compute. So WHY ARE THESE DIFFERENT?

Just for fun, I did the same in Excel:

Screenshot 2015-07-24 11.09.18
WTF? Am I right or wrong???

So I started looking into it and recalled something about n vs. n-1 in the RMSE equation and discussion in the latest Model Validation and Reasonableness Checking Manual. So I decided to manually code the standard deviation in Excel and use sqrt(sum(x-xavg)^2/n-1) instead of Excel’s function:

Looky there!  Matching Numbers!
Looky there! Matching Numbers!

It’s not that Excel is incorrect, it’s not using Bessel’s Correction. R is.

/A

R + OMX: Trip Length Frequency Plots

May 8th, 2015

If you don’t follow me on twitter or the Open Model Data site, you may have missed that Cube 6.4 makes some DLL changes that rendered the prior version of the Cube2OMX converter unusable.  I jumped in (partly because I installed Cube 6.4) and fixed the problem.  You can get the source or a binary on Github.

I did this because sending matrices to DBF files means that you’ll have a 500 MB DBF for a matrix that’s 3200+ zones.  Normal R routines chug on it.  On the contrary, OMX files are around 10% of that size (60 MB) and R can read them quite quickly – in a matter of seconds instead of minutes.

So the first thing I wanted to do in R with OMX files is Trip Length Frequency Plots.  This isn’t incredibly quick, but it works.  The code is below.  According to one run in R, it takes around 6 minutes to run (on 3,312 zones).  The largest part is the loop in the function, and this could probably be parallelized (is that a word?) using doParallel.

Code below or here