Using GAWK to Get Through CTPP Data

The 3-year CTPP website lacks a little in usability (just try getting a county-county matrix out of it).

One of the CTPP staff pointed me to the downloads, which are a double-edge sword. On one hand, you have a lot of data without an interface in the way. On the other hand, you have a lot of data.

I found it was easiest to use GAWK to get through the data, and it was pretty easy:

gawk '/.*COUNTY_CODE.*/' *.csv >Filename.txt

Where COUNTY_CODE is the code from Pn-Labels-xx.txt where n is the part number (1,2, or 3) and xx is the state abbreviation.

NOTE: Look up the county code EACH TIME.  It changes among parts 1, 2, and 3.

This command will go through all .csv files and output any line with the county code to the new file.


I have multiple counties to deal with.  There's an easy way to start on getting a matrix:

gawk '/C4300US.*(21037|21015|21117).*32100.*/' *.csv >TotalFlowsNKY.csv

This results in a CSV table of only the total flows from three Northern Kentucky counties (21037, 21015, 21117; Campbell, Boone, and Kenton county, respectfully).  For simplicity's sake, I didn't include all 11 that I used.

Finishing Up

Then, I did a little Excel magic to build a matrix for all 11 counties and externals.  The formula is shown.  I have an additional sheet which is basically a cross reference of the county FIPS codes to the name abbreviations I'm using.  See the image below (click for a larger version).

After this, I built a matrix in Excel.  The matrix uses array summation (when you build this formula, you press CTRL+Enter to set it up right, else the returned value will be 0).

Using these techniques, I was able to get a journey to work matrix fairly quickly and without a lot of manual labor.


You need to have GNUWin32 installed to use gawk.




Tags: , , ,

Comments from Other Sites

Comments are closed.