#TRBAM Twitter Data Mining Project
I have been interested in playing around with twitter as a data mining resource. Today, I happened to stumble upon an article in Getting Genetics Done that talks about just that (just with a different conference).
I looked into their script, and it points to a twitter command line program called t.
That and a little bit of shell scripting gave me something I could run to get the tweets in the last 10 minutes:
What this means is that I can get tweets in CSV for the last 10 minutes. This can easily be run via cron:
*/10 * * * * sh /root/trbam_tweets/searchTweets.sh >/var/www/tstat.txt 2>&1
I have the output redirected to somewhere I'll be able to see from DC, as I don't know how my access will be or how much I'll be able to do prior to then. I will make the data available to other researchers since it is all public tweets... That being said, if I (@okiAndrew) follow you on twitter and you've made your timeline private, contact me if you're concerned (or don't use "#trbam"). I don't specifically know if protected tweets would show up in the search - I DO have to be authenticated with Twitter, though.
Duplicates and Misses
I am going to write some code (whenever I get some spare time) to import the CSV files into mySQL or couchDB or something. This will allow me to use the twitter ID as a way to test for and remove (or not import) duplicates.
As far as misses are concerned, that's just life. This script is being fired off every 10 minutes - there are 144 files from each day, there's 71 days left until the annual meeting starts at the time of me typing this, and TRBAM lasts for 5 days... so that's about 11,000 files (plus more because people will still talk about it afterwards). I'm not sure anyone has a count of how many tweets from last year (and I'm not going looking), and Twitter's API may decide to hate me during this.
Where is this Going?
Many of the charts in the first referenced article are great charts that can easily be done in R. I'll have a few more to add, I'm sure, and as soon as others get their hands on the data, there will be many more. I also will possibly use Hadoop (or something) to do some text analysis.
Another place this will be going is #ESRIUC. I've submitted an abstract for their conference. I don't know if I'm going, but whether I do or not is a moot point - there's usually some good stuff there.