Note: This article was written in November of 2011 and documents the process I took to build an interactive map for a now-defunct side project of mine called Goalfinch. I shut down Goalfinch recently, but plenty of people found this article very informative at the time, so I’ve resurrected it here.
I’d been toying around with ideas for cool ancillary features for Goalfinch for a while, and finally settled on creating an interactive map of weight loss goals. I knew what I wanted: a Google-maps-style, draggable, zoomable, slick-looking map, with the ability to combine raster images and style-able vector data. And I didn’t want to use Flash. But as a complete geographic information sciences (GIS) neophyte, I had no idea where to start. Luckily there are some new technologies in this area that greatly simplified this project. I’m going to show you how they all fit together so you can create your own interactive maps for the browser.
The main components of the weight loss goals map are:
- Server-based application that provides the data for each layer (TileStache, MongoDB, PostGIS, Pylons)
- Server-based Python code that runs periodically to search Twitter and update the weight loss goal data
I’ll cover each component separately in upcoming posts, but I’ll start with a high-level description of how the components work together for those of you who are new to web-based interactive maps.
Serving information-rich content to the browser requires programmers to think carefully about performance. For an interactive, detail-filled map of the globe, we could serve a single, very high-resolution image, but it would take a while to load. If we want our map to show up and be usable right away, we need a different strategy. That’s why most online maps (such as Google Maps) use a technique called tiling. With tiling we load a series of smaller images (tiles) that cover the visible map area and dynamically load tiles covering other areas only as the user pans to them. Tiles can be images stitched together by the browser or vector data for a particular geographical region. This lets us display the map relatively quickly without having to wait for the non-visible images to load. Another advantage to tiling is that we can load different tiles for different zoom levels. So when the map initially appears zoomed all the way out we don’t have to overload the browser with all the geographic complexities that won’t even be discernable at this scale.
So we have Polymaps assembling the map in the browser on the fly, but where is this data coming from? The short answer is: wherever we want. Here’s the long answer.
For the image tiles, the conventional approach has been to collect a bunch of geographic data from somewhere like OpenStreetMap, shove it into a database, and use that data to render PNG files for the various zoom levels. If you want complete control over how your images tiles look, this is the only way to go. I, however, only wanted a basic, monochrome gray map on which to overlay SVG, and found the perfect solution in the CloudMade Maps API and their free developer account. So rather than building and hosting the map tiles myself, I was able to pull in map tiles from CloudMade’s servers in my Polymaps code.
The vector data for state and county boundaries is served from my own server as GeoJSON using a combination of TileStache – a cache for image and vector map tiles – and Postgresql/PostGIS. Integrating TileStache with my existing Pylons application was a breeze. Learning all about PostGIS, shapefiles, SRIDs, projections, and polygon simplification was quite a bit more pain for me, so hopefully my upcoming post on that will help other newcomers get these details right.
Finally, to get the data I was actually interested in, I wrote a Python script to repeatedly ask Twitter’s search API for tweets related to weight loss in each county across the US, store the results in MongoDB, and do some simple natural language processing to determine how much weight each user wanted to lose. This made it possible to calculate the average weight loss goals of Twitter users on a per-location basis.
Our map is assembled by rendering a layer of image tiles and overlaying one or more layers of vector data on top of it. In our case, we have a layer for state boundaries and another layer for county boundaries. When a portion of the map becomes visible (as a result of the page initially loading or the visitor scrolling or zooming to a new area) Polymaps asks the server for geography data contained by the region. The server responds with a chunk of GeoJSON that contains any features that should be drawn on the map.
The “raw” data for both layers can be found on the US Census Bureau’s website (county boundaries, state boundaries) in the shapefile format. Shapefiles are binary files that contain property data (name, identifier code, etc.) and geometric data (points, lines, polygons) about geographic features. The geometric data is either unprojected – that is, it represents real longitude/latitude positions of points on the globe – or projected – where it has been passed through a transform and now represents points on a plane. Projected data must be displayed on maps that use the same projection, otherwise features won’t line up. You can read more about map projections here.
The shapefiles provided by the US Census Bureau are unprojected. Polymaps also expects the GeoJSON it renders to be in unprojected longitude/latitude format – it handles transforming the data into the map’s spherical mercator projection before drawing SVG polygons. So we’re all set to go, right? Let’s just find a utility to convert our shapefile into GeoJSON, make it availale as a static file on our server, and add it as a layer in Polymaps. Done!
Except now our visitor’s browser has to download 30 megabytes of JSON before drawing anything on our map. Damn.
As I mentioned in the previous post, tiling is used to avoid having to download huge map images, and we can tile our GeoJSON as well. Enter TileStache, an excellent Python package that was originally built to serve image tiles and can serve GeoJSON tiles too. There may be other options for serving GeoJSON tiles out there, but I chose TileStache because its WSGI entry point made it easy to call from within my existing Pylons application. The TileStache server listens for requests for a URL like
/counties/5/7/9.json where 5 is the zoom level, and 7 and 9 are the column and row of the tile, and returns a GeoJSON response describing any geographical features in that region:
TileStache also caches these results so it doesn’t have to perform the database lookup next time someone asks for the same URL.
At the time of this writing, TileStache can query either a PostGIS database or Solr for its geographic data. It uses a bounding box query to ask the database, “What do you have for me in this region?” and doesn’t know how to search a raw GeoJSON file or shapefile. So I had to get my shapefiles into one of these databases.
I chose PostGIS because it seems like a widely-used standard for this type of thing. PostGIS is an add-on for PostgreSQL, and getting the two installed on my Ubuntu Lucid server was as easy as
For my Mac OS X development environment, I had good luck with William Kyngesburye’s packages.
Now, this is where my lack of experience with GIS really started to hurt. Massaging geographic data from one format into another is a confusing process if you aren’t familiar with the different types of projections, file formats, and cryptic option flags for the command line tools that come with PostgreSQL and PostGIS. Here’s what finally worked for me to get my unprojected shapefiles imported into PostGIS with the proper projections and indexes.
Transform shapefile to spherical mercator projection
Yes, TileStache serves up unprojected GeoJSON, but for some reason it requires the data in PostGIS to be in the Google Spherical Mercator projection. So before we load our shapefiles in, we transform them with
ogr2ogr, a utility from GDAL:
EPSG:4326 is the spatial reference identifier (SRID) for the source’s longitude/latitude projection,
EPSG:900913 is the SRID for the spherical mercator projection we need,
st99_d00.shp is our input shapefile and
states_900913.shp is our output shapefile.
Convert shapefile to SQL file
Now we can use
shp2pgsql, a utility in PostGIS, to create a SQL file that can be loaded into the database.
shp2pgsql the SRID of the projection we’ll be using (
-s 900913), that we want to drop the table and recreate it when this file is loaded (
-d), tells PostGIS to create an index on our data (
-I), that the input file’s encoding is “LATIN1” (
-W LATIN1, yours may be different), and that this data should be put into the
states.sql is our output file.
Create the database and load the files
This creates the database, installs the
plpgsql languages on it, sets it up to use PostGIS, and loads in our state and county data. Depending on how you have your PostgreSQL users and permissions set up, you may need to use
-U postgres or something similar for these commands.
Great! Now we have our county and state boundaries in a form that’s queryable by TileStache. Let’s get TileStache configured to serve up GeoJSON tiles from this database.
The TileStache documentation tells us how to configure our tile server to serve GeoJSON. Specifically, we can use the PostGeoJSON Provider to respond to requests by searching PostGIS. Here’s what my configuration looks like:
You’ll notice that I have two layers set up, one for county boundaries and the other for state boundaries. Instead of the
PostGeoJSON.Provider, however, I’m using
SimplifyingGeoJSONProvider, which is a subclass I wrote to handle polygon simplification for different zoom levels. Why? Because when
PostGeoJSON.Provider gets geographic features back from PostGIS, it converts them directly to GeoJSON and sends them along with every little detail intact. If we’re at a low zoom level (we’re viewing a large area of the map), those little details won’t be visible and will just increase the size of the file that has to be transfered. We only want to show those details as we increase our zoom level. So the
SimplifyingGeoJSONProvider checks what zoom level we’re requesting and will perform more aggressive polygon simplification for lower zoom levels and will keep more details intact for higher zoom levels.
Here’s the source code for the original
PostGeoJSON provider and here are the relevent bits of my
So we look up the tolerance for the requested zoom level and then
simplify() (part of the Shapely package) the geometry with that tolerance before writing the coordinates out to JSON.
The final step is to call TileStache within my maps controller:
And to get the URL paths working nicely, I had to add the following line in
So requests to
/maps/tiles/counties/6/23/18.json will get routed to the
maps/controller action, but TileStache will see the request’s
PATH_INFO as if it was for just
Now we’re ready to serve GeoJSON tiles to Polymaps. I’ve shown how I use TileStache to serve GeoJSON tiles of US county and state boundary data for the map, and touched a bit on how Polymaps requests and assembles these tiles and displays them as polygons on top of standard image tiles. Now I’ll delve more into Polymaps and the rest of the client-side code. I’ll also show how I collect and parse weight loss goals from Twitter and display them on the map.
Setting up Polymaps
First we need to include Polymaps and a couple of other scripts on our map page:
polymaps.min.js is the Polymaps library,
weightdata is the average weight loss goal by county ID (more on this later),
protodata.min.js is the Protivis library, used to calculate quantiles for the data, and
weightmap.js is the actual code used to create the map.
We also need a placeholder element that will contain our map once it’s built:
The Polymaps code is pretty straightforward. Here are the relevant bits from
This sets up our map and adds the image and SVG layers to it.
weightdata that contains a mapping of county ID to the average weight loss goal for that county. This creates a map that we can zoom and pan around in, but we aren’t yet showing weight loss data for each county. This gets set up in the
onload_counties function (line 29) that gets called each time we load new county boundaries from GeoJSON. Here’s
onload_counties and a couple other functions we use to show the tooltip when the mouse hovers over a county:
Pretty self-explanatory. For each county polygon that we load, we assign a few event listeners that handle mouse events over that shape and style it according to the average weight loss goal in the region. If we have data for it, we give it a CSS class of
q0 for the lowest weight loss goals and
q8 for the highest. Counties with no data get a
no-quantile class. When the mouse enters a county (see
countyDetail), we look up the average weight loss goal for the area and display it in a tooltip-like dialog. I’m using a bit of JQuery here because I already have it included on the rest of my site, but this could be rewritten without it.
Twitter Weight Loss Goals
The actual data that we care about and want to show on this map is in the
/maps/weightdata script. It looks like this:
A first cut at the search process would (if you were using Python, at least) look something like this:
scrape_tweets constructs a query to the Search API and make the request, parses the JSON results, and stores them in Mongo. At the highest level that’s exactly what my script does, but there are some refinements that I needed to make to get things running efficiently and to keep Twitter from hating me.
First, the locations. Since I wanted data for each individual county, the temptation was to make
locations a list of the geographic center of every single county in the United States. According to the county details I pulled from the US Geological Survey’s site, this would have resulted in 3219 requests for each search term I was interested in. With the Twitter Search API’s rate limit, this would have taken forever. I needed to cut down the number of locations I was searching.
Fortunately, this was relatively easy to do. It’s obvious that there are lots of Twitter users in metropolitan centers like New York and San Francisco and that Twitter users in rural Montana are relatively sparse. Consequently, most of the searches near locations in sparsely populated areas would return only a few results at most. Where population density was low, I could chunk counties together into larger regions and search across all of them. Where population density was high, I could fall back to searching individual counties. At the expense of reduced resolution in sparsely populated areas, this would cut the number of requests I had to make.
So that’s what I did.
QueryLocation is a class that encapsulates a location and radius to search near. It could consist of a single county or a group of counties. In
load_county_data we start out with a list of one
QueryLocation for every county in the US. Then we look at each one, and if its total population is below our threshold we find the nearest neighboring
QueryLocation and add it to this one. We repeat this process, combining neighboring regions, until all locations have a population above our threshold. Now we have a list (
new_locations) of regions to search. For a location that consists of a single county, we use its geographic center as our search location and approximate our search radius as if the county were a perfect circle. For a location consisting of a group of counties, we calculate the average geographic center by looking at the coordinates for each county in the group. Our search radius is approximated by finding the total area of the group and assuming we’re dealing with a circle with that area.
Obviously, assuming each region is a circle is not perfect. When we search for tweets near the center of the region and within a certain radius, we may miss tweets that are in some corners not covered by the bounding circle, and we may find tweets that belong to other
QueryLocations. To solve the first problem, I added
radius_feather, which increases the search radii by a little bit in order to reach corners and other geographic features that extend far away from a region’s center. I ignored the second problem by simply noting every
QueryLocation that found a tweet; tweets found inside two regions count toward the average weight loss goal in both.
With this technique I was able to cut the number of requests I needed to make by 70%, while still retaining good resolution in more densely populated areas.
Intelligent Search Timing
Since this script is running on a daily basis, most of the results for each query will be the same, with only a few new tweets that happened since we last looked. So we don’t need to perform the same set of searches every single day. To further limit the number of requests this script makes every day, it saves the last time it searched for a particular term near a particular location. The next time the script is run, it checks to see if the current request was made within the last n days. If it was, the script skips the request. This is how it looks:
SearchTime is a minimongo collection that saves the last time we searched for a term near a location.
refresh_days, in my case, is 3, but this can be adjusted lower for searches with a higher rate of new tweets.
You’ll also notice another parameter to
since_id. This is the ID of the last tweet we found for this search term and location. We pass this along in the API request so that we only get new tweets that we haven’t seen yet.
Calculating Average Weight Loss Goals
Now we have a collection of tweets sitting in MongoDB, categorized by search term and the county (or counties) that the tweet came from. Now we need to determine how much weight the author of each tweet wants to lose. When a visitor loads the map page and the browser requests
/maps/weightdata, the Pylons controller calls this:
We look at each tweet in the collection, determine how many pounds the author wants to lose (
find_pounds(tweet)), and keep a running sum for every county. Then we find the average for each county and return the result as JSON. Finally, the result of the calculation gets cached using Beaker’s
cache_region decorator. Subsequent requests will use the cached value rather than calculating the averages every time. Of course, when the Twitter search script runs and adds new tweets to the collection it invalidates the cache so the averages get refreshed.
find_pounds() function is a bit of special sauce, and I won’t get into too many details there. I initially went down the road of using NLTK to parse, tag, and stem the text content of each tweet and use that to find subjects and direct objects that quantified weight. Unfortunately I couldn’t get this natural language processing-based approach to recognize even half of the weight loss goals. Twitter users do all sorts of crazy stuff that confused NLTK: They shorten words and phrases to stay within 140 characters, sprinkle hashtags and URLs everywhere, and don’t bother to spell check. So I settled on a more brute force approach. Roughly:
- Remove URLs, extraneous punctuation, and other non-English tokens
- Find every token in the tweet that could be interpreted as a number, such as “5”, “five”, “a few”, and so on.
- Mark locations of each part of the search term (i.e. “lose”, “pounds”) within the text
- Look for patterns such as “lose x pounds”, “I have x more pounds to lose”, and lots more
If a tweet can’t be definitively parsed, it’s skipped. This pattern-based approach does much better: it’s able to figure out the number of pounds an author is referring to about 90% of the time.
So that’s that. I covered pretty much every layer and component used to create an interactive map entirely without flash. I had a lot of fun with this project, so I’ll be building more of this type of interactive map in the future. Hopefully the details in this series of posts will help future developers and hackers get up to speed on similar projects as well. If you have any questions about this project or your own efforts, don’t hesitate to get in touch. Thanks again for reading!