Mapping Public Opinion: A Tutorial

At the upcoming 2012 summer meeting of the Society of Political Methodology, I will be presenting a poster on Isarithmic Maps of Public Opinion. Since last posting on the topic, I have made major improvements to the code and robustness of the modeling approach, and written a tutorial that illustrates the production of such maps.

This tutorial is in a very rough draft form, but I will post it here when it is finalized. (An earlier draft had some errors, and so I have taken it down.)

Isarithmic Maps of Public Opinion Data

As a follow-up to my isarithmic maps of county electoral data, I have attempted to experiment with extending the technique in two ways. First, where the electoral maps are based on data aggregated to the county level, I have sought to generalize the method to accept individual responses for which only zip code data is known. Further, since survey respondents are not distributed uniformly across the geographic area of the United States (tending to be concentrated in more populous states and around cities), I have attempted to convey a sense of uncertainty or data sparsity through transparency. Some early products of this experimentation can be seen below.

Party Identification

Isarithmic map of party identification from the 2008 CCES. Click to enlarge.

Continue reading

Choropleth tutorial and regression coefficient plots

About two weeks ago, I gave short talk at Duke, wherein I presented a brief tutorial on creating choropleth maps in R using ggplot2. Since the code is already written, and the data and shapefiles already hosted online, I thought I would share the tutorial more widely.

A .ZIP file containing all the files necessary to follow the tutorial is available at:

The script goes very briefly through the loading of shapefiles and presidential election returns, and ends with the production of the choropleth below.

Click to enlarge

I don’t get into further customization of the map, as there are other more authoritative and complete sources for that. Further, much more detailed instruction on reading shape files are available from CSISS and NCEAS.

Included at the very end of the script is a brief example of a regression coefficient plot, something like a ggplot2 version of the coefplot() in Andrew Gelman‘s arm package.

I decided to develop the example into a function that takes as input a list() of model objects, and returns a ggplot2 object, which can be further modified by the user if so desired.

A coefficient plot comparing three models.

The script for the above plot can be found here. I also wrote a function that eschews arbitrarily discrete confidence bounds, instead attempting to suggest a sense of our confidence in the estimate without choosing a specific interval, since the difference between significance and insignificance is not itself significant. Code for the function is available here, and an example can be seen below.

Smoothed standard error coefficient plots

Electoral Marimekko Plots

To be reductive, visual displays of quantitative information might be reasonably categorized on a continuum between “data display” and “statistical graphics.” By statistical graphics, I mean a plot that displays some summary of or relationship amongst several variables, likely having undergone some processing or analysis. This may be as simple as a scatterplot of a primary independent variable and the dependent variable, a boxplot, or a graphical regression table.

In this reductive scheme, then, “data displays” present variables in raw form — for use in exploratory data analysis, or perhaps just to offer the viewer access to all of the data. Where “statistical graphics” might be best served by simplicity and minimalism in design, such that a single idea might be conveyed clearly, “data displays” will tend to be inherently complex, and require effort from both the creator and viewer to parse meaning from the available information.

Where statistical graphics are ideal for presenting conclusions, data displays are useful for generating ideas, and optimally, permitting the relatively rapid identification of relationships between multiple variables. On top of this, I might add that many of the more well-regarded data displays of recent note offer macro-level insight as well as the opportunity to ascertain specific details (for this, interactivity is often valuable, as in the internet-classic New York Times box office visualization).

As several recent posts suggest, I am interested in finding ways to successfully and clearly convey multidimensional data, and have been focusing on political data as it varies across geopolitical units and time. Here I offer an approach which departs from the spatial basis of other recent efforts in favor of allowing the position of graphical objects to convey other variables.

County Vote Spinogram (Turnout), 1992

County Vote Marimekko Plot, 1992, sorted by votes cast. Click for slideshow.

This type of plot is called, variously, a spinogram, a mosaic plot, or a marimekko — and is not dissimilar from a treemap with a different organizational structure (other examples). The utility of this plot type is that it can spatially convey four numeric variables (x position, y position, height, width), and color can be added to incorporate up to three additional variables (R, G, B). Further, there is a straightforward geometric interpretation of each cell: the areas of each (in this case, width/state turnout ×height/county proportion of state turnout) are directly comparable.

Unlike a stacked bar plot, the width of each column conveys information, permitting height to convey proportion rather than count. Further, columns and cells within columns can be sorted to express the ordering of variables of interest. In some ways, these can be seen as extreme reinterpretations of (Dorling) cartograms, in which not only the size and shape of political boundaries, but also their position, are distorted by other variables.

County Vote Spinogram (Dem 2PV), 1924

County Vote Marimekko Plot, 1924, sorted by Democratic share of the two-party vote. Click for slideshow.

In the plots above, cells are colored according to the strength of Democratic (blue), Republican (red), and other party (green) support, and counties whose turnout represents greater than 1% of the total turnout in an election are labeled.

I present two different layouts for the cells in each plot. The first arrays states left-to-right in order of the number of votes cast in an election, and sorts counties bottom-to-top in the same order. Thus, more populous states are on the right, and more populous counties are at the top of the plot. This arrangement allows the viewer to observe the effects of population density both within and across states, and may better facilitate tracking changes in county or state politics over time.

The second layout sorts states left-to-right, and counties bottom-to-top in order of the Democratic share of the two party vote (Dem Votes / (Dem Votes + Rep Votes)). Thus, more Democratic-leaning (relative to Republican) states are on the right, and counties that were more supportive of Democratic candidates are at the top. I believe that this arrangement makes it easier to discern overall trends in partisanship across time, as the total “sum” of red within a diagram is relatively easy to compare to the total “sum” of blue (and green).

I have attempted to make my R code fairly general, and it is available for download here, although it will obviously require some modifications for other applications. Our approaches differ, but another instructive example can be found at Learning R.

Isarithmic History of the Two-Party Vote

A few weeks ago, I shared a series of choropleth maps of U.S. presidential election returns, illustrating the relative support for Democratic, Republican, and third Party candidates since 1920. The granularity of these county level results led me to wonder whether it would be possible to develop an isarithmic map of presidential voting using the same data.

Isarithmic maps are essentially topographic or contour maps, wherein a third variable is represented in two dimensions by color, or by contour lines, indicating gradations. I had never seen such a map depicting political data — certainly not election returns, and thus sought to create them.

There is a trade-off between an isarithmic depiction versus a choroplethic depiction, in which a third variable is shown within discrete political boundaries. Namely, that though a politically-delineated presentation better facilitates the connection of the variable of interest to the level at which it was measured, the superimposition of geographically arbitrary political boundaries may cloud the existence of more general regional patterns.

Election-year maps can be seen in a slideshow here (and compared to the three-color choropleth maps here). The isarithmic depiction does an excellent job of highlighting several broad patterns in modern U.S. political history.

2008 Isarithmic MapFirst, it does a good job of depicting local “peaks” and “valleys” of partisan support clustered around urban areas. In the 2008 map, for example, Salt Lake City, Denver, Chicago, Miami, Memphis, and many other cities stand apart from their surrounding environs, highlighted by a relatively intense concentration of voters with distinct partisan leanings. In 1980, this method shows that though Reagan enjoyed broad support in California, the revolution was not felt in the Bay Area.

Comparison of these maps across time also underscores well-known political trends, but offers more resolution than state-level choropleths and greater clarity than county-level choropleths. Note the nearly inverted maps for 1924 and 2004, between which elections the Solid South went from solidly Democratic to solidly Republican. Interestingly, though that particular regional pattern has been remarkably consistent since 1984, the South favored a Democratic candidate as recently as 1980.

These patterns over time are even better observed in motion. Interpolating support between elections, I have generated a video in which these maps shift smoothly from one election year to the next. The result is the story of 20th century presidential politics on a grand scale, condensed into a little 0ver a minute of data visualization.

The video can also be seen at YouTube (I recommend the “expanded” or “full screen” view), or at Vimeo. The images were rendered at 1280 x 720 pixels, to allow the video to be seen in HD.

This animated interpretation accentuates certain phenomena: the breadth and duration of support for Roosevelt, the shift from a Democratic to a Republican South, the move from an ostensibly east-west division to the contemporary coasts-versus-heartland division, and the stability of the latter.

More broadly, this video is a reminder that what constitutes “politics as usual” is always in flux, shifting sometimes abruptly. The landscape of American politics is constantly evolving,  as members of the two great parties battle for electoral supremacy.

Appendix on creating the visualization

Using county-level presidential returns from the CQ Press Voting and Elections Collection, I associated each county’s support in a given election year for the Democratic and Republican candidates with an approximation of that county’s centroid in degrees latitude and longitude, using the shapefiles loaded with the package mapdata.

I then used simple linear interpolation to create a smoothed transition from election-to-election, creating 99-interelectoral estimates of partisanship for each county. Using a custom function and the interp function from akima, I created a spatially smoothed image of interpolated partisanship at points other than the county centroids.

This resulted in inferred votes over the Gulf of Mexico, the Atlantic and Pacific Oceans, the Great Lakes, Canada and Mexico — so I had to clip any interpolated points outside of the U.S. border using the very handy pinpoly function from the spatialkernel package.

Finally, I created a custom color palette, a modification of the RdBu scheme from Colorbrewer, using colorRampPalette(), and plotted the interpolated data along with state borders using the excellent ggplot2.

I would like to note that I would have preferred using the Albers Equal Area Conic projection, but settled on the default Mercator projection, as drawing the Albers map with ggplot2 was prohibitively time-consuming, given that I was generating 2,201 individual frames.

Choropleth Maps of Presidential Voting

Having always appreciated the red and blue cartograms and cartographs of geographic electoral preferences, such as those made available by Mark Newman, I sought to produce similar maps, but include information about support for non-“state-sponsored” parties, and to extend the coverage back in time.

I was able to find county-level presidential election returns going as far back as 1920, thanks to the CQ Press Voting and Elections Collection (gated). I converted the proportion of the vote garnered by Democratic, Republican, and “Other” parties’ candidates to coordinates in three-dimensional RGB color space, and used shapefiles from the mapdata package to plot these results as choropleth maps with ggplot.

This slideshow requires JavaScript.

It is interesting to observe these maps in a series, which gives historical context to the Red State/Blue State narrative. Most obviously, there is a significant shift in the geographic center of Democratic support, from a concentration in the southeast to the present equilibrium, localized on each coast and near the Great Lakes.

Among these 23 elections, landslide victories, such as Roosevelt over Landon in 1936, Johnson over Goldwater in 1964, Nixon over McGovern in 1972, and Reagan over Mondale in 1984, tend to stand out for their monochromaticity.

Also intriguing are the elections featuring substantial support for third-party candidates. Most of these are individuals who were had a strong support base in a specific region of the country, such as La Follette in the northwest, and Thurmond and Wallace in the deep south. Ross Perot’s run in 1992 is unique here, as his relatively broad geographic base of support results in a map that runs the gamut to a greater degree than any others.

Click on the image below to see a full screen version of the slideshow above, or to download any of the individual maps as PNGs.

Goldwater Click for slideshow/download

K-Means Redistricting

U.S. Congressional districts are today drawn with the aim of maximizing the electoral advantage of the state’s majority party, subject to some constraints, including compactness (which can be measured in numerous ways) and a “one person, one vote” standard. What if, instead of minimizing population variance across districts, we aimed to minimize the mean distance between each resident and their district center?

To do so would be to employ something very much like k-means clustering, and produces some interesting results.

Using the population and latitude and longitude coordinates of the centroid of each (2000) census tract (a block-level reproduction was deemed too computationally intensive for the present purposes), I produced a geospatial k-means clustering for several states. Each tract was represented by its centroid as a point, weighted by population (which required a custom function, as the default kmeans() function in R does not appear to permit weighted points.

Since each run of the k-means algorithm begins with a random set of points, I replicated the function several thousand times, attempting to find a maximum inverse Herfindahl-Hirschman index of district population — the “effective number of districts,” as it were. For North Carolina, as shown below, I was able to find a maximum END of 12.17 for thirteen districts, which is a fairly even distribution of population.

Click to enlarge

Interestingly, there is still substantially wider variation in population than would be permitted under the current system. The least populous district houses fewer than 400,000 individuals, and the most populous, nearly a million. These figures are much more extreme than the extant least- (Wyoming) and most- (Montana) populous districts.

Population by district:

#  Population
1  398492
4  398896
8  423710
10 525860
2  533812
13 537417
3  618040
6  662092
11 676221
12 767249
7  785668
5  786448
9  935408

However, the district boundaries (here hastily drawn by use of chull()) are not characterized by the ragged edges and elongated shapes often seen in the existing plans.

I was interested in what the k-means-based plan would do to district partisanship, and decided to use population density as a rough proxy for local party affiliation. The distribution of population per square mile for each North Carolina census tract is shown below, with a vertical line indicating the median.

I decided to characterize any tract with greater-than-median population density as Democratic, and less-dense tracts as Republican. This resulted in the following proportion of Democrats residing in each district as plotted above:

#  % Dem.
1  0.253
4  0.265
10 0.336
8  0.350
6  0.383
7  0.474
13 0.510
3  0.589
11 0.615
9  0.628
12 0.671
2  0.673
5  0.837

As the table indicates, full turnout under such a plan would result in the election of 6 Republicans and 7 Democrats. Below, I plot “Democratic” tracts in blue and “Republican” tracts in red, scaled according to their population. Urban centers are easily identifiable. Note the difference between this plan and the current actual plan, which draws a single elongated district (the twelfth) parallel to Interstate 85.

Click to enlarge

Below, I replicate the same process for the state of Texas, generating 32 districts. One problem with the k-means algorithm is that larger states, or those with greater variance in population density, tend to generate districts with wide variations in population and inequalities of representation. The Texas plan below depicts a district with fewer than 200,000 residents and one with over 2 million. The Effective Number of Districts (maximum after 100 attempts) is a mere 21.58. Interestingly, the the district “partisanship” split is 22/10 majority Republican/Democrat — not far from the current 20/12 split. In this simulated redistricting, there are 10 districts in which the majority of residents live in higher-than-the-state-median density areas: four each in Houston and Dallas-Fort Worth, one each around San Antonio and Austin.

Click to enlarge

The slideshow below depicts the incremental steps of the weighted k-means algorithm toward convergence around alternate districts for Ohio, beginning with set of random centers, and eventually minimizing collective distances from local centroids.

Finally, I used the same algorithm to investigate what a the continental United States would look like if states were partitioned according to the k-means rule. Clicking on the image below will bring you to an interactive, scalable map of the U.S. with 48 alternate states and inferred partisanship. Instead of initializing with random centers, I started the k-means algorithm with the population centroids of the actual states, and allowed the algorithm to converge to a minimizing partition. Many of these alternative states are more compact but familiar versions of the originals, although this new plan does realize Plunkitt’s Fondest Dream.

Click to enlarge

Regionalization via network-constrained clustering

I was interested in applications for a clustering algorithm that works along a network, identifying contiguous partitions, and thought that a good place to start would be identifying regional patterns in electoral preferences.

This project represents the early products of this inquiry. I chose county-level data, as counties are small enough to make “interesting” regions, and the presidential vote data was available back to 1920.

The poster linked below was given at the 2010 Political Networks Conference at Duke University, and describes the project in somewhat greater detail.

Click to embiggen.

I am particularly enamored of the Obama/McCain color-coded network graph, as an abstracted version of the red/blue/purple cartograms produced in the wake of recent U.S. national elections. I also like the 12-cluster solution (middle left), as the regions produced are large enough to be considered general, but appear to cluster around recognizable politico-geographic features. In general, I have been very pleased with the results produced by this network-constrained clustering algorithm.