What should every data scientist know when working with ZIP Codes?

Datasets often contain ZIP code fields making it tempting for data scientists to organize data and develop models based on ZIP codes. However, ZIP codes present a significant challenge. When considering a ZIP code, many think of a well-bounded area contained perfectly within another geographic space (such as a city, congressional district, or census tract). However, this is often not the case.

To comprehend the complexities, one must first understand what ZIP codes are and how they work. Modern ZIP codes were implemented in 1963 when the United States Postal Service (USPS) adopted the Zoning Improvement Plan (ZIP) to expand the postal zones established in 1943. Like an automotive VIN, the five-digit ZIP code encodes meaningful information. The first digit indicates a region or group of states that make up a zone. The United States is separated into ten such zones. The following two digits represent the sectional center facility, a mail sorting facility located within one of the ten zones. The last two digits represent the post office or delivery area. In 1983, the USPS added four additional digits, commonly referred to as plus four (stylized as +4), to identify specific streets and street directions. Essentially all of this was done to improve mail delivery, not to create convenient boundaries for data analysis. In fact, if plotted correctly, ZIP codes would appear as lines (representing delivery networks) and points (representing large buildings, campuses, and post office boxes).

Data scientists have encountered other challenges related to ZIP codes. For instance, large portions of the United States, particularly in Alaska and Nevada, do not have an assigned ZIP code. The USPS does not assign ZIP codes to remote regions that do not receive mail. Another challenge is that ZIP codes may change, particularly in response to new construction and new delivery routes. To make things even more confusing – if they weren’t already – some ZIP codes are not stationary. For instance, 96620-2820 is the ZIP+4 for the 5,500+ crew (ship’s company and air wing) aboard the nuclear aircraft supercarrier USS Nimitz.

Data scientists should know that ZIP codes do not always fall within state boundaries (or even of the borders of states within a zone). There are over 100 cases where ZIP codes cross state lines. Even if ZIP codes could be plotted with well-defined boundaries, these would not align with other political boundaries such as county or municipal borders. And ZIP codes certainly do not align with United States Census tracks, block groups, or blocks.

To help facilitate the relationship between census data and ZIP codes, the United States Census Bureau created ZIP Code Tabulation Areas (ZCTAs). Therefore, such ZCTAs are often included in census data. However, ZCTAs are far from precise. The most frequent ZIP code (the mode ZIP code of all mailing addresses) within the block is the ZCTA of the block. If there is no identifiable most frequent ZIP code, the ZCTA of the neighboring block with the longest shared border is assigned. However, ZCTAs are not without other limitations. ZCTAs do not include all ZIP codes, particularly those of large buildings, campuses, or post office boxes. Also, keep in mind that ZCTAs do not use ZIP+4 codes, only the five-digit ZIP code.

Data scientists with access to address data are advised to geocode the addresses to geographic coordinates. Or, if the data has geographic coordinates, start there. Next, a vector overlay operation can determine the relevant census tract or block, congressional district, or political jurisdiction. Doing so presents an opportunity for more precise analyses. Unfortunately, if no addresses are associated with the data, ZCTAs may be the best option to crosswalk from ZIP codes to more meaningful boundaries. Also, some municipalities and for-profit groups provide demographic data collected (or aggregated) to ZIP codes.

If you encounter a situation that requires linking data with ZIP codes to data without ZIP codes, proceed cautiously and be aware of your limitations.

What Makes Spatial Data Special Data?

Data scientists work with a wide variety of data. Some of that data likely includes street addresses or coordinates (e.g., latitude and longitude). However, most data scientists have not explored spatial data’s true capabilities (and complexities). There are benefits to working with a simplified view of reality known as spatial data models. Let’s consider two primary spatial data models – discrete and continuous – and learn when to use them. Discrete spatial data denotes known locations with a known boundary (such as the political border of the State of Tennessee). In contrast, continuous spatial data is estimated and does not have a known border (such as where the ocean temperature is at 59 degrees).

Discrete data is stored using vectors (e.g., points, lines, or polygons). Point data can represent where a soil test sample originated, the precise location of study trees, or where an animal was tagged. Line data often represent streams, delivery routes, wildlife migration paths, or streets. Polygon data are closed shapes and frequently represent lakes, forests, or cities.

Vector data are commonly stored as ESRI Shapefiles, consisting of a .shp file and several sidecar files (i.e., .dbf, .shx, .prj) that are all kept together in the same folder. While ESRI Shapefiles are an old and somewhat outdated standard, most public data sources provide geospatial data in this format. Therefore, a data scientist is very likely to encounter ESRI Shapefiles. Alternately, vector data may be stored using proprietary ESRI File Geodatabases (.gdb). More recently, vector data are available through an open and non-proprietary file type known as a GeoPackage (.gpkg). Using Python GeoPandas – which uses the Fiona file handler powered by GDAL (the Geospatial Data Abstraction Library) – all of these file types can be read and explored.

While discrete data are intuitive, working with continuous data adds complexity. Continuous or thematic spatial data can include representations of noise pollution, terrain elevations, precipitation, or wind speed. Data-collecting sensors cannot easily be placed on a perfect grid. Therefore, such information is calculated between discrete data points. Continuous data are frequently represented using raster files. Much like a digital image where each pixel represents a color, the ‘pixels’ of raster files contain data representing values such as a water temperature.

Common raster file types include Erdas Imagine files (consisting of an .img file and an .xml sidecar file), open and non-proprietary GeoPackage (.gpkg) geodatabases, or open standard GeoTIFF (.tif) files. Public data sources often share raster data using Erdas Imagine files, while up-to-date satellite-based optical and radar imagery is now more frequently available in GeoTIFF formats. Data scientists can explore these raster file types using the Rasterio Python library, which relies on GDAL.

We hope that this simplified overview encourages data scientists to go beyond the traditional bounds and focus on a new world of possibilities available through the exploration of spatial data. Once familiar with vector and raster data, data scientists can explore indoor mapping spatial file types (e.g., Apple Venue Format or Revit BIM), three-dimensional spatial files (Collada or Trimble Sketchup), or multitemporal spatial file formats (Network Common Data Form or Hierarchical Data Format).

Spatial data science is truly special data science.