Zoning codes contain an enormous amount of information about local land use and housing regulations, with clear implications for racial equity, housing affordability, economic development, and environmental impacts. Yet zoning documents are often long, complex, unstandardized, and even handwritten at times, making manual extraction of zoning data a difficult and time-consuming process. We implemented a pilot to gauge the extent to which tools such as text analysis and natural language processing, a branch of machine learning focused on written and spoken language, can automate this data collection process. Using data from the Connecticut Zoning Atlas as a benchmark against which to compare our results, we developed the following four-step methodology:
- Gather zoning codes and maps for all jurisdictions in Connecticut and process these documents to enable searchability.
- Identify algorithmically the zoning district names that appear within those documents.
- Build text datasets from identified districts that serve as model inputs.
- Utilize machine learning and natural language processing to generate relevant information pertaining to each zoning district.
Due in part to the lack of standardization of zoning documents and high bar for accuracy that the task demands, we find that our pilot is limited in effectiveness and unable to automate the data collection process in its entirety. However, as efforts to collect zoning data continue to expand to other states, we do identify several areas of promise for a hybrid methodology wherein automation could reduce the amount of human effort needed. We also highlight areas to prioritize in future efforts that seek to fully automate the data collection pipeline.