
To retrieve criminal charges and conviction data made publicly available by the Washington, DC, city government, a recent Urban Institute study of criminal background checks employed over 1 million automated mouse clicks. The need for more and better data has long been a staple of empirical social science research, and automated data collection, or “web scraping,” is a powerful tool to address this issue.
An ongoing court case in California has potentially deep ramifications for this sort of novel research. The case addresses a crucial gray area: can collecting publicly available data from the internet be considered criminal misconduct?
hiQ Labs Inc. v. LinkedIn Corporation
The plaintiff in the case, hiQ, automatically collects data made public on LinkedIn profiles and uses it to help companies predict when their employees are looking for new jobs. After several years of operation, LinkedIn sent a cease and desist letter under the 1986 Computer Fraud and Abuse Act (CFAA).
The case hinges on whether continuing data collection by hiQ constitutes “access[ing] a computer without authorization,” the benchmark set by the CFAA. Monday’s decision preventing LinkedIn from blocking hiQ while the case progresses rests heavily on the work of Orin Kerr, a professor at the George Washington University School of Law.
The only private areas of the internet, according to Kerr, are password-protected systems. Otherwise, because of the open nature of the web, content is properly assumed to be public. This key fact differentiates the case from other recent automated data collection cases, such as Facebook Inc. v. Power Ventures Inc., where the data in question required passwords to access. The court draws this distinction clearly:
But if a business displayed a sign in its storefront window visible to all on a public street and sidewalk, it could not ban an individual from looking at the sign and subject such person to trespass for violating such a ban. LinkedIn, here, essentially seeks to prohibit hiQ from viewing a sign publicly visible to all.
The vague wording of the CFAA means a researcher collecting public data could wind up in court, a situation that a decision in favor of LinkedIn could make more likely.
Furthermore, although hiQ received a cease and desist letter in this case, in other circumstances, the language blocking scraping could be buried in, say, a large, potentially unread user agreement. Would that automatically constitute unauthorized access? This remains an open question that few researchers would want to test in court, creating a strong disincentive to experiment with data collection, much to the detriment of innovative public research.
A decision in favor of hiQ, particularly one that rests on Kerr’s reasoning, would draw a clear line. Only restrictions on data in password-protected areas of a site are backed by the force of legal sanctions under the CFAA.
This would not solve all problems related to web scraping. As Timothy Lee at Ars Technica points out, a decision that blocks the application of the CFAA to public data, but says nothing about creating technical roadblocks to scraping, could lead to a “technological arms race” between websites and web scrapers.
We should always remain cautious and considerate when applying mass data collection techniques to public websites, particularly when the volume could be sufficient to affect the site’s regular operation. Web scraping, however, holds great promise for research, particularly for social scientists. Blocking the CFAA from applying to publicly available data is an important step.