The views expressed by contributors are their own and not the view of The Hill

Public records data must be off-limits for AI

by Vishnu S. Pendyala, opinion contributor 08/11/24 01:00 PM ET

In this June 21, 2019, file photo, commuters walk through a corridor in the World Trade Center Transportation Hub, in New York. (AP Photo/Mark Lennihan, File)

Companies are looking for innovative ways to collect data to feed their data-hungry artificial intelligence systems and create innovative applications. Fortunately, some do not have to go too far.

Then, there are the companies that collect data from public records to share on the internet and run analytics.

There are many compelling reasons for public records to be off-limits for AI systems. Lawmakers must consider them and act soon before the practice can unleash havoc.

First, public records are neither free from biases nor representative. Outcomes from the systems trained on such data are not likely to be entirely fair.

In some cases, such as the court records, the data may not even be truthful. It is a well-known fact that perjury law is rudimentary and ineffective, particularly in family law, where once-close couples become each other’s worst adversaries. The warring couple can share each other’s most private secrets and also lie about them. Financial disclosures are common.

AI can be used to perform a psychometric typing or financial risk profiling of the parties to a court case using the court documents combined with other openly available information. If such analysis is sold to potential employers, landlords and other providers, it can unfairly jeopardize competent applicants’ chances.

Scamsters who get hold of the monetary aspects of the analysis can easily victimize the parties. Repressive governments can use the analysis to the peril of their citizens.

The law, so far, has been attempting to address the current problems of technology. It should also consider the future impacts. We need technology visionaries without conflicts of interest to draft the laws regulating AI.

As popularly said, data is the new oil. It is important for the law to substantially regulate the collection, use and retention of data in various forms. With the rise of quantum computing, generative AI and ingenious hackers, data privacy and security will increasingly face formidable challenges.

I closed my account with AT&T in 2019, but still received a notification from them this April that some of my personal information was compromised, exposing me to identity theft. The law should not have permitted companies to retain personally identifiable information for so long — more than four years in this specific instance.

In my Big Data course, I teach graduate students how anonymity alone is not sufficient to protect privacy. Even if the AI models can be trained on anonymous data from public records, there is a chance that stolen personally identifiable information can be compared against anonymous data in order to divulge sensitive data. Also, it is not too difficult to train language models to detect such information from unstructured text.

There are multiple instances where AI models reveal personally identifiable information. Researchers have also proved in the past that truly deleting sensitive information from large language models like ChatGPT is not easy.

The problem of internet data persistence — where data, once online, can remain indefinitely, impacting privacy and security — gets compounded with the rise of AI models.

Even if used for altruistic purposes, AI models are mostly black boxes, making it hard to explain the rationale behind their decisions — a requirement for most uses of government data. Individuals have very little control over their data once it gets into the AI models, even to correct inaccuracies.

It is therefore imperative that the government limit its own collection and retention of personally identifiable information in addition to restricting its use to train AI models.

Vishnu S. Pendyala, PhD, MBA (Finance) teaches machine learning and other data science courses at San Jose State University and is a Public Voices Fellow of the OpEd Project.

Technology