Public Data, Private Data, and “Good” AI: How is it All Connected?

Richard Zhang

Sep 23, 2025

Introduction

Mainstream Artificial Intelligence (AI) tools such as ChatGPT have been a turning point across many industries, drawing attention for its human-like conversational skills and online searching features. These qualities have led it to reach 100 million users only two months after launch (Reuters, 2023). Yet, the rise of these tools has inevitably caused problems: there is a considerable amount of malicious use of AI, which has prompted people to question whether these technologies are inherently “good.”

What makes an AI system or model “good?” To answer this question, “good” AI will be defined as an AI system or model which is created and used for public benefit (Iazzolino & Stremlau, 2024), trained on consented and accurate data (Azzalini et al., 2023), and provides real world use (Iazzolino & Stremlau, 2024).

One polarizing feature that determines perceptions of its “goodness” is how AI operates. AI systems cannot function solely with basic human inputs; they require massive amounts of data, which forms the foundations of their outputs. Data can be split into two large categories: public data and private data. Public data training for AI is the use of information that is lawfully available to anyone without special permission to train AI systems and models (Open Data Institute, n.d.). Such information can include open-access journals, public GitHub repositories, and public datasets. On the other hand, private data training for AI is the use of restricted and confidential information to train AI systems and models (European Union, 2016). Such information can include electronic health records, internal corporate documents, and classified intelligence. Neither type of data used for training automatically makes an AI good or bad; an AI system or model is only good when the data is used with consent, kept secure, checked for quality, and applied for real public benefit.

Consensual and Secure Data

(FlyD, 2021)

The first major determinant of “good” AI that relates to public and private data is ensuring that data is used with consent, and kept secure. This means that when an entity’s data is used, the entity volunteers their data to the AI with the knowledge of how it would be used (Dove & Chen, 2020). Furthermore, the data would never be shared with unauthorized individuals, parties, and processes (Lundgren & Möller, 2017).

Public data is lawfully available to anyone without the need for special permission. This is due to how public datasets usually work: being released under open licenses which allow anyone to access and use its information, with the possibility of minor conditions that need to be followed, such as copyright acknowledgement and mass-scale data scraping restrictions. For instance, OpenStreetMap says on its website that it allows its geographical data to be used by anyone, as long as they credit the OpenStreetMap contributors and use bulk downloads or the API for large data usage (OpenStreetMap, n.d.). Because these rules are simple and transparent, compliance is easy, which greatly reduces the risk of data-regulatory issues.

The most common example of this is public data that can be found online, where users individually put information online or interact with systems, and consent to their information or interactions being used as data. Much (but not all) of the information online is public data, and can therefore be accessed and used easily by anyone. There are also many public data storages on the internet, such as Zenodo, which allow users to publish information which can be used by others. Therefore, due to how public data is shared and used, their use in AI training already comes with permission from the original publishers. Furthermore, the minor regulations that need to be followed pose less risk of data regulatory issues, making public data suitable in the context of consensual and secure AI training.

On the other hand, private data is held behind legal, contractual, or technical barriers and may only be used with explicit authorization. This significantly limits the availability of this data, and due to the complex barriers which block access, the population that does have access to private data generally consists of researchers, educational institutions, firms, and governments (Lathe, 2022). While this does ensure limited sensitive data exposure, important and non-sensitive information such as anonymized lab data also gets blocked from those attempting to create systems for public benefit. This is due to the original data owners not wanting to share the data, and can be for many reasons such as preventing others from creating similar AI systems or models. Simultaneously, this means potentials for beneficial breakthroughs are reduced, such as when researchers working with UK Biobank were denied access to anonymised primary-care records for over a decade, which delayed AI-driven research and consequently breakthroughs in early detection of Parkinson’s and dementia (McKie, 2024).

However, there are reasons for denying entities access to private data. There are a number of regulations associated with the usage of private data, such as documenting detailed records of usage like processing logs (European Union, 2016), assessing bias introduced by dataset selection like fairness bias (Hardt, Price, & Srebro, 2016), and meeting audit requirements set by the government (Innovation, Science and Economic Development Canada, 2022) like annual compliance reports (European Union, 2024). Entities may miss some data regulatory law requirements, or even choose to ignore them entirely.

Disregarding regulations undermines the very barriers that keep sensitive information safe, which can cause severe data security issues like breaches of private information. This was the case in 2017, when the Royal Free London NHS Foundation Trust skipped key Data-Protection Act requirements and ended up sharing 1.6 million identifiable patient records with Google DeepMind without patient consent (Hern, 2017). Because of this, private data can be seen as a double-edged sword; although it is important to keep sensitive information private and only for use by trusted entities seeking public benefit in developing AI systems, it also creates a reliance on those entities, as only they have access to this data. People must trust these entities to have public benefit in mind, follow regulations, and produce results. Private data having limited availability for training in AI and strict regulations has both benefits and drawbacks, but how beneficial it is depends on what is deemed more important: privacy or results.

Data Quality and Quantity

(Balan, 2022)

Another major determinant of “good” AI in data is its quality and quantity. This constitutes how much data is readily available for use for training AI systems, and how reliable the available data is. Some fields may have more available data allowing for more AI training, whereas others may suffer from scarcity, limiting AI training potential.

While public data is plentiful and easily accessible, credibility is not always guaranteed. Sources from researchers or governments may be factually accurate, but public data can also include blog posts or fabricated news. The accessibility of public forums leads to a higher risk of duplicate data and misinformation. A recent study exemplifies this point: MRI-reconstruction algorithms trained on unfiltered public data produced up to 48% less accurate results than those trained on filtered data (Shimron et al., 2022).

However, filtered data is not perfect either. The filtering process takes time, can fail to filter pieces of inaccurate information, and can also accidentally remove accurate data. This was the case with ImageNet, which used an AI model trained on filtered public data to label images, but upon manual inspection, it was found that at least 6% of test images were mislabeled (Northcutt et al., 2021)

Furthermore, although the public data can have abundant data spanning many fields, there are key areas in which public data lacks information. Data such as health records, law-enforcement incident reports, and standardized testing scores are often kept confidential under private regulations. In Ontario, Canada, health records in particular require potential users to obtain consent before it can be collected, used, or even disclosed (Information and Privacy Commissioner of Ontario, 2015). Therefore, public data is optimal when it comes to quantity, but can be deficient in terms of quality.

On the other hand, private data is difficult to access, but can encompass vital sensitive information. Its quantity is still substantial, but it is kept behind barriers of legal, contractual, and technical controls, thus the volume of accessible private data for AI training is limited. However, private data is often high quality: Upstart’s lending-risk model, which was trained on private data like credit-bureau files and bank-transaction histories, was able to accurately determine risk when lending to specific individuals (Chakravarty, 2024). These attributes of private data allow for AI models’ performance to be well beyond what public data can achieve. Furthermore, the limited availability of private data means less entities access it for training, so there is more potential improvement available for AI systems. Take NYUTron, a clinical language model trained on private data which was able to keep improving at rates of 5.36–14.7% compared to traditional models (Jiang, Liu, Nejatian, et al., 2023), showing the untapped potential of private data.

Due to private data’s high demand and value, owners of these sources are less likely to allow their data to be used by anyone. As a consequence, an industry where regulation-breaching groups steal and sell valuable private data has emerged. A recent case was the HCA Healthcare breach in 2023, where a party stole personal details of 11 million patients, and posted the data for sale on the dark web (Whittaker, 2023). As such, while private data is good quality, its public unavailability makes it difficult to harness in AI training.

Conclusion

Choosing between public and private data is not an afterthought. It determines who benefits, what the risks are, and how far AI can safely progress. In a world where AI and data are becoming exceedingly important, understanding both types of data and their characteristics is crucial. It helps policymakers create balanced rules, organizations build consent-respecting and secure systems, and individuals judge and use AI properly.

Both public data and private data have benefits and drawbacks for AI training, and neither leads directly to good AI. Public data offers an abundance of available information, low entry costs, and minimal data regulation, which allow for the rapid development and improvement of AI models. However, its quality and absence in sensitive domains leaves critical knowledge gaps. On the contrary, private data supplies depth, accuracy, and more domain-specific insight, enabling the training of more powerful and specific AI models. Yet, a lack of availability to the general population and high risks of regulation violation inhibits the potential of private data.

Thus, good AI cannot rely on specifically one type of data, and neither type of data can directly produce good AI. Instead, a combination of both public and private data for AI training can lead to more optimal results, and can advance AI systems to be both powerful in real-world applications whilst being ethically responsible.

View References