top of page

Data Gathering

Lifecycle - Design

Classification

Data Management and Lifecycle

Overview

Data gathering is the process of collecting data from various sources, including structured formats (such as databases and spreadsheets) and unstructured formats (such as text, images, or video). This process can involve static data, which is collected at a single point in time, or streaming data, which is continuously generated and ingested, such as from IoT sensors or real-time user interactions. Effective data gathering is foundational for AI development, as the quality, diversity, and relevance of collected data directly impact model performance and fairness. However, limitations include potential biases introduced at the collection stage, risks of collecting irrelevant or excessive data, and challenges in verifying data provenance and accuracy. Nuances arise when integrating data from multiple sources, each with varying standards, formats, and privacy considerations.

Governance Context

Data gathering is subject to various regulatory and organizational controls to ensure responsible data use. For example, the EU General Data Protection Regulation (GDPR) mandates organizations to collect only necessary personal data and to obtain informed consent from data subjects. The NIST AI Risk Management Framework (AI RMF) recommends implementing controls such as data provenance tracking and data minimization. Organizations must also establish clear data retention policies and conduct Data Protection Impact Assessments (DPIAs) when gathering data that could impact individual rights. These obligations help mitigate risks related to privacy, security, and bias, and ensure that data gathering practices align with ethical and legal standards. Two concrete obligations/controls include: (1) implementing data minimization (collecting only data strictly necessary for the intended purpose), and (2) maintaining detailed data provenance records to trace the origin and handling of data throughout its lifecycle.

Ethical & Societal Implications

Data gathering raises significant ethical and societal concerns, including privacy violations, consent, data ownership, and the risk of amplifying biases present in source data. Collecting sensitive or personal information without adequate safeguards can erode public trust and disproportionately impact vulnerable groups. Additionally, the scope and purpose of data collection must be transparent to avoid misuse or function creep. Responsible data gathering practices are essential to uphold individual rights and foster equitable AI outcomes. There is also a risk of excluding minority groups if their data is underrepresented, leading to biased or unfair AI systems.

Key Takeaways

Data gathering is foundational to AI development and impacts model quality.; Structured and unstructured data require different collection and management approaches.; Regulatory frameworks like GDPR impose strict obligations on data collection and consent.; Controls such as data minimization and provenance tracking are essential for governance.; Ethical considerations include privacy, transparency, and avoiding bias amplification.; Integrating data from multiple sources increases complexity and risk of inconsistency.; Clear data retention and deletion policies are necessary to comply with regulations.

bottom of page