Innoveren betekent dat we ons op onbekend terrein bevinden. Met name als het gaat om (generatieve) AI, kan wet- en regelgeving nog ontbreken en zijn best practices nog beperkt, maar moeten we toch verantwoorde keuzes maken over de ontwikkeling van het model en het proces van het project. De commitments zijn dan ook niet in beton gegoten. Na een jaar GPT-NL weten we nog beter waar we staan en wat er wél en wat er níet mogelijk is. Daarom hebben we na één jaar GPT-NL een review* gedaan op onze commitments: zo kunnen we gestructureerd weergeven welke doelen we hebben behaald, en waar we commitments hebben moeten aanpassen omdat de kaders van wet- veren regelgeving ons helaas niet anders toelaten.
Our commitments, regarding:
1. Project Proces
1.1 Publishing this commitments document. We will be reviewing this commitment list on a regular basis to incorporate feedback and publicly report on any changes to the commitments.
1.2 Publishing a document that describes how we have built our dataset. We will publish this when the dataset is finished.
2. The end-product of the GPT-NL project
i. A blueprint for ethical and responsible development of LLMs,
ii. A research facility,
iii. A first version of a large language model, named GPT-NL, with a performance comparable to Llama 2 7B and GPT-3 175B, further to be split in the following elements:
a. The training dataset;
b. The bron codes;
c. The model weights.
2.1 Publishing a definition of success for the GPT-NL project (describing when we see the project as a success). This should be published no later than the end of the data collection-milestone.
2.2 Publishing an overview of the end-products intended by the project; including a definition of success for each end-product, a description of intended goal, openness, and licensing. Our intention is that the end-product will be as open as possible, but as this is dependent on agreements with data providers, we cannot guarantee this yet.
2.3 Provide with each end-product a clearly described license.
3. Transparency and accessibility
3.1 Publishing all code publicly under an open-source license.
3.2 Publishing datasheets and model-cards for all datasets and models according to industry best practices.
3.3 The ambition to release and publish the datasets used to train GPT-NL by default. However, some datasets might be protected by copyright law and thereby limiting us in publication. For those datasets we will give explicit attention to creating other mechanisms of transparency.
3.4 Find out, in cooperation with the Content Board, if there are ways to give researchers and/or auditors secured access to the training dataset of GPT-NL.
3.5 Making GPT-NL as accessible as possible, meaning that
1) we try to give free access to researchers and/or auditors, and;
2) create licenses that suit all stakeholders of GPT-NL
4. The use of data (content)
4.1 We will only use content for training GPT-NL if we have the appropriate rights. This means that we only use data that is licensed in the public domain (CC-0 and CC-BY), or use data from content providers that grant us the license to do so.
4.2 We do not train GPT-NL on any information subject to regulatory or contractual confidentiality requirements (such as info under patient confidentiality, business confidential data).
4.3 We have dedicated focus on detecting, filtering, and removing personal information from the training data. We support data providers with removing personal information from the dataset.
4.4 We have dedicated focus on detecting, filtering and removing harmful content - such as, violent or criminal content, discriminatory content or hate speech– from our training data.
5. Diversity and inclusion
5.1 Mitigating bias in the model to the best of our ability by creating a diverse foundation dataset that represents as many groups as possible. We focus on minority groups in the categories below.
a) Gender identification
b) Age
c) Ethnicity
d) Religion
e) Sexual orientation
f) Disabilities
g) Socio-economic status
5.2 Involving underrepresented groups to help us improve the model in the fine-tuning stage. We look at the same underrepresented groups as listed in 5.1.
6. Stakeholders and communication
6.1 We proactively communicate and regularly share our progress with the general public. :
a) with explanatory articles
b) via external media
c) with progress reports
6.2 Publishing regular (public) reports on decisions made within the project. Including reporting on legal and ethical dilemmas and decisions.
6.3 Reporting on our conclusions of stakeholder consultations (below).
6.4 We will organise public stakeholder consultations for at least the following:
6.4a Involvement in preparation of the fine-tuning stage.
6.4b Consultation on methods of evaluating model performance (on technical and societal benchmarks).
Stakeholder consultations will be announced publicly on the GPT-NL website and social media channels.
6.5 We organise influence of stakeholders that are data providers (Content Board) and/or users (User Board) to facilitate co-creation of GPT-NL.
*De commitment review is afgerond en gepubliceerd op 7 april 2025.