{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Using DaRUS via API\n", "\n", "This notebook gives an introduction how DaRUS can be accessed by Application Programming Interfaces (APIs). This can be helpful when the work with DaRUS shall be automated. It further allows to connect other programms / scripts with the data repository.\n", "\n", "We will see two ways of working with the APIs, one using [curl](https://curl.se/), and the other using a Python libray called [pyDataverse](https://github.com/gdcc/pyDataverse)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Preparation\n", "\n", "Assuming that curl is already installed on the system, we need some environment variables for later usage. If you don't have an [API token](https://guides.dataverse.org/en/latest/api/auth.html) so far you can create one by clicking on your account name in the navbar (after login to DemoDaRUS / DaRUS), then select \"API Token\" from the dropdown menu. In this tab, click \"Create Token\". It is a common mistake that the server you use and the API token do not match, so be aware that you pick the right API token for the DaRUS instance you work with.\n", "\n", "We set these variables also in python, and since pyDataverse is not part of this Jupyter installation, it has to be installed." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "curl" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "env: SERVER_URL=https://demodarus.izus.uni-stuttgart.de\n", "env: API_TOKEN=bef831ab-2e12-453b-9d2d-1f23e8880d24\n" ] } ], "source": [ "# for production use https://darus.uni-stuttgart.de\n", "%env SERVER_URL=https://demodarus.izus.uni-stuttgart.de\n", "# the API token represents your login (password), so keep it secret\n", "%env API_TOKEN=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "pyDataverse" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "SERVER_URL=\"https://demodarus.izus.uni-stuttgart.de\"\n", "API_TOKEN=\"xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx\"" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Collecting pyDataverse\n", " Downloading pyDataverse-0.3.1-py3-none-any.whl (32 kB)\n", "Requirement already satisfied: jsonschema>=3.2.0 in /opt/conda/lib/python3.9/site-packages (from pyDataverse) (4.2.1)\n", "Requirement already satisfied: requests>=2.12.0 in /opt/conda/lib/python3.9/site-packages (from pyDataverse) (2.26.0)\n", "Requirement already satisfied: attrs>=17.4.0 in /opt/conda/lib/python3.9/site-packages (from jsonschema>=3.2.0->pyDataverse) (21.2.0)\n", "Requirement already satisfied: pyrsistent!=0.17.0,!=0.17.1,!=0.17.2,>=0.14.0 in /opt/conda/lib/python3.9/site-packages (from jsonschema>=3.2.0->pyDataverse) (0.18.0)\n", "Requirement already satisfied: charset-normalizer~=2.0.0 in /opt/conda/lib/python3.9/site-packages (from requests>=2.12.0->pyDataverse) (2.0.0)\n", "Requirement already satisfied: idna<4,>=2.5 in /opt/conda/lib/python3.9/site-packages (from requests>=2.12.0->pyDataverse) (3.1)\n", "Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.9/site-packages (from requests>=2.12.0->pyDataverse) (2021.10.8)\n", "Requirement already satisfied: urllib3<1.27,>=1.21.1 in /opt/conda/lib/python3.9/site-packages (from requests>=2.12.0->pyDataverse) (1.26.7)\n", "Installing collected packages: pyDataverse\n", "Successfully installed pyDataverse-0.3.1\n" ] } ], "source": [ "! pip install pyDataverse" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Creating new datasets" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Option 1: Based on a simple example json file\n", "The Dataverse documentation provides a simple json file that can serve as a basis for creating datasets via the API. You can download the file with " ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "--2022-05-30 07:53:04-- https://guides.dataverse.org/en/5.5/_downloads/fc56af1c414df69fd4721ce3629f0c03/dataset-finch1.json\n", "Resolving guides.dataverse.org (guides.dataverse.org)... 18.213.227.1\n", "Connecting to guides.dataverse.org (guides.dataverse.org)|18.213.227.1|:443... connected.\n", "HTTP request sent, awaiting response... 200 OK\n", "Length: 2346 (2.3K) [application/json]\n", "Saving to: ‘dataset-finch1.json’\n", "\n", "dataset-finch1.json 100%[===================>] 2.29K --.-KB/s in 0s \n", "\n", "2022-05-30 07:53:05 (359 MB/s) - ‘dataset-finch1.json’ saved [2346/2346]\n", "\n" ] } ], "source": [ "! wget https://guides.dataverse.org/en/5.5/_downloads/fc56af1c414df69fd4721ce3629f0c03/dataset-finch1.json" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "inspect it with an editior, and change it as you like. For automation you can use, e.g. [jq](https://stedolan.github.io/jq/), for bash or the json-library in Python. \n", "If your data is ready, upload it to the server. For that, you need the dataverse id where the dataset should reside in. You can easily obtain it within the URL of the dataverse.\n", "\n", "![smile]()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "curl" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\"status\":\"OK\",\"data\":{\"id\":8038,\"persistentId\":\"doi:10.15770/darus-1315\"}}" ] }, { "name": "stderr", "output_type": "stream", "text": [ " % Total % Received % Xferd Average Speed Time Time Time Current\n", " Dload Upload Total Spent Left Speed\n", "100 2380 100 75 100 2305 83 2567 --:--:-- --:--:-- --:--:-- 2650\n" ] } ], "source": [ "%%bash\n", "export PARENT=fokus_hod\n", "curl -H \"X-Dataverse-key:$API_TOKEN\" -H \"Content-Type: application/json\" -X POST \"$SERVER_URL/api/dataverses/$PARENT/datasets\" --upload-file dataset-finch1.json\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "pyDataverse" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Dataset with pid 'doi:10.15770/darus-1316' created.\n" ] }, { "data": { "text/plain": [ "{'status': 'OK',\n", " 'data': {'id': 8039, 'persistentId': 'doi:10.15770/darus-1316'}}" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from pyDataverse.api import NativeApi\n", "dataset_example = open(\"dataset-finch1.json\").read()\n", "PARENT=\"fokus_hod\"\n", "api = NativeApi(SERVER_URL, API_TOKEN)\n", "resp = api.create_dataset(PARENT, dataset_example)\n", "resp.json()" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Dataset with pid 'doi:10.15770/darus-1317' created.\n" ] }, { "data": { "text/plain": [ "{'status': 'OK',\n", " 'data': {'id': 8040, 'persistentId': 'doi:10.15770/darus-1317'}}" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# an example of changing the title programmatically\n", "import json\n", "metadata = json.load(open(\"dataset-finch1.json\"))\n", "fields = metadata[\"datasetVersion\"][\"metadataBlocks\"][\"citation\"][\"fields\"]\n", "for field in fields:\n", " if field[\"typeName\"] == 'title':\n", " field[\"value\"] = 'The title of a dataset should be as specific as possible'\n", " break\n", "resp = api.create_dataset(PARENT, json.dumps(metadata))\n", "resp.json()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Option 2: Based on an already existing dataset\n", "If there is already a well described dataset you want to base on, it's possible to start from there. Download the json-representative, manipulate it and create a new dataset as done above." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "curl" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ " % Total % Received % Xferd Average Speed Time Time Time Current\n", " Dload Upload Total Spent Left Speed\n", "100 1614 100 1614 0 0 24437 0 --:--:-- --:--:-- --:--:-- 24830\n" ] } ], "source": [ "%%bash\n", "export PERSISTENT_IDENTIFIER=doi:10.15770/darus-1312\n", "curl -H \"X-Dataverse-key:$API_TOKEN\" -H \"Content-Type: application/json\" \"$SERVER_URL/api/datasets/:persistentId/?persistentId=$PERSISTENT_IDENTIFIER\" | json_pp > \"existing_dataset.json\" " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Extract the \"latestVersion\" element from the downloaded json file, rename it to \"datasetVersion\", and delete every child element except \"license\", \"metadataBlocks\" and \"termsOfUse\". **Note**, the contact email has been removed during export. Since it is a required metadata field for every dataset, you have to add it again within the citation metadata block inside the contact field.\n", "\n", "```json\n", "\"datasetContactEmail\" : {\n", " \"multiple\" : false,\n", " \"typeClass\" : \"primitive\",\n", " \"typeName\" : \"datasetContactEmail\",\n", " \"value\" : \"your-email@example.com\"\n", "}\n", "```" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\"status\":\"OK\",\"data\":{\"id\":8041,\"persistentId\":\"doi:10.15770/darus-1318\"}}" ] }, { "name": "stderr", "output_type": "stream", "text": [ " % Total % Received % Xferd Average Speed Time Time Time Current\n", " Dload Upload Total Spent Left Speed\n", "100 3284 100 75 100 3209 69 2982 0:00:01 0:00:01 --:--:-- 3054\n" ] } ], "source": [ "%%bash\n", "export PARENT=fokus_hod\n", "curl -H \"X-Dataverse-key:$API_TOKEN\" -X POST \"$SERVER_URL/api/dataverses/$PARENT/datasets\" --upload-file existing_dataset.json" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "pyDataverse" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Dataset with pid 'doi:10.15770/darus-1319' created.\n" ] }, { "data": { "text/plain": [ "{'status': 'OK',\n", " 'data': {'id': 8042, 'persistentId': 'doi:10.15770/darus-1319'}}" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import json\n", "metadata = json.load(open(\"existing_dataset.json\"))\n", "latestVersion = metadata[\"data\"][\"latestVersion\"]\n", "new_metadata = {\"datasetVersion\": \n", " {\"license\": latestVersion[\"license\"],\n", " \"metadataBlocks\": latestVersion[\"metadataBlocks\"],\n", " \"termsOfUse\": latestVersion[\"termsOfUse\"]}}\n", "fields = new_metadata[\"datasetVersion\"][\"metadataBlocks\"][\"citation\"][\"fields\"]\n", "for field in fields:\n", " if field[\"typeName\"] == 'datasetContact':\n", " field[\"value\"][0][\"datasetContactEmail\"] = {\n", " \"multiple\" : False,\n", " \"typeClass\" : \"primitive\",\n", " \"typeName\" : \"datasetContactEmail\",\n", " \"value\" : \"your-email@example.com\"}\n", " elif field[\"typeName\"] == 'title':\n", " field[\"value\"] = 'Example dataset for joint simulation and experimental data created with pyDataverse'\n", " \n", "resp = api.create_dataset(PARENT, json.dumps(new_metadata))\n", "resp.json()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Add metadata\n", "\n", "It is also possible to add or update the metadata of an existing dataset. This can be useful when several people are working on a dataset. Maybe one person is creating the dataset in the web interface using a template, another person might want to add specific metadata based on log file information programatically. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The json file you need has a simpler structure than the full representation of a dataset. Add only the fields you want to add to the dataset, e.g.,\n", "```json\n", "{\"fields\": [\n", " {\"typeName\" : \"processSoftware\",\n", " \"value\" : [\n", " {\"processSoftwareName\" : {\n", " \"typeName\" : \"processSoftwareName\",\n", " \"value\" : \"Aquisition software X\"}\n", " }]\n", " }]\n", "}\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "curl" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bash\n", "export PERSISTENT_IDENTIFIER=doi:10.15770/darus-XXXX\n", "curl -H \"X-Dataverse-key: $API_TOKEN\" -X PUT $SERVER_URL/api/datasets/:persistentId/editMetadata/?persistentId=$PERSISTENT_IDENTIFIER --upload-file dataset-edit-metadata-sample.json" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "pyDataverse" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import json\n", "additional_metadata = {\"fields\": [\n", " {\"typeName\" : \"processSoftware\",\n", " \"value\" : [\n", " {\"processSoftwareName\" : {\n", " \"typeName\" : \"processSoftwareName\",\n", " \"value\" : \"Simulation software X\"}\n", " }]\n", " }]\n", "}\n", "PID=\"doi:10.15770/darus-XXXX\"\n", "resp = api.edit_dataset_metadata(PID, json.dumps(additional_metadata))\n", "resp.json()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Upload files\n", "\n", "Finally, lets add data to a dataset. The demonstrated way may not work for files larger than 10 GB. If you have problems get in touch with the DaRUS-team.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "curl" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bash\n", "\n", "export PERSISTENT_ID=doi:10.15770/darus-xxxx\n", "export FILENAME=simulation.dat\n", "curl -H \"X-Dataverse-key:$API_TOKEN\" -X POST -F \"file=@$FILENAME\" -F 'jsonData={\"description\":\"Simulation raw data of a random process using the os.urandom function in Python\",\"directoryLabel\":\"data/\",\"categories\":[\"Simulation\"], \"restrict\":\"false\"}' \"$SERVER_URL/api/datasets/:persistentId/add?persistentId=$PERSISTENT_ID\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "pyDataverse" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import json\n", "PID = \"doi:10.15770/darus-xxxx\"\n", "filename=\"random-structure.png\"\n", "metadata = {\"description\": \"By jaacker on Pixabay\", \"restrict\": False}\n", "resp = api.upload_datafile(PID, filename, json_str=json.dumps(metadata), is_pid=True)\n", "resp.json()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# pyDataverse does not sent the correct MIME type, so let Dataverse redetect it with the dataFile id from the last output\n", "resp = api.redetect_file_type(identifier='7609', is_pid=False, dry_run=False)\n", "resp.json()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Further reading\n", "The topics and examples in this tutorial are only a tip of the iceberg of what you can do with Dataverse's APIs. You should be familar now how they work in principle and can further study the [official documentation](https://guides.dataverse.org/en/5.5/api/index.html). Make sure, that the current version of the documentation matches the version of DaRUS. You can see the current version in the footer on the right on each DaRUS page.\n", "\n", "Beside further API calls, there is also a list of other [client libraries](https://guides.dataverse.org/en/5.5/api/client-libraries.html) that might be of interest for you." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" } }, "nbformat": 4, "nbformat_minor": 4 }