BigData - Handling large amounts of data with DaRUS

For a smooth upload of very large files to DaRUS we recommend using the API instead of the web interface and uploading directly to the data backend. The DVUploader implements the individual steps.

Upload of Files

Upload via Web-Interface

DaRUS - or the underlying software Dataverse - offers a comfortable web interface to create records and upload files, which works very well for files up to 2 GB in size.

After uploading, additional information (description, tags, folder structure) can be added to the files.

Upload via API

For files up to about 100 GB the DaRUS API can be used. This requires an API key and the ID of the record to which the file is to be added.

$API_TOKEN: Your API token which you can create in your user account.
$SERVER_URL: (production system) or (test system)
$PERSISTENT_ID: ID of the record, to be found within the URL of the record page (e.g. doi:10.18419/darus-444)

curl -H X-Dataverse-key:$API_TOKEN -X POST -F "file=@$FILENAME" -F 'jsonData={"description":"<description>", "directoryLabel":"<directory>", "categories":<categories>, "restrict": "false"}' "$SERVER_URL/api/datasets/:persistentId/add?persistentId=$PERSISTENT_ID"

At the same time, information about the file can (but does not have to) be transmitted:

  • <description>: Textual description of the file
  • <directory>: Supplement a folder structure, e.g. /images/subdir
  • <categories>: List of tags, e.g. ["Data", "Documentation", "Code"]
  • with "restrict": "true" a file can be provided with an access protection

As an alternative to addressing the API directly, libraries such as pyDataverse or tools such as the DaRUS app can also be used.

Direct Upload to the Data Backend

To completely avoid timeouts of the Dataverse server, very large files can be uploaded directly to the S3 backend. However, this requires that direct upload is activated for the corresponding dataverse. Please contact your local administrator if you are not sure.

The upload then takes place in three steps:

$API_TOKEN: Ihr API-Token, das Sie in ihrem User Account erzeugen können.
$SERVER_URL: (Produktivsystem) oder (Testsystem)
$PERSISTENT_ID: ID des Datensatzes, zu finden innerhalb der URL der Datensatzseite (z.B. doi:10.18419/darus-444) 

  1. Generate One-Time-Upload-Url:
    curl -H "X-Dataverse-key: $API_TOKEN" -X GET "$SERVER_URL/api/datasets/:persistentId/uploadsid/?persistentId=$PERSISTENT_ID"
    From the JSON response, you need the "url" and the "storageIdentifier"
  2. Direct upload of the file to the S3-Backend:  
    curl -X PUT -H "x-amz-tagging: dv-state=temp" --upload-file <path-to-file> "<url>"
    This will upload the file into data backend, but first it will be marked as temporary and not yet linked to the record. This upload can take a long time depending on the file size. 
  3. Register the file with Dataverse:
    curl -H "X-Dataverse-key: $API_TOKEN" -X POST -F 'jsonData={"description":"<description>", "storageIdentifier": "<storageIdentifier>", "fileName": "<filename>", "mimeType": "<mimeType>", "md5Hash": "<md5sum>", "fileSize": "<filesize>"}' "$SERVER_URL/api/datasets/:persistentId/add?persistentId=$PERSISTENT_ID" 
    Since the file information can no longer be determined by Dataverse itself, it must now be passed as metadata:
    • <description>: textual description of the file
    • <storageIdentifier>: Returned when creating the One-Time-Upload-URL
    • <mimeType>: File type of the file as MIME type (list of commom MIME typescomplete list of all MIME types
    • <md5sum> md5Hash of the file, under Linux you can get this with md5sum <filename>
    • <filesize>: File size in bytes, under Linux you can determine it with ls -l <filename>

If the manual upload feels too complex for you, you can also use the DV-Uploader with the option -directupload

Using the DVUploader

The Dataverse-Uploader is a Java-based command line tool that helps to upload especially many or very large files and also supports direct upload to our data backend. It is developed by the Dataverse community and can be downloaded as a JAR file.

To use it, an API token must be generated and the ID of the dataset to which the file(s) are to be uploaded must be known.

java -jar DVUploader-v1.0.0.jar -server=<ServerUrl> -did=<Dataset DOI> -key=<User's API Key> <file or directory list>

<User's API Key>: Your API token that you can create in your user account.
<ServerUrl>: (production system) or (test system)
<Dataset DOI>: ID of the dataset, to be found within the URL of the dataset page (e.g. doi:10.18419/darus-444)

For the direct upload to the data backend, you also use the option -directupload

java -jar DVUploader-v1.0.0.jar -server=<ServerUrl> -did=<Dataset DOI> -key=<User's API Key> -directupload <file or directory list>


To the top of the page