BigData - Handling large amounts of data with DaRUS

For a smooth upload of very large files to DaRUS we recommend using the API instead of the web interface and uploading directly to the data backend. The DVUploader implements the individual steps.

Upload of Files

Upload via Web-Interface

DaRUS - or the underlying software Dataverse - offers a comfortable web interface to create records and upload files, which works very well for files up to 2 GB in size.

After uploading, additional information (description, tags, folder structure) can be added to the files.

Upload via API

For files up to about 100 GB the DaRUS API can be used. This requires an API key and the ID of the record to which the file is to be added.

$API_TOKEN: Your API token which you can create in your user account.
$SERVER_URL: https://darus.uni-stuttgart.de (production system) or https://demodarus.izus.uni-stuttgart.de (test system)
$PERSISTENT_ID: ID of the record, to be found within the URL of the record page (e.g. doi:10.18419/darus-444)

curl -H X-Dataverse-key:$API_TOKEN -X POST -F "file=@$FILENAME" -F 'jsonData={"description":"<description>", "directoryLabel":"<directory>", "categories":<categories>, "restrict": "false"}' "$SERVER_URL/api/datasets/:persistentId/add?persistentId=$PERSISTENT_ID"

At the same time, information about the file can (but does not have to) be transmitted:

  • <description>: Textual description of the file
  • <directory>: Supplement a folder structure, e.g. /images/subdir
  • <categories>: List of tags, e.g. ["Data", "Documentation", "Code"]
  • with "restrict": "true" a file can be provided with an access protection

As an alternative to addressing the API directly, libraries such as pyDataverse or tools such as the DaRUS app can also be used.

Direct Upload to the Data Backend

To completely avoid timeouts of the Dataverse server, very large files can be uploaded directly to the S3 backend. However, this requires that direct upload is activated for the corresponding dataverse. Please contact your local administrator if you are not sure.

The upload then takes place in three steps:

$API_TOKEN: Ihr API-Token, das Sie in ihrem User Account erzeugen können.
$SERVER_URL: https://darus.uni-stuttgart.de (Produktivsystem) oder https://demodarus.izus.uni-stuttgart.de (Testsystem)
$PERSISTENT_ID: ID des Datensatzes, zu finden innerhalb der URL der Datensatzseite (z.B. doi:10.18419/darus-444) 

$SIZE: size of the dataset in bytes, can be determined under Linux with ls -l <filename>

  1. Generate one-time upload URL:
    curl -H ‘X-Dataverse-key: $API_TOKEN’ -X GET ‘$SERVER_URL/api/datasets/:persistentId/uploadurls/?persistentId=$PERSISTENT_ID&size=$SIZE’
    From the JSON response, you need the ‘url’ and the ‘storageIdentifier’.

    If the file is larger than the single-part limit up to which a file can be uploaded in one piece (currently 100 GB), more than one ‘url’ will be returned, along with a ‘complete’ URL, which must be accessed after uploading the partial files with the returned eTags (multi-part upload).

  2. Direct upload of the file to the S3-Backend:  
    curl -i -X PUT -H ‘x-amz-tagging:dv-state=temp’ -T <path-to-file> ‘<url>’
    This will upload the file into data backend, but first it will be marked as temporary and not yet linked to the record. This upload can take a long time depending on the file size. 

    Multi-part upload: For large files, this command must be executed for each url returned in step 1 with a section of the data. (To split the file, you can use the Linux command split) An eTag is returned after each upload. All these eTags must be sent to the ‘complete’ URL returned in step 1 to complete the multi-part upload. 

    curl -X PUT ‘$SERVER_URL/api/datasets/mpload?...’ -d ‘{’1‘:’<eTag1 string>‘,’2‘:’<eTag2 string>‘,’3‘:’<eTag3 string>‘,’4‘:’<eTag4 string>‘,’5‘:’<eTag5 string>‘}’



  3. Register the file with Dataverse:
    curl -H ‘X-Dataverse-key: $API_TOKEN’ -X POST -F ‘jsonData={’description‘:’<description>’, “storageIdentifier”: “<storageIdentifier>”, “fileName”: “<filename>”, “mimeType”: ’< mimeType>‘, “checksum”: {’@type‘: “MD5”, “@value”: “<md5sum>”}}’ ‘$SERVER_URL/api/datasets/:persistentId/add?persistentId=$PERSISTENT_ID’ 
    Since the file information can no longer be determined by Dataverse itself, it must now be passed as metadata:
    • <description>: textual description of the file
    • <storageIdentifier>: Returned when generating the one-time upload URL
    • <mimeType>: File type of the file as MIME type (list of commom MIME typescomplete list of all MIME types
    • <md5sum> md5Hash of the file, under Linux you can get this with md5sum <filename>

Optionally, you can also provide the following additional information about the file:

    • ‘directoryLabel‘: “/foldername/subfolder” Path to the file, in case the file is to be displayed within the data set in a specific folder structure.
    • ‘categories’: [<category>], in case the file is to be given a tag of the category <category> (e.g. Data, Code, Documentation)
    • ‘restrict": “true”, in case access to the file is to be restricted

For more information about uploading via API, see the Dataverse guides.

If the manual upload feels too complex for you, you can also use the DV-Uploader with the option -directupload

 

 

curl -i -X PUT -H ‘x-amz-tagging:dv-state=temp’ -T <path-to-file> ‘<url>’
This uploads the file to the data backend, but initially it is still marked as temporary and not yet linked to the data set. This upload can take a long time depending on the file size.

Multi-part upload: For large files, this command must be executed for each url returned in step 1 with a section of the data. (To split the file, you can use the Linux command split, for example.) An eTag is returned after each upload. All these eTags must be sent to the ‘complete’ URL returned in step 1 to complete the multi-part upload. 

 

 

Registering the file with Dataverse:
curl -H ‘X-Dataverse-key: $API_TOKEN’ -X POST -F ‘jsonData={’description‘:’<description>’, “storageIdentifier”: “<storageIdentifier>”, “fileName”: “<filename>”, “mimeType”: ’< mimeType>‘, “checksum”: {’@type‘: “MD5”, “@value”: “<md5sum>”}}’ ‘$SERVER_URL/api/datasets/:persistentId/add?persistentId=$PERSISTENT_ID’ 
Since the file information can no longer be determined by Dataverse itself, it must now be transferred as metadata:
<description>: textual description of the file
<storageIdentifier>: returned when generating the one-time upload URL
<mimeType>: file type of the file as MIME type (list of common MIME types, complete list of all MIME types) 
<md5sum> md5 hash of the file, can be obtained under Linux md5sum <filename> 


Optionally, you can also provide the following additional information about the file:

‘directoryLabel‘: “/foldername/subfolder” Path to the file, in case the file is to be displayed within the data set in a specific folder structure.

‘categories’: [<category>], in case the file is to be given a tag of the category <category> (e.g. Data, Code, Documentation)

‘restrict": “true”, in case access to the file is to be restricted

 

 

For more information about uploading via API, see the Dataverse guides.

Using the DVUploader

The Dataverse-Uploader is a Java-based command line tool that helps to upload especially many or very large files and also supports direct upload to our data backend. It is developed by the Dataverse community and can be downloaded as a JAR file.

To use it, an API token must be generated and the ID of the dataset to which the file(s) are to be uploaded must be known.

java -jar DVUploader-v1.0.0.jar -server=<ServerUrl> -did=<Dataset DOI> -key=<User's API Key> <file or directory list>

<User's API Key>: Your API token that you can create in your user account.
<ServerUrl>: https://darus.uni-stuttgart.de (production system) or https://demodarus.izus.uni-stuttgart.de (test system)
<Dataset DOI>: ID of the dataset, to be found within the URL of the dataset page (e.g. doi:10.18419/darus-444)

For the direct upload to the data backend, you also use the option -directupload

java -jar DVUploader-v1.0.0.jar -server=<ServerUrl> -did=<Dataset DOI> -key=<User's API Key> -directupload <file or directory list>

Kontakt

To the top of the page