Setup a standard job (headless)

NOTICE: The guide below only pertains to DataSync versions 1.0 and higher.

NOTICE: Before using DataSync in headless mode, we recommend familiarizing yourself with DataSync through the UI. For information on using DataSync’s UI please see guide to setup a standard job (GUI)

DataSync’s command line interface, or “headless mode,” enables easy integration of DataSync into ETL code or other software systems. DataSync jobs can be run from the command line in one of two ways: (1) passing job parameters as command-line arguments/flags or (2) running an .sij file that was previously saved using the user interface. This guide focuses on (1).

Step 1: Establish your configuration (e.g. authentication details)

Information about your domain, username, password and app token is required for all DataSync jobs. Note that the user running the job must have publisher rights on the dataset. A number of other global settings, such as logging and emailing preferences can also be configured. Please refer to the configuration guide to establish your credentials and preferences.

Step 2: Configure job details

For general help using DataSync in headless/command-line mode run:

java -jar <DATASYNC_JAR> --help

To run a job execute the following command, replacing <..> with the appropriate values (flags explained below):

java -jar <DATASYNC_JAR> -c <CONFIG.json FILE> -f <FILE TO PUBLISH> -h <HAS HEADER ROW> -i <DATASET ID> -m <PUBLISH METHOD> -pf <PUBLISH VIA FTP> -ph <PUBLISH VIA HTTP> -cf <FTP CONTROL.json FILE>

Explanation of flags: * = required flag

Flag - Short Name Flag - Long Name Example Values Description
-c --config /Users/home/config.json Points to the config.json file you created in Step 1
-f * --fileToPublish /Users/home/data_file.csv CSV or TSV file to publish
-h --fileToPublishHasHeaderRow true Set this to true if the file to publish has a header row, otherwise set it to false
-i * --datasetID m985-ywaw The dataset identifier to publish to.
-m --publishMethod replace Specifies the publish method to use (replace, upsert, append, and delete are the only acceptable values, for details on the publishing methods refer to Step 3 of the Setup a Standard Job (GUI)
-ph --publishViaHttp true Set this to true to use HTTP (rather than FTP or Soda2); This is the preferred method because is highly efficient and can reliably handle very large files (1 million+ rows). If false and --publishViaFTP is false, perform the dataset update using Soda2. (false is the default value)
-pf --publishViaFTP true Set this to true to use FTP (currently only works for replace). If false and --publishViaHttp is false,perform the dataset update using Soda2. (false is the default value)
-cf --pathToControlFile /Users/home/control.json Specifies a control file> that configures HTTP and ‘replace via FTP’ jobs. Only required when --publishViaHttp or --publishViaFTP is set to true. When this flag is set the --fileToPublishHasHeaderRow and --publishMethod flags are overridden by the settings in the supplied control file.
-t * --jobType LoadPreferences Specifies the type of job to run (IntegrationJob, LoadPreferences and PortJob are the only acceptable values)

Step 3: Job Output

Information about the status of the job will be output to STDOUT. If the job runs successfully a ‘Success’ message will be output to STDOUT and the job will exit with a normal status code (0). If there was a problem running the job a detailed error message will be output to STDERR and the program will exit with an error status code (1). You can capture the exit code to configure error handling logic within your ETL process.

Complete example job

java -jar <DATASYNC_JAR> -c config.json -f business_licenses_2014-02-10.csv -h true -i 7tgi-grrk -m replace -pf true -sc control.json

config.json contents:

{
    "domain": "https://opendata.socrata.com",
    "username": "publisher@opendata.socrata.com",
    "password": "secret_password",
    "appToken": "fPsJQRDYN9KqZOgEZWyjoa1SG",
    "adminEmail": "",
    "emailUponError": "false",
    "logDatasetID": "",
    "outgoingMailServer": "",
    "smtpPort": "",
    "sslPort": "",
    "smtpUsername": "",
    "smtpPassword": ""
}

control.json contents:

{
  "action" : "Replace",
  "csv" :
    {
      "useSocrataGeocoding" : true,
      "columns" : null,
      "skip" : 0,
      "fixedTimestampFormat" : ["ISO8601","MM/dd/yy","MM/dd/yyyy"],
      "floatingTimestampFormat" : ["ISO8601","MM/dd/yy","MM/dd/yyyy"],
      "timezone" : "UTC",
      "separator" : ",",
      "quote" : "\"",
      "encoding" : "utf-8",
      "emptyTextIsNull" : true,
      "trimWhitespace" : true,
      "trimServerWhitespace" : true,
      "overrides" : {}
    }
}

Running a previously saved job file (.sij file)

Simply run:

java -jar <DATASYNC_JAR> <.sij FILE TO RUN>

For example:

java -jar D<DATASYNC_JAR> /Users/john/Desktop/business_licenses.sij

NOTE: you can also create an .sij file directly (rather than saving a job using the DataSync UI) which stores the job details in JSON format. Here is an example:

{
    "datasetID" : "2bw7-dr67",
    "fileToPublish" : "/Users/john/Desktop/building_permits_2014-12-05.csv",
    "publishMethod" : "replace",
    "fileToPublishHasHeaderRow" : true,
    “publishViaFTP” : true,
    “pathToFTPControlFile” : “/Users/john/Desktop/building_permits_control.json”
}