Setup a standard job (headless)

NOTICE: The guide below only pertains to DataSync versions 1.0 and higher.

NOTICE: Before using DataSync in headless mode, we recommend familiarizing yourself with DataSync through the UI. For information on using DataSync’s UI please see guide to setup a standard job (GUI)

DataSync’s command line interface, or “headless mode,” enables easy integration of DataSync into ETL code or other software systems. DataSync jobs can be run from the command line in one of two ways: (1) passing job parameters as command-line arguments/flags or (2) running an .sij file that was previously saved using the user interface. This guide focuses on (1).

Step 1: Establish your configuration (e.g. authentication details)

Information about your domain, username, password and app token is required for all DataSync jobs. Note that the user running the job must have publisher rights on the dataset. A number of other global settings, such as logging and emailing preferences can also be configured. Please refer to the configuration guide to establish your credentials and preferences.

Step 2: Configure job details

For general help using DataSync in headless/command-line mode run:

java -jar <DATASYNC_JAR> --help

To run a job execute the following command, replacing <..> with the appropriate values (flags explained below):

java -jar <DATASYNC_JAR> -c <CONFIG.json FILE> -f <FILE TO PUBLISH> -h <HAS HEADER ROW> -i <DATASET ID> -m <PUBLISH METHOD> -pf <PUBLISH VIA FTP> -ph <PUBLISH VIA HTTP> -cf <FTP CONTROL.json FILE>

Explanation of flags: * = required flag

Flag - Short Name	Flag - Long Name	Example Values	Description
-c	--config	/Users/home/config.json	Points to the config.json file you created in Step 1
-f `*`	--fileToPublish	/Users/home/data_file.csv	CSV or TSV file to publish
-h	--fileToPublishHasHeaderRow	true	Set this to `true` if the file to publish has a header row, otherwise set it to `false`
-i `*`	--datasetID	m985-ywaw	The dataset identifier to publish to.
-m	--publishMethod	replace	Specifies the publish method to use (`replace`, `upsert`, `append`, and `delete` are the only acceptable values, for details on the publishing methods refer to Step 3 of the Setup a Standard Job (GUI)
-ph	--publishViaHttp	true	Set this to `true` to use HTTP (rather than FTP or Soda2); This is the preferred method because is highly efficient and can reliably handle very large files (1 million+ rows). If `false` and --publishViaFTP is `false`, perform the dataset update using Soda2. (false is the default value)
-pf	--publishViaFTP	true	Set this to `true` to use FTP (currently only works for replace). If `false` and --publishViaHttp is `false`,perform the dataset update using Soda2. (false is the default value)
-cf	--pathToControlFile	/Users/home/control.json	Specifies a control file> that configures HTTP and ‘replace via FTP’ jobs. Only required when --publishViaHttp or --publishViaFTP is set to `true`. When this flag is set the --fileToPublishHasHeaderRow and --publishMethod flags are overridden by the settings in the supplied control file.
-t `*`	--jobType	LoadPreferences	Specifies the type of job to run (`IntegrationJob`, `LoadPreferences` and `PortJob` are the only acceptable values)

Step 3: Job Output

Information about the status of the job will be output to STDOUT. If the job runs successfully a ‘Success’ message will be output to STDOUT and the job will exit with a normal status code (0). If there was a problem running the job a detailed error message will be output to STDERR and the program will exit with an error status code (1). You can capture the exit code to configure error handling logic within your ETL process.

Complete example job

java -jar <DATASYNC_JAR> -c config.json -f business_licenses_2014-02-10.csv -h true -i 7tgi-grrk -m replace -pf true -sc control.json

config.json contents:

{
    "domain": "https://opendata.socrata.com",
    "username": "publisher@opendata.socrata.com",
    "password": "secret_password",
    "appToken": "fPsJQRDYN9KqZOgEZWyjoa1SG",
    "adminEmail": "",
    "emailUponError": "false",
    "logDatasetID": "",
    "outgoingMailServer": "",
    "smtpPort": "",
    "sslPort": "",
    "smtpUsername": "",
    "smtpPassword": ""
}

control.json contents:

{
  "action" : "Replace",
  "csv" :
    {
      "useSocrataGeocoding" : true,
      "columns" : null,
      "skip" : 0,
      "fixedTimestampFormat" : ["ISO8601","MM/dd/yy","MM/dd/yyyy"],
      "floatingTimestampFormat" : ["ISO8601","MM/dd/yy","MM/dd/yyyy"],
      "timezone" : "UTC",
      "separator" : ",",
      "quote" : "\"",
      "encoding" : "utf-8",
      "emptyTextIsNull" : true,
      "trimWhitespace" : true,
      "trimServerWhitespace" : true,
      "overrides" : {}
    }
}

Running a previously saved job file (.sij file)

Simply run:

java -jar <DATASYNC_JAR> <.sij FILE TO RUN>

For example:

java -jar D<DATASYNC_JAR> /Users/john/Desktop/business_licenses.sij

NOTE: you can also create an .sij file directly (rather than saving a job using the DataSync UI) which stores the job details in JSON format. Here is an example:

{
    "datasetID" : "2bw7-dr67",
    "fileToPublish" : "/Users/john/Desktop/building_permits_2014-12-05.csv",
    "publishMethod" : "replace",
    "fileToPublishHasHeaderRow" : true,
    “publishViaFTP” : true,
    “pathToFTPControlFile” : “/Users/john/Desktop/building_permits_control.json”
}