Setup a Port Job (Headlessly)

Port jobs are used for copying datasets that are already on the Socrata platform. Port jobs allow users with publishing rights to copy both dataset schemas (metadata and columns) and data (rows). Port jobs also allow users to port derived views as stand-alone datasets. This guide shows how to setup and run a Port Job using the command line interface.

Step 1: Setup your configuration

Information about your domain, username, password and app token is required for all DataSync jobs. Note that the user running the job must have publisher rights on the dataset, and that the domain used here must be the site hosting the dataset to be ported. A number of other global settings, such as logging and emailing preferences can also be configured. Please refer to the configuration guide to establish your credentials and preferences.

Step 2: Configure job details

For general help using DataSync in headless/command-line mode run:

java -jar <DATASYNC_JAR> --help

To run a job execute the following command, replacing <..> with the appropriate values (flags explained below):

java -jar <DATASYNC_JAR> -c <CONFIG FILE> -t PortJob -pm copy_all -pd1 <SOURCE DOMAIN> -pi1 <SOURCE DATASET ID> -pd2 <DESTINATION DOMAIN>  -pdt <TITLE OF NEW DATASET> -pp true

Explanation of flags: * = required flag

Flag - Short Name Flag - Long Name Example Values Description
-t * --jobType PortJob Specifies the type of job to run.
-c --config /Users/home/config.json Points to the config.json file you created in Step 1, if you chose to do so.
-pm * --portMethod copy_all One of copy_all, copy_schema or copy_data
--pd1 * --sourceDomain The scheme and domain to which the source dataset belongs.
-pi1 * --sourceDatasetId m985-ywaw The dataset identifier of the source dataset.
--pd2 * --destinationDomain The scheme and domain where the destination dataset should be copied.
-pi2 --destinationDatasetId ax36-bgg2 The dataset identifier of the destination dataset.; only relevant if choosing copy_data for the --portMethod
-pdt --destinationDatasetTitle "Crimes 2014" The title to give the destination dataset; only relevant if the destination set is being created by either choosing copy_all or copy_schema for the --portMethod
-pp --publishDestinationDataset true Set this to true to have the destination dataset published before the Port Job completes; only relevant if the destination set is being created by either choosing copy_all or copy_schema for the portMethod. If false, the destination dataset will be left as a working copy (false is the default value)
-ppm --portPublishMethod replace Specifies the publish method to use (replace or upsert). For details on the publishing methods refer to Step 5 of the Setup a Port Job (GUI)

Step 3: Job output

Information about the status of the job will be output to STDOUT. If the job runs successfully a ‘Success’ message will be output to STDOUT, the destination dataset id will be printed out and the program will exit with a normal status code (0). If there was a problem running the job a detailed error message will be output to STDERR and the program will exit with an error status code (1). You can capture the exit code to configure error handling logic within your ETL process.

Complete example job

java -jar <DATASYNC_JAR> -c config.json -t PortJob -pm copy_schema -pd1 -pi1 97wa-y6ff -pd2 -pdt ‘Port Job Test Title’ -pp true

config.json contents:

    "domain": "",
    "username": "",
    "password": "secret_password",
    "appToken": "fPsJQRDYN9KqZOgEZWyjoa1SG",

Running a previously saved job file (.spj file)

Simply run:

java -jar <DATASYNC_JAR> <.spj FILE TO RUN>

For example:

java -jar <DATASYNC_JAR> /Users/john/Desktop/business_licenses.spj

NOTE: you can also create an .spj file directly (rather than saving a job using the DataSync UI) which stores the job details in JSON format. Here is an example:

  "portMethod": "copy_all",
  "sourceSiteDomain": "",
  "sourceSetID": "w8e5-buaa",
  "sinkSiteDomain": "",
  "sinkSetID": "",
  "publishMethod": "upsert",
  "publishDataset": "publish",
  "portResult": "",
  "jobFilename": "job_saved_v0.3.spj",
  "fileVersionUID": 1,
  "pathToSavedJobFile": "/home/louis/Socrata/Github/datasync/src/test/resources/job_saved_v0.3.spj"