Setup a Port Job (Headlessly)

Port jobs are used for copying datasets that are already on the Socrata platform. Port jobs allow users with publishing rights to copy both dataset schemas (metadata and columns) and data (rows). Port jobs also allow users to port derived views as stand-alone datasets. This guide shows how to setup and run a Port Job using the command line interface.

Step 1: Setup your configuration

Information about your domain, username, password and app token is required for all DataSync jobs. Note that the user running the job must have publisher rights on the dataset, and that the domain used here must be the site hosting the dataset to be ported. A number of other global settings, such as logging and emailing preferences can also be configured. Please refer to the configuration guide to establish your credentials and preferences.

Step 2: Configure job details

For general help using DataSync in headless/command-line mode run:

java -jar <DATASYNC_JAR> --help

To run a job execute the following command, replacing <..> with the appropriate values (flags explained below):

java -jar <DATASYNC_JAR> -c <CONFIG FILE> -t PortJob -pm copy_all -pd1 <SOURCE DOMAIN> -pi1 <SOURCE DATASET ID> -pd2 <DESTINATION DOMAIN>  -pdt <TITLE OF NEW DATASET> -pp true

Explanation of flags: * = required flag

Flag - Short Name	Flag - Long Name	Example Values	Description
-t `*`	--jobType	PortJob	Specifies the type of job to run.
-c	--config	/Users/home/config.json	Points to the config.json file you created in Step 1, if you chose to do so.
-pm `*`	--portMethod	copy_all	One of `copy_all`, `copy_schema` or `copy_data`
--pd1 `*`	--sourceDomain	https://opendata.socrata.com	The scheme and domain to which the source dataset belongs.
-pi1 `*`	--sourceDatasetId	m985-ywaw	The dataset identifier of the source dataset.
--pd2 `*`	--destinationDomain	https://opendata.socrata.com	The scheme and domain where the destination dataset should be copied.
-pi2	--destinationDatasetId	ax36-bgg2	The dataset identifier of the destination dataset.; only relevant if choosing `copy_data` for the --portMethod
-pdt	--destinationDatasetTitle	"Crimes 2014"	The title to give the destination dataset; only relevant if the destination set is being created by either choosing `copy_all` or `copy_schema` for the --portMethod
-pp	--publishDestinationDataset	true	Set this to `true` to have the destination dataset published before the Port Job completes; only relevant if the destination set is being created by either choosing `copy_all` or `copy_schema` for the portMethod. If `false`, the destination dataset will be left as a working copy (`false` is the default value)
-ppm	--portPublishMethod	replace	Specifies the publish method to use (`replace` or `upsert`). For details on the publishing methods refer to Step 5 of the Setup a Port Job (GUI)

Step 3: Job output

Information about the status of the job will be output to STDOUT. If the job runs successfully a ‘Success’ message will be output to STDOUT, the destination dataset id will be printed out and the program will exit with a normal status code (0). If there was a problem running the job a detailed error message will be output to STDERR and the program will exit with an error status code (1). You can capture the exit code to configure error handling logic within your ETL process.

Complete example job

java -jar <DATASYNC_JAR> -c config.json -t PortJob -pm copy_schema -pd1 https://opendata.socrata.com -pi1 97wa-y6ff -pd2 https://opendata.socrata.com -pdt ‘Port Job Test Title’ -pp true

config.json contents:

{
    "domain": "https://opendata.socrata.com",
    "username": "publisher@socrata.com",
    "password": "secret_password",
    "appToken": "fPsJQRDYN9KqZOgEZWyjoa1SG",
}

Running a previously saved job file (.spj file)

Simply run:

java -jar <DATASYNC_JAR> <.spj FILE TO RUN>

For example:

java -jar <DATASYNC_JAR> /Users/john/Desktop/business_licenses.spj

NOTE: you can also create an .spj file directly (rather than saving a job using the DataSync UI) which stores the job details in JSON format. Here is an example:

{
  "portMethod": "copy_all",
  "sourceSiteDomain": "https://louis.demo.socrata.com",
  "sourceSetID": "w8e5-buaa",
  "sinkSiteDomain": "https://louis.demo.socrata.com",
  "sinkSetID": "",
  "publishMethod": "upsert",
  "publishDataset": "publish",
  "portResult": "",
  "jobFilename": "job_saved_v0.3.spj",
  "fileVersionUID": 1,
  "pathToSavedJobFile": "/home/louis/Socrata/Github/datasync/src/test/resources/job_saved_v0.3.spj"
}