Control File Configuration

Contents

NOTICE: This guide in only applicable when using the FTP or HTTP methods (i.e. it is not applicable when using Soda2).

Basic control file setup

The control file is a JSON-formatted file that is used to configure a Standard DataSync job that uses the FTP or HTTP means of transfering data. Control files are specific to the dataset you are updating.

An example of a typical control file:

{
  "action" : "Replace",
  "csv" :
    {
      "useSocrataGeocoding" : true,
      "columns" : null,
      "skip" : 0,
      "fixedTimestampFormat" : ["ISO8601","MM/dd/yy","MM/dd/yyyy"],
      "floatingTimestampFormat" : ["ISO8601","MM/dd/yy","MM/dd/yyyy"],
      "timezone" : "UTC",
      "separator" : ",",
      "quote" : "\"",
      "encoding" : "utf-8",
      "emptyTextIsNull" : true,
      "trimWhitespace" : true,
      "trimServerWhitespace" : true,
      "overrides" : {}
    }
}

This guide will describe how to use the different options within the control file.

Header row/column list

The columns and skip options enable configuration of how the columns within the CSV/TSV align with those of the dataset.

columns: List of column names in the following format ["col_id1","col_id2",..]. If null, the first line of the CSV/TSV after any skipped records is assumed to contain the column names. If specified, columns must be an array of strings, and must not contain nulls.

IMPORTANT NOTE: the column names, whether provided in “columns” or in the first row of the CSV/TSV, must be the API field names, not the display name of the columns.

skip: Specifies the number of rows to skip before reaching the header.

Common combinations of columns and skip:

If the first line of the CSV/TSV is the list of column identifiers:

"columns": null,
"skip": 0,

If the first line of the CSV is the columns incorrectly formatted, for example with human-readable names instead of column identifiers:

"columns": ["first_name","last_name","age"],
"skip": 1,

If the first line of the CSV/TSV is data (there is no header row):

"columns": ["first_name","last_name","age"],
"skip": 0,

Date/time formatting

Timestamp Format Options

The floatingTimestampFormat and fixedTimestampFormat options specify how date/time data is formatted in the CSV/TSV file. floatingTimestampFormat applies to (“Date & Time” datatype columns). fixedTimestampFormat functions similarly but applies (“Date & Time (with timezone)” datatype columns). If the format does not specify a time zone, the zone may be given via the timezone option. If no zone information is provided, UTC is assumed.

Both floatingTimestampFormat and fixedTimestampFormat accept a string (e.g. “ISO8601”) or a JSON-formatted list of formats including “ISO8601” and any date/time “Joda time” format-string. Joda time syntax is documented in detail here: http://www.joda.org/joda-time/apidocs/org/joda/time/format/DateTimeFormat.html. Note that time specification “z” cannot be used; it is an output only formatter.

Example syntax to accept four of the most common date/time formats:

"fixedTimestampFormat" : ["ISO8601","MM/dd/yy","MM/dd/yyyy","dd-MMM-yyyy"],
"floatingTimestampFormat" : ["ISO8601","MM/dd/yy","MM/dd/yyyy","dd-MMM-yyyy"],

This would accept any of the following example date/time data values: “2014-04-22”, “2014-04-22T05:44:38”, “04/22/2014”, “4/22/2014”, “4/22/14”, and “22-Apr-2014”.

If you want to allow a date with or without a time value (e.g. both “2014-04-22” and “2014-04-22 9:30:00”), you would use:

"fixedTimestampFormat" : ["yyyy-MM-dd", "yyyy-MM-dd HH:mm:ss"],
"floatingTimestampFormat" : ["yyyy-MM-dd", "yyyy-MM-dd HH:mm:ss"],
Timezone option

timezone specifies the timezones for FixedTimestamps (“Date & Time (with timezone)” columns). This only has an effect if the timestamp format does not specify a time zone.

You can set this to one of the following:

  1. “UTC”
  2. A timezone name (e.g. “US/Pacific”). The list of accepted names can be found at http://joda-time.sourceforge.net/timezones.html. Please avoid the 3-letter variants as these are ambiguous (e.g. MST is both Mountain Standard Time and Malaysia Standard Time).

Location column and geocoding configuration

The syntheticLocations option allows configuring a Location datatype column to populate from address, city, state, zipcode or latitude/longitude data within existing columns of the CSV/TSV.

For example:

"syntheticLocations" : {
  "location_col_id" : {
    "address" : "address_col_id",
    "city" : "city_col_id",
    "state" : "state_col_id",
    "zip" : "zipcode_col_id",
    "latitude" : "lat_col_id",
    "longitude" : "lng_col_id"
  }
}

The following fields are optional: “address”, “city”, “state”, “zip”, “latitude”, and “longitude”. Those that are are not provided are omitted from the generated location. The column field names must exist in the CSV. In the above example, a Location datatype column with the identifier location_col_id would pull in the “address” from the column with identifier address_col_id and the “city” from column with identifier city_col_id.

The synthetic location location_col_id should not be present in the CSV. If it is, you can ignore this column using the ignoreColumns option.

When you provide any combination of location information but do not fill in latitude or longitude then Socrata geocodes the information automatically to generate the latitude and longitude values. For more information on Socrata geocoding, please see this guide.

IMPORTANT: If you are providing the latitude and longitude values as inputs to the Location column (i.e. you are NOT using Socrata’s geocoding), you should set the useSocrataGeocoding option to false. If you are not providing the latitude and longitude, you should to set this to true. This will minimize the number of perceived changes to the dataset, decreasing the time it takes to complete your job. If you are constructing multiple Location columns and they require different useSocrataGeocoding settings, you may use the overrides option.

Ignoring Columns

The ignoreColumns option allows you to exclude columns within the CSV/TSV. This may be necessary if the dataset lacks a column within the CSV or if a synthetic location is provided in the CSV, but you would still like it constructed from individual address fields.

ignoreColumns: List of column names in the following format ["col_id1","col_id2",..]. These must be present in columns.

Complete control file settings

The control file is comprised of the Action setting and the CSV or TSV settings.

Actions

The action is given by one of the following strings:

Option Explanation
Replace Use if the the CSV/TSV represents the desired new state for the dataset. DataSync will calculate the minimal set of changes required, updating the dataset accordingly.
Append Deprecated. See Upsert.
Upsert Use if the CSV/TSV contains updates to the dataset, rather than the complete dataset. If the dataset does not have a RowID, then all rows in the CSV are appended, even if they duplicate existing rows. If the dataset does have a RowID, then matching row IDs will be updated and new row IDs will be appended.
Delete Use if the CSV/TSV contains row IDs of rows to delete. This option requires that a row identifier be set on the dataset.

CSV or TSV Settings

The following are options available to both CSV files or TSV files within the csv or tsv object:

Option Explanation
encoding "utf-8", or any other encoding that the JVM understands. The list is available at /datasync/charsets.json from your socrata domain (e.g. https://opendata.socrata.com/datasync/charsets.json).
separator Field separator. Typically "," or "\t".
quote Used to quote values which contain the separator character. Separators between quotes will be treated as part of the value. Typical values are "\"" for double-quotes, "'" for single-quotes and "\u0000" for no quote character.
escape Used to specify the escape character. Typically this is "\\", a single backslash.
columns JSON list of column names. If null then the first line of the CSV after any skipped records is used. If specified, it must be an array of strings and must not contain nulls. Note that the column names, whether provided in “columns” or in the first row of the CSV, must match the API field names, not the display name of the columns.
ignoreColumns Specifies any columns in the CSV/TSV file that are to be ignored. These must be given as an array of strings, each of which must be listed within `columns`.
skip Specifies the number of rows to skip. The first row that will be read is `skip` + 1 and will be read as a header or a data row depending on how `columns` is set. If `columns` is null, the first row read will be treated as the header; otherwise it is treated as data.
trimWhitespace Trims leading and trailing whitespace before inserting the data into the dataset. Also trims quoted values (e.g. " Foo" would be converted to "Foo").
trimServerWhitespace Trims leading and trailing whitespace that already exists in the dataset. This flag is generally only necessary if data was previously added to the dataset with whitespace (due to trimWhitespace being false).
useSocrataGeocoding Relevant only to Location columns. Controls how comparisons are made between values in the CSV/TSV file and the data we have stored in our servers. If you are not providing the latitude and longitude in the synthetic location (e.g. Socrata will geocode the location), set this to "true" to minimize perceived changes to your data. If you are providing the latitude and longitude in the synthetic location, set this to "false".
emptyTextIsNull For old backend datasets, set this to “true”. The old backend converts empty strings to null and if “false”, every empty string will be viewed as a change to your dataset, slowing down the upsert time considerably. For new backend datasets, this will affect how data is imported into text fields. If true, then empty text (not whitespace) will be treated as NULL. If false, it will be treated as the empty string.
floatingTimestampFormat Specifies how Floating Timestamps (“Date & Time” columns) are interpreted. Typical values are "ISO8601" or "yyyy-MM-dd". Any joda-formated string is acceptable. If you want to allow multiple formats to be accepted, then you can specify a list of values rather than a single value (e.g. ["ISO8601", "MM.dd.yyyy"]).
fixedTimestampFormat Same as floatingTimestampFormat but for Fixed Timestamps (“Date & Time (with timezone)”). If the format does not specify a time zone, the zone named by the `timezone` field is used.
timezone Specifies the timezones for FixedTimestamps (“Date & Time (with timezone)” columns). This only has an effect if the timestamp format does not specify a time zone. Typical values are "UTC" or "US/Pacific". A list of accepted names is available at this Joda documentation page. *Please avoid the 3-letter variants as these are ambiguous (e.g. MST is both Mountain Standard Time and Malaysia Standard Time)*.
syntheticLocations Allows transformation of multiple columns into one or more Location columns during insert. See See the Location column and geocoding configuration section for an example.
setAsideErrors When set to true, bad data will be ignored, allowing the job to complete while setting aside any rows that contain errors. As with other upload jobs, the status of the job can be monitored in the job status page. If the upload job completed (with partial success), a link to the CSV with bad data will be available in the job details page to download and inspect.
overrides A map whose keys are field names, and whose values are objects containing per-column overrides for the `timestampFormat`, `timezone`, `emptyTextIsNull`, `trimWhitespace`, `trimServerWhitespace` and `useSocrataGeocoding` settings. Note that “timestampFormat” applies to both fixed and floating timestamps. For an example, see below.

Example of using column-level overrides:

{
  "action" : "Replace",
  "csv" : {
    "useSocrataGeocoding" : true,
    "fixedTimestampFormat" : "ISO8601",
    "floatingTimestampFormat" : "ISO8601",
    "timezone" : "UTC",
    "overrides" : {
      "my_time_column" : {
        "timestampFormat" : "YYYY-MM-dd HH:mm:ss",
        "timezone" : "US/Central"
      },
      "my_text_column" : {
        "emptyTextIsNull" : true
      }
      "my_location_column" {
        "useSocrataGeocoding" : false
      }
    }
  }
}