This workflow demonstrates the basic functionality of Pentaho Kettle Data Integration in order to clean a file, pivot columns to rows, and publish to Socrata.
To publish to Socrata using Pentaho is a two-step process. 1 workflow runs the transformations and the 2nd workflow runs datasync on the updated and transformed csv.
The Transformations
tranformation.ktr workflow:
- CSV Input
- Row Normaliser in order to Pivot Columns into flat rows
- Filter Rows to remove unwated “NA” characters
- Select Values (renames to schema, selects fields to write to output file)in order to Rename Columns
- Formula in order to change decimal rate to whole number and comply with Socrata Percent Data Type
- Replace String in order to remove unwanted text from cells
- Text Output (force quote enclosure, comma separator, ISO8601 date format and trims all fields)
datasyncrunner.kjb:
- Job Scheduling to iniate workflow
- Transformation runs the transformation.ktr workflow
- Execute Shell Script runs datasync
Quickstart
Follow these steps to get the example workflow working.
- Download Pentaho Kettle Community Edition and Datasync
- Download the workflow and associated files using the Download link above and open it from Spoon. NOTE: You may have to right-click > Save Link As… to get it to download.
- Update the Paths to the Raw Inspection and Violation files so they point to the correct directory.[^2]
- Update the path for the Text file output.
- Run the workflow and ensure that the file transformed_data.csv was created in the directory with current timestamp.