# `rq` DataSpecs

*[back to index](./README.md)*

DataSpec is a bespoke format developed for `rq` to facilitate concisely specifying per-file options in a concise way when writing shell commands. It is can be used to provide options for input and output files, data files, and is also used for some builtins such as `rq.run()`.

A DataSpec includes:

* An optional format
* Zero or more options
* A (possibly empty) file path

A DataSpec takes the form: `format:option1=value1;option2=value2:/some/path`.

The format, options, and file path are separated from one another using `:` symbols. If a format is specified, it is always the first component of the DataSpec. Options always contain a `=` symbol delineating the option key from the value. Options are separated from one another using `;` symbols. If the file path is provided, it is always the last component.

If the format is omitted, then the extension of the file path is typically used to infer it. If there is none, then `json` is typically assumed. ([see also](./gotchas.md#input-format-inference-from-file-extensions))

For situations where paths or options need to contain special characters, it is also possible to specify a DataSpec in JSON format. A DataSpec is assumed to be JSON formatted if the first and last characters are `{` and `}` respectively. For example, the following are equivalent:

* `{"format": "csv", "file_path": "/foo/bar", "options": {"bool": true, "int": 7}}`
* `csv:bool=true;int=7:/foo/bar`

Note that DataSpec option values are always stringily typed. If non-string values are used, they are silently converted to strings during parsing. Many options interpret strings as other types like numbers or booleans, so enforcing stringedness in this way ensures that both the concise and JSON representations of a DataSpec always have their values interpreted in a consistent way.

## Supported Options

Note that the `type` column in the following tables indicates how the value string will be interpreted.

### Options for Input Formats

| format   | option           | type    | default |description |
|----------|------------------|---------|---------|------------|
| `csv`    | `csv.comma`      | rune    | `,`     | The delimiter to be used for parsing the CSV. |
| `csv`    | `csv.comment`    | rune    |         | Lines beginning with this symbol will be skipped. |
| `csv`    | `csv.skip_lines` | integer | 0       | Skip this many leading lines. |
| `csv`    | `csv.headers`    | boolean | `false` | If true, assume the first row of the CSV contains headers. |
| `csv`    | `csv.infer`      | boolean | `true`  | If true, automatically infer the types of value in each cell, otherwise keep all values as strings. |
| `base64` | `base64.data`    | string  |         | Rather than reading data from the file path, the value of this option is base64 decoded and then parsed as JSON to generate the input. |
| all      | `rego.path`      | string  | [file path basename] | Path under the `data` package that the data should be loaded into. For example, setting `rego.path` to `hello` would cause the loaded data to appear in `data.hello`. |
| `raw`    | `raw.fs`         | regex   |         | If provided, defines a regex along which records are split into fields. Only valid if `raw.rs` is specified too. |
| `raw`    | `raw.rs`         | regex   |         | If provided, defines a regex along which the input is split into records. |
| `raw`    | `raw.lcutset`    | string  |         | If provided, apply [`strings.TrimLeft()`](https://pkg.go.dev/strings#TrimLeft) to each record with this value as the cutset. |
| `raw`    | `raw.rcutset`    | string  |         | If provided, apply [`strings.TrimRight()`](https://pkg.go.dev/strings#TrimRight) to each record with this value as the cutset. |
| `raw`    | `raw.cutset`     | string  |         | If provided, apply [`strings.Trim()`](https://pkg.go.dev/strings#Trim) to each record with this value as the cutset. |
| `raw`    | `raw.headers`    | boolean | `false` | If true, assume the first row of records contains headers. Only valid when both `raw.fs` and `raw.rs` are set. |
| `raw`    | `raw.infer`      | boolean | `false` | If true, automatically infer the types of value in each cell, otherwise keep all values as strings. |
| `raw`    | `raw.coalesce`   | int     | 0       | If a value n > 0, stop splitting after the (n-1)th column, coalescing the unsplit remainder into column n. When both `raw.fs` and `raw.rs` are set, this only applied to field splitting. |
| all      | `strict`         | boolean | `true`  | If `false`, disable error handling for the input file. ([see also](./gotchas.md#strict-mode-amp-dealing-with-malformed-input-files)) |

### Options for Output Formats

| format                          | option            | type    | default      |description |
|---------------------------------|-------------------|---------|--------------|------------|
| `csv`                           | `csv.comma`       | rune    | `,`          | The delimiter to be used when writing the CSV. |
| `csv`                           | `csv.headers`     | boolean | `true`       | If true, generate a header row where possible. |
| `json`, `ndjson`, `yaml`, `xml`, `hcl` | `output.colorize` | boolean | *            | If true, syntax highlight the output using Chroma |
| `json`, `xml`                   | `output.pretty`   | boolean | `true`       | If true, pretty-print the output. |
| `json`, `xml`                   | `output.indent`   | string  | `\t`         | Prefix to indent with when pretty printing output. |
| `json`, `ndjson`, `yaml`, `xml`, `hcl` | `output.style`    | string  | `native`     | Chroma styla to use for syntax highlighting, see [here](https://xyproto.github.io/splash/docs/all.html). |
| `json` |                        | `json.canonical`  | boolean | `false`      | Format the output as [RFC8785](https://datatracker.ietf.org/doc/html/rfc8785) compliant canonical JSON. Overrides all other formatting options. |
| `xml`                           | `xml.root-tag`    | string  | `doc`        | Override the XML root tag for the generated XML document. |
| `xml`                           | `xml.list-style`  | string  | `enumerated` | If `enumerated`, top-level lists are marshaled with separate node types for each index (`list-element<n>`, if `common`, a single common `list` node type is shared by all top- level list elemnts. |
| `template`                      | `output.template` | string  | `no template provided (HINT: did you forget the output.template option on your output dataspec?)` | Specify the template to populate using the result of the query. |
| `template`                      | `output.sep`      | string  | `\n`         | Specify the line separator to use for array shaped outputs. Note that a trailing `\n` is always appended for arrays, irrespective of this setting. |
| `raw`                           | `raw.fl`          | string  | ``           | "field left", a string to prefix each output field with. |
| `raw`                           | `raw.fs`          | string  | `\t`         | "field seprator", a string to delimit output fields with. |
| `raw`                           | `raw.fr`          | string  | ``           | "field right", a string to suffix each output field with. |
| `raw`                           | `raw.rl`          | string  | ``           | "record left", a string to prefix each output record with. |
| `raw`                           | `raw.rs`          | string  | `\n`         | "record seprator", a string to delimit output records with. |
| `raw`                           | `raw.rr`          | string  | ``           | "record right", a string to suffix each output record with. |

## DataSpec Specification

A DataSpec is a utf8 encoded string. If the string begins and ends with `{` and `}` respectively, then it is *JSON-formatted* **must** encode an object with the following keys and types:

* `format`, string
* `file_path`, string
* `options`, object

If the DataSpec does not begin and end with `{` and `}`, then it is said to be *concise-formatted*, and **must** encode data according to the following pseudo-BNF:

```plain
DATASPEC -> FILEPATH |
            FORMAT ":" FILEPATH |
            OPTIONS ":" FILEPATH |
            FORMAT ":" OPTIONS ":" FILEPATH

OPTIONS ->  KEY "=" VALUE |
            KEY "=" VALUE ";" OPTIONS

KEY -> SEGMENT

VALUE -> SEGMENT

FORMAT -> SEGMENT

FILEPATH -> SEGMENT

SEGMENT -> /[^;:=]+/
```

In the above grammar, the `DATASPEC` production lists it's possibilities in descending order of priority, noting that `OPTIONS` productions can distinguished from `FORMAT` by the presence of the `=` symbol.

Some examples of DataSpecs include:

| DataSpec string | format | options | file path  |
|-----------------|--------|---------|------------|
| `/foo/bar`      |        |         | `/foo/bar` |
| `json:/foo/bar` | `json` |         | `/foo/bar` |
| `csv:csv.headers=true;csv.comma=,;rego.path=foo:/foo/bar` | `csv` | `{"csv.headers": "true", "csv.comma": ",", "rego.path": "foo"}` | `/foo/bar` |
| `csv.headers=true;csv.comma=,;rego.path=foo:` | | `{"csv.headers": "true", "csv.comma": ",", "rego.path": "foo"}` | |
| `{"format": "csv", "options": {"csv.headers": "true", "csv.comma": ",", "rego.path": "foo"}, "file_path": "/foo/bar"}` | `csv` | `{"csv.headers": "true", "csv.comma": ",", "rego.path": "foo"}` | `/foo/bar` |
| `json:option=value` | `json` | | `option=value` |
| `json:option=value:` | `json` | `{"option": "value"}` | |

**NOTE**: As demonstrated in the last two rows of the table above, if exactly one `:` symbol is present, then the second component is always assumed to be the file path even if a `=` symbol is present. This is an intentional decision, because otherwise a path to a file containing an `=` symbol could be misinterpreted as an option. If an empty file path is desired, then a trailing `:` symbol is needed, or the JSON representation of the DataSpec can be used. This is pertinent because some builtins like `rq.encode()`, `rq.decode()`, and `rq.run()` use DataSpecs to specify formats and options, but ignore the file path component.

