DataGen SQL Connector #
Scan Source: Bounded Scan Source: UnBounded
The DataGen connector allows for creating tables based on in-memory data generation. This is useful when developing queries locally without access to external systems such as Kafka. Tables can include Computed Column syntax which allows for flexible record generation.
The DataGen connector is built-in, no additional dependencies are required.
Usage #
By default, a DataGen table will create an unbounded number of rows with a random value for each column. For variable sized types, char/varchar/binary/varbinary/string/array/map/multiset, the length can be specified. Additionally, a total number of rows can be specified, resulting in a bounded table.
There also exists a sequence generator, where users specify a sequence of start and end values. If any column in a table is a sequence type, the table will be bounded and end with the first sequence completes.
Time types are always the local machines current system time.
CREATE TABLE Orders (
order_number BIGINT,
price DECIMAL(32,2),
buyer ROW<first_name STRING, last_name STRING>,
order_time TIMESTAMP(3)
) WITH (
'connector' = 'datagen'
)
Often, the data generator connector is used in conjunction with the LIKE
clause to mock out physical tables.
CREATE TABLE Orders (
order_number BIGINT,
price DECIMAL(32,2),
buyer ROW<first_name STRING, last_name STRING>,
order_time TIMESTAMP(3)
) WITH (...)
-- create a bounded mock table
CREATE TEMPORARY TABLE GenOrders
WITH (
'connector' = 'datagen',
'number-of-rows' = '10'
)
LIKE Orders (EXCLUDING ALL)
Types #
Type | Supported Generators | Notes |
---|---|---|
BOOLEAN | random | |
CHAR | random / sequence | |
VARCHAR | random / sequence | |
BINARY | random / sequence | |
VARBINARY | random / sequence | |
STRING | random / sequence | |
DECIMAL | random / sequence | |
TINYINT | random / sequence | |
SMALLINT | random / sequence | |
INT | random / sequence | |
BIGINT | random / sequence | |
FLOAT | random / sequence | |
DOUBLE | random / sequence | |
DATE | random | Always resolves to the current date of the local machine. |
TIME | random | Always resolves to the current time of the local machine. |
TIMESTAMP | random | Resolves a past timestamp relative to the current timestamp of the local machine. The max past can be specified by the 'max-past' option. |
TIMESTAMP_LTZ | random | Resolves a past timestamp relative to the current timestamp of the local machine. The max past can be specified by the 'max-past' option. |
INTERVAL YEAR TO MONTH | random | |
INTERVAL DAY TO MONTH | random | |
ROW | random | Generates a row with random subfields. |
ARRAY | random | Generates an array with random entries. |
MAP | random | Generates a map with random entries. |
MULTISET | random | Generates a multiset with random entries. |
Connector Options #
Option | Required | Default | Type | Description |
---|---|---|---|---|
connector |
required | (none) | String | Specify what connector to use, here should be 'datagen'. |
rows-per-second |
optional | 10000 | Long | Rows per second to control the emit rate. |
number-of-rows |
optional | (none) | Long | The total number of rows to emit. By default, the table is unbounded. |
fields.#.kind |
optional | random | String | Generator of this '#' field. Can be 'sequence' or 'random'. |
fields.#.min |
optional | (Minimum value of type) | (Type of field) | Minimum value of random generator, work for numeric types. |
fields.#.max |
optional | (Maximum value of type) | (Type of field) | Maximum value of random generator, work for numeric types. |
fields.#.max-past |
optional | 0 | Duration | Maximum past of timestamp random generator, only works for timestamp types. |
fields.#.length |
optional | 100 | Integer | Size or length of the collection for generating char/varchar/binary/varbinary/string/array/map/multiset types. |
fields.#.start |
optional | (none) | (Type of field) | Start value of sequence generator. |
fields.#.end |
optional | (none) | (Type of field) | End value of sequence generator. |
fields.#.null-rate |
optional | (none) | (Type of field) | The proportion of null values. |