Kafka Producer
Supports ClickHouse documentation on Kafka.
This is a simple Kafka producer written in Python for ndjson data. It ensures a schema is set for the JSON - either generating a schema or using a specified one.
Schemas are required for tools such as the Kafka JDBC sink.
Requirements
- Python 3.8.10+
- Kafka instance v7+. Easiest solution is to create a Kafka cluster in Confluent Cloud - which offers an adequate free tier.
- Ndjson file. A sample github ndjson file can be found here with accompanying config for the script here. See Larger Datasets if a larger test file is required.
Setup
pip install -r requirements.txt
Usage
- Prepare a configuration. See github.config for examples. Any target topic will be automatically created if it doesn't exist.
- (Optional) Prepare a JSON schema file for your ndjson and specify this in the config from (1) via
input.schema
. To infer a schema automatically do not set this parameter. This will cause the schema to be inferred from the first 100 lines. This is best effort only (but works for the gitub dataset)! - Run it!
python producer.py -c <config_file>
Not in scope
Whilst all producer configuration parameters supported by the Kafka python client can be used - replace _
with .
in the configuration, no work has been done regards testing these settings for optimal performance.
Large Datasets
The sample Github dataset consists of events on the ClickHouse Github repository. This static files covers the period 2019-09-23
to 2022-01-05
.
Specifically, this file was generated from the following command executed against the ClickHouse play site:
clickhouse-client --secure --host play.clickhouse.com --port 9440 --user explorer --query "SELECT file_time, event_type, actor_login, repo_name, created_at, updated_at, action, comment_id, path, ref, ref_type, creator_user_login, number, title, labels, state, assignee, assignees, closed_at, merged_at, merge_commit_sha, requested_reviewers, merged_by, review_comments, member_login FROM github_events WHERE repo_name = 'ClickHouse/ClickHouse' ORDER BY created_at ASC LIMIT 200000 FORMAT JSONEachRow" > github_all_columns.ndjson
Note the upper limit 200k rows and restriction to the ClickHouse/ClickHouse
repository. Feel free to use this command to generate larger datasets for testing, potentially exploring other repositories. If you experience quota limits, instructions for downloading and transforming the data can be found here.