Awesome-omni-skill seatunnel-skill

Apache SeaTunnel - A multimodal, high-performance, distributed data integration tool for massive data synchronization across 100+ connectors

install
source · Clone the upstream repo
git clone https://github.com/diegosouzapw/awesome-omni-skill
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/diegosouzapw/awesome-omni-skill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/development/seatunnel-skill" ~/.claude/skills/diegosouzapw-awesome-omni-skill-seatunnel-skill && rm -rf "$T"
manifest: skills/development/seatunnel-skill/SKILL.md
source content

Apache SeaTunnel OpenCode Skill

Apache SeaTunnel is a multimodal, high-performance, distributed data integration tool capable of synchronizing vast amounts of data daily. It connects hundreds of evolving data sources with support for real-time, CDC (Change Data Capture), and full database synchronization.

Quick Start

Prerequisites

  • Java: JDK 8 or higher
  • Maven: 3.6.0 or higher (for building from source)
  • Python: 3.7+ (optional, for Python API)
  • Git: For cloning the repository

Installation

1. Clone Repository

git clone https://github.com/apache/seatunnel.git
cd seatunnel

2. Build from Source

# Build entire project
mvn clean install -DskipTests

# Build specific module
mvn clean install -pl seatunnel-core -DskipTests

3. Download Pre-built Binary (Recommended)

Visit the official download page and select your version:

VERSION=2.3.12
wget https://archive.apache.org/dist/seatunnel/${VERSION}/apache-seatunnel-${VERSION}-bin.tar.gz
tar -xzf apache-seatunnel-${VERSION}-bin.tar.gz
cd apache-seatunnel-${VERSION}

4. Basic Configuration

# Set JAVA_HOME
export JAVA_HOME=/path/to/java

# Add to PATH
export PATH=$PATH:/path/to/seatunnel/bin

5. Verify Installation

seatunnel --version

Hello World Example

Create a simple data integration job:

config/hello_world.conf

env {
  job.mode = "BATCH"
  job.name = "Hello World"
}

source {
  FakeSource {
    row.num = 100
    schema = {
      fields {
        id = "bigint"
        name = "string"
        age = "int"
      }
    }
  }
}

sink {
  Console {
    format = "json"
  }
}

Run the job:

seatunnel.sh -c config/hello_world.conf -e spark
# or with flink
seatunnel.sh -c config/hello_world.conf -e flink

Overview

Purpose and Target Users

SeaTunnel is designed for:

  • Data Engineers: Building large-scale data pipelines with minimal complexity
  • DevOps Teams: Managing data integration infrastructure
  • Enterprise Platforms: Handling 100+ billion data records daily
  • Real-time Analytics: Supporting streaming data synchronization
  • Legacy System Migration: CDC-based incremental sync from transactional databases

Core Capabilities

  1. Multimodal Support

    • Structured data (databases, data warehouses)
    • Unstructured data (video, images, binaries)
    • Semi-structured data (JSON, logs, binlog streams)
  2. Multiple Synchronization Methods

    • Batch: Full historical data transfer
    • Streaming: Real-time data pipeline
    • CDC: Incremental capture from databases
    • Full + Incremental: Combined approach
  3. 100+ Pre-built Connectors

    • Databases: MySQL, PostgreSQL, Oracle, SQL Server, MongoDB, DB2, OceanBase
    • Data Warehouses: Snowflake, BigQuery, Redshift, Iceberg
    • Data Lakes: Hive, Iceberg, Hudi, Paimon
    • Cloud SaaS: Salesforce, Shopify, Google Sheets
    • Message Queues: Kafka, RabbitMQ, Pulsar, RocketMQ, ActiveMQ
    • Search Engines: Elasticsearch, OpenSearch, Easysearch
    • OLAP Engines: ClickHouse, StarRocks, Doris, Druid
    • Time-series Databases: IoTDB, TDengine, InfluxDB
    • Vector Databases: Milvus, Qdrant
    • Graph Databases: Neo4j
    • Object Storage: S3, GCS, HDFS, OssFile, CosFile
  4. Multi-Engine Support

    • Zeta Engine: Lightweight, standalone deployment (no Spark/Flink required)
    • Apache Flink: Distributed streaming engine
    • Apache Spark: Distributed batch/batch-stream processing

Features

High-Performance

  • Distributed Snapshot Algorithm: Ensures data consistency without locks
  • JDBC Multiplexing: Minimizes database connections for real-time sync
  • Log Parsing: Efficient CDC implementation with binary log analysis
  • Resource Optimization: Reduces computing resources and I/O overhead

Data Quality & Reliability

  • Real-time Monitoring: Track synchronization progress and data metrics
  • Data Loss Prevention: Transactional guarantees (exactly-once semantics)
  • Deduplication: Prevents duplicate records during reprocessing
  • Error Handling: Graceful failure recovery and retry logic

Developer-Friendly

  • SQL-like Configuration: Intuitive HOCON job definition syntax
  • Visual Web UI: Drag-and-drop job builder (SeaTunnel Web Project)
  • Extensive Documentation: Comprehensive guides with i18n (English + Chinese)
  • Community Support: Active community via Slack and mailing lists

Production Ready

  • Proven at Scale: Used in enterprises processing billions of records daily
  • Version Stability: Regular releases with backward compatibility
  • Enterprise Features: Multi-tenancy, RBAC, audit logging
  • Cloud Native: Kubernetes-ready deployment

Installation

Installation Methods

Method 1: Binary Download (Recommended for Quick Start)

# 1. Download binary
VERSION=2.3.12
wget https://archive.apache.org/dist/seatunnel/${VERSION}/apache-seatunnel-${VERSION}-bin.tar.gz

# 2. Extract
tar -xzf apache-seatunnel-${VERSION}-bin.tar.gz
cd apache-seatunnel-${VERSION}

# 3. Verify
./bin/seatunnel.sh --version

Method 2: Build from Source

# 1. Clone repository
git clone https://github.com/apache/seatunnel.git
cd seatunnel

# 2. Build with Maven
mvn clean install -DskipTests

# 3. Navigate to distribution
cd seatunnel-dist/target/apache-seatunnel-*-bin/apache-seatunnel-*

# 4. Verify
./bin/seatunnel.sh --version

Method 3: Docker

# Pull official Docker image
docker pull apache/seatunnel:latest

# Run container
docker run -it apache/seatunnel:latest /bin/bash

# Or run a job directly
docker run -v /path/to/config:/config \
  apache/seatunnel:latest \
  seatunnel.sh -c /config/job.conf -e spark

Environment Setup

Set Java Home

# Bash/Zsh
export JAVA_HOME=/path/to/java
export PATH=$JAVA_HOME/bin:$PATH

# Verify
java -version

Configure for Spark Engine

# Set Spark Home (if using Spark engine)
export SPARK_HOME=/path/to/spark

# Verify Spark installation
$SPARK_HOME/bin/spark-submit --version

Configure for Flink Engine

# Set Flink Home (if using Flink engine)
export FLINK_HOME=/path/to/flink

# Verify Flink installation
$FLINK_HOME/bin/flink --version

System Requirements

RequirementVersion/Spec
JavaJDK 1.8+
Memory2GB+ (minimum), 8GB+ (recommended)
Disk500MB (binary) + job storage
NetworkConnectivity to source/sink systems
Scala2.12.15 (for Spark/Flink integration)

Usage

1. Job Configuration (HOCON Format)

SeaTunnel uses HOCON (Human-Optimized Config Object Notation) for job configuration.

Basic Structure:

env {
  job.mode = "BATCH"  # or STREAMING
  job.name = "My Job"
  parallelism = 4
}

source {
  SourceConnector {
    option1 = value1
    option2 = value2
    schema = {
      fields {
        column1 = "type"
        column2 = "type"
      }
    }
  }
}

# Optional: Transform data
transform {
  TransformName {
    option = value
  }
}

sink {
  SinkConnector {
    option1 = value1
    option2 = value2
  }
}

2. Common Use Cases

Use Case 1: MySQL to PostgreSQL (Batch)

config/mysql_to_postgres.conf

env {
  job.mode = "BATCH"
  job.name = "MySQL to PostgreSQL"
}

source {
  Jdbc {
    driver = "com.mysql.cj.jdbc.Driver"
    url = "jdbc:mysql://mysql-host:3306/mydb"
    user = "root"
    password = "password"
    query = "SELECT * FROM users"
    connection_check_timeout_sec = 100
  }
}

sink {
  Jdbc {
    driver = "org.postgresql.Driver"
    url = "jdbc:postgresql://pg-host:5432/mydb"
    user = "postgres"
    password = "password"
    database = "mydb"
    table = "users"
    primary_keys = ["id"]
    connection_check_timeout_sec = 100
  }
}

Run:

seatunnel.sh -c config/mysql_to_postgres.conf -e spark

Use Case 2: Kafka Streaming to Elasticsearch

config/kafka_to_es.conf

env {
  job.mode = "STREAMING"
  job.name = "Kafka to Elasticsearch"
  parallelism = 2
}

source {
  Kafka {
    bootstrap.servers = "kafka-host:9092"
    topic = "events"
    patterns = "event.*"
    consumer.group = "seatunnel-group"
    format = "json"
    schema = {
      fields {
        event_id = "bigint"
        event_name = "string"
        timestamp = "bigint"
        payload = "string"
      }
    }
  }
}

transform {
  Sql {
    sql = "SELECT event_id, event_name, FROM_UNIXTIME(timestamp/1000) as ts, payload FROM source"
  }
}

sink {
  Elasticsearch {
    hosts = ["es-host:9200"]
    index = "events"
    index_type = "_doc"
    username = "elastic"
    password = "password"
  }
}

Run:

seatunnel.sh -c config/kafka_to_es.conf -e flink

Use Case 3: CDC from MySQL to Kafka

config/mysql_cdc_kafka.conf

env {
  job.mode = "STREAMING"
  job.name = "MySQL CDC to Kafka"
}

source {
  Mysql {
    server_id = 5400
    hostname = "mysql-host"
    port = 3306
    username = "root"
    password = "password"
    database = ["mydb"]
    table = ["users", "orders"]
    startup.mode = "initial"
    snapshot.split.size = 8096
    incremental.snapshot.chunk.size = 1024
    snapshot_fetch_size = 1024
    snapshot_lock_timeout_sec = 10
    server_time_zone = "UTC"
  }
}

sink {
  Kafka {
    bootstrap.servers = "kafka-host:9092"
    topic = "mysql_cdc"
    format = "canal_json"
    semantic = "EXACTLY_ONCE"
  }
}

Run:

seatunnel.sh -c config/mysql_cdc_kafka.conf -e flink

3. Running Jobs

Local Mode (Single Machine)

# Using Spark (default)
seatunnel.sh -c config/job.conf -e spark

# Using Flink
seatunnel.sh -c config/job.conf -e flink

# Using Zeta (lightweight)
seatunnel.sh -c config/job.conf -e zeta

Cluster Mode

# Submit to Spark cluster
seatunnel.sh -c config/job.conf -e spark -m cluster -n hadoop-master:7077

# Submit to Flink cluster
seatunnel.sh -c config/job.conf -e flink -m remote -s localhost 8081

Verbose Output

seatunnel.sh -c config/job.conf -e spark -l DEBUG

Check Status

# View running jobs (Spark Cluster)
spark-submit --status <driver-id>

# View running jobs (Flink Cluster)
$FLINK_HOME/bin/flink list

4. SQL API (Advanced)

Use SQL for complex transformations:

source {
  MySQL {
    # Source config...
  }
}

transform {
  Sql {
    # Multiple SQL statements
    sql = """
      SELECT
        id,
        UPPER(name) as name,
        age + 10 as age_plus_10,
        CURRENT_TIMESTAMP as created_at
      FROM source
      WHERE age > 18
    """
  }
}

sink {
  PostgreSQL {
    # Sink config...
  }
}

API Reference

Core Connector Types

Source Connectors

  • Jdbc: Generic JDBC databases (MySQL, PostgreSQL, Oracle, SQL Server, DB2, OceanBase)
  • Kafka: Apache Kafka topics
  • Mysql: MySQL with CDC support
  • MongoDB: MongoDB collections
  • PostgreSQL: PostgreSQL with CDC
  • S3/OssFile/CosFile/HdfsFile: Object storage / file systems
  • Http: HTTP/HTTPS endpoints
  • Hive/Iceberg/Hudi/Paimon: Data lake formats
  • Elasticsearch/Easysearch: Search engines
  • Redis/HBase/Cassandra: NoSQL databases
  • Pulsar/RocketMQ/RabbitMQ/ActiveMQ: Message queues
  • ClickHouse/StarRocks/Doris/Druid: OLAP engines
  • IoTDB/TDengine/InfluxDB: Time-series databases
  • Milvus/Qdrant: Vector databases
  • Neo4j: Graph databases
  • FakeSource: For testing and development

Transform Connectors

  • Sql: Execute SQL transformations
  • Dummy: Pass-through (testing)
  • FieldMapper: Rename/map columns
  • JsonPath: Extract data from JSON

Sink Connectors

  • Jdbc: Write to JDBC-compatible databases
  • Kafka: Publish to Kafka topics
  • Elasticsearch: Write to Elasticsearch indices
  • S3: Write to S3 buckets
  • Redis: Write to Redis
  • HBase: Write to HBase tables
  • StarRocks: Write to StarRocks tables
  • ClickHouse/Doris/Druid: Write to OLAP engines
  • Hive/Iceberg/Hudi/Paimon: Write to data lakes
  • Console: Output to console (testing)

All source connectors above also have corresponding sink implementations where applicable.

Configuration Options

Common Source Options

source {
  ConnectorName {
    # Connection
    hostname = "host"
    port = 3306
    username = "user"
    password = "pass"

    # Data selection
    database = "db_name"
    table = "table_name"

    # Performance
    fetch_size = 1000
    connection_check_timeout_sec = 100
    split_size = 10000

    # Schema
    schema = {
      fields {
        id = "bigint"
        name = "string"
      }
    }
  }
}

Common Sink Options

sink {
  ConnectorName {
    # Connection
    hostname = "host"
    port = 3306
    username = "user"
    password = "pass"

    # Write behavior
    database = "db_name"
    table = "table_name"
    primary_keys = ["id"]
    batch_size = 500

    # Error handling
    max_retries = 3
    retry_wait_time_ms = 1000
    on_duplicate_key_update_column_names = ["field1"]
  }
}

Engine Options

env {
  # Execution mode
  job.mode = "BATCH"  # or STREAMING

  # Job identity
  job.name = "My Job"

  # Parallelism
  parallelism = 4

  # Checkpoint (streaming)
  checkpoint.interval = 60000

  # Restart strategy
  restart_strategy = "fixed_delay"
  restart_strategy.fixed_delay.attempts = 3
  restart_strategy.fixed_delay.delay = 10000
}

Debugging and Monitoring

View Job Metrics

# During execution, monitor logs
tail -f logs/seatunnel.log

# Check specific log level
grep ERROR logs/seatunnel.log

Enable Debug Logging

# Set log level to DEBUG
seatunnel.sh -c config/job.conf -e spark -l DEBUG

Use Test Sources

source {
  FakeSource {
    row.num = 1000
    schema = {
      fields {
        id = "bigint"
        name = "string"
      }
    }
  }
}

sink {
  Console {
    format = "json"
  }
}

Configuration

Project Structure

seatunnel/
├── bin/                          # Executable scripts
│   ├── seatunnel.sh             # Main entry point
│   └── seatunnel-submit.sh      # Spark submission script
├── config/                       # Configuration examples
│   ├── flink-conf.yaml          # Flink configuration
│   └── spark-conf.yaml          # Spark configuration
├── connectors/                  # Pre-built connectors
├── lib/                         # JAR dependencies
├── logs/                        # Runtime logs
└── plugin/                      # Plugin directory

Environment Variables

# Java configuration
export JAVA_HOME=/path/to/java
export JVM_OPTS="-Xms1G -Xmx4G"

# Spark configuration
export SPARK_HOME=/path/to/spark
export SPARK_MASTER=spark://master:7077

# Flink configuration
export FLINK_HOME=/path/to/flink

# SeaTunnel configuration
export SEATUNNEL_HOME=/path/to/seatunnel

Key Configuration Files

seatunnel-env.sh

Configure runtime environment:

JAVA_HOME=/path/to/java
JVM_OPTS="-Xms1G -Xmx4G -XX:+UseG1GC"
SPARK_HOME=/path/to/spark
FLINK_HOME=/path/to/flink

flink-conf.yaml
(for Flink engine)

taskmanager.memory.process.size: 2g
taskmanager.memory.jvm-overhead.fraction: 0.1
parallelism.default: 4

spark-conf.yaml
(for Spark engine)

driver-memory: 2g
executor-memory: 4g
num-executors: 4

Performance Tuning

For Batch Jobs

env {
  job.mode = "BATCH"
  parallelism = 8  # Increase for larger clusters
}

source {
  Jdbc {
    # Split large tables for parallel reads
    split_size = 100000
    fetch_size = 5000
  }
}

sink {
  Jdbc {
    # Batch inserts for better throughput
    batch_size = 1000
    # Use connection pooling
    max_retries = 3
  }
}

For Streaming Jobs

env {
  job.mode = "STREAMING"
  parallelism = 4
  checkpoint.interval = 30000  # 30 seconds
}

source {
  Kafka {
    # Consumer group for parallel reads
    consumer.group = "seatunnel-consumer"
    # Batch reading
    max_poll_records = 500
  }
}

Website & Documentation System

The official documentation site (https://seatunnel.apache.org) is managed through the

apache/seatunnel-website
repository, built with Docusaurus 2.4.3.

How Documentation Works

Documentation content lives in the main codebase (

apache/seatunnel
repo,
docs/
directory). The website repo fetches docs via
npm run sync
(or
tools/build-docs.js
) before building.

The sync process:

  1. Creates
    swap/
    directory, clones
    apache/seatunnel
    codebase
  2. Copies
    docs/en
    to website's
    docs/
    directory
  3. Copies
    docs/zh
    to
    i18n/zh-CN/
    for Chinese translations
  4. Copies
    docs/images
    to
    static/image_en
    and
    static/image_zh
  5. Copies
    docs/sidebars.js
    to root level

Website Repository Structure

seatunnel-website/
├── blog/                          # Blog posts (EN)
├── community/                     # Community docs
│   ├── contribution_guide/        # Contribution guidelines
│   │   ├── contribute.md
│   │   ├── committer.md
│   │   ├── code-review.md
│   │   ├── release.md
│   │   └── subscribe.md
│   └── submit_guide/             # Submission guidelines
│       ├── document.md
│       ├── license.md
│       └── submit-code.md
├── docs/                          # Current version docs (synced from main repo)
├── docusaurus.config.js           # Site configuration
├── i18n/zh-CN/                    # Chinese translations
│   ├── docusaurus-plugin-content-blog/
│   ├── docusaurus-plugin-content-docs/
│   └── docusaurus-plugin-content-docs-community/
├── package.json
├── sidebars.js                    # Doc sidebar config (synced)
├── src/
│   ├── components/
│   ├── css/
│   ├── pages/
│   │   ├── home/                  # Homepage
│   │   ├── team/                  # Team page (with Base64 avatars)
│   │   ├── user/                  # User showcase
│   │   └── versions/             # Version selector
│   └── styles/
├── static/                        # Static assets
│   ├── doc/image/, image_en/, image_zh/
│   ├── home/, image/, user/
│   └── js/google_translate_init.js
├── tools/
│   ├── build-docs.js              # Doc sync script
│   ├── common.js                  # Shared constants
│   ├── version.js                 # Version management
│   ├── image-copy.js              # Image processing
│   └── fetch-team-avatars.js      # Team avatar updater
├── versioned_docs/                # Historical version docs (20 versions: 1.x ~ 2.3.12)
├── versioned_sidebars/            # Historical sidebars
├── versions.json                  # Version registry
└── seatunnel_web_versions.json    # Web UI version registry

Documentation Categories (per version)

Each version's docs contain:

  • about.md
    - Project overview
  • command/
    - CLI commands (e.g., connector-check)
  • concept/
    - Core concepts (config, schema-evolution, speed-limit, SQL config, event-listener)
  • connector-v2/
    - Connector documentation
    • source/
      - Source connector docs (100+ connectors)
    • sink/
      - Sink connector docs
    • changelog/
      - Per-connector changelogs
    • formats/
      - Data formats (Avro, Canal JSON, Debezium JSON, OGG JSON, Protobuf)
  • contribution/
    - Contributor guides
  • faq.md
    - Frequently asked questions
  • other-engine/
    - Flink/Spark engine specifics
  • seatunnel-engine/
    - Zeta engine docs + telemetry
  • start-v2/
    - Getting started guides
    • docker/
      - Docker deployment
    • kubernetes/
      - K8s deployment
    • locally/
      - Local setup
  • transform-v2/
    - Transform documentation (SQL, etc.)

Versioned Documentation

20 historical versions maintained:

1.x
,
2.1.0
2.1.3
,
2.2.0-beta
,
2.3.0-beta
2.3.12

Total: ~3691 versioned doc files + ~1652 i18n files

Website Development

# Clone
git clone git@github.com:apache/seatunnel-website.git
cd seatunnel-website

# Sync docs from main repo
npm run sync
# Or with SSH: export PROTOCOL_MODE=ssh && npm run sync

# Install dependencies
npm install

# Dev server (English)
npm run start

# Dev server (Chinese)
npm run start-zh

# Production build (needs ~10GB heap)
npm run build
# Or faster parallel build
npm run build:fast

# Serve built site
npm run serve

Adding a New Version

# 1. Create versioned snapshot
npm run version <target_version>

# 2. Update download links
# Edit src/pages/download/st_data.json

Branching Strategy

BranchPurpose
main
Default development branch
asf-site
Production (https://seatunnel.apache.org)
asf-staging
Staging (https://seatunnel.staged.apache.org)

CI/CD

GitHub Actions workflow (

.github/workflows/deploy.yml
):

  • Triggers: push to
    main
    , PRs to
    main
    , daily cron (5:00 AM UTC)
  • Node.js 18.20.7
  • Steps:
    npm install
    ->
    npm run sync
    ->
    npm run build
    -> deploy to
    asf-site
    branch

Key Conventions

  • Directory names: lowercase, underscore-separated, plural (e.g.,
    scripts
    ,
    components
    )
  • JS/static files: lowercase, dash-separated (e.g.,
    render-dom.js
    )
  • Images: stored under
    static/{module_name}
  • Styles: placed in
    src/css/
  • Most pages have "Edit this page" link pointing to GitHub source

Development

Project Architecture

SeaTunnel follows a modular architecture:

seatunnel/
├── seatunnel-api/              # Core APIs
├── seatunnel-core/             # Execution engine
├── seatunnel-engines/          # Engine implementations
│   ├── seatunnel-engine-flink/
│   ├── seatunnel-engine-spark/
│   └── seatunnel-engine-zeta/
├── seatunnel-connectors/       # 100+ connector implementations
│   ├── seatunnel-connectors-*/ # One per connector type
└── seatunnel-dist/             # Distribution package

Building from Source

Full Build

# Clone repository
git clone https://github.com/apache/seatunnel.git
cd seatunnel

# Build all modules
mvn clean install -DskipTests

# Build with tests (slower)
mvn clean install

Build Specific Module

# Build only Kafka connector
mvn clean install -pl seatunnel-connectors/seatunnel-connectors-seatunnel-kafka -DskipTests

# Build only Flink engine
mvn clean install -pl seatunnel-engines/seatunnel-engine-flink -DskipTests

Build Distribution

# Create binary distribution
mvn clean install -DskipTests
cd seatunnel-dist
tar -tzf target/apache-seatunnel-*-bin.tar.gz | head

Running Tests

Unit Tests

# Run all tests
mvn test

# Run specific test class
mvn test -Dtest=MySqlConnectorTest

# Run with coverage
mvn test jacoco:report

Integration Tests

# Run integration tests
mvn verify

# Run with Docker containers
mvn verify -Pintegration-tests

Development Setup

IDE Configuration (IntelliJ IDEA)

  1. Import Project

    • File → Open → Select seatunnel directory
    • Choose Maven as build system
  2. Configure JDK

    • Project Settings → Project → JDK
    • Select JDK 1.8 or higher
  3. Enable Annotation Processing

    • Project Settings → Build → Compiler → Annotation Processors
    • Enable annotation processing
  4. Run Configuration

    • Run → Edit Configurations
    • Add new "Application" configuration
    • Set main class:
      org.apache.seatunnel.core.starter.command.CommandExecuteRunner

Code Style

SeaTunnel follows Apache project conventions:

  • 4-space indentation (not tabs)
  • Line length max 120 characters
  • Standard Java naming conventions
  • Organize imports alphabetically

Use the provided

.editorconfig
:

# Install EditorConfig plugin (IntelliJ)
# Then your IDE will auto-format code

Creating Custom Connectors

1. Extend SeConnector Interface

import org.apache.seatunnel.api.source.SeSource;
import org.apache.seatunnel.api.table.catalog.Table;

public class CustomSource extends SeSource {
    @Override
    public String getPluginName() {
        return "Custom";
    }

    @Override
    public void validate() {
        // Validation logic
    }

    @Override
    public ResultSet read(Boundedness boundedness) {
        // Implementation
    }
}

2. Create Configuration Class

import org.apache.seatunnel.api.configuration.Option;
import org.apache.seatunnel.api.configuration.Options;

public class CustomSourceOptions {
    public static final Option<String> HOST =
        Options.key("host")
            .stringType()
            .noDefaultValue()
            .withDescription("Source hostname");

    public static final Option<Integer> PORT =
        Options.key("port")
            .intType()
            .defaultValue(9200)
            .withDescription("Source port");
}

3. Register Connector

META-INF/services/org.apache.seatunnel.api.source.SeSource

Contributing Guide

  1. Fork and Clone

    git clone https://github.com/YOUR_USERNAME/seatunnel.git
    cd seatunnel
    git remote add upstream https://github.com/apache/seatunnel.git
    
  2. Create Feature Branch

    git checkout -b feature/my-feature
    
  3. Make Changes

    • Follow code style guide
    • Add tests for new features
    • Update documentation
  4. Commit and Push

    git add .
    git commit -m "feat: add new feature"
    git push origin feature/my-feature
    
  5. Create Pull Request

    • Go to GitHub repository
    • Create PR with clear description
    • Link any related issues
    • Wait for review
  6. Code Review Process

    • Address feedback from maintainers
    • Update PR with changes
    • After approval, maintainers will merge

Troubleshooting

Common Issues and Solutions

Issue 1: "ClassNotFoundException: com.mysql.jdbc.Driver"

Cause: JDBC driver JAR not in classpath

Solution:

# 1. Download MySQL JDBC driver
wget https://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-java-8.0.33.jar

# 2. Copy to lib directory
cp mysql-connector-java-8.0.33.jar $SEATUNNEL_HOME/lib/

# 3. Restart job
seatunnel.sh -c config/job.conf -e spark

Issue 2: "OutOfMemoryError: Java heap space"

Cause: Insufficient JVM heap memory

Solution:

# Increase JVM memory
export JVM_OPTS="-Xms2G -Xmx8G"

# Or set in seatunnel-env.sh
echo 'JVM_OPTS="-Xms2G -Xmx8G"' >> $SEATUNNEL_HOME/bin/seatunnel-env.sh

Issue 3: "Connection refused: connect"

Cause: Source/sink service not reachable

Solution:

# 1. Verify connectivity
ping source-host
telnet source-host 3306

# 2. Check credentials
mysql -h source-host -u root -p

# 3. Check firewall rules
# Ensure port 3306 is open

Issue 4: "Table not found" during CDC

Cause: Binlog not enabled on MySQL

Solution:

-- Check binlog status
SHOW VARIABLES LIKE 'log_bin';

-- Enable binlog in my.cnf
[mysqld]
log_bin = mysql-bin
binlog_format = row  # Important for CDC

-- Restart MySQL and verify
SHOW MASTER STATUS;

Issue 5: Slow Job Performance

Cause: Suboptimal configuration

Solutions:

# 1. Increase parallelism
env {
  parallelism = 8  # or higher based on cluster
}

# 2. Increase batch sizes
source {
  Jdbc {
    fetch_size = 5000
    split_size = 100000
  }
}

sink {
  Jdbc {
    batch_size = 2000
  }
}

# 3. Enable connection pooling
source {
  Jdbc {
    pool_size = 10
    max_idle_time = 300
  }
}

Issue 6: "Kafka topic offset out of range"

Cause: Offset doesn't exist in topic

Solution:

source {
  Kafka {
    # Reset to earliest or latest
    auto.offset.reset = "earliest"  # or "latest"

    # Or specify explicit offsets
    start_mode = "earliest"
  }
}

FAQ

Q: What's the difference between BATCH and STREAMING mode?

A:

  • BATCH: One-time execution, suitable for full database migration
  • STREAMING: Continuous execution, suitable for real-time data sync and CDC

Q: How do I handle schema changes during CDC?

A: SeaTunnel automatically detects schema changes in CDC mode. Configure in source:

source {
  Mysql {
    schema_change_mode = "auto"  # auto-detect and apply
  }
}

Q: Can I transform data during synchronization?

A: Yes, use SQL transform:

transform {
  Sql {
    sql = "SELECT id, UPPER(name) as name FROM source"
  }
}

Q: What's the maximum throughput?

A: Depends on:

  • Hardware (CPU, RAM, Network)
  • Source/sink database configuration
  • Data size per record
  • Network latency

Typical throughput: 100K - 1M records/second per executor

Q: How do I handle errors in production?

A: Configure restart strategy:

env {
  restart_strategy = "exponential_delay"
  restart_strategy.exponential_delay.initial_delay = 1000
  restart_strategy.exponential_delay.max_delay = 30000
  restart_strategy.exponential_delay.multiplier = 2.0
  restart_strategy.exponential_delay.attempts_unlimited = true
}

Q: Is there a web UI for job management?

A: Yes! Use SeaTunnel Web Project:

# Check out web UI project
git clone https://github.com/apache/seatunnel-web.git
cd seatunnel-web
mvn clean install

# Run web UI
java -jar target/seatunnel-web-*.jar
# Access at http://localhost:8080

Resources

Official Links

Community

Related Projects

Learning Resources

Version History

  • 2.3.12 (Latest Stable) - Current recommended version
  • 2.3.13-SNAPSHOT (Development)
  • 20 historical versions maintained in documentation (1.x ~ 2.3.12)
  • All Releases

Additional Notes

License

Apache License 2.0 - See LICENSE file

Security

  • Report security issues via Apache Security
  • Do NOT create public issues for security vulnerabilities

Support & Contribution

  • Join the community Slack for support
  • Submit feature requests on GitHub Issues
  • Contribute code via Pull Requests
  • Follow Contributing Guide

Last Updated: 2026-02-26 Skill Version: 2.3.13 Sources: apache/seatunnel + apache/seatunnel-website