建立开发环境

在本部分,我们将向您展示如何为SeaTunnel设置开发环境,然后在您的JetBrains IntelliJ IDEA中运行一个简单的示例。

您可以在任何您喜欢的开发环境中开发或测试SeaTunnel代码,但在这里,我们以JetBrains IDEA作为示例,逐步教您设置环境。

准备工作

在我们开始讨论如何设置环境之前,我们需要进行一些准备工作。确保您已经安装了以下软件:

  1. 安装了Git。
  2. 安装了Java(目前支持JDK8/JDK11)并设置了JAVA_HOME。
  3. 安装了Scala(目前仅支持Scala 2.11.12)。
  4. 安装了JetBrains IDEA。

设置

克隆源代码

首先,您需要从GitHub克隆SeaTunnel的源代码。

  [bash]
1
git clone git@github.com:apache/seatunnel.git

本地安装子项目

在克隆源代码之后,您应该运行./mvnw命令将子项目安装到Maven本地仓库。

否则,在JetBrains IntelliJ IDEA中,您的代码可能无法正确启动。

  [bash]
1
./mvnw install -Dmaven.test.skip

或者使用 maven 命令:

  [bash]
1
mvn clean install -DskipTests -U

编译报错 1

  [plaintext]
1
2
3
4
5
6
7
8
9
10
[ERROR] Failed to execute goal on project seatunnel-flink-15-starter: Could not resolve dependencies for project org.apache.seatunnel:seatunnel-flink-15-starter:jar:2.3.4-SNAPSHOT: Could not find artifact org.apache.seatunnel:seatunnel-flink-starter-common:jar:2.3.3-SNAPSHOT in apache.snapshots (https://repository.apache.org/snapshots) -> [Help 1] [ERROR] [ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch. [ERROR] Re-run Maven using the -X switch to enable full debug logging. [ERROR] [ERROR] For more information about the errors and possible solutions, please read the following articles: [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException [ERROR] [ERROR] After correcting the problems, you can resume the build with the command [ERROR] mvn <goals> -rf :seatunnel-flink-15-starter

[Bug] [seatunnel-translation-flink-15] failed compile

in 2.3.3 release source code, just update revisionto{project.version} like the picture below.

release

PS: 还是把这个构建一个自己的源码分支。

编译报错 2

  [plaintext]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-remote-resources-plugin:1.7.0:process (process-resource-bundles) on project connector-iceberg: Execution process-resource-bundles of goal org.apache.maven.plugins:maven-remote-resources-plugin:1.7.0:process failed: Failed to resolve dependencies for one or more projects in the reactor. Reason: Unable to get dependency information for org.pentaho:pentaho-aggdesigner-algorithm:jar:5.1.5-jhyde: Failed to retrieve POM for org.pentaho:pentaho-aggdesigner-algorithm:jar:5.1.5-jhyde: Could not transfer artifact org.pentaho:pentaho-aggdesigner-algorithm:pom:5.1.5-jhyde from/to conjars (http://conjars.org/repo): Connect to conjars.org:80 [conjars.org/54.235.127.59] failed: Connection timed out: connect [ERROR] org.pentaho:pentaho-aggdesigner-algorithm:jar:5.1.5-jhyde [ERROR] [ERROR] from the specified remote repositories: [ERROR] apache.snapshots (https://repository.apache.org/snapshots, releases=false, snapshots=true), [ERROR] repo1 (https://repo1.maven.org/maven2, releases=true, snapshots=false), [ERROR] conjars (http://conjars.org/repo, releases=true, snapshots=true) [ERROR] Path to dependency: [ERROR] 1) org.apache.seatunnel:connector-iceberg:jar:2.3.4-SNAPSHOT [ERROR] 2) org.apache.hive:hive-exec:jar:core:2.3.9 [ERROR] [ERROR] [ERROR] -> [Help 1] [ERROR] [ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch. [ERROR] Re-run Maven using the -X switch to enable full debug logging. [ERROR] [ERROR] For more information about the errors and possible solutions, please read the following articles: [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/PluginExecutionException [ERROR]

[Maven try to download a pom</package> pom as a jar file and cannot find it](https://stackoverflow.com/questions/42536845/maven-try-to-download-a-packaingpom-package-pom-as-a-jar-file-and-cannot-fin)

包下载失败

下载失败

1)手动下载

https://mvnrepository.com/artifact/org.pentaho/pentaho-aggdesigner-algorithm/5.1.4-jhyde

网址:https://developer.aliyun.com/mvn/search

将jar和pom放到maven对应路径下即可。

pentaho-aggdesigner-algorithm-5.1.5-jhyde.jar pentaho-aggdesigner-algorithm-5.1.5-jhyde.pom

也可以手动 jar,放入到对应的目录:

  [plaintext]
1
~/.m2/repository/org/pentaho/pentaho-aggdesigner-algorithm/5.1.5-jhyde/pentaho-aggdesigner-algorithm-5.1.5-jhyde.jar

我本地是:

  [plaintext]
1
C:\Users\dh\.m2\repository\org/pentaho/pentaho-aggdesigner-algorithm/5.1.5-jhyde/

2)添加 aliyun 的仓库

用这种方式更方便些。

Could not find artifact org.pentaho:pentaho-aggdesigner-algorithm:pom:5.1.5-jhyde

是因为这个包不在阿里云公共maven镜像仓库上,需要添加一个新的镜像仓库,修改maven的settings.xml

  [xml]
1
2
3
4
5
6
7
8
9
10
11
12
13
<!-- 添加这个镜像仓库在阿里云公共仓库前面 --> <mirror> <id>aliyunmaven</id> <mirrorOf>*</mirrorOf> <name>spring-plugin</name> <url>https://maven.aliyun.com/repository/spring-plugin</url> </mirror> <mirror> <id>aliyunmaven</id> <mirrorOf>*</mirrorOf> <name>阿里云公共仓库</name> <url>https://maven.aliyun.com/repository/public</url> </mirror>

在这个包下载完后,可以把新增的阿里云spring-plugin镜像仓库注释掉,依旧优先使用阿里云公共仓库。

个人做法:

  [xml]
1
2
3
4
5
6
7
8
9
10
11
12
<mirror> <id>repo1</id> <mirrorOf>central</mirrorOf> <url>https://repo1.maven.org/maven2</url> </mirror> <mirror> <id>aliyunmaven</id> <mirrorOf>*</mirrorOf> <name>spring-plugin</name> <url>https://maven.aliyun.com/repository/spring-plugin</url> </mirror>

把 aliyun spring-plugin 放在后面。这样主仓库没有的,才去后面下载。省的删除。

编译报错3

  [plaintext]
1
2
3
4
5
6
7
8
9
10
[ERROR] Failed to execute goal on project connector-file-jindo-oss: Could not resolve dependencies for project org.apache.seatunnel:connector-file-jindo-oss:jar:2.3.4-SNAPSHOT: The following artifacts could not be resolved: com.aliyun.jindodata:jindo-core:jar:4.6.1, com.aliyun.jindodata:jindosdk:jar:4.6.1: Could not find artifact com.aliyun.jindodata:jindo-core:jar:4.6.1 in aliyunmaven (https://maven.aliyun.com/repository/public) -> [Help 1] [ERROR] [ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch. [ERROR] Re-run Maven using the -X switch to enable full debug logging. [ERROR] [ERROR] For more information about the errors and possible solutions, please read the following articles: [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException [ERROR] [ERROR] After correcting the problems, you can resume the build with the command [ERROR] mvn <goals> -rf :connector-file-jindo-oss

这是因为上面的 aliyun mirror 了所有的 maven,但是阿里云却没有这个包。。。

我们在 https://help.aliyun.com/zh/emr/emr-on-ecs/user-guide/download-jindodata

下载 v4.6.1

https://emr-public-sh.oss-cn-shanghai.aliyuncs.com/emrjindodata%2Fv4.6.1%2Fjindosdk-4.6.1.tar.gz

下载完成后,解压,在 lib 中找到对应的jar,放在目录 D:\doc\seatunnel-resources\other

在此目录打开命令行,通过 maven 的方式安装:

  [plaintext]
1
2
3
mvn install:install-file -DgroupId=com.aliyun.jindodata -DartifactId=jindo-core -Dversion=4.6.1 -Dpackaging=jar -Dfile=jindo-core-4.6.1.jar mvn install:install-file -DgroupId=com.aliyun.jindodata -DartifactId=jindosdk -Dversion=4.6.1 -Dpackaging=jar -Dfile=jindo-sdk-4.6.1.jar

从源代码构建SeaTunnel

在安装Maven之后,您可以使用以下命令进行编译和打包。

  [bash]
1
mvn clean package -pl seatunnel-dist -am -Dmaven.test.skip=true

解释

这个 Maven 命令具有以下选项和参数:

  • mvn: Maven 的执行命令。
  • clean: 清理命令,用于清除先前构建生成的文件和目录。
  • package: 打包命令,用于将项目代码编译、测试,并将其打包成可部署的格式(通常是 JAR 文件)。
  • -pl seatunnel-dist: 使用 -pl(或 --projects)指定要构建的项目或模块。在这里,只构建名为 seatunnel-dist 的项目。
  • -am: 使用 -am(或 --also-make)选项,将指定项目及其依赖一起构建。在这里,它将构建 seatunnel-dist 项目及其依赖。
  • -Dmaven.test.skip=true: 使用系统属性 maven.test.skip 设置为 true,跳过测试阶段,不执行测试。

综合起来,这个命令的作用是清理先前的构建、构建指定项目 seatunnel-dist 及其依赖,将其打包,并跳过测试阶段。

这在一些情况下很有用,比如在构建过程中不需要执行测试,或者为了加速构建过程。

构建子模块

如果您想单独构建子模块,可以使用以下命令进行编译和打包。

  [bash]
1
2
# 这是构建Redis连接器的示例 mvn clean package -pl seatunnel-connectors-v2/connector-redis -am -DskipTests -T 1C

安装JetBrains IntelliJ IDEA Scala插件

现在,您可以打开您的JetBrains IntelliJ IDEA并浏览源代码,但是要在IDEA中构建Scala代码,您还应该安装JetBrains IntelliJ IDEA的Scala插件。

scala plugin

如果需要,请参阅IDEA安装插件。

安装JetBrains IntelliJ IDEA Lombok插件

在运行以下示例之前,您还应该安装JetBrains IntelliJ IDEA的Lombok插件。如果需要,请参阅IDEA安装插件。

lombok plugin

代码风格

Apache SeaTunnel使用Spotless进行代码风格和格式检查。您可以运行以下命令,Spotless将自动为您修复代码风格和格式错误。

  [bash]
1
./mvnw spotless:apply

您可以将预提交挂钩文件/tools/spotless_check/pre-commit.sh复制到您的.git/hooks/目录,这样每次使用git commit提交代码时,Spotless将自动为您修复问题。

运行简单示例

完成上述所有步骤后,您已经完成了环境设置,并可以运行我们提供给您的示例。

所有示例都在seatunnel-examples模块中,您可以选择您感兴趣的一个,在IDEA中运行或调试它。

在这里,我们以seatunnel-examples/seatunnel-flink-connector-v2-example/src/main/java/org/apache/seatunnel/example/flink/v2/SeaTunnelApiExample.java作为示例。当您成功运行时,您可以看到以下输出:

  [plaintext]
1
2
3
4
5
6
+I[Ricky Huo, 71] +I[Gary, 12] +I[Ricky Huo, 93] ... ... +I[Ricky Huo, 83]

运行报错1

启动时,发现报错如下:

  [plaintext]
1
2
3
4
5
Caused by: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset. at org.apache.hadoop.util.Shell.checkHadoopHomeInner(Shell.java:469) ~[hadoop-common-3.1.4.jar:?] at org.apache.hadoop.util.Shell.checkHadoopHome(Shell.java:440) ~[hadoop-common-3.1.4.jar:?] at org.apache.hadoop.util.Shell.<clinit>(Shell.java:517) ~[hadoop-common-3.1.4.jar:?] ... 20 more

https://github.com/apache/seatunnel/issues/5892

Windows 环境,编译这个比较痛苦。

1)debug 定位到 checkpoint 的配置文件。

  [plaintext]
1
//file:/D:/_my/seatunnel-2.3.3-release-slim/seatunnel-engine/seatunnel-engine-common/target/classes/seatunnel.yaml

2) 修改对应的配置文件内容为:

把默认的 HDFS 调整为 localfile

  [yaml]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
seatunnel: engine: backup-count: 1 queue-type: blockingqueue print-execution-info-interval: 60 slot-service: dynamic-slot: true checkpoint: interval: 300000 timeout: 10000 storage: type: localfile max-retained: 3 plugin-config: namespace: C:\ProgramData\seatunnel\checkpoint\ # storage: # type: hdfs # max-retained: 3 # plugin-config: # namespace: /tmp/seatunnel/checkpoint_snapshot/ # storage.type: hdfs # fs.defaultFS: file:///tmp/

3) 重新打包

  [plaintext]
1
mvn clean install -DskipTests=true

重新确认上述路径的文件是否已经按照预期修改。

4)重新执行。

首先启动 exampleServer

然后启动对应的 example

测试效果

日志:

  [plaintext]
1
2
3
4
5
6
7
8
2024-01-12 18:03:21,229 DEBUG org.apache.seatunnel.connectors.seatunnel.fake.source.FakeSourceReader - reader 0 add splits [FakeSourceSplit(splitId=0, rowNum=5)] 2024-01-12 18:03:21,235 INFO org.apache.seatunnel.connectors.seatunnel.fake.source.FakeSourceReader - 5 rows of data have been generated in split(0). Generation time: 1705053801232 2024-01-12 18:03:21,235 INFO org.apache.seatunnel.connectors.seatunnel.fake.source.FakeSourceReader - Closed the bounded fake source 2024-01-12 18:03:21,236 INFO org.apache.seatunnel.connectors.seatunnel.console.sink.ConsoleSinkWriter - subtaskIndex=0 rowIndex=1: SeaTunnelRow#tableId= SeaTunnelRow#kind=INSERT : WnRWU, 1011154944 2024-01-12 18:03:21,236 INFO org.apache.seatunnel.connectors.seatunnel.console.sink.ConsoleSinkWriter - subtaskIndex=0 rowIndex=2: SeaTunnelRow#tableId= SeaTunnelRow#kind=INSERT : CAChl, 1231359126 2024-01-12 18:03:21,236 INFO org.apache.seatunnel.connectors.seatunnel.console.sink.ConsoleSinkWriter - subtaskIndex=0 rowIndex=3: SeaTunnelRow#tableId= SeaTunnelRow#kind=INSERT : Bzdjj, 729673747 2024-01-12 18:03:21,236 INFO org.apache.seatunnel.connectors.seatunnel.console.sink.ConsoleSinkWriter - subtaskIndex=0 rowIndex=4: SeaTunnelRow#tableId= SeaTunnelRow#kind=INSERT : YKWUP, 41852707 2024-01-12 18:03:21,236 INFO org.apache.seatunnel.connectors.seatunnel.console.sink.ConsoleSinkWriter - subtaskIndex=0 rowIndex=5: SeaTunnelRow#tableId= SeaTunnelRow#kind=INSERT : Ikbzt, 1583951001

默认的测试配置为:

  [yaml]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
###### ###### This config file is a demonstration of streaming processing in seatunnel config ###### env { # You can set engine configuration here execution.parallelism = 1 job.mode = "BATCH" checkpoint.interval = 5000 #execution.checkpoint.data-uri = "hdfs://localhost:9000/checkpoint" } source { # This is a example source plugin **only for test and demonstrate the feature source plugin** FakeSource { result_table_name = "fake" parallelism = 1 schema = { fields { name = "string" age = "int" } } } } transform { } sink { console { source_table_name="fake" } }

更多信息

所有我们的示例都使用简单的源和汇聚,以减少依赖并使其易于运行。

您可以在resources/examples中更改示例配置。

如果您想要使用PostgreSQL作为源并将其作为汇聚输出到控制台,可以按照以下配置更改您的配置。

  [plaintext]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
env { execution.parallelism = 1 } source { JdbcSource { driver = org.postgresql.Driver url = "jdbc:postgresql://host:port/database" username = postgres query = "select * from test" } } sink { ConsoleSink {} }

参考资料

安装说明

Seatunnel实践及相关报错总结

https://seatunnel.apache.org/docs/2.3.3/contribution/setup

alibabacloud-jindodata