From 15c736844309bcddf47bb4f52b7def2865445a80 Mon Sep 17 00:00:00 2001 From: Raghav Aggarwal Date: Fri, 10 Apr 2026 20:19:41 +0530 Subject: [PATCH 1/2] TEZ-4703: Append BUILDING.txt to root README.md --- BUILDING.txt | 182 ----------------------------------------- README.md | 227 +++++++++++++++++++++++++++++++++++++++++++++++++-- 2 files changed, 218 insertions(+), 191 deletions(-) delete mode 100644 BUILDING.txt diff --git a/BUILDING.txt b/BUILDING.txt deleted file mode 100644 index cd8623a784..0000000000 --- a/BUILDING.txt +++ /dev/null @@ -1,182 +0,0 @@ -Build instructions for Tez - -For instructions on how to contribute to Tez, refer to: -https://cwiki.apache.org/confluence/display/TEZ - ----------------------------------------------------------------------------------- -Requirements: - -* JDK 1.8+ -* Maven 3.6.3 or later -* spotbugs 4.9.3 or later (if running spotbugs) -* ProtocolBuffer 3.21.1 -* Internet connection for first build (to fetch all dependencies) -* Hadoop version should be 2.7.0 or higher. - ----------------------------------------------------------------------------------- -Maven main modules: - - tez................................(Main Tez project) - - tez-api .....................(Tez api) - - tez-common ..................(Tez common) - - tez-runtime-internals .......(Tez runtime internals) - - tez-runtime-library .........(Tez runtime library) - - tez-mapreduce ...............(Tez mapreduce) - - tez-dag .....................(Tez dag) - - tez-examples ................(Tez examples) - - tez-plugins .................(Tez plugins) - - tez-tests ...................(Tez tests and additional test examples) - - tez-dist ....................(Tez dist) - - tez-ui ......................(Tez web user interface) - ----------------------------------------------------------------------------------- -Maven build goals: - - * Clean : mvn clean - * Compile : mvn compile - * Run tests : mvn test - * Create JAR : mvn package - * Run spotbugs : mvn compile spotbugs:spotbugs - * Run checkstyle : mvn compile checkstyle:checkstyle - * Install JAR in M2 cache : mvn install - * Deploy JAR to Maven repo : mvn deploy - * Run jacoco : mvn test -Pjacoco - * Run Rat : mvn apache-rat:check - * Build javadocs : mvn javadoc:javadoc - * Build distribution : mvn package[-Dhadoop.version=2.7.0] - * Visualize state machines : mvn compile -Pvisualize -DskipTests=true - -Build options: - - * Use -Dpackage.format to create distributions with a format other than .tar.gz (mvn-assembly-plugin formats). - * Use -Dhadoop.version to specify the version of hadoop to build tez against - * Use -Dprotoc.path to specify the path to protoc - * Use -Dallow.root.build to root build tez-ui components - -Tests options: - - * Use -DskipTests to skip tests when running the following Maven goals: - 'package', 'install', 'deploy' or 'verify' - * -Dtest=,,.... - * -Dtest.exclude= - * -Dtest.exclude.pattern=**/.java,**/.java - ----------------------------------------------------------------------------------- -Building against a specific version of hadoop: - -Tez runs on top of Apache Hadoop YARN and requires hadoop version 2.7.0 or higher. - -By default, it can be compiled against other compatible hadoop versions by just -specifying the hadoop.version. For example, to build tez against hadoop 3.0.0-SNAPSHOT - - $ mvn package -Dhadoop.version=3.0.0-SNAPSHOT - -To skip Tests and java docs - - $ mvn package -Dhadoop.version=3.0.0-SNAPSHOT -DskipTests -Dmaven.javadoc.skip=true - -However, to build against hadoop versions higher than 2.7.0, you will need to do the -following: - -For Hadoop version X where X >= 2.8.0 - - $ mvn package -Dhadoop.version=${X} -Phadoop28 -P\!hadoop27 - -For recent versions of Hadoop (which do not bundle aws and azure by default), -you can bundle AWS-S3 (2.7.0+) or Azure (2.7.0+) support: - - $ mvn package -Dhadoop.version=${X} -Paws -Pazure - -Tez also has some shims to provide version-specific implementations for various APIs. -For more details, please refer to https://cwiki.apache.org/confluence/display/TEZ/HadoopShims - ----------------------------------------------------------------------------------- -UI build issues: - -In case of issue with UI build, please clean the UI cache. - - $ mvn clean -PcleanUICache - -Issue with PhantomJS on building in PowerPC. - - Official PhantomJS binaries were not available for Power platform. Hence if the build fails in PPC - please try installing PhantomJS manually and rerun. Refer https://github.com/ibmsoe/phantomjs-1/blob/v2.1.1-ppc64/README.md - and install it globally for the build to work. - ----------------------------------------------------------------------------------- -Skip UI build: - -In case you want to completely skip UI build, you can use 'noui' profile. -For instance, a full build without tests and tez-ui looks like: - - $ mvn clean install -DskipTests -Pnoui - -It's important to note that maven will still include tez-ui project, but all of the maven plugins are skipped. - ----------------------------------------------------------------------------------- -Protocol Buffer compiler: - -The version of Protocol Buffer compiler, protoc, can be defined on-the-fly as: - $ mvn clean install -DskipTests -pl ./tez-api -Dprotobuf.version=3.7.1 - -The default version is defined in the root pom.xml. - -If you have multiple versions of protoc in your system, you can set in your -build shell the PROTOC_PATH environment variable to point to the one you -want to use for the Tez build. If you don't define this environment variable then the -embedded protoc compiler will be used with the version defined in ${protobuf.version}. -It detects the platform and executes the corresponding protoc binary at build time. - -You can also specify the path to protoc while building using -Dprotoc.path - - $ mvn package -DskipTests -Dprotoc.path=/usr/local/bin/protoc - ----------------------------------------------------------------------------------- -Building the docs: - -The following commands will build a local copy of the Apache Tez website under docs - $ cd docs; mvn site - ----------------------------------------------------------------------------------- -Building components separately: - -If you are building a submodule directory, all the Tez dependencies this -submodule has will be resolved as all other 3rd party dependencies. This is, -from the Maven cache or from a Maven repository (if not available in the cache -or the SNAPSHOT 'timed out'). -An alternative is to run 'mvn install -DskipTests' from Tez source top -level once; and then work from the submodule. Keep in mind that SNAPSHOTs -time out after a while, using the Maven '-nsu' will stop Maven from trying -to update SNAPSHOTs from external repos. - ----------------------------------------------------------------------------------- -Visualize the State Machines used in Tez internals: - -Use -Pvisualize to generate a graphviz file named Tez.gv which can then be -converted into a state machine diagram that represents the state transitions of -the state machine for the classses provided. - -Optional parameters: - * -Dtez.dag.state.classes= - - By default, all 4 state machines - DAG, Vertex, Task and TaskAttempt are generated. - * -Dtez.graphviz.title - - Title for the Graph ( Default is Tez ) - * -Dtez.graphviz.output.file - - Output file to be generated with the state machines ( Default is Tez.gv ) - -For example, to generate the state machine graphviz file for DAGImpl, run: - - $ mvn compile -Pvisualize -Dtez.dag.state.classes=org.apache.tez.dag.app.dag.impl.DAGImpl -DskipTests=true - -To generate the diagram, you can use a Graphviz application or something like: - - $ dot -Tpng -o Tez.png Tez.gv' - ----------------------------------------------------------------------------------- -Building contrib tools under tez-tools : - -Use -Ptools to build various contrib tools present under tez-tools. For example, run: - - $ mvn package -Ptools - ----------------------------------------------------------------------------------- diff --git a/README.md b/README.md index 23a6d9ec32..5c35b0074d 100644 --- a/README.md +++ b/README.md @@ -15,18 +15,227 @@ Apache Tez ========== -Apache Tez is a generic data-processing pipeline engine envisioned as a low-level engine for higher abstractions -such as Apache Hadoop Map-Reduce, Apache Pig, Apache Hive etc. +Apache Tez is a generic data-processing pipeline engine envisioned as a +low-level engine for higher abstractions such as Apache Hadoop Map-Reduce, +Apache Pig, Apache Hive etc. At its heart, tez is very simple and has just two components: -* The data-processing pipeline engine where-in one can plug-in input, processing and output implementations to - perform arbitrary data-processing. Every 'task' in tez has the following: - - Input to consume key/value pairs from. - - Processor to process them. - - Output to collect the processed key/value pairs. +* The data-processing pipeline engine where-in one can plug-in input, + processing and output implementations to perform arbitrary data-processing. + Every 'task' in tez has the following: +* Input to consume key/value pairs from. +* Processor to process them. +* Output to collect the processed key/value pairs. -* A master for the data-processing application, where-by one can put together arbitrary data-processing 'tasks' - described above into a task-DAG to process data as desired. +* A master for the data-processing application, where-by one can put together + arbitrary data-processing 'tasks' described above into a task-DAG to process + data as desired. The generic master is implemented as a Apache Hadoop YARN ApplicationMaster. + +Building Tez +------------ + +For instructions on how to contribute to Tez, refer to: +[Tez Wiki - How to Contribute](https://cwiki.apache.org/confluence/display/TEZ) + +Requirements +------------ + +* JDK 21+ +* Maven 3.6.3 or later +* spotbugs 4.9.3 or later (if running spotbugs) +* ProtocolBuffer 3.25.5 +* Internet connection for first build (to fetch all dependencies) +* Hadoop 3.x + +Maven Modules +------------- + +* **tez** (Main Tez project) + * **tez-api**: Tez API + * **tez-common**: Tez common + * **tez-runtime-internals**: Tez runtime internals + * **tez-runtime-library**: Tez runtime library + * **tez-mapreduce**: Tez mapreduce + * **tez-dag**: Tez dag + * **tez-examples**: Tez examples + * **tez-plugins**: Tez plugins + * **tez-tests**: Tez tests and additional test examples + * **tez-dist**: Tez dist + * **tez-ui**: Tez web user interface + +Maven Build Goals +----------------- + +* Clean: `mvn clean` +* Compile: `mvn compile` +* Run tests: `mvn test` +* Create JAR: `mvn package` +* Run spotbugs: `mvn compile spotbugs:spotbugs` +* Run checkstyle: `mvn compile checkstyle:checkstyle` +* Install JAR in M2 cache: `mvn install` +* Deploy JAR to Maven repo: `mvn deploy` +* Run jacoco: `mvn test -Pjacoco` +* Run Rat: `mvn apache-rat:check` +* Build javadocs: `mvn javadoc:javadoc` +* Build distribution: `mvn package -Dhadoop.version=3.4.2` +* Visualize state machines: `mvn compile -Pvisualize -DskipTests=true` + +Build Options +------------- + +* Use `-Dpackage.format` to create distributions with a format other than + .tar.gz (mvn-assembly-plugin formats). +* Use `-Dhadoop.version` to specify the version of Hadoop to build Tez against. +* Use `-Dprotoc.path` to specify the path to `protoc`. +* Use `-Dallow.root.build` to root build `tez-ui` components. + +Test Options +------------ + +* Use `-DskipTests` to skip tests when running Maven goals like `package`, + `install`, `deploy`, or `verify`. +* Specific tests: `-Dtest=,,....` +* Exclude tests: `-Dtest.exclude=` +* Exclude pattern: + `-Dtest.exclude.pattern=**/.java,**/.java` + +Building against a Specific Version of Hadoop +---------------------------------------------- + +Tez runs on top of Apache Hadoop YARN and requires Hadoop 3.x. + +By default, it can be compiled against other compatible Hadoop versions by +specifying `hadoop.version`: + +```bash +mvn package -Dhadoop.version=3.4.2 +``` + +To skip tests and Javadocs: + +```bash +mvn package -Dhadoop.version=3.4.2 -DskipTests -Dmaven.javadoc.skip=true +``` + +For recent versions of Hadoop (which do not bundle AWS and Azure by default), +you can bundle AWS-S3 or Azure support: + +```bash +mvn package -Dhadoop.version=3.4.2 -Paws -Pazure +``` + +Tez also has shims to provide version-specific implementations for various APIs. +For more details, refer to +[Hadoop Shims](https://cwiki.apache.org/confluence/display/TEZ/HadoopShims). + +UI Build Issues +--------------- + +In case of issues with the UI build, please clean the UI cache: + +```bash +mvn clean -PcleanUICache +``` + +Issue with PhantomJS on building in PowerPC +------------------------------------------- + +Official PhantomJS binaries were not available for the Power platform. If the +build fails on PPC, try installing PhantomJS manually and rerun. Refer to +[PhantomJS README](https://github.com/ibmsoe/phantomjs-1/blob/v2.1.1-ppc64/README.md) +and install it globally. + +Skip UI Build +------------- + +To skip the UI build, use the `noui` profile: + +```bash +mvn clean install -DskipTests -Pnoui +``` + +Maven will still include the `tez-ui` project, but all related plugins will be +skipped. + +Protocol Buffer Compiler +------------------------ + +The version of the Protocol Buffer compiler (`protoc`) can be defined +on-the-fly: + +```bash +mvn clean install -DskipTests -pl ./tez-api -Dprotobuf.version=3.25.5 +``` + +The default version is defined in the root `pom.xml`. + +If you have multiple versions of `protoc`, set the `PROTOC_PATH` environment +variable to point to the desired binary. If not defined, the embedded `protoc` +compiler corresponding to `${protobuf.version}` will be used. + +Alternatively, specify the path during the build: + +```bash +mvn package -DskipTests -Dprotoc.path=/usr/local/bin/protoc +``` + +Building the Docs +----------------- + +Build a local copy of the Apache Tez website: + +```bash +cd docs +mvn site +``` + +Building Components Separately +------------------------------ + +If you are building a submodule directory, dependencies will be resolved from +the Maven cache or remote repositories. Alternatively, run +`mvn install -DskipTests` from the Tez top level once and then work from the +submodule. + +Visualize State Machines +------------------------ + +Use `-Pvisualize` to generate a Graphviz file (`Tez.gv`) representing state +transitions: + +```bash +mvn compile -Pvisualize -DskipTests=true +``` + +Optional parameters: + +* `-Dtez.dag.state.classes=` + (Default: DAG, Vertex, Task, TaskAttempt) +* `-Dtez.graphviz.title` (Default: Tez) +* `-Dtez.graphviz.output.file` (Default: Tez.gv) + +Example for `DAGImpl`: + +```bash +mvn compile -Pvisualize \ + -Dtez.dag.state.classes=org.apache.tez.dag.app.dag.impl.DAGImpl \ + -DskipTests=true +``` + +Convert the `.gv` file to an image: + +```bash +dot -Tpng -o Tez.png Tez.gv +``` + +Building Contrib Tools +---------------------- + +Use `-Ptools` to build tools under `tez-tools`: + +```bash +mvn package -Ptools +``` From 09eefc9396860388b6bbc94a0021b1ccc5e2597b Mon Sep 17 00:00:00 2001 From: Raghav Aggarwal Date: Tue, 14 Apr 2026 14:00:01 +0530 Subject: [PATCH 2/2] Review comments --- README.md | 95 +++++++++++++++++++------------------------------------ 1 file changed, 32 insertions(+), 63 deletions(-) diff --git a/README.md b/README.md index 5c35b0074d..2bdee64d78 100644 --- a/README.md +++ b/README.md @@ -1,23 +1,26 @@ Apache Tez ========== Apache Tez is a generic data-processing pipeline engine envisioned as a -low-level engine for higher abstractions such as Apache Hadoop Map-Reduce, -Apache Pig, Apache Hive etc. +low-level engine for higher abstractions such as Apache Hive, Apache Pig etc. At its heart, tez is very simple and has just two components: @@ -44,28 +47,11 @@ Requirements ------------ * JDK 21+ -* Maven 3.6.3 or later +* Maven 3.9.14 or later * spotbugs 4.9.3 or later (if running spotbugs) * ProtocolBuffer 3.25.5 -* Internet connection for first build (to fetch all dependencies) * Hadoop 3.x -Maven Modules -------------- - -* **tez** (Main Tez project) - * **tez-api**: Tez API - * **tez-common**: Tez common - * **tez-runtime-internals**: Tez runtime internals - * **tez-runtime-library**: Tez runtime library - * **tez-mapreduce**: Tez mapreduce - * **tez-dag**: Tez dag - * **tez-examples**: Tez examples - * **tez-plugins**: Tez plugins - * **tez-tests**: Tez tests and additional test examples - * **tez-dist**: Tez dist - * **tez-ui**: Tez web user interface - Maven Build Goals ----------------- @@ -92,16 +78,6 @@ Build Options * Use `-Dprotoc.path` to specify the path to `protoc`. * Use `-Dallow.root.build` to root build `tez-ui` components. -Test Options ------------- - -* Use `-DskipTests` to skip tests when running Maven goals like `package`, - `install`, `deploy`, or `verify`. -* Specific tests: `-Dtest=,,....` -* Exclude tests: `-Dtest.exclude=` -* Exclude pattern: - `-Dtest.exclude.pattern=**/.java,**/.java` - Building against a Specific Version of Hadoop ---------------------------------------------- @@ -114,12 +90,6 @@ specifying `hadoop.version`: mvn package -Dhadoop.version=3.4.2 ``` -To skip tests and Javadocs: - -```bash -mvn package -Dhadoop.version=3.4.2 -DskipTests -Dmaven.javadoc.skip=true -``` - For recent versions of Hadoop (which do not bundle AWS and Azure by default), you can bundle AWS-S3 or Azure support: @@ -131,34 +101,34 @@ Tez also has shims to provide version-specific implementations for various APIs. For more details, refer to [Hadoop Shims](https://cwiki.apache.org/confluence/display/TEZ/HadoopShims). -UI Build Issues ---------------- +Tez UI +------ -In case of issues with the UI build, please clean the UI cache: +* **UI Build Issues** -```bash -mvn clean -PcleanUICache -``` + In case of issues with the UI build, please clean the UI cache: -Issue with PhantomJS on building in PowerPC -------------------------------------------- + ```bash + mvn clean -PcleanUICache + ``` -Official PhantomJS binaries were not available for the Power platform. If the -build fails on PPC, try installing PhantomJS manually and rerun. Refer to -[PhantomJS README](https://github.com/ibmsoe/phantomjs-1/blob/v2.1.1-ppc64/README.md) -and install it globally. +* **Skip UI Build** -Skip UI Build -------------- + To skip the UI build, use the `noui` profile: -To skip the UI build, use the `noui` profile: + ```bash + mvn clean install -DskipTests -Pnoui + ``` -```bash -mvn clean install -DskipTests -Pnoui -``` + Maven will still include the `tez-ui` project, but all related plugins will be + skipped. + +* **Issue with PhantomJS on building in PowerPC** -Maven will still include the `tez-ui` project, but all related plugins will be -skipped. + Official PhantomJS binaries were not available for the Power platform. If the + build fails on PPC, try installing PhantomJS manually and rerun. Refer to + [PhantomJS README](https://github.com/ibmsoe/phantomjs-1/blob/v2.1.1-ppc64/README.md) + and install it globally. Protocol Buffer Compiler ------------------------ @@ -188,8 +158,7 @@ Building the Docs Build a local copy of the Apache Tez website: ```bash -cd docs -mvn site +mvn site -pl docs ``` Building Components Separately