A dynamic query routing framework that parses structured DSL queries, decomposes them into dependency-aware DAGs, extracts a 22-dimensional feature vector, and routes execution to either SQL (Spark SQL / pandas) or Graph (GraphFrames) backends.
π Repository: https://github.com/DataScience-ArtificialIntelligence/Hybrid-SQL-Graph-Query-Routing-System
π₯ 2-Minute Demo Video: https://drive.google.com/file/d/1PmTvewqPUMJ_vOrPGdnUg_BhPUDZGSv3/view?usp=sharing
- Strict curated source is available and quality-gated:
training_data/real_labeled_runs_strict_curated.csvtraining_data/fixed_train_base_strict.csvtraining_data/fixed_eval_set_strict.csvtraining_data/dataset_quality_report_strict_runtime.json
- Publishable strict runtime artifacts are generated under
experiments/results/:relevance_eval_strict_runtime.jsondataset_shift_eval_strict_runtime.jsonstrict_robustness_eval_runtime.jsonablation_strict_runtime.json
- Dashboard supports:
strictprofile for publication-grade artifactsfastprofile for quick iteration artifacts (*_fast_runtime.*)
| Name | Department | Roll No. | Institute | |
|---|---|---|---|---|
| Piyush Prashant | Data Science & AI | 24BDS055 | IIIT Dharwad | 24bds055@iiitdwd.ac.in |
| Priyanshu Mittal | Data Science & AI | 24BDS058 | IIIT Dharwad | 24bds058@iiitdwd.ac.in |
| Harshitha M S | Data Science & AI | 24BDS038 | IIIT Dharwad | 24bds038@iiitdwd.ac.in |
| J. Sameer Karthikeya | Data Science & AI | 24BDS026 | IIIT Dharwad | 24bds026@iiitdwd.ac.in |
HIFUN-Router/
β
βββ config/ # Configuration files
β βββ feature_schema.json # 22-dimensional feature schema definition
β βββ paths.py # Filesystem path constants
β βββ spark_config.py # PySpark session & cluster configuration
β
βββ data/
β βββ graphs/ # Pre-built graph parquet files (SNB, synthetic)
β β βββ snb/ # LDBC Social Network Benchmark graph
β β β βββ vertices.parquet
β β β βββ edges.parquet
β β βββ synthetic/ # Synthetically generated graph data
β βββ scripts/ # Dataset ingestion & conversion scripts
β β βββ ldbc_snb_to_parquet.py
β β βββ ogb_to_parquet.py
β β βββ job_to_parquet.py
β β βββ tpcds_to_parquet.py
β β βββ generate_synthetic.py
β β βββ compute_stats.py
β βββ stats/ # JSON column statistics per table
β
βββ dsl/
β βββ sample_queries/ # Example JSON DSL query files (TPC-H, SNB, etc.)
β
βββ experiments/ # Evaluation & benchmarking scripts
β βββ relevance_evaluation.py # Model vs baseline accuracy/F1
β βββ dataset_shift_evaluation.py # Cross-family domain shift analysis
β βββ strict_robustness_evaluation.py # Bootstrap CI + permutation tests
β βββ results/ # JSON/MD output artifacts from evaluations
β
βββ features/
β βββ feature_extractor.py # Builds the R^22 feature vector per routing node
β
βββ model/
β βββ trainer.py # XGBoost/LogReg model training & cross-validation
β βββ predictor.py # Inference wrapper for routing-time prediction
β βββ feature_analysis.py # VIF, correlation, collinearity analysis
β βββ feature_importance.py # SHAP-based feature importance plots
β βββ artifacts/ # Saved model files & analysis outputs
β βββ classifier_v1.pkl
β βββ classifier_v1_dt.pkl
β βββ feature_schema_v1.json
β βββ training_results.json
β βββ analysis/ # SHAP bar/summary plots + values
β
βββ notebooks/ # Jupyter exploration & reporting notebooks
β βββ 01_data_exploration.ipynb
β βββ 02_feature_analysis.ipynb
β βββ 03_model_training.ipynb
β βββ 04_results_visualization.ipynb
β βββ project_report.ipynb # Full project report notebook
β
βββ parser/
β βββ dsl_parser.py # JSON DSL β AST with jsonschema validation
β βββ ast_nodes.py # Typed operation node representations
β
βββ report/
β βββ hifun_router_ieee_report.tex # LaTeX source for the IEEE-format paper
β βββ decision_boundary.pdf # Decision boundary visualization
β
βββ router/
β βββ hybrid_router.py # Core routing engine (rule + ML hybrid)
β βββ baselines.py # AlwaysSQL, AlwaysGRAPH, TraversalRule baselines
β
βββ execution/
β βββ sql_generator.py # Translates plan nodes β Spark SQL / pandas
β βββ graph_generator.py # Translates plan nodes β GraphFrames operations
β βββ result_composer.py # Merges cross-engine outputs into final result
β
βββ training_data/ # Curated train/eval CSV splits & generation scripts
β βββ fixed_train_balanced_strict.csv
β βββ fixed_eval_set_strict.csv
β βββ fixed_split_manifest_strict.json
β βββ real_labeled_runs_strict_curated.csv
β βββ query_generator.py
β βββ real_collection_script.py
β βββ real_query_generator.py
β βββ fix_dataset_splits.py
β βββ dataset_quality_gate.py
β
βββ tests/ # Unit + integration test suite (pytest)
β βββ test_parser.py
β βββ test_decomposer.py
β βββ test_features.py
β βββ test_model.py
β βββ test_execution.py
β βββ test_correctness.py
β βββ test_experiments.py
β βββ reference_executor.py
β
βββ decomposer/
β βββ query_decomposer.py # Builds dependency DAG + schedulable blocks
β
βββ streamlit_app.py # Interactive evaluation dashboard (demo UI)
βββ test_setup.py # Environment sanity check
βββ run_project_strict.sh # End-to-end strict pipeline runner
βββ Dockerfile # Container image definition
βββ Makefile # Convenience command shortcuts
βββ requirements.txt # Python dependency list
βββ SETUP_AND_RUN.md # Detailed Linux setup guide
βββ RUN_STREAMLIT_DEMO.md # Demo recording script and guide
Install system packages (Ubuntu/Debian):
sudo apt update
sudo apt install -y python3 python3-pip python3-venv openjdk-17-jdk \
docker.io docker-compose-plugin makeEnable Docker for your user:
sudo systemctl enable --now docker
sudo usermod -aG docker "$USER"
newgrp dockercd ~
git clone https://github.com/PriyanshuMittal0310/HIFUN-Router.git HIFUN-Router-clone
cd HIFUN-Router-clonepython3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip wheel setuptools
pip install -r requirements.txtexport PYTHONPATH="$PWD"
export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64
export PATH="$JAVA_HOME/bin:$PATH"
mkdir -p /tmp/spark-events
export HIFUN_HISTORY_SERVER=/tmp/spark-eventspython test_setup.pyA successful run prints confirmation that Spark, GraphFrames, and the ML stack are importable.
Use this for reproducible strict artifacts without manually chaining commands:
source .venv/bin/activate
export PYTHONPATH="$PWD"
./run_project_strict.sh smokeUse this for the full strict bundle:
./run_project_strict.sh allIf you want to run step-by-step manually:
source .venv/bin/activate
export PYTHONPATH="$PWD"
export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64
export PATH="$JAVA_HOME/bin:$PATH"
python -m model.trainer
python -m experiments.relevance_evaluation
python -m experiments.dataset_shift_evaluation \
--source training_data/real_labeled_runs_strict_curated.csv
pytest -q# LDBC SNB via Docker datagen
mkdir -p data/raw/ldbc_snb
docker run --rm \
--mount type=bind,source="$(pwd)/data/raw/ldbc_snb",target=/out \
ldbc/datagen-standalone:latest \
--parallelism 1 -- --format csv --scale-factor 1 --mode raw --output-dir /out
python -m data.scripts.ldbc_snb_to_parquet \
--input data/raw/ldbc_snb \
--parquet-dir data/parquet/snb \
--graph-dir data/graphs/snb
# OGB graph dataset
python -m data.scripts.ogb_to_parquet \
--dataset ogbn-arxiv --root data/raw/ogb --graph-dir data/graphs
# JOB/IMDB (if available)
python -m data.scripts.job_to_parquet \
--input data/raw/job --output data/parquet/job
# TPC-DS (if available)
python -m data.scripts.tpcds_to_parquet \
--input data/raw/tpcds --output data/parquet/tpcdspython -m data.scripts.compute_statspython -m training_data.real_query_generator --scale aggressive --focus-mode allpython -m training_data.real_collection_script \
--queries_dir dsl/sample_queries \
--output training_data/real_labeled_runs.csv \
--n_warmup 2 --n_measure 3 --repeat 3python -m training_data.fix_dataset_splits \
--source training_data/real_labeled_runs.csvpython -m model.trainer# Relevance evaluation (accuracy, F1, PR-AUC)
python -m experiments.relevance_evaluation
# Domain shift evaluation (strict curated source)
python -m experiments.dataset_shift_evaluation \
--source training_data/real_labeled_runs_strict_curated.csv
# Strict robustness (bootstrap CI + permutation tests)
python -m experiments.strict_robustness_evaluation \
--train training_data/fixed_train_base_strict.csv \
--eval training_data/fixed_eval_set_strict.csv \
--transfer_source training_data/real_labeled_runs_strict_curated.csv \
--n_bootstrap 1000 --n_perm_labels 100 --n_perm_features 20 \
--out_json experiments/results/strict_robustness_eval_runtime.json \
--out_md experiments/results/strict_robustness_eval_runtime.mdFor broad all-data transfer analysis, you can still run:
python -m experiments.dataset_shift_evaluation \
--source training_data/real_labeled_runs.csv# All tests
pytest -q
# Specific suites
pytest tests/test_execution.py -v
pytest tests/test_correctness.py -v
pytest tests/test_experiments.py -vmake help # Show all available commands
make setup # Set up environment
make test-env # Verify environment
make data-all # Run full dataset pipeline
make train # Train routing model
make evaluate # Run all evaluations
make test-all # Run full test suite
make clean # Remove generated outputs# Start all services
docker compose up -d
docker compose ps
# Monitor Spark master
docker compose logs -f spark-master
# Stop services
docker compose downService UIs once running:
| Service | URL |
|---|---|
| Spark Master | http://localhost:8080 |
| Spark Worker 1 | http://localhost:8081 |
| Spark Worker 2 | http://localhost:8082 |
| Spark History Server | http://localhost:18080 |
| HDFS NameNode | http://localhost:9870 |
| YARN Resource Manager | http://localhost:8088 |
The project includes an interactive Streamlit dashboard for exploring routing decisions and evaluation metrics.
source .venv/bin/activate
export PYTHONPATH="$PWD"
streamlit run streamlit_app.py --server.port 8501 --server.headless trueOpen your browser at: http://localhost:8501
The dashboard features four tabs: Dataset and Quality, Relevance Evaluation, Robustness Evaluation, and Cross-Dataset Generalization. Use the sidebar to switch between fast (rapid iteration) and strict (publication-grade) run profiles.
π Demo Link: https://drive.google.com/file/d/1PmTvewqPUMJ_vOrPGdnUg_BhPUDZGSv3/view?usp=sharing
Also available:
- 4-minute detailed recording script: RUN_STREAMLIT_DEMO.md
| Policy | Accuracy | F1 | Precision | Recall |
|---|---|---|---|---|
| AlwaysSQL | 0.5345 | 0.0000 | 0.0000 | 0.0000 |
| AlwaysGRAPH | 0.4655 | 0.6353 | 0.4655 | 1.0000 |
| TraversalRule | 0.9655 | 0.9643 | 0.9310 | 1.0000 |
| LogReg (Balanced) | 0.9655 | 0.9643 | 0.9310 | 1.0000 |
| XGBoost (Balanced) | 0.9655 | 0.9643 | 0.9310 | 1.0000 |
Evaluated on a strict template-disjoint split of N=58 evaluation cases. Bootstrap 95% CI (F1): [0.913, 1.000]. Top permutation-importance feature: op_count_traversal.
See also:
experiments/results/relevance_eval_strict_runtime.jsonexperiments/results/strict_robustness_eval_runtime.jsonexperiments/results/dataset_shift_eval_strict_runtime.json
| Layer | Technology |
|---|---|
| Query Parsing | Python, jsonschema |
| Distributed Compute | Apache Spark 3.4.2, PySpark |
| Graph Engine | GraphFrames 0.8.3 |
| Storage | Apache Parquet, HDFS |
| Cluster Management | Apache YARN |
| ML Models | XGBoost 2.0.3, LightGBM 4.2.0, scikit-learn 1.4.0 |
| Explainability | SHAP 0.44.0 |
| Dashboard | Streamlit 1.44.1 |
| Notebooks | JupyterLab 4.1.2 |
| Containerization | Docker Compose |
| Testing | pytest 7.4.4 |
JAVA_HOME is not set
export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64
export PATH="$JAVA_HOME/bin:$PATH"ModuleNotFoundError: No module named 'config'
cd ~/HIFUN-Router-clone
export PYTHONPATH="$PWD"Spark fails downloading packages on first run
python test_setup.py # retry β Spark caches JARs on second run- L. Libkin et al., "HIFUN: A Framework for Hybrid Query Processing," Journal of Intelligent Information Systems, 2022.
- M. Armbrust et al., "Spark SQL: Relational Data Processing in Spark," SIGMOD, 2015.
- A. Dave et al., "GraphFrames: An Integrated API for Mixing Graph and Relational Queries," GRADES, 2016.
- R. Marcus et al., "Neo: A Learned Query Optimizer," VLDB, 2019.
- W. Hu et al., "Open Graph Benchmark: Datasets for Machine Learning on Graphs," NeurIPS, 2020.
- LDBC Council, "LDBC Social Network Benchmark," 2015.
IIIT Dharwad β Department of Data Science and AI β BDA Project, 2025-2026