Skip to content

DataScience-ArtificialIntelligence/Hybrid-SQL-Graph-Query-Routing-System

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

36 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

HIFUN Router β€” Hybrid SQL/Graph Query Routing System

A dynamic query routing framework that parses structured DSL queries, decomposes them into dependency-aware DAGs, extracts a 22-dimensional feature vector, and routes execution to either SQL (Spark SQL / pandas) or Graph (GraphFrames) backends.

Current Status (April 2026)

  • Strict curated source is available and quality-gated:
    • training_data/real_labeled_runs_strict_curated.csv
    • training_data/fixed_train_base_strict.csv
    • training_data/fixed_eval_set_strict.csv
    • training_data/dataset_quality_report_strict_runtime.json
  • Publishable strict runtime artifacts are generated under experiments/results/:
    • relevance_eval_strict_runtime.json
    • dataset_shift_eval_strict_runtime.json
    • strict_robustness_eval_runtime.json
    • ablation_strict_runtime.json
  • Dashboard supports:
    • strict profile for publication-grade artifacts
    • fast profile for quick iteration artifacts (*_fast_runtime.*)

πŸ‘₯ Team Members

Name Department Roll No. Institute Email
Piyush Prashant Data Science & AI 24BDS055 IIIT Dharwad 24bds055@iiitdwd.ac.in
Priyanshu Mittal Data Science & AI 24BDS058 IIIT Dharwad 24bds058@iiitdwd.ac.in
Harshitha M S Data Science & AI 24BDS038 IIIT Dharwad 24bds038@iiitdwd.ac.in
J. Sameer Karthikeya Data Science & AI 24BDS026 IIIT Dharwad 24bds026@iiitdwd.ac.in

πŸ“ Repository Structure

HIFUN-Router/
β”‚
β”œβ”€β”€ config/                          # Configuration files
β”‚   β”œβ”€β”€ feature_schema.json          # 22-dimensional feature schema definition
β”‚   β”œβ”€β”€ paths.py                     # Filesystem path constants
β”‚   └── spark_config.py              # PySpark session & cluster configuration
β”‚
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ graphs/                      # Pre-built graph parquet files (SNB, synthetic)
β”‚   β”‚   β”œβ”€β”€ snb/                     # LDBC Social Network Benchmark graph
β”‚   β”‚   β”‚   β”œβ”€β”€ vertices.parquet
β”‚   β”‚   β”‚   └── edges.parquet
β”‚   β”‚   └── synthetic/               # Synthetically generated graph data
β”‚   β”œβ”€β”€ scripts/                     # Dataset ingestion & conversion scripts
β”‚   β”‚   β”œβ”€β”€ ldbc_snb_to_parquet.py
β”‚   β”‚   β”œβ”€β”€ ogb_to_parquet.py
β”‚   β”‚   β”œβ”€β”€ job_to_parquet.py
β”‚   β”‚   β”œβ”€β”€ tpcds_to_parquet.py
β”‚   β”‚   β”œβ”€β”€ generate_synthetic.py
β”‚   β”‚   └── compute_stats.py
β”‚   └── stats/                       # JSON column statistics per table
β”‚
β”œβ”€β”€ dsl/
β”‚   └── sample_queries/              # Example JSON DSL query files (TPC-H, SNB, etc.)
β”‚
β”œβ”€β”€ experiments/                     # Evaluation & benchmarking scripts
β”‚   β”œβ”€β”€ relevance_evaluation.py      # Model vs baseline accuracy/F1
β”‚   β”œβ”€β”€ dataset_shift_evaluation.py  # Cross-family domain shift analysis
β”‚   β”œβ”€β”€ strict_robustness_evaluation.py  # Bootstrap CI + permutation tests
β”‚   └── results/                     # JSON/MD output artifacts from evaluations
β”‚
β”œβ”€β”€ features/
β”‚   └── feature_extractor.py         # Builds the R^22 feature vector per routing node
β”‚
β”œβ”€β”€ model/
β”‚   β”œβ”€β”€ trainer.py                   # XGBoost/LogReg model training & cross-validation
β”‚   β”œβ”€β”€ predictor.py                 # Inference wrapper for routing-time prediction
β”‚   β”œβ”€β”€ feature_analysis.py          # VIF, correlation, collinearity analysis
β”‚   β”œβ”€β”€ feature_importance.py        # SHAP-based feature importance plots
β”‚   └── artifacts/                   # Saved model files & analysis outputs
β”‚       β”œβ”€β”€ classifier_v1.pkl
β”‚       β”œβ”€β”€ classifier_v1_dt.pkl
β”‚       β”œβ”€β”€ feature_schema_v1.json
β”‚       β”œβ”€β”€ training_results.json
β”‚       └── analysis/                # SHAP bar/summary plots + values
β”‚
β”œβ”€β”€ notebooks/                       # Jupyter exploration & reporting notebooks
β”‚   β”œβ”€β”€ 01_data_exploration.ipynb
β”‚   β”œβ”€β”€ 02_feature_analysis.ipynb
β”‚   β”œβ”€β”€ 03_model_training.ipynb
β”‚   β”œβ”€β”€ 04_results_visualization.ipynb
β”‚   └── project_report.ipynb         # Full project report notebook
β”‚
β”œβ”€β”€ parser/
β”‚   β”œβ”€β”€ dsl_parser.py                # JSON DSL β†’ AST with jsonschema validation
β”‚   └── ast_nodes.py                 # Typed operation node representations
β”‚
β”œβ”€β”€ report/
β”‚   β”œβ”€β”€ hifun_router_ieee_report.tex # LaTeX source for the IEEE-format paper
β”‚   └── decision_boundary.pdf        # Decision boundary visualization
β”‚
β”œβ”€β”€ router/
β”‚   β”œβ”€β”€ hybrid_router.py             # Core routing engine (rule + ML hybrid)
β”‚   └── baselines.py                 # AlwaysSQL, AlwaysGRAPH, TraversalRule baselines
β”‚
β”œβ”€β”€ execution/
β”‚   β”œβ”€β”€ sql_generator.py             # Translates plan nodes β†’ Spark SQL / pandas
β”‚   β”œβ”€β”€ graph_generator.py           # Translates plan nodes β†’ GraphFrames operations
β”‚   └── result_composer.py           # Merges cross-engine outputs into final result
β”‚
β”œβ”€β”€ training_data/                   # Curated train/eval CSV splits & generation scripts
β”‚   β”œβ”€β”€ fixed_train_balanced_strict.csv
β”‚   β”œβ”€β”€ fixed_eval_set_strict.csv
β”‚   β”œβ”€β”€ fixed_split_manifest_strict.json
β”‚   β”œβ”€β”€ real_labeled_runs_strict_curated.csv
β”‚   β”œβ”€β”€ query_generator.py
β”‚   β”œβ”€β”€ real_collection_script.py
β”‚   β”œβ”€β”€ real_query_generator.py
β”‚   β”œβ”€β”€ fix_dataset_splits.py
β”‚   └── dataset_quality_gate.py
β”‚
β”œβ”€β”€ tests/                           # Unit + integration test suite (pytest)
β”‚   β”œβ”€β”€ test_parser.py
β”‚   β”œβ”€β”€ test_decomposer.py
β”‚   β”œβ”€β”€ test_features.py
β”‚   β”œβ”€β”€ test_model.py
β”‚   β”œβ”€β”€ test_execution.py
β”‚   β”œβ”€β”€ test_correctness.py
β”‚   β”œβ”€β”€ test_experiments.py
β”‚   └── reference_executor.py
β”‚
β”œβ”€β”€ decomposer/
β”‚   └── query_decomposer.py          # Builds dependency DAG + schedulable blocks
β”‚
β”œβ”€β”€ streamlit_app.py                 # Interactive evaluation dashboard (demo UI)
β”œβ”€β”€ test_setup.py                    # Environment sanity check
β”œβ”€β”€ run_project_strict.sh            # End-to-end strict pipeline runner
β”œβ”€β”€ Dockerfile                       # Container image definition
β”œβ”€β”€ Makefile                         # Convenience command shortcuts
β”œβ”€β”€ requirements.txt                 # Python dependency list
β”œβ”€β”€ SETUP_AND_RUN.md                 # Detailed Linux setup guide
└── RUN_STREAMLIT_DEMO.md            # Demo recording script and guide

βš™οΈ Installation & Setup

Prerequisites

Install system packages (Ubuntu/Debian):

sudo apt update
sudo apt install -y python3 python3-pip python3-venv openjdk-17-jdk \
                   docker.io docker-compose-plugin make

Enable Docker for your user:

sudo systemctl enable --now docker
sudo usermod -aG docker "$USER"
newgrp docker

1. Clone the Repository

cd ~
git clone https://github.com/PriyanshuMittal0310/HIFUN-Router.git HIFUN-Router-clone
cd HIFUN-Router-clone

2. Create Python Virtual Environment

python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip wheel setuptools
pip install -r requirements.txt

3. Set Environment Variables

export PYTHONPATH="$PWD"
export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64
export PATH="$JAVA_HOME/bin:$PATH"
mkdir -p /tmp/spark-events
export HIFUN_HISTORY_SERVER=/tmp/spark-events

4. Verify Environment

python test_setup.py

A successful run prints confirmation that Spark, GraphFrames, and the ML stack are importable.


πŸš€ Running the Project

Quick Start (Recommended Strict Runner)

Use this for reproducible strict artifacts without manually chaining commands:

source .venv/bin/activate
export PYTHONPATH="$PWD"
./run_project_strict.sh smoke

Use this for the full strict bundle:

./run_project_strict.sh all

Quick Start (Manual)

If you want to run step-by-step manually:

source .venv/bin/activate
export PYTHONPATH="$PWD"
export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64
export PATH="$JAVA_HOME/bin:$PATH"

python -m model.trainer
python -m experiments.relevance_evaluation
python -m experiments.dataset_shift_evaluation \
    --source training_data/real_labeled_runs_strict_curated.csv
pytest -q

Full Data Pipeline (Rebuild from Raw Data)

Step 1 β€” Generate Datasets

# LDBC SNB via Docker datagen
mkdir -p data/raw/ldbc_snb
docker run --rm \
  --mount type=bind,source="$(pwd)/data/raw/ldbc_snb",target=/out \
  ldbc/datagen-standalone:latest \
  --parallelism 1 -- --format csv --scale-factor 1 --mode raw --output-dir /out

python -m data.scripts.ldbc_snb_to_parquet \
    --input data/raw/ldbc_snb \
    --parquet-dir data/parquet/snb \
    --graph-dir data/graphs/snb

# OGB graph dataset
python -m data.scripts.ogb_to_parquet \
    --dataset ogbn-arxiv --root data/raw/ogb --graph-dir data/graphs

# JOB/IMDB (if available)
python -m data.scripts.job_to_parquet \
    --input data/raw/job --output data/parquet/job

# TPC-DS (if available)
python -m data.scripts.tpcds_to_parquet \
    --input data/raw/tpcds --output data/parquet/tpcds

Step 2 β€” Compute Column Statistics

python -m data.scripts.compute_stats

Step 3 β€” Generate Query Workloads

python -m training_data.real_query_generator --scale aggressive --focus-mode all

Step 4 β€” Collect Runtime Labels

python -m training_data.real_collection_script \
  --queries_dir dsl/sample_queries \
  --output training_data/real_labeled_runs.csv \
  --n_warmup 2 --n_measure 3 --repeat 3

Step 5 β€” Create Train/Eval Splits

python -m training_data.fix_dataset_splits \
    --source training_data/real_labeled_runs.csv

Step 6 β€” Train the Router Model

python -m model.trainer

Running Evaluations

# Relevance evaluation (accuracy, F1, PR-AUC)
python -m experiments.relevance_evaluation

# Domain shift evaluation (strict curated source)
python -m experiments.dataset_shift_evaluation \
    --source training_data/real_labeled_runs_strict_curated.csv

# Strict robustness (bootstrap CI + permutation tests)
python -m experiments.strict_robustness_evaluation \
    --train training_data/fixed_train_base_strict.csv \
    --eval training_data/fixed_eval_set_strict.csv \
    --transfer_source training_data/real_labeled_runs_strict_curated.csv \
    --n_bootstrap 1000 --n_perm_labels 100 --n_perm_features 20 \
    --out_json experiments/results/strict_robustness_eval_runtime.json \
    --out_md experiments/results/strict_robustness_eval_runtime.md

For broad all-data transfer analysis, you can still run:

python -m experiments.dataset_shift_evaluation \
    --source training_data/real_labeled_runs.csv

Running Tests

# All tests
pytest -q

# Specific suites
pytest tests/test_execution.py -v
pytest tests/test_correctness.py -v
pytest tests/test_experiments.py -v

Makefile Shortcuts

make help          # Show all available commands
make setup         # Set up environment
make test-env      # Verify environment
make data-all      # Run full dataset pipeline
make train         # Train routing model
make evaluate      # Run all evaluations
make test-all      # Run full test suite
make clean         # Remove generated outputs

Optional: Docker Big-Data Stack (Spark + HDFS + YARN)

# Start all services
docker compose up -d
docker compose ps

# Monitor Spark master
docker compose logs -f spark-master

# Stop services
docker compose down

Service UIs once running:

Service URL
Spark Master http://localhost:8080
Spark Worker 1 http://localhost:8081
Spark Worker 2 http://localhost:8082
Spark History Server http://localhost:18080
HDFS NameNode http://localhost:9870
YARN Resource Manager http://localhost:8088

πŸ–₯️ Launching the Demo Dashboard

The project includes an interactive Streamlit dashboard for exploring routing decisions and evaluation metrics.

source .venv/bin/activate
export PYTHONPATH="$PWD"

streamlit run streamlit_app.py --server.port 8501 --server.headless true

Open your browser at: http://localhost:8501

The dashboard features four tabs: Dataset and Quality, Relevance Evaluation, Robustness Evaluation, and Cross-Dataset Generalization. Use the sidebar to switch between fast (rapid iteration) and strict (publication-grade) run profiles.


πŸŽ₯ 2-Minute Demo Video

πŸ”— Demo Link: https://drive.google.com/file/d/1PmTvewqPUMJ_vOrPGdnUg_BhPUDZGSv3/view?usp=sharing

Also available:

  • 4-minute detailed recording script: RUN_STREAMLIT_DEMO.md

πŸ“Š Key Results

Policy Accuracy F1 Precision Recall
AlwaysSQL 0.5345 0.0000 0.0000 0.0000
AlwaysGRAPH 0.4655 0.6353 0.4655 1.0000
TraversalRule 0.9655 0.9643 0.9310 1.0000
LogReg (Balanced) 0.9655 0.9643 0.9310 1.0000
XGBoost (Balanced) 0.9655 0.9643 0.9310 1.0000

Evaluated on a strict template-disjoint split of N=58 evaluation cases. Bootstrap 95% CI (F1): [0.913, 1.000]. Top permutation-importance feature: op_count_traversal.

See also:

  • experiments/results/relevance_eval_strict_runtime.json
  • experiments/results/strict_robustness_eval_runtime.json
  • experiments/results/dataset_shift_eval_strict_runtime.json

πŸ› οΈ Tech Stack

Layer Technology
Query Parsing Python, jsonschema
Distributed Compute Apache Spark 3.4.2, PySpark
Graph Engine GraphFrames 0.8.3
Storage Apache Parquet, HDFS
Cluster Management Apache YARN
ML Models XGBoost 2.0.3, LightGBM 4.2.0, scikit-learn 1.4.0
Explainability SHAP 0.44.0
Dashboard Streamlit 1.44.1
Notebooks JupyterLab 4.1.2
Containerization Docker Compose
Testing pytest 7.4.4

🩺 Troubleshooting

JAVA_HOME is not set

export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64
export PATH="$JAVA_HOME/bin:$PATH"

ModuleNotFoundError: No module named 'config'

cd ~/HIFUN-Router-clone
export PYTHONPATH="$PWD"

Spark fails downloading packages on first run

python test_setup.py   # retry β€” Spark caches JARs on second run

πŸ“š References

  1. L. Libkin et al., "HIFUN: A Framework for Hybrid Query Processing," Journal of Intelligent Information Systems, 2022.
  2. M. Armbrust et al., "Spark SQL: Relational Data Processing in Spark," SIGMOD, 2015.
  3. A. Dave et al., "GraphFrames: An Integrated API for Mixing Graph and Relational Queries," GRADES, 2016.
  4. R. Marcus et al., "Neo: A Learned Query Optimizer," VLDB, 2019.
  5. W. Hu et al., "Open Graph Benchmark: Datasets for Machine Learning on Graphs," NeurIPS, 2020.
  6. LDBC Council, "LDBC Social Network Benchmark," 2015.

IIIT Dharwad β€” Department of Data Science and AI β€” BDA Project, 2025-2026

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors