This site is hosted via GitHub Pages for the repository HossamTabana/terminal_storage_kiro.
index.md.An advanced AI-powered system for extracting and processing terminal storage rates from PDF contracts and image documents. Built on Databricks with Delta Lake, this system uses multi-method extraction techniques combined with AI models to parse complex terminal service agreements and extract structured storage tariff data.
databricks-gemma-3-12b for data extraction and OCR processingdatabricks-claude-3-7-sonnet for complex reasoning and contract analysisThe system automatically installs required dependencies:
# Core PDF processing
PyPDF2, PyMuPDF, pdfplumber
# OCR and image processing
Pillow, pytesseract, opencv-python-headless
# Data processing
pyspark, delta-spark, pandas, numpy
# AI and utilities
langdetect
Terminal_Storage_AI_V00.py notebook to your Databricks workspace# Initialize the system
setup_unity_catalog_v6()
# Ingest PDF files
pdf_ingestion_v6()
# Extract text and tables
text_extraction_v6()
# Run AI extraction
enhanced_ai_extraction_v6()
# For PDF files
with open("contract.pdf", "rb") as f:
pdf_bytes = f.read()
result = advanced_pdf_extraction_v6(pdf_bytes)
# For image files
result = extract_from_image_file_v6("contract_scan.png")
PDF/Image Files โ Bronze Layer โ Silver Layer โ AI Processing โ Gold Layer โ Reports
graph TD
A[PDF/Image Files] --> B[File Ingestion]
B --> C[Bronze Layer - File Tracking]
C --> D[Multi-Method Extraction]
D --> E[Silver Layer - Processed Text]
E --> F[AI Analysis]
F --> G[Gold Layer - Structured Data]
G --> H[Flattened Layer - Reports]
D --> D1[PDF Text Extraction]
D --> D2[OCR Processing]
D --> D3[Table Extraction]
F --> F1[Claude 3.7 Sonnet]
F --> F2[Gemma 3-12B]
bronze_file_log)processed_data)gold_storage_tariffs)gold_tariffs_flattened)# Version and catalog configuration
VERSION = "V6.1"
CATALOG_NAME = "pg_ba_output_dev"
SCHEMA_NAME = "terminal_contract_v3"
# AI Model Configuration
AI_EXTRACTION_MODEL = "databricks-gemma-3-12b" # For extraction
AI_PROMPT_MODEL = "databricks-claude-3-7-sonnet" # For reasoning
AI_MAX_RETRIES = 3
AI_TIMEOUT_SECONDS = 120
# Quality Thresholds
MIN_CONFIDENCE_THRESHOLD = 0.7
HIGH_CONFIDENCE_THRESHOLD = 0.9
OCR_CONFIDENCE_THRESHOLD = 0.6
BRONZE_VOLUME_PATH = f"/Volumes/{CATALOG_NAME}/{SCHEMA_NAME}/bronze_contracts"
SILVER_TABLE_NAME = f"{CATALOG_NAME}.{SCHEMA_NAME}.processed_data"
GOLD_TABLE_NAME = f"{CATALOG_NAME}.{SCHEMA_NAME}.gold_storage_tariffs"
BRONZE_LOG_TABLE = f"{CATALOG_NAME}.{SCHEMA_NAME}.bronze_file_log"
# Ingest files from volume
pdf_ingestion_v6()
# Check ingestion status
spark.sql(f"""
SELECT processing_status, COUNT(*)
FROM {BRONZE_LOG_TABLE}
GROUP BY processing_status
""").show()
# Extract text from ingested files
text_extraction_v6()
# View extraction quality
spark.sql(f"""
SELECT extraction_quality, AVG(quality_score)
FROM {SILVER_TABLE_NAME}
GROUP BY extraction_quality
""").show()
# Run AI extraction on processed text
enhanced_ai_extraction_v6()
# Check AI processing results
spark.sql(f"""
SELECT confidence_level, COUNT(*)
FROM {GOLD_TABLE_NAME}
GROUP BY confidence_level
""").show()
# Enhanced OCR for image-based PDFs
result = enhanced_ocr_extraction_v6(image_bytes)
# Image file processing
result = extract_from_image_file_v6("scanned_contract.png")
# Comprehensive extraction with fallbacks
result = comprehensive_fallback_chain_v6(file_bytes, file_format)
# Quality assessment
quality_score = calculate_extraction_quality_v6(result)
The project includes comprehensive testing with 100% success rate across all components:
# Run all tests
python run_comprehensive_tests.py
# Individual test suites
python test_standalone_comprehensive_suite.py
python test_comprehensive_validation_suite.py
python test_task13_comprehensive_testing.py
python test_v6_1_improvements.py
advanced_pdf_extraction_v6(pdf_bytes)Multi-method PDF text and table extraction.
Parameters:
pdf_bytes: Binary PDF contentReturns:
enhanced_ocr_extraction_v6(image_bytes)Enhanced OCR processing with image preprocessing.
Parameters:
image_bytes: Binary image contentReturns:
enhanced_ai_extraction_v6()AI-powered contract analysis and data extraction.
Returns:
calculate_extraction_quality_v6(extraction_result)Calculate quality score based on terminal-specific metrics.
detect_image_based_pdf_v6(pdf_bytes)Detect if PDF contains primarily image content.
comprehensive_fallback_chain_v6(file_bytes, file_format)Execute comprehensive extraction with multiple fallback methods.
# Check extraction methods used
spark.sql(f"""
SELECT extraction_method, AVG(quality_score)
FROM {SILVER_TABLE_NAME}
GROUP BY extraction_method
""").show()
# Verify OCR configuration
result = enhanced_ocr_extraction_v6(image_bytes)
print(f"OCR Confidence: {result.get('confidence', 0)}")
# Check AI processing status
spark.sql(f"""
SELECT ai_processing_status, error_message
FROM {GOLD_TABLE_NAME}
WHERE ai_processing_status = 'failed'
""").show()
# Optimize for large files
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
spark.conf.set("spark.databricks.delta.optimizeWrite.enabled", "true")
# Process files in batches
text_extraction_v6(batch_size=10)
enhanced_ai_extraction_v6(max_concurrent=5)
This project is proprietary software developed for terminal storage contract processing.
For technical support and questions:
Version: V6.1
Last Updated: January 2025
Databricks Runtime: 13.0+
Python Version: 3.8+