terminal_storage_kiro

Terminal Storage โ€“ Kiro

This site is hosted via GitHub Pages for the repository HossamTabana/terminal_storage_kiro.

Terminal Storage AI V6.1

An advanced AI-powered system for extracting and processing terminal storage rates from PDF contracts and image documents. Built on Databricks with Delta Lake, this system uses multi-method extraction techniques combined with AI models to parse complex terminal service agreements and extract structured storage tariff data.

๐Ÿš€ Key Features

Multi-Format Document Processing

AI-Powered Analysis

Enterprise-Grade Architecture

๐Ÿ“‹ Table of Contents

๐Ÿ›  Installation

Prerequisites

Required Libraries

The system automatically installs required dependencies:

# Core PDF processing
PyPDF2, PyMuPDF, pdfplumber

# OCR and image processing
Pillow, pytesseract, opencv-python-headless

# Data processing
pyspark, delta-spark, pandas, numpy

# AI and utilities
langdetect

Setup

  1. Upload the Terminal_Storage_AI_V00.py notebook to your Databricks workspace
  2. Configure the catalog and schema settings (see Configuration)
  3. Run the setup function to initialize Unity Catalog tables

๐Ÿš€ Quick Start

Basic Setup

# Initialize the system
setup_unity_catalog_v6()

# Ingest PDF files
pdf_ingestion_v6()

# Extract text and tables
text_extraction_v6()

# Run AI extraction
enhanced_ai_extraction_v6()

Processing a Single File

# For PDF files
with open("contract.pdf", "rb") as f:
    pdf_bytes = f.read()
    result = advanced_pdf_extraction_v6(pdf_bytes)

# For image files
result = extract_from_image_file_v6("contract_scan.png")

๐Ÿ— Architecture

System Overview

PDF/Image Files โ†’ Bronze Layer โ†’ Silver Layer โ†’ AI Processing โ†’ Gold Layer โ†’ Reports

Data Flow Architecture

graph TD
    A[PDF/Image Files] --> B[File Ingestion]
    B --> C[Bronze Layer - File Tracking]
    C --> D[Multi-Method Extraction]
    D --> E[Silver Layer - Processed Text]
    E --> F[AI Analysis]
    F --> G[Gold Layer - Structured Data]
    G --> H[Flattened Layer - Reports]
    
    D --> D1[PDF Text Extraction]
    D --> D2[OCR Processing]
    D --> D3[Table Extraction]
    
    F --> F1[Claude 3.7 Sonnet]
    F --> F2[Gemma 3-12B]

Storage Layers

Bronze Layer (bronze_file_log)

Silver Layer (processed_data)

Gold Layer (gold_storage_tariffs)

Flattened Layer (gold_tariffs_flattened)

โš™๏ธ Configuration

Core Settings

# Version and catalog configuration
VERSION = "V6.1"
CATALOG_NAME = "pg_ba_output_dev"
SCHEMA_NAME = "terminal_contract_v3"

# AI Model Configuration
AI_EXTRACTION_MODEL = "databricks-gemma-3-12b"      # For extraction
AI_PROMPT_MODEL = "databricks-claude-3-7-sonnet"    # For reasoning
AI_MAX_RETRIES = 3
AI_TIMEOUT_SECONDS = 120

# Quality Thresholds
MIN_CONFIDENCE_THRESHOLD = 0.7
HIGH_CONFIDENCE_THRESHOLD = 0.9
OCR_CONFIDENCE_THRESHOLD = 0.6

Table Paths

BRONZE_VOLUME_PATH = f"/Volumes/{CATALOG_NAME}/{SCHEMA_NAME}/bronze_contracts"
SILVER_TABLE_NAME = f"{CATALOG_NAME}.{SCHEMA_NAME}.processed_data"
GOLD_TABLE_NAME = f"{CATALOG_NAME}.{SCHEMA_NAME}.gold_storage_tariffs"
BRONZE_LOG_TABLE = f"{CATALOG_NAME}.{SCHEMA_NAME}.bronze_file_log"

๐Ÿ“– Usage

File Processing Pipeline

1. File Ingestion

# Ingest files from volume
pdf_ingestion_v6()

# Check ingestion status
spark.sql(f"""
    SELECT processing_status, COUNT(*) 
    FROM {BRONZE_LOG_TABLE} 
    GROUP BY processing_status
""").show()

2. Text Extraction

# Extract text from ingested files
text_extraction_v6()

# View extraction quality
spark.sql(f"""
    SELECT extraction_quality, AVG(quality_score) 
    FROM {SILVER_TABLE_NAME} 
    GROUP BY extraction_quality
""").show()

3. AI Processing

# Run AI extraction on processed text
enhanced_ai_extraction_v6()

# Check AI processing results
spark.sql(f"""
    SELECT confidence_level, COUNT(*) 
    FROM {GOLD_TABLE_NAME} 
    GROUP BY confidence_level
""").show()

Advanced Processing Options

OCR Processing

# Enhanced OCR for image-based PDFs
result = enhanced_ocr_extraction_v6(image_bytes)

# Image file processing
result = extract_from_image_file_v6("scanned_contract.png")

Multi-Method Extraction

# Comprehensive extraction with fallbacks
result = comprehensive_fallback_chain_v6(file_bytes, file_format)

# Quality assessment
quality_score = calculate_extraction_quality_v6(result)

๐Ÿ”„ Data Pipeline

Processing Workflow

  1. File Detection: Automatic format detection (PDF vs image)
  2. Method Selection: Choose optimal extraction strategy
  3. Text Extraction: Multi-method approach with fallbacks
  4. Quality Assessment: Confidence scoring and validation
  5. AI Analysis: Structured data extraction using AI models
  6. Data Storage: Store results in appropriate layer
  7. Error Handling: Recovery and retry mechanisms

Extraction Methods

PDF Processing

OCR Processing

AI Processing

๐Ÿงช Testing

Test Suite Overview

The project includes comprehensive testing with 100% success rate across all components:

# Run all tests
python run_comprehensive_tests.py

# Individual test suites
python test_standalone_comprehensive_suite.py
python test_comprehensive_validation_suite.py
python test_task13_comprehensive_testing.py
python test_v6_1_improvements.py

Test Coverage

Key Test Areas

๐Ÿ“š API Reference

Core Functions

advanced_pdf_extraction_v6(pdf_bytes)

Multi-method PDF text and table extraction.

Parameters:

Returns:

enhanced_ocr_extraction_v6(image_bytes)

Enhanced OCR processing with image preprocessing.

Parameters:

Returns:

enhanced_ai_extraction_v6()

AI-powered contract analysis and data extraction.

Returns:

Utility Functions

calculate_extraction_quality_v6(extraction_result)

Calculate quality score based on terminal-specific metrics.

detect_image_based_pdf_v6(pdf_bytes)

Detect if PDF contains primarily image content.

comprehensive_fallback_chain_v6(file_bytes, file_format)

Execute comprehensive extraction with multiple fallback methods.

๐Ÿ”ง Troubleshooting

Common Issues

Low Quality Scores

# Check extraction methods used
spark.sql(f"""
    SELECT extraction_method, AVG(quality_score) 
    FROM {SILVER_TABLE_NAME} 
    GROUP BY extraction_method
""").show()

OCR Processing Issues

# Verify OCR configuration
result = enhanced_ocr_extraction_v6(image_bytes)
print(f"OCR Confidence: {result.get('confidence', 0)}")

AI Processing Failures

# Check AI processing status
spark.sql(f"""
    SELECT ai_processing_status, error_message 
    FROM {GOLD_TABLE_NAME} 
    WHERE ai_processing_status = 'failed'
""").show()

Performance Optimization

Spark Configuration

# Optimize for large files
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
spark.conf.set("spark.databricks.delta.optimizeWrite.enabled", "true")

Batch Processing

# Process files in batches
text_extraction_v6(batch_size=10)
enhanced_ai_extraction_v6(max_concurrent=5)

๐Ÿค Contributing

Development Setup

  1. Clone the repository
  2. Set up Databricks development environment
  3. Configure Unity Catalog access
  4. Run test suite to verify setup

Code Standards

Testing Requirements

๐Ÿ“„ License

This project is proprietary software developed for terminal storage contract processing.

๐Ÿ“ž Support

For technical support and questions:


Version: V6.1
Last Updated: January 2025
Databricks Runtime: 13.0+
Python Version: 3.8+