terminal_storage_kiro

Terminal Storage – Kiro

This site is hosted via GitHub Pages for the repository HossamTabana/terminal_storage_kiro.

Visit the repository on GitHub: Open Repo
You can replace this page content by editing index.md.

Terminal Storage AI V6.1

An advanced AI-powered system for extracting and processing terminal storage rates from PDF contracts and image documents. Built on Databricks with Delta Lake, this system uses multi-method extraction techniques combined with AI models to parse complex terminal service agreements and extract structured storage tariff data.

🚀 Key Features

Multi-Format Document Processing

PDF Processing: Advanced text and table extraction using PyPDF2, PyMuPDF, and pdfplumber
OCR Capabilities: Enhanced OCR processing with pytesseract for image-based PDFs and scanned documents
Image File Support: Direct processing of PNG, JPG, and TIFF image files
Smart Detection: Automatic detection of document types and optimal processing strategies

AI-Powered Analysis

Dual Model Architecture:
- databricks-gemma-3-12b for data extraction and OCR processing
- databricks-claude-3-7-sonnet for complex reasoning and contract analysis
Structured Data Extraction: Extracts terminal storage rates, tariffs, and container pricing
Multi-Container Support: Handles 20ft, 40ft, TEU container types
Rate Tier Processing: Extracts free days, pricing structures, and tier-based rates

Enterprise-Grade Architecture

Delta Lake Storage: ACID transactions with Unity Catalog integration
Multi-Layer Design: Bronze (raw) → Silver (processed) → Gold (structured) data flow
Streaming Ingestion: Scalable file processing with checkpoints
Quality Assurance: Comprehensive confidence scoring and validation metrics
Error Recovery: Advanced error handling with categorization and recovery strategies

🛠 Installation

Prerequisites

Databricks Runtime 13.0+
Unity Catalog enabled
Python 3.8+

Required Libraries

The system automatically installs required dependencies:

# Core PDF processing
PyPDF2, PyMuPDF, pdfplumber

# OCR and image processing
Pillow, pytesseract, opencv-python-headless

# Data processing
pyspark, delta-spark, pandas, numpy

# AI and utilities
langdetect

Setup

Upload the Terminal_Storage_AI_V00.py notebook to your Databricks workspace
Configure the catalog and schema settings (see Configuration)
Run the setup function to initialize Unity Catalog tables

🚀 Quick Start

Basic Setup

# Initialize the system
setup_unity_catalog_v6()

# Ingest PDF files
pdf_ingestion_v6()

# Extract text and tables
text_extraction_v6()

# Run AI extraction
enhanced_ai_extraction_v6()

Processing a Single File

# For PDF files
with open("contract.pdf", "rb") as f:
    pdf_bytes = f.read()
    result = advanced_pdf_extraction_v6(pdf_bytes)

# For image files
result = extract_from_image_file_v6("contract_scan.png")

🏗 Architecture

System Overview

PDF/Image Files → Bronze Layer → Silver Layer → AI Processing → Gold Layer → Reports

Data Flow Architecture

graph TD
    A[PDF/Image Files] --> B[File Ingestion]
    B --> C[Bronze Layer - File Tracking]
    C --> D[Multi-Method Extraction]
    D --> E[Silver Layer - Processed Text]
    E --> F[AI Analysis]
    F --> G[Gold Layer - Structured Data]
    G --> H[Flattened Layer - Reports]
    
    D --> D1[PDF Text Extraction]
    D --> D2[OCR Processing]
    D --> D3[Table Extraction]
    
    F --> F1[Claude 3.7 Sonnet]
    F --> F2[Gemma 3-12B]

Storage Layers

Bronze Layer (`bronze_file_log`)

Raw file metadata and processing status
Retry tracking and error logging
File path and ingestion timestamps

Silver Layer (`processed_data`)

Extracted text and table content
Quality scores and confidence metrics
Processing method metadata

Gold Layer (`gold_storage_tariffs`)

Structured storage tariff data
Rate tiers with pricing information
Terminal and contract metadata

Flattened Layer (`gold_tariffs_flattened`)

Denormalized reporting view
Individual tier columns for easy querying
Business-ready data structure

⚙️ Configuration

Core Settings

# Version and catalog configuration
VERSION = "V6.1"
CATALOG_NAME = "pg_ba_output_dev"
SCHEMA_NAME = "terminal_contract_v3"

# AI Model Configuration
AI_EXTRACTION_MODEL = "databricks-gemma-3-12b"      # For extraction
AI_PROMPT_MODEL = "databricks-claude-3-7-sonnet"    # For reasoning
AI_MAX_RETRIES = 3
AI_TIMEOUT_SECONDS = 120

# Quality Thresholds
MIN_CONFIDENCE_THRESHOLD = 0.7
HIGH_CONFIDENCE_THRESHOLD = 0.9
OCR_CONFIDENCE_THRESHOLD = 0.6

Table Paths

BRONZE_VOLUME_PATH = f"/Volumes/{CATALOG_NAME}/{SCHEMA_NAME}/bronze_contracts"
SILVER_TABLE_NAME = f"{CATALOG_NAME}.{SCHEMA_NAME}.processed_data"
GOLD_TABLE_NAME = f"{CATALOG_NAME}.{SCHEMA_NAME}.gold_storage_tariffs"
BRONZE_LOG_TABLE = f"{CATALOG_NAME}.{SCHEMA_NAME}.bronze_file_log"

📖 Usage

File Processing Pipeline

1. File Ingestion

# Ingest files from volume
pdf_ingestion_v6()

# Check ingestion status
spark.sql(f"""
    SELECT processing_status, COUNT(*) 
    FROM {BRONZE_LOG_TABLE} 
    GROUP BY processing_status
""").show()

2. Text Extraction

# Extract text from ingested files
text_extraction_v6()

# View extraction quality
spark.sql(f"""
    SELECT extraction_quality, AVG(quality_score) 
    FROM {SILVER_TABLE_NAME} 
    GROUP BY extraction_quality
""").show()

3. AI Processing

# Run AI extraction on processed text
enhanced_ai_extraction_v6()

# Check AI processing results
spark.sql(f"""
    SELECT confidence_level, COUNT(*) 
    FROM {GOLD_TABLE_NAME} 
    GROUP BY confidence_level
""").show()

Advanced Processing Options

OCR Processing

# Enhanced OCR for image-based PDFs
result = enhanced_ocr_extraction_v6(image_bytes)

# Image file processing
result = extract_from_image_file_v6("scanned_contract.png")

Multi-Method Extraction

# Comprehensive extraction with fallbacks
result = comprehensive_fallback_chain_v6(file_bytes, file_format)

# Quality assessment
quality_score = calculate_extraction_quality_v6(result)

🔄 Data Pipeline

Processing Workflow

File Detection: Automatic format detection (PDF vs image)
Method Selection: Choose optimal extraction strategy
Text Extraction: Multi-method approach with fallbacks
Quality Assessment: Confidence scoring and validation
AI Analysis: Structured data extraction using AI models
Data Storage: Store results in appropriate layer
Error Handling: Recovery and retry mechanisms

Extraction Methods

PDF Processing

PDFPlumber: Primary method for table extraction
PyMuPDF: Layout-preserving text extraction
PyPDF2: Fallback text extraction

OCR Processing

Pytesseract: OCR engine with preprocessing
Image Enhancement: Contrast and noise reduction
Confidence Scoring: OCR-specific quality metrics

AI Processing

Prompt Engineering: Structured prompts with examples
Response Validation: JSON structure verification
Confidence Assessment: Multi-factor scoring

🧪 Testing

Test Suite Overview

The project includes comprehensive testing with 100% success rate across all components:

# Run all tests
python run_comprehensive_tests.py

# Individual test suites
python test_standalone_comprehensive_suite.py
python test_comprehensive_validation_suite.py
python test_task13_comprehensive_testing.py
python test_v6_1_improvements.py

Test Coverage

Total Tests: 49 tests across 4 test suites
Success Rate: 100% ✅
Quality Score: 94.9/100 🥇
Component Coverage: 100% (All 10 components)

Key Test Areas

PDF extraction and OCR processing
Multi-model AI integration
Parallel processing capabilities
Error handling and recovery
Data validation and quality scoring
End-to-end pipeline testing

📚 API Reference

Core Functions

`advanced_pdf_extraction_v6(pdf_bytes)`

Multi-method PDF text and table extraction.

Parameters:

pdf_bytes: Binary PDF content

Returns:

Dictionary with extracted text, tables, and quality metrics

`enhanced_ocr_extraction_v6(image_bytes)`

Enhanced OCR processing with image preprocessing.

Parameters:

image_bytes: Binary image content

Returns:

Dictionary with OCR text and confidence scores

`enhanced_ai_extraction_v6()`

AI-powered contract analysis and data extraction.

Returns:

Structured storage tariff data in Gold layer

Utility Functions

`calculate_extraction_quality_v6(extraction_result)`

Calculate quality score based on terminal-specific metrics.

`detect_image_based_pdf_v6(pdf_bytes)`

Detect if PDF contains primarily image content.

`comprehensive_fallback_chain_v6(file_bytes, file_format)`

Execute comprehensive extraction with multiple fallback methods.

🔧 Troubleshooting

Common Issues

Low Quality Scores

# Check extraction methods used
spark.sql(f"""
    SELECT extraction_method, AVG(quality_score) 
    FROM {SILVER_TABLE_NAME} 
    GROUP BY extraction_method
""").show()

OCR Processing Issues

# Verify OCR configuration
result = enhanced_ocr_extraction_v6(image_bytes)
print(f"OCR Confidence: {result.get('confidence', 0)}")

AI Processing Failures

# Check AI processing status
spark.sql(f"""
    SELECT ai_processing_status, error_message 
    FROM {GOLD_TABLE_NAME} 
    WHERE ai_processing_status = 'failed'
""").show()

Performance Optimization

Spark Configuration

# Optimize for large files
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
spark.conf.set("spark.databricks.delta.optimizeWrite.enabled", "true")

Batch Processing

# Process files in batches
text_extraction_v6(batch_size=10)
enhanced_ai_extraction_v6(max_concurrent=5)

🤝 Contributing

Development Setup

Clone the repository
Set up Databricks development environment
Configure Unity Catalog access
Run test suite to verify setup

Code Standards

Follow PEP 8 style guidelines
Include comprehensive docstrings
Add unit tests for new functionality
Maintain backward compatibility

Testing Requirements

All new features must include tests
Maintain 100% test success rate
Update documentation for API changes

📄 License

This project is proprietary software developed for terminal storage contract processing.

📞 Support

For technical support and questions:

Review the troubleshooting section
Check test results for component status
Consult the API reference for function details

Version: V6.1
Last Updated: January 2025
Databricks Runtime: 13.0+
Python Version: 3.8+

terminal_storage_kiro

Terminal Storage – Kiro

Terminal Storage AI V6.1

🚀 Key Features

Multi-Format Document Processing

AI-Powered Analysis

Enterprise-Grade Architecture

📋 Table of Contents

🛠 Installation

Prerequisites

Required Libraries

Setup

🚀 Quick Start

Basic Setup

Processing a Single File

🏗 Architecture

System Overview

Data Flow Architecture

Storage Layers

Bronze Layer (bronze_file_log)

Silver Layer (processed_data)

Gold Layer (gold_storage_tariffs)

Flattened Layer (gold_tariffs_flattened)

⚙️ Configuration

Core Settings

Table Paths

📖 Usage

File Processing Pipeline

1. File Ingestion

2. Text Extraction

3. AI Processing

Advanced Processing Options

OCR Processing

Multi-Method Extraction

🔄 Data Pipeline

Processing Workflow

Extraction Methods

PDF Processing

OCR Processing

AI Processing

🧪 Testing

Test Suite Overview

Test Coverage

Key Test Areas

📚 API Reference

Core Functions

advanced_pdf_extraction_v6(pdf_bytes)

enhanced_ocr_extraction_v6(image_bytes)

enhanced_ai_extraction_v6()

Utility Functions

calculate_extraction_quality_v6(extraction_result)

detect_image_based_pdf_v6(pdf_bytes)

comprehensive_fallback_chain_v6(file_bytes, file_format)

🔧 Troubleshooting

Common Issues

Low Quality Scores

OCR Processing Issues

AI Processing Failures

Performance Optimization

Spark Configuration

Batch Processing

🤝 Contributing

Development Setup

Code Standards

Testing Requirements

📄 License

📞 Support

Bronze Layer (`bronze_file_log`)

Silver Layer (`processed_data`)

Gold Layer (`gold_storage_tariffs`)

Flattened Layer (`gold_tariffs_flattened`)

`advanced_pdf_extraction_v6(pdf_bytes)`

`enhanced_ocr_extraction_v6(image_bytes)`

`enhanced_ai_extraction_v6()`

`calculate_extraction_quality_v6(extraction_result)`

`detect_image_based_pdf_v6(pdf_bytes)`

`comprehensive_fallback_chain_v6(file_bytes, file_format)`