logo
1
1
Login
y-shi23<y-shi23tsinghua@outlook.com>
feat: Add diabetes case generation configuration and templates

Medical Case Generation System

A medical case generation system built with Python that uses OpenAI's API to generate structured medical case studies based on disease information from CSV files. The system specializes in generating both single disease cases and comorbidity cases with detailed medical information.

Features

  • Generate structured medical cases from CSV disease data using OpenAI API
  • Support for both single disease and comorbidity case generation
  • Pydantic-based data validation and structured output
  • Five-part structured case format (basic info, examination data, positive findings, inquiry data, diagnosis)
  • Concurrent processing for batch generation with ThreadPoolExecutor
  • Comprehensive error handling and fallback mechanisms
  • JSON-based output for easy integration and analysis
  • Detailed logging for debugging and monitoring
  • Multiple API key support with load balancing rotation

Installation

Using Uv (Recommended)

# Create virtual environment uv venv # Activate virtual environment source .venv/bin/activate # On Windows: .venv\Scripts\activate # Install dependencies uv pip install -r requirements.txt

Using Pip

pip install -r requirements.txt

Project Structure

The system follows a modular design with several key components:

Data Models

  • models.py: Pydantic models defining structured data formats for medical cases
    • StructuredCaseData: Main data structure containing all case information
    • MedicalCase: Complete medical case with metadata
    • BatchProcessingResult: Results for batch processing operations

Case Generation & Processing

  • Generators: Four main generator scripts for different use cases:
    • single_test_case_generator.py: Test single disease case generation
    • single_batch_generator.py: Batch process single disease cases
    • comorbidity_test_case_generator.py: Test comorbidity case generation
    • comorbidity_batch_generator.py: Batch process comorbidity cases
  • case_parser.py: Parses AI-generated case content into structured Pydantic models
  • prompt_manager.py: Loads profile configuration, resolves prompt templates, and provides dataset metadata
  • profiles/: YAML profiles describing prompt templates, dataset paths, and placeholder variables (default: hepatic.yml)
  • prompts/: Markdown-based prompt templates referenced by profiles
  • generate.py: Utility for generating disease combination data

Data Sources

  • single_disease_cases.csv: Single disease information
  • comorbidity_combinations.csv: Comorbidity disease combinations
  • data.csv: Additional medical data reference

Web Interface

  • server.py: FastAPI web server providing REST API and web interface
    • RESTful API endpoints for case generation
    • Web-based UI for interactive case generation
    • Profile management and configuration
    • Real-time streaming generation
    • Batch processing with progress tracking
  • web/: Frontend static files for the web interface

Web Server Setup

To run the web interface with the FastAPI server:

1. Configuration

Create a config.json file with your OpenAI API settings:

{ "api": { "key": "your_openai_api_key_here", "base_url": "https://api.openai.com/v1" }, "model": "gpt-4", "temperature": 0.7, "max_tokens": 2000, "top_p": 1.0, "frequency_penalty": 0.0, "presence_penalty": 0.0 }

2. Start the Web Server

Important: server.py is a FastAPI application module, not a standalone script. You must use an ASGI server like uvicorn to run it.

# Using Uv (Recommended) uv run uvicorn server:app --host 0.0.0.0 --port 8000 --reload # Using Pip/Python directly uvicorn server:app --host 0.0.0.0 --port 8000 --reload

3. Access the Web Interface

Once the server is running, open your browser and navigate to:

Server Options Explained

  • --host 0.0.0.0: Allows external access (not just localhost)
  • --port 8000: Specifies the port number (change if 8000 is occupied)
  • --reload: Auto-restart on code changes (development mode)
    • Monitors Python files for changes
    • Automatically restarts the server when you save modifications
    • Very useful during development
    • Remove this flag for production environments

Production Deployment

For production environments, run without --reload:

uv run uvicorn server:app --host 0.0.0.0 --port 8000

Usage

0. Profile-Based Prompt Configuration

Prompt selection and dataset metadata are now driven by profiles stored in profiles/. Each profile specifies:

  • The system prompt templates used for single-disease and comorbidity cases (prompts/<template>.md)
  • User-message templates (how the disease information is injected)
  • Dataset paths and column names for both case types
  • Optional placeholder values (e.g., disease labels, specialist name) substituted into the templates

Available Profiles

Hepatic Profile (Default) - hepatic.yml

  • Focus: Liver diseases (MASLD, MASH, HBV, HCV, AIH, PBC, cirrhosis, HCC, etc.)
  • Datasets: single_disease_cases.csv, comorbidity_combinations.csv
  • Templates: prompts/single_case_template.md, prompts/comorbidity_case_template.md

Diabetes Profile - diabetes.yml

  • Focus: Diabetes and its complications (T1DM, T2DM, GDM, prediabetes, LADA, MODY, etc.)
  • Datasets: TNB/tnb_single_disease_cases.csv (45 cases), TNB/tnb_comorbidity_combinations.csv (1049 cases)
  • Templates: prompts/tnb_single_case_template.md, prompts/tnb_comorbidity_case_template.md
  • Coverage:
    • Diabetes types: Prediabetes, T1DM, T2DM, GDM, LADA, MODY, secondary diabetes
    • Microvascular complications: Diabetic kidney disease, retinopathy, neuropathy, diabetic foot
    • Macrovascular complications: Coronary artery disease, heart failure, stroke, peripheral artery disease
    • Metabolic syndrome: Hypertension, dyslipidemia, obesity, MASLD/MASH
    • Other comorbidities: Thyroid diseases, PCOS, OSA, infections, mental health

The default profile is hepatic. To switch to diabetes cases, pass --profile diabetes to the generator scripts. To create custom profiles, add a new YAML file under profiles/ and specify the prompt templates and dataset paths. Prompt templates use $placeholder syntax (via string.Template) for safe substitution.

1. Configuration

The system reads configuration from a config.json file with OpenAI API settings:

Single API Key Configuration

{ "api": { "key": "your_openai_api_key", "base_url": "https://api.openai.com/v1" }, "model": "gpt-4", "temperature": 0.7, "max_tokens": 4000, "top_p": 0.9, "frequency_penalty": 0.0, "presence_penalty": 0.0, "timeout": 60 }

Multiple API Keys Configuration (Load Balancing)

For batch processing, you can configure multiple API keys for load balancing:

{ "api": { "key1": "your_first_openai_api_key", "key2": "your_second_openai_api_key", "base_url": "https://api.openai.com/v1" }, "model": "gpt-4", "temperature": 0.7, "max_tokens": 4000, "top_p": 0.9, "frequency_penalty": 0.0, "presence_penalty": 0.0, "timeout": 60 }

API Key Configuration Options:

  • api.key: Single API key for all requests (required if not using key1/key2)
  • api.key1 and api.key2: Two API keys for load balancing in batch processing (optional)
  • The system automatically rotates between multiple keys to distribute load
  • Single key fallback is supported for test generators

2. Running the System

Single Disease Cases

# Test single disease case generation (随机抽取) - Hepatic (default) python single_test_case_generator.py --profile hepatic # Test single disease case generation - Diabetes python single_test_case_generator.py --profile diabetes # Test single disease case generation with a specific row (1-based index) python single_test_case_generator.py --profile hepatic --row-index 5 # Test single disease case generation with a custom description string python single_test_case_generator.py --profile hepatic --disease-info "单纯疾病: ..." # 为随机测试指定种子,便于复现 python single_test_case_generator.py --profile hepatic --seed 2025 # Batch generate single disease cases python single_batch_generator.py --profile hepatic # Batch generate diabetes single disease cases python single_batch_generator.py --profile diabetes

Comorbidity Cases

# Test comorbidity case generation (随机抽取) - Hepatic (default) python comorbidity_test_case_generator.py --profile hepatic # Test comorbidity case generation - Diabetes python comorbidity_test_case_generator.py --profile diabetes # Test comorbidity case generation with a specific row (1-based index) python comorbidity_test_case_generator.py --profile hepatic --row-index 3 # Test comorbidity case generation with a custom description string python comorbidity_test_case_generator.py --profile hepatic --disease-info "并发症: ..." # 为随机测试指定种子,便于复现 python comorbidity_test_case_generator.py --profile hepatic --seed 2025 # Batch generate comorbidity cases python comorbidity_batch_generator.py --profile hepatic # Batch generate diabetes comorbidity cases python comorbidity_batch_generator.py --profile diabetes # Generate disease combinations data python generate.py # End-to-end case generation with CSV augmentation python disease_case_generator.py --profile hepatic --case-type comorbidity # End-to-end diabetes case generation python disease_case_generator.py --profile diabetes --case-type comorbidity

Output Structure

The system generates structured JSON output in the output/ directory:

  • Individual case files: single_case_YYYYMMDD_HHMMSS.json or comorbidity_case_YYYYMMDD_HHMMSS.json
  • Batch summary files: single_batch_YYYYMMDD_HHMMSS.json or comorbidity_batch_YYYYMMDD_HHMMSS.json

Each case contains five structured sections:

  1. Basic user information (name, age, gender, occupation, contact)
  2. Medical examination data (lab tests, results, reference ranges)
  3. Positive findings (abnormal results and clinical significance)
  4. Medical inquiry data (symptoms, medical history, family history, personal habits)
  5. Health diagnosis and evidence (diagnosis, basis, severity)

JSON Structure Example

{ "original_disease_info": "Single disease: Simple fatty liver (Steatosis)", "generation_time": "2023-12-01T14:30:22", "structured_data": { "basic_info": { "name": "张三", "gender": "男", "age": "45岁" }, "inquiry_data": { "main_symptoms": "乏力、纳差1月余", "past_history": "既往体健,无肝炎病史", "family_history": "父亲有肝硬化病史", "personal_history": "吸烟20年,每日10支,偶有饮酒" }, "examination_data": { "laboratory_tests": [ { "item_name": "ALT", "result": "85 U/L (参考范围: 9-50)" } ] }, "positive_findings": { "positive_conclusions": [ "ALT升高:提示肝细胞损伤" ] }, "health_diagnosis": { "preliminary_diagnosis": "非酒精性脂肪性肝病", "diagnosis_basis": [ { "category": "实验室检查", "content": "ALT升高,提示肝细胞损伤" } ] } } }

System Prompts

The system uses specialized system prompts optimized for different case types:

Single Disease Cases

System prompts for single disease cases focus on:

  • Typical disease presentation and characteristic findings
  • Standard diagnostic workflows
  • Targeted treatment approaches
  • Compliance with medical guidelines
  • Strict output formatting with the five-section structure

Comorbidity Cases

System prompts for comorbidity cases emphasize:

  • Disease interactions and correlations
  • Treatment conflicts and drug interactions
  • Multidisciplinary diagnostic approaches
  • Comprehensive treatment planning
  • Complex clinical reasoning for multiple conditions

Both prompt types enforce strict output formatting with the five-section structure and prohibit additional metadata or explanatory text.

Key Implementation Details

  • Concurrent Processing: Batch generators use ThreadPoolExecutor for parallel API calls
  • Error Handling: Graceful fallback to default structured data when parsing fails
  • API Client Management: Support for multiple API keys with automatic rotation for load balancing in batch processing
  • Data Validation: Pydantic models ensure data structure integrity
  • Logging: Comprehensive logging for debugging and monitoring

Disease-Specific Data Generators

TNB (Diabetes) Data Generators

The TNB/ directory contains specialized generators for diabetes-related cases:

Generate Diabetes Case Data

# Generate single-disease diabetes cases (45 cases) python TNB/tnb_single_data_generator.py # Generate diabetes comorbidity cases (1049 cases) python TNB/tnb_comorbidity_data_generator.py

Diabetes Coverage

Single Disease Cases (45 total):

  • Prediabetes (IFG, IGT, HbA1c 5.7-6.4%): 5 cases
  • Type 1 Diabetes (T1DM): 10 cases
  • Type 2 Diabetes (T2DM): 14 cases
  • Gestational Diabetes Mellitus (GDM): 6 cases
  • LADA: 2 cases
  • MODY: 2 cases
  • Secondary diabetes: 3 cases
  • Other rare types: 3 cases

Comorbidity Cases (1049 total):

  • 2-disease combinations: 353 cases
  • 3-disease combinations: 368 cases
  • 4-disease combinations: 186 cases
  • 5+ disease combinations: 142 cases

Covered Complications:

  • Microvascular: Diabetic kidney disease (CKD stages), retinopathy (NPDR/PDR), neuropathy, diabetic foot (Wagner grades)
  • Macrovascular: Coronary artery disease, heart failure (HFrEF/HFpEF), stroke, peripheral artery disease
  • Metabolic syndrome: Hypertension, dyslipidemia, obesity (BMI categories), MASLD/MASH, liver disease
  • Other: Thyroid disorders, PCOS, OSA, infections, mental health disorders

For detailed diabetes data information, see TNB/README.md.

File Naming Conventions

  • Generator files follow pattern: {single|comorbidity}_{test|batch}_generator.py
  • Output files use timestamps: {type}_case_YYYYMMDD_HHMMSS.json
  • Backup files use _backup suffix

Dependencies

Core dependencies are defined in requirements.txt:

  • openai>=1.0.0: OpenAI API client
  • pandas>=1.5.0: Data manipulation
  • pydantic>=2.0.0: Data validation and modeling
  • PyYAML>=6.0: Profile and prompt configuration loading
  • fastapi>=0.112.0: Web framework for REST API
  • uvicorn>=0.30.0: ASGI server for running FastAPI applications
  • sse-starlette>=1.6.1: Server-sent events support for streaming responses

Troubleshooting

Web Server Issues

  1. Server Not Starting: Remember to use uvicorn server:app instead of python server.py
  2. Port Already in Use: Change port with --port 8001 or kill the existing process
  3. Access Denied: Use --host 0.0.0.0 to allow external access
  4. Auto-reload Not Working: Ensure you're using the --reload flag in development

General Issues

  1. API Key Issues: Verify api.key is correctly set in config.json
  2. Connection Problems: Check network connectivity and api.base_url settings
  3. Encoding Issues: Ensure CSV files use UTF-8 encoding to avoid Chinese character problems
  4. API Quota: Monitor API usage to avoid rate limiting
  5. Logging: Check console output for detailed error messages and processing status