# Hetionet Drug Analysis Pipeline

A comprehensive ETL pipeline and interactive dashboard for analyzing biomedical knowledge graph data from Hetionet v1.0. This project processes complex relationships between diseases, genes, drugs, and symptoms to generate actionable insights for drug repurposing, polypharmacy risk assessment, and biomedical research.

## Table of Contents

- [Overview](#overview)
- [Features](#features)
- [Prerequisites](#prerequisites)
- [Installation](#installation)
- [Usage](#usage)
- [Project Structure](#project-structure)
- [Data Analyses](#data-analyses)
- [Dashboard Features](#dashboard-features)
- [Output Files](#output-files)
- [Technical Details](#technical-details)

## Overview

This project implements a complete data pipeline for the Hetionet knowledge graph, consisting of:

1. **ETL Pipeline**: Extracts, transforms, and loads Hetionet data into structured CSV files
2. **Analytics Engine**: Performs 8 different biomedical analyses
3. **Interactive Dashboard**: Streamlit-based web interface for data exploration and visualization

### What is Hetionet?

Hetionet is a biomedical knowledge graph containing 47,031 nodes (genes, diseases, drugs, etc.) and 2,250,197 relationships. This project analyzes these connections to identify:

- Genes with the most disease associations ("hotspot genes")
- Drug repurposing opportunities
- Polypharmacy risks
- Disease-symptom relationships
- Drug-drug interaction conflicts

## Features

### ETL Pipeline

- Processes 2.2M+ edges and 47K+ nodes
- Generates analysis-ready CSV files
- Optimized for performance with pre-filtering and indexing
- Handles complex data transformations

### Analyses

1. **Hotspot Genes**: Identifies genes associated with multiple diseases
2. **Drug Repurposing**: Finds existing drugs that could treat new diseases
3. **Polypharmacy Risk**: Calculates risk scores based on side effects
4. **Symptom Triangle**: Maps symptom-disease-drug relationships
5. **Super Drug Score**: Ranks drugs by benefit/risk ratio
6. **Drug Conflicts**: Identifies drugs with overlapping side effects
7. **Network Visualization**: Generates graph data for disease-gene-drug networks
8. **Disease Symptom Diversity**: Analyzes symptom complexity across diseases

### Dashboard

- Interactive visualizations with Plotly
- Global search functionality
- Chart export (PNG/SVG)
- CSV data downloads
- Real-time filtering
- Network graph visualization
- Drug comparison tool

## Prerequisites

### System Requirements

- Python 3.8 or higher
- 4GB+ RAM recommended
- ~500MB disk space for data files

### Required Libraries

```python
pandas>=1.5.0
streamlit>=1.25.0
plotly>=5.15.0
networkx>=3.0
```

## Installation

### 1. Clone or Download Project Files

```bash
# Create project directory
mkdir hetionet_analysis
cd hetionet_analysis
```

### 2. Set Up Python Environment

```bash
# Create virtual environment
python -m venv etl_projekt

# Activate environment
# On macOS/Linux:
source etl_projekt/bin/activate
# On Windows:
# etl_projekt\Scripts\activate

# Install dependencies
pip install pandas streamlit plotly networkx
```

### 3. Download Hetionet Data

Download `hetionet-v1.0.json` from [Hetionet GitHub](https://github.com/hetio/hetionet) and place it in the project directory.

### 4. Add Project Files

Place the following files in your project directory:

- `hetionet_etl_final.py` - Main ETL script
- `dashboard.py` - Streamlit dashboard

## Usage

### Step 1: Run ETL Pipeline

Execute the ETL pipeline to process the Hetionet data:

```bash
python hetionet_etl_final.py
```

**Expected Runtime**: 1-2 minutes

**Output**: Creates `neo4j_csv/` directory with 20 CSV files

### Step 2: Launch Dashboard

Start the interactive dashboard:

```bash
streamlit run dashboard.py
```

The dashboard will automatically open in your web browser at `http://localhost:8501`

### Step 3: Explore Data

Navigate through the dashboard using the sidebar menu:

- **Overview**: Summary statistics and key metrics
- **Hotspot Genes**: Top genes by disease associations
- **Drug Repurposing**: Repurposing opportunities
- **Polypharmacy Risk**: Drugs ranked by side effects
- **Symptom Triangle**: Symptom-disease-drug connections
- **Super Drugs**: Best benefit/risk ratios
- **Drug Conflicts**: Overlapping side effects
- **Network Graph**: Interactive visualization
- **Compare Drugs**: Side-by-side drug comparison

## Project Structure

```editorconfig
hetionet_analysis/
├── hetionet-v1.0.json          # Input data (download separately)
├── hetionet_etl_final.py       # ETL pipeline
├── dashboard.py                # Streamlit dashboard
├── neo4j_csv/                  # Generated output directory
│   ├── nodes_*.csv             # Node files by type (11 files)
│   ├── edges_all.csv           # All relationships
│   ├── analysis_*.csv          # Analysis results (6 files)
│   ├── network_nodes.csv       # Network visualization nodes
│   └── network_edges.csv       # Network visualization edges
└── README.md                   # This file
```

## Data Analyses

### 1. Hotspot Genes

**Purpose**: Identify genes associated with multiple diseases for potential therapeutic targets.

**Method**: Counts disease associations (via `associates`, `regulates`, `upregulates`, `downregulates`, `binds` relationships) for each gene.

**Key Findings**:

- TNF: 48 disease associations
- TP53: 47 disease associations
- IL6: 41 disease associations

### 2. Drug Repurposing Opportunities

**Purpose**: Discover existing drugs that could treat new diseases based on shared genetic mechanisms.

**Method**:

1. Identify genes associated with each disease
2. Find other diseases sharing those genes
3. Identify drugs treating the related diseases
4. Exclude drugs already treating the target disease

**Output**: Disease-drug pairs with shared gene counts

### 3. Polypharmacy Risk Score

**Purpose**: Assess safety profiles of drugs based on documented side effects.

**Method**:

- Counts side effects per drug
- Calculates risk score: `side_effects / (diseases_treated + 1)`

**Key Metrics**:

- Higher score = higher risk per disease treated
- Enables comparison of drug safety profiles

### 4. Symptom-Disease-Drug Triangle

**Purpose**: Map relationships between symptoms, diseases, and treatments.

**Method**:

1. Count diseases presenting each symptom
2. Identify drugs treating those diseases
3. Calculate impact score: `diseases × treating_drugs`

**Applications**:

- Symptom-based drug discovery
- Understanding disease complexity

### 5. Super Drug Score

**Purpose**: Identify drugs with optimal benefit/risk ratios.

**Method**: `score = diseases_treated / (1 + side_effects)`

**Interpretation**:

- Higher score = better benefit/risk ratio
- Useful for first-line treatment selection

### 6. Drug Conflicts

**Purpose**: Identify drugs with overlapping side effects that may compound when combined.

**Method**:

1. Build drug-to-side-effects mapping
2. Compare all drug pairs
3. Calculate overlap percentage
4. Flag pairs with 10+ shared side effects

**Critical for**: Polypharmacy safety assessment

### 7. Network Visualization

**Purpose**: Provide graph-based view of disease-gene-drug relationships.

**Method**:

- Selects top 20 diseases by symptom count
- Includes connected genes (up to 150)
- Includes drugs treating those diseases (up to 50)

**Format**: NetworkX-compatible node/edge lists

### 8. Disease Symptom Diversity

**Purpose**: Quantify disease complexity by symptom count.

**Method**: Counts unique symptoms per disease via `presents` relationships.

**Insights**:

- Germ cell cancer: 116 symptoms
- Brain cancer: 88 symptoms
- Head and neck cancer: 79 symptoms

## Dashboard Features

### Global Search

Search across genes, diseases, or drugs by name. Results show top 5 matches with key metrics.

### Interactive Charts

All visualizations built with Plotly:

- Hover for detailed information
- Zoom and pan
- Export to PNG/SVG
- Responsive design

### Data Export

Download filtered data as CSV:

- Custom date ranges
- Filtered subsets
- Complete analysis results

### Statistics Boxes

Overview page displays key metrics:

- Average diseases per gene
- Average symptoms per disease
- Average side effects per drug

### Network Graph

Interactive force-directed graph showing:

- Red nodes: Diseases
- Blue nodes: Genes
- Green nodes: Drugs (Compounds)
- Edges: Relationships

### Drug Comparison

Side-by-side comparison of two drugs:

- Diseases treated
- Side effects
- Super score
- Recommendation based on benefit/risk ratio

## Output Files

### Node Files (11 files)

- `nodes_Gene.csv` - Gene entities with disease counts
- `nodes_Disease.csv` - Disease entities with symptom counts
- `nodes_Compound.csv` - Drug/compound entities
- `nodes_Symptom.csv` - Symptom entities
- `nodes_Side_Effect.csv` - Side effect entities
- Plus 6 additional node type files

### Edge File

- `edges_all.csv` - All 2.2M relationships with source, target, and type

### Analysis Files

- `analysis_drug_repurposing.csv` - Repurposing opportunities
- `analysis_polypharmacy_risk.csv` - Drug risk scores
- `analysis_symptom_triangle.csv` - Symptom connections
- `analysis_super_drugs.csv` - Drug rankings
- `analysis_drug_conflicts.csv` - Drug interaction warnings

### Network Files

- `network_nodes.csv` - Graph nodes for visualization
- `network_edges.csv` - Graph edges for visualization

## Technical Details

### Performance Optimizations

**Pre-filtering**: Edges filtered by type once, then reused across analyses

**Set Operations**: Uses Python sets for fast membership testing (O(1) vs O(n))

**Defaultdict**: Builds indices using defaultdict for efficient lookups

**Batch Processing**: Processes edges in batches for memory efficiency

### Data Type Handling

All IDs converted to strings for consistency:

```python
edges_df['source'] = edges_df['source'].astype(str)
edges_df['target'] = edges_df['target'].astype(str)
```

This prevents type mismatch errors when joining dataframes.

### Memory Management

Peak memory usage: ~2GB during ETL processing

**Optimization strategies**:

- Process data in chunks where possible
- Drop intermediate dataframes after use
- Use generators for large iterations

### Edge Direction Conventions

Hetionet uses directional relationships. Key conventions:

- `Disease -> Gene` for associations
- `Disease -> Symptom` for presentations  
- `Compound -> Disease` for treatments
- `Compound -> Side Effect` for adverse effects

### Scalability Considerations

Current implementation handles Hetionet v1.0 (47K nodes, 2.2M edges).

For larger datasets:

- Implement chunked CSV reading
- Use database backend (PostgreSQL, Neo4j)
- Parallelize analyses with multiprocessing

## Troubleshooting

### Common Issues

**Issue**: "FileNotFoundError: hetionet-v1.0.json"
**Solution**: Download Hetionet data and place in project directory

**Issue**: "Module not found"
**Solution**: Ensure virtual environment is activated and dependencies installed

**Issue**: Dashboard shows "No data available"
**Solution**: Run ETL pipeline first to generate CSV files

**Issue**: "Memory Error" during ETL
**Solution**: Close other applications or increase system RAM

### Data Quality

The analyses depend on Hetionet data quality. Known limitations:

- Not all drugs have documented side effects
- Gene-disease associations vary in evidence strength
- Network is not exhaustive of all biomedical knowledge

## Future Enhancements

Potential extensions to this project:

1. **Neo4j Integration**: Direct graph database storage for complex queries
2. **Machine Learning**: Predictive models for drug efficacy
3. **Temporal Analysis**: Track knowledge graph changes over time
4. **API Development**: REST API for programmatic access
5. **Cloud Deployment**: AWS/GCP hosting for web access
6. **Additional Data Sources**: Integrate DrugBank, KEGG, etc.

## References

- Himmelstein, D. S. et al. (2017). Systematic integration of biomedical knowledge prioritizes drugs for repurposing. *eLife*, 6, e26726.
- Hetionet GitHub: [https://github.com/hetio/hetionet](https://github.com/hetio/hetionet)
- Neo4j Graph Database: [https://neo4j.com](https://neo4j.com)
- Streamlit Documentation: [https://docs.streamlit.io](https://docs.streamlit.io)
  
## License

This project processes publicly available Hetionet data. Refer to Hetionet licensing for data usage terms.

## Contact

For questions or issues, please refer to the project repository or documentation.

---

**Last Updated**: January 2026
**Version**: 1.0.0