A comprehensive ETL pipeline and interactive dashboard for analyzing biomedical knowledge graph data from Hetionet v1.0. This project processes complex relationships between diseases, genes, drugs, and symptoms to generate actionable insights for drug repurposing, polypharmacy risk assessment, and biomedical research.

Overview
Features
Prerequisites
Installation
Project Structure
Data Analyses
Dashboard Features
Output Files
Technical Details
Neo4j ETL & Analysis Pipeline

Overview

This project implements a complete data pipeline for the Hetionet knowledge graph, consisting of:

ETL Pipeline: Extracts, transforms, and loads Hetionet data into structured CSV files
Analytics Engine: Performs 8 different biomedical analyses
Interactive Dashboard: Streamlit-based web interface for data exploration and visualization

What is Hetionet?

Hetionet is a biomedical knowledge graph containing 47,031 nodes (genes, diseases, drugs, etc.) and 2,250,197 relationships. This project analyzes these connections to identify:

Genes with the most disease associations ("hotspot genes")
Drug repurposing opportunities
Polypharmacy risks
Disease-symptom relationships
Drug-drug interaction conflicts

Features

ETL Pipeline

Processes 2.2M+ edges and 47K+ nodes
Generates analysis-ready CSV files
Optimized for performance with pre-filtering and indexing
Handles complex data transformations

Analyses

Hotspot Genes: Identifies genes associated with multiple diseases
Drug Repurposing: Finds existing drugs that could treat new diseases
Polypharmacy Risk: Calculates risk scores based on side effects
Symptom Triangle: Maps symptom-disease-drug relationships
Super Drug Score: Ranks drugs by benefit/risk ratio
Drug Conflicts: Identifies drugs with overlapping side effects
Network Visualization: Generates graph data for disease-gene-drug networks
Disease Symptom Diversity: Analyzes symptom complexity across diseases

Dashboard

Interactive visualizations with Plotly
Global search functionality
Chart export (PNG/SVG)
CSV data downloads
Real-time filtering
Network graph visualization
Drug comparison tool

Prerequisites

System Requirements

Python 3.8 or higher
4GB+ RAM recommended
~500MB disk space for data files

Required Libraries

pandas>=1.5.0
streamlit>=1.25.0
plotly>=5.15.0
networkx>=3.0

Installation

Clone or Download Project Files

# Create project directory
mkdir hetionet_analysis
cd hetionet_analysis

(Optional) Docker Setup for Dashboard

You can run the Streamlit dashboard in a Docker container for easier deployment.

Dockerfile example (already present in project root)

FROM python:3.11-slim

ENV PYTHONDONTWRITEBYTECODE=1 \
    PYTHONUNBUFFERED=1 \
    PIP_NO_CACHE_DIR=1

WORKDIR /app

COPY . /app

RUN pip install --upgrade pip \
    && pip install streamlit pandas plotly networkx neo4j

EXPOSE 8501

CMD ["bash", "-c", "python etl.py || true && exec streamlit run dashboard.py --server.port=8501 --server.address=0.0.0.0"]

Build Docker image

docker build -t etl-dashboard .

Run etl and dashboard in Docker

docker run -p 8501:8501 etl-dashboard

Non Docker usage

1. Set Up Python Environment

# Create virtual environment
python -m venv etl_projekt

# Activate environment
# On macOS/Linux:
source etl_projekt/bin/activate
# On Windows:
# etl_projekt\Scripts\activate

# Install dependencies
pip install pandas streamlit plotly networkx

2. Run ETL Pipeline

Execute the ETL pipeline to process the Hetionet data:

python etl.py

Expected Runtime: ~ 1 minute

Output: Creates neo4j_csv/ directory with CSV files

3. Launch Dashboard

Directly with Python

streamlit run dashboard.py

The dashboard will automatically open in your web browser at http://localhost:8501

Explore Data

Navigate through the dashboard using the sidebar menu:

Overview: Summary statistics and key metrics
Hotspot Genes: Top genes by disease associations
Drug Repurposing: Repurposing opportunities
Polypharmacy Risk: Drugs ranked by side effects
Symptom Triangle: Symptom-disease-drug connections
Super Drugs: Best benefit/risk ratios
Drug Conflicts: Overlapping side effects
Network Graph: Interactive visualization
Compare Drugs: Side-by-side drug comparison

Project Structure

hetionet_analysis/
├── hetionet-v1.0.json          # Input data (download separately)
├── hetionet_etl_final.py       # ETL pipeline
├── dashboard.py                # Streamlit dashboard
├── neo4j_csv/                  # Generated output directory
│   ├── nodes_*.csv             # Node files by type (11 files)
│   ├── edges_all.csv           # All relationships
|   |── edges_*.csv              # Splitted relationsship for neo4j import
│   ├── analysis_*.csv          # Analysis results (6 files)
│   ├── network_nodes.csv       # Network visualization nodes
│   └── network_edges.csv       # Network visualization edges
└── README.md                   # This file

Data Analyses

1. Hotspot Genes

Purpose: Identify genes associated with multiple diseases for potential therapeutic targets.

Method: Counts disease associations (via associates, regulates, upregulates, downregulates, binds relationships) for each gene.

Key Findings:

TNF: 48 disease associations
TP53: 47 disease associations
IL6: 41 disease associations

2. Drug Repurposing Opportunities

Purpose: Discover existing drugs that could treat new diseases based on shared genetic mechanisms.

Method:

Identify genes associated with each disease
Find other diseases sharing those genes
Identify drugs treating the related diseases
Exclude drugs already treating the target disease

Output: Disease-drug pairs with shared gene counts

3. Polypharmacy Risk Score

Purpose: Assess safety profiles of drugs based on documented side effects.

Method:

Counts side effects per drug
Calculates risk score: side_effects / (diseases_treated + 1)

Key Metrics:

Higher score = higher risk per disease treated
Enables comparison of drug safety profiles

4. Symptom-Disease-Drug Triangle

Purpose: Map relationships between symptoms, diseases, and treatments.

Method:

Count diseases presenting each symptom
Identify drugs treating those diseases
Calculate impact score: diseases × treating_drugs

Applications:

Symptom-based drug discovery
Understanding disease complexity

5. Super Drug Score

Purpose: Identify drugs with optimal benefit/risk ratios.

Method: score = diseases_treated / (1 + side_effects)

Interpretation:

Higher score = better benefit/risk ratio
Useful for first-line treatment selection

6. Drug Conflicts

Purpose: Identify drugs with overlapping side effects that may compound when combined.

Method:

Build drug-to-side-effects mapping
Compare all drug pairs
Calculate overlap percentage
Flag pairs with 10+ shared side effects

Critical for: Polypharmacy safety assessment

7. Network Visualization

Purpose: Provide graph-based view of disease-gene-drug relationships.

Method:

Selects top 20 diseases by symptom count
Includes connected genes (up to 150)
Includes drugs treating those diseases (up to 50)

Format: NetworkX-compatible node/edge lists

8. Disease Symptom Diversity

Purpose: Quantify disease complexity by symptom count.

Method: Counts unique symptoms per disease via presents relationships.

Insights:

Germ cell cancer: 116 symptoms
Brain cancer: 88 symptoms
Head and neck cancer: 79 symptoms

Dashboard Features

Global Search

Search across genes, diseases, or drugs by name. Results show top 5 matches with key metrics.

Interactive Charts

All visualizations built with Plotly:

Hover for detailed information
Zoom and pan
Export to PNG/SVG
Responsive design

Data Export

Download filtered data as CSV:

Custom date ranges
Filtered subsets
Complete analysis results

Statistics Boxes

Overview page displays key metrics:

Average diseases per gene
Average symptoms per disease
Average side effects per drug

Network Graph

Interactive force-directed graph showing:

Red nodes: Diseases
Blue nodes: Genes
Green nodes: Drugs (Compounds)
Edges: Relationships

Drug Comparison

Side-by-side comparison of two drugs:

Diseases treated
Side effects
Super score
Recommendation based on benefit/risk ratio

Output Files

Node Files (11 files)

nodes_Gene.csv - Gene entities with disease counts
nodes_Disease.csv - Disease entities with symptom counts
nodes_Compound.csv - Drug/compound entities
nodes_Symptom.csv - Symptom entities
nodes_Side_Effect.csv - Side effect entities
Plus 6 additional node type files

Edge File

edges_all.csv - All 2.2M relationships with source, target, and type

Analysis Files

analysis_drug_repurposing.csv - Repurposing opportunities
analysis_polypharmacy_risk.csv - Drug risk scores
analysis_symptom_triangle.csv - Symptom connections
analysis_super_drugs.csv - Drug rankings
analysis_drug_conflicts.csv - Drug interaction warnings

Network Files

network_nodes.csv - Graph nodes for visualization
network_edges.csv - Graph edges for visualization

Technical Details

Performance Optimizations

Pre-filtering: Edges filtered by type once, then reused across analyses

Set Operations: Uses Python sets for fast membership testing (O(1) vs O(n))

Defaultdict: Builds indices using defaultdict for efficient lookups

Batch Processing: Processes edges in batches for memory efficiency

Data Type Handling

All IDs converted to strings for consistency:

edges_df['source'] = edges_df['source'].astype(str)
edges_df['target'] = edges_df['target'].astype(str)

This prevents type mismatch errors when joining dataframes.

Edge Direction Conventions

Hetionet uses directional relationships. Key conventions:

Disease -> Gene for associations
Disease -> Symptom for presentations
Compound -> Disease for treatments
Compound -> Side Effect for adverse effects

Scalability Considerations

The current implementation handles Hetionet v1.0 (47K nodes, 2.2M edges). Loading the csv files with the extracted relationships even if separetely loaded into neo4j takes a siginificant amount of time. Unfortunately I have not been able to find a suitable solution in time and therefore I am currently only importing the relationships which are fewer in number.

Data Quality

The analyses depend on Hetionet data quality. Known limitations:

Not all drugs have documented side effects
Gene-disease associations vary in evidence strength
Network is not exhaustive of all biomedical knowledge

Neo4j ETL & Analysis Pipeline

This repository includes a script for executing analysis queries on the dataset in a Neo4j database.

Neo4j Prerequisites

Ensure that the following components are installed and ready to use:

Neo4j Desktop: A local database instance must be created.
Python 3.x: Installed on your system.
Python Driver: Install the official Neo4j driver via pip in your virtual environment you created earlier:
```
pip install neo4j
```

Workflow Steps

Follow these steps exactly in the order provided:

1. Start Neo4j Database

Open Neo4j Desktop. Select your project and click Start on the corresponding database. The database must be active before proceeding to the next steps.

2. Copy CSV Files

After your ETL process has generated the CSV files, they must be moved to the Neo4j import directory. Locating the path: In Neo4j Desktop, click on Open Folder -> Import. Copy all CSV files from your ETL output into this folder.

3. Data Import via Cypher

Navigate to the folder neo4jqueries/loadingQueriesNeo4j. Execute the Cypher scripts contained there within the Neo4j Browser. These scripts load the data from the import folder and create the nodes and relationships in the graph.

4. Execute Python Analysis

Start the analysis script via your terminal:

python neo4j_etl.py

Input: The script will ask you for your database-username (default: neo4j) and your password.

Processing: The script automatically reads and executes cypher queries in the following directory neo4jqueries/analysis_queries.

Output: Results of the analysis will be displayed on the terminal.

Structure

Directory / file	Functionality
`neo4j_etl.py`	Python-Script for execting analysis queries.
`neo4jqueries/loadingQueriesNeo4j/`	Contains all Cypher files for the initial data import.
`neo4jqueries/analysis_queries/`	Includes Cypher files for analysis.

Future Enhancements

Potential extensions to this project:

Machine Learning: Predictive models for drug efficacy
Temporal Analysis: Track knowledge graph changes over time
API Development: REST API for programmatic access
Cloud Deployment: AWS/GCP hosting for web access
Additional Data Sources: Integrate DrugBank, KEGG, etc.

References

Himmelstein, D. S. et al. (2017). Systematic integration of biomedical knowledge prioritizes drugs for repurposing. eLife, 6, e26726.
Hetionet GitHub: https://github.com/hetio/hetionet
Neo4j Graph Database: https://neo4j.com
Streamlit Documentation: https://docs.streamlit.io

License

This project processes publicly available Hetionet data. Refer to Hetionet licensing for data usage terms.

Contact

For questions or issues, please refer to the project repository or documentation.

Last Updated: January 2026 Version: 1.0.0

README.md Unescape Escape

Hetionet Drug Analysis Pipeline

Table of Contents

Overview

What is Hetionet?

Features

ETL Pipeline

Analyses

Dashboard

Prerequisites

System Requirements

Required Libraries

Installation

Clone or Download Project Files

(Optional) Docker Setup for Dashboard

Non Docker usage

1. Set Up Python Environment

2. Run ETL Pipeline

3. Launch Dashboard

Explore Data

Project Structure

Data Analyses

1. Hotspot Genes

2. Drug Repurposing Opportunities

3. Polypharmacy Risk Score

4. Symptom-Disease-Drug Triangle

5. Super Drug Score

6. Drug Conflicts

7. Network Visualization

8. Disease Symptom Diversity

Dashboard Features

Global Search

Interactive Charts

Data Export

Statistics Boxes

Network Graph

Drug Comparison

Output Files

Node Files (11 files)

Edge File

Analysis Files

Network Files

Technical Details

Performance Optimizations

Data Type Handling

Edge Direction Conventions

Scalability Considerations

Data Quality

Neo4j ETL & Analysis Pipeline

Neo4j Prerequisites

Workflow Steps

1. Start Neo4j Database

2. Copy CSV Files

3. Data Import via Cypher

4. Execute Python Analysis

Structure

Future Enhancements

References

License

Contact

README.md