14 KiB
Hetionet Drug Analysis Pipeline
A comprehensive ETL pipeline and interactive dashboard for analyzing biomedical knowledge graph data from Hetionet v1.0. This project processes complex relationships between diseases, genes, drugs, and symptoms to generate actionable insights for drug repurposing, polypharmacy risk assessment, and biomedical research.
Table of Contents
- Overview
- Features
- Prerequisites
- Installation
- Project Structure
- Data Analyses
- Dashboard Features
- Output Files
- Technical Details
- Neo4j ETL & Analysis Pipeline
Overview
This project implements a complete data pipeline for the Hetionet knowledge graph, consisting of:
- ETL Pipeline: Extracts, transforms, and loads Hetionet data into structured CSV files
- Analytics Engine: Performs 8 different biomedical analyses
- Interactive Dashboard: Streamlit-based web interface for data exploration and visualization
What is Hetionet?
Hetionet is a biomedical knowledge graph containing 47,031 nodes (genes, diseases, drugs, etc.) and 2,250,197 relationships. This project analyzes these connections to identify:
- Genes with the most disease associations ("hotspot genes")
- Drug repurposing opportunities
- Polypharmacy risks
- Disease-symptom relationships
- Drug-drug interaction conflicts
Features
ETL Pipeline
- Processes 2.2M+ edges and 47K+ nodes
- Generates analysis-ready CSV files
- Optimized for performance with pre-filtering and indexing
- Handles complex data transformations
Analyses
- Hotspot Genes: Identifies genes associated with multiple diseases
- Drug Repurposing: Finds existing drugs that could treat new diseases
- Polypharmacy Risk: Calculates risk scores based on side effects
- Symptom Triangle: Maps symptom-disease-drug relationships
- Super Drug Score: Ranks drugs by benefit/risk ratio
- Drug Conflicts: Identifies drugs with overlapping side effects
- Network Visualization: Generates graph data for disease-gene-drug networks
- Disease Symptom Diversity: Analyzes symptom complexity across diseases
Dashboard
- Interactive visualizations with Plotly
- Global search functionality
- Chart export (PNG/SVG)
- CSV data downloads
- Real-time filtering
- Network graph visualization
- Drug comparison tool
Prerequisites
System Requirements
- Python 3.8 or higher
- 4GB+ RAM recommended
- ~500MB disk space for data files
Required Libraries
pandas>=1.5.0
streamlit>=1.25.0
plotly>=5.15.0
networkx>=3.0
Installation
Clone or Download Project Files
# Create project directory
mkdir hetionet_analysis
cd hetionet_analysis
(Optional) Docker Setup for Dashboard
You can run the Streamlit dashboard in a Docker container for easier deployment.
Dockerfile example (already present in project root)
FROM python:3.11-slim
ENV PYTHONDONTWRITEBYTECODE=1 \
PYTHONUNBUFFERED=1 \
PIP_NO_CACHE_DIR=1
WORKDIR /app
COPY . /app
RUN pip install --upgrade pip \
&& pip install streamlit pandas plotly networkx neo4j
EXPOSE 8501
CMD ["bash", "-c", "python etl.py || true && exec streamlit run dashboard.py --server.port=8501 --server.address=0.0.0.0"]
Build Docker image
docker build -t etl-dashboard .
Run etl and dashboard in Docker
docker run -p 8501:8501 etl-dashboard
Non Docker usage
1. Set Up Python Environment
# Create virtual environment
python -m venv etl_projekt
# Activate environment
# On macOS/Linux:
source etl_projekt/bin/activate
# On Windows:
# etl_projekt\Scripts\activate
# Install dependencies
pip install pandas streamlit plotly networkx
2. Run ETL Pipeline
Execute the ETL pipeline to process the Hetionet data:
python etl.py
Expected Runtime: ~ 1 minute
Output: Creates neo4j_csv/ directory with CSV files
3. Launch Dashboard
Directly with Python
streamlit run dashboard.py
The dashboard will automatically open in your web browser at http://localhost:8501
Explore Data
Navigate through the dashboard using the sidebar menu:
- Overview: Summary statistics and key metrics
- Hotspot Genes: Top genes by disease associations
- Drug Repurposing: Repurposing opportunities
- Polypharmacy Risk: Drugs ranked by side effects
- Symptom Triangle: Symptom-disease-drug connections
- Super Drugs: Best benefit/risk ratios
- Drug Conflicts: Overlapping side effects
- Network Graph: Interactive visualization
- Compare Drugs: Side-by-side drug comparison
Project Structure
hetionet_analysis/
├── hetionet-v1.0.json # Input data (download separately)
├── hetionet_etl_final.py # ETL pipeline
├── dashboard.py # Streamlit dashboard
├── neo4j_csv/ # Generated output directory
│ ├── nodes_*.csv # Node files by type (11 files)
│ ├── edges_all.csv # All relationships
| |── edges_*.csv # Splitted relationsship for neo4j import
│ ├── analysis_*.csv # Analysis results (6 files)
│ ├── network_nodes.csv # Network visualization nodes
│ └── network_edges.csv # Network visualization edges
└── README.md # This file
Data Analyses
1. Hotspot Genes
Purpose: Identify genes associated with multiple diseases for potential therapeutic targets.
Method: Counts disease associations (via associates, regulates, upregulates, downregulates, binds relationships) for each gene.
Key Findings:
- TNF: 48 disease associations
- TP53: 47 disease associations
- IL6: 41 disease associations
2. Drug Repurposing Opportunities
Purpose: Discover existing drugs that could treat new diseases based on shared genetic mechanisms.
Method:
- Identify genes associated with each disease
- Find other diseases sharing those genes
- Identify drugs treating the related diseases
- Exclude drugs already treating the target disease
Output: Disease-drug pairs with shared gene counts
3. Polypharmacy Risk Score
Purpose: Assess safety profiles of drugs based on documented side effects.
Method:
- Counts side effects per drug
- Calculates risk score:
side_effects / (diseases_treated + 1)
Key Metrics:
- Higher score = higher risk per disease treated
- Enables comparison of drug safety profiles
4. Symptom-Disease-Drug Triangle
Purpose: Map relationships between symptoms, diseases, and treatments.
Method:
- Count diseases presenting each symptom
- Identify drugs treating those diseases
- Calculate impact score:
diseases × treating_drugs
Applications:
- Symptom-based drug discovery
- Understanding disease complexity
5. Super Drug Score
Purpose: Identify drugs with optimal benefit/risk ratios.
Method: score = diseases_treated / (1 + side_effects)
Interpretation:
- Higher score = better benefit/risk ratio
- Useful for first-line treatment selection
6. Drug Conflicts
Purpose: Identify drugs with overlapping side effects that may compound when combined.
Method:
- Build drug-to-side-effects mapping
- Compare all drug pairs
- Calculate overlap percentage
- Flag pairs with 10+ shared side effects
Critical for: Polypharmacy safety assessment
7. Network Visualization
Purpose: Provide graph-based view of disease-gene-drug relationships.
Method:
- Selects top 20 diseases by symptom count
- Includes connected genes (up to 150)
- Includes drugs treating those diseases (up to 50)
Format: NetworkX-compatible node/edge lists
8. Disease Symptom Diversity
Purpose: Quantify disease complexity by symptom count.
Method: Counts unique symptoms per disease via presents relationships.
Insights:
- Germ cell cancer: 116 symptoms
- Brain cancer: 88 symptoms
- Head and neck cancer: 79 symptoms
Dashboard Features
Global Search
Search across genes, diseases, or drugs by name. Results show top 5 matches with key metrics.
Interactive Charts
All visualizations built with Plotly:
- Hover for detailed information
- Zoom and pan
- Export to PNG/SVG
- Responsive design
Data Export
Download filtered data as CSV:
- Custom date ranges
- Filtered subsets
- Complete analysis results
Statistics Boxes
Overview page displays key metrics:
- Average diseases per gene
- Average symptoms per disease
- Average side effects per drug
Network Graph
Interactive force-directed graph showing:
- Red nodes: Diseases
- Blue nodes: Genes
- Green nodes: Drugs (Compounds)
- Edges: Relationships
Drug Comparison
Side-by-side comparison of two drugs:
- Diseases treated
- Side effects
- Super score
- Recommendation based on benefit/risk ratio
Output Files
Node Files (11 files)
nodes_Gene.csv- Gene entities with disease countsnodes_Disease.csv- Disease entities with symptom countsnodes_Compound.csv- Drug/compound entitiesnodes_Symptom.csv- Symptom entitiesnodes_Side_Effect.csv- Side effect entities- Plus 6 additional node type files
Edge File
edges_all.csv- All 2.2M relationships with source, target, and type
Analysis Files
analysis_drug_repurposing.csv- Repurposing opportunitiesanalysis_polypharmacy_risk.csv- Drug risk scoresanalysis_symptom_triangle.csv- Symptom connectionsanalysis_super_drugs.csv- Drug rankingsanalysis_drug_conflicts.csv- Drug interaction warnings
Network Files
network_nodes.csv- Graph nodes for visualizationnetwork_edges.csv- Graph edges for visualization
Technical Details
Performance Optimizations
Pre-filtering: Edges filtered by type once, then reused across analyses
Set Operations: Uses Python sets for fast membership testing (O(1) vs O(n))
Defaultdict: Builds indices using defaultdict for efficient lookups
Batch Processing: Processes edges in batches for memory efficiency
Data Type Handling
All IDs converted to strings for consistency:
edges_df['source'] = edges_df['source'].astype(str)
edges_df['target'] = edges_df['target'].astype(str)
This prevents type mismatch errors when joining dataframes.
Edge Direction Conventions
Hetionet uses directional relationships. Key conventions:
Disease -> Genefor associationsDisease -> Symptomfor presentationsCompound -> Diseasefor treatmentsCompound -> Side Effectfor adverse effects
Scalability Considerations
The current implementation handles Hetionet v1.0 (47K nodes, 2.2M edges). Loading the csv files with the extracted relationships even if separetely loaded into neo4j takes a siginificant amount of time. Unfortunately I have not been able to find a suitable solution in time and therefore I am currently only importing the relationships which are fewer in number.
Data Quality
The analyses depend on Hetionet data quality. Known limitations:
- Not all drugs have documented side effects
- Gene-disease associations vary in evidence strength
- Network is not exhaustive of all biomedical knowledge
Neo4j ETL & Analysis Pipeline
This repository includes a script for executing analysis queries on the dataset in a Neo4j database.
Neo4j Prerequisites
Ensure that the following components are installed and ready to use:
-
Neo4j Desktop: A local database instance must be created.
-
Python 3.x: Installed on your system.
-
Python Driver: Install the official Neo4j driver via pip in your virtual environment you created earlier:
pip install neo4j
Workflow Steps
Follow these steps exactly in the order provided:
1. Start Neo4j Database
Open Neo4j Desktop. Select your project and click Start on the corresponding database. The database must be active before proceeding to the next steps.
2. Copy CSV Files
After your ETL process has generated the CSV files, they must be moved to the Neo4j import directory.
Locating the path: In Neo4j Desktop, click on Open Folder -> Import.
Copy all CSV files from your ETL output into this folder.
3. Data Import via Cypher
Navigate to the folder neo4jqueries/loadingQueriesNeo4j.
Execute the Cypher scripts contained there within the Neo4j Browser.
These scripts load the data from the import folder and create the nodes and relationships in the graph.
4. Execute Python Analysis
Start the analysis script via your terminal:
python neo4j_etl.py
Input: The script will ask you for your database-username (default: neo4j) and your password.
Processing: The script automatically reads and executes cypher queries in the following directory neo4jqueries/analysis_queries.
Output: Results of the analysis will be displayed on the terminal.
Structure
| Directory / file | Functionality |
|---|---|
neo4j_etl.py |
Python-Script for execting analysis queries. |
neo4jqueries/loadingQueriesNeo4j/ |
Contains all Cypher files for the initial data import. |
neo4jqueries/analysis_queries/ |
Includes Cypher files for analysis. |
Future Enhancements
Potential extensions to this project:
- Machine Learning: Predictive models for drug efficacy
- Temporal Analysis: Track knowledge graph changes over time
- API Development: REST API for programmatic access
- Cloud Deployment: AWS/GCP hosting for web access
- Additional Data Sources: Integrate DrugBank, KEGG, etc.
References
- Himmelstein, D. S. et al. (2017). Systematic integration of biomedical knowledge prioritizes drugs for repurposing. eLife, 6, e26726.
- Hetionet GitHub: https://github.com/hetio/hetionet
- Neo4j Graph Database: https://neo4j.com
- Streamlit Documentation: https://docs.streamlit.io
License
This project processes publicly available Hetionet data. Refer to Hetionet licensing for data usage terms.
Contact
For questions or issues, please refer to the project repository or documentation.
Last Updated: January 2026 Version: 1.0.0