finishes setup

This commit is contained in:
Philipp Jacoby
2026-02-10 17:43:26 +01:00
parent 4f1a5c311f
commit 3003310be0
39 changed files with 2251611 additions and 1188 deletions

184
README.md
View File

@@ -8,13 +8,13 @@ A comprehensive ETL pipeline and interactive dashboard for analyzing biomedical
- [Features](#features)
- [Prerequisites](#prerequisites)
- [Installation](#installation)
- [Usage](#usage)
- [Project Structure](#project-structure)
- [Data Analyses](#data-analyses)
- [Dashboard Features](#dashboard-features)
- [Output Files](#output-files)
- [Technical Details](#technical-details)
- [Neo4j ETL & Analysis Pipeline](#neo4j-etl--analysis-pipeline)
## Overview
This project implements a complete data pipeline for the Hetionet knowledge graph, consisting of:
@@ -82,7 +82,7 @@ networkx>=3.0
## Installation
### 1. Clone or Download Project Files
### Clone or Download Project Files
```bash
# Create project directory
@@ -90,7 +90,48 @@ mkdir hetionet_analysis
cd hetionet_analysis
```
### 2. Set Up Python Environment
### (Optional) Docker Setup for Dashboard
You can run the Streamlit dashboard in a Docker container for easier deployment.
Dockerfile example (already present in project root)
```bash
FROM python:3.11-slim
ENV PYTHONDONTWRITEBYTECODE=1 \
PYTHONUNBUFFERED=1 \
PIP_NO_CACHE_DIR=1
WORKDIR /app
COPY . /app
RUN pip install --upgrade pip \
&& pip install streamlit pandas plotly networkx neo4j
EXPOSE 8501
CMD ["bash", "-c", "python etl.py || true && exec streamlit run dashboard.py --server.port=8501 --server.address=0.0.0.0"]
```
Build Docker image
```bash
docker build -t etl-dashboard .
```
Run etl and dashboard in Docker
```bash
docker run -p 8501:8501 etl-dashboard
```
## Non Docker usage
### 1. Set Up Python Environment
```bash
# Create virtual environment
@@ -106,34 +147,21 @@ source etl_projekt/bin/activate
pip install pandas streamlit plotly networkx
```
### 3. Download Hetionet Data
Download `hetionet-v1.0.json` from [Hetionet GitHub](https://github.com/hetio/hetionet) and place it in the project directory.
### 4. Add Project Files
Place the following files in your project directory:
- `hetionet_etl_final.py` - Main ETL script
- `dashboard.py` - Streamlit dashboard
## Usage
### Step 1: Run ETL Pipeline
### 2. Run ETL Pipeline
Execute the ETL pipeline to process the Hetionet data:
```bash
python hetionet_etl_final.py
python etl.py
```
**Expected Runtime**: 1-2 minutes
**Expected Runtime**: ~ 1 minute
**Output**: Creates `neo4j_csv/` directory with 20 CSV files
**Output**: Creates `neo4j_csv/` directory with CSV files
### Step 2: Launch Dashboard
### 3. Launch Dashboard
Start the interactive dashboard:
Directly with Python
```bash
streamlit run dashboard.py
@@ -141,7 +169,7 @@ streamlit run dashboard.py
The dashboard will automatically open in your web browser at `http://localhost:8501`
### Step 3: Explore Data
## Explore Data
Navigate through the dashboard using the sidebar menu:
@@ -165,6 +193,7 @@ hetionet_analysis/
├── neo4j_csv/ # Generated output directory
│ ├── nodes_*.csv # Node files by type (11 files)
│ ├── edges_all.csv # All relationships
| |── edges_*.csv # Splitted relationsship for neo4j import
│ ├── analysis_*.csv # Analysis results (6 files)
│ ├── network_nodes.csv # Network visualization nodes
│ └── network_edges.csv # Network visualization edges
@@ -375,16 +404,6 @@ edges_df['target'] = edges_df['target'].astype(str)
This prevents type mismatch errors when joining dataframes.
### Memory Management
Peak memory usage: ~2GB during ETL processing
**Optimization strategies**:
- Process data in chunks where possible
- Drop intermediate dataframes after use
- Use generators for large iterations
### Edge Direction Conventions
Hetionet uses directional relationships. Key conventions:
@@ -398,28 +417,6 @@ Hetionet uses directional relationships. Key conventions:
Current implementation handles Hetionet v1.0 (47K nodes, 2.2M edges).
For larger datasets:
- Implement chunked CSV reading
- Use database backend (PostgreSQL, Neo4j)
- Parallelize analyses with multiprocessing
## Troubleshooting
### Common Issues
**Issue**: "FileNotFoundError: hetionet-v1.0.json"
**Solution**: Download Hetionet data and place in project directory
**Issue**: "Module not found"
**Solution**: Ensure virtual environment is activated and dependencies installed
**Issue**: Dashboard shows "No data available"
**Solution**: Run ETL pipeline first to generate CSV files
**Issue**: "Memory Error" during ETL
**Solution**: Close other applications or increase system RAM
### Data Quality
The analyses depend on Hetionet data quality. Known limitations:
@@ -428,16 +425,79 @@ The analyses depend on Hetionet data quality. Known limitations:
- Gene-disease associations vary in evidence strength
- Network is not exhaustive of all biomedical knowledge
## Neo4j ETL & Analysis Pipeline
This repository includes a script for executing analysis queries on the dataset in a Neo4j database.
### Neo4j Prerequisites
Ensure that the following components are installed and ready to use:
- **Neo4j Desktop:** A local database instance must be created.
- **Python 3.x:** Installed on your system.
- **Python Driver:** Install the official Neo4j driver via pip in your virtual environment you created earlier:
```bash
pip install neo4j
```
---
## Workflow Steps
Follow these steps exactly in the order provided:
### 1. Start Neo4j Database
Open **Neo4j Desktop**.
Select your project and click **Start** on the corresponding database.
The database must be active before proceeding to the next steps.
### 2. Copy CSV Files
After your ETL process has generated the CSV files, they must be moved to the Neo4j import directory.
**Locating the path:** In Neo4j Desktop, click on `Open Folder` -> `Import`.
Copy all CSV files from your ETL output into this folder.
### 3. Data Import via Cypher
Navigate to the folder `neo4jqueries/loadingQueriesNeo4j`.
Execute the Cypher scripts contained there within the **Neo4j Browser**.
These scripts load the data from the import folder and create the nodes and relationships in the graph.
### 4. Execute Python Analysis
Start the analysis script via your terminal:
```bash
python neo4j_etl.py
```
**Eingabe:** Das Skript wird Sie nacheinander nach Ihrem **Datenbank-Usernamen** (Standard: `neo4j`) und Ihrem **Passwort** fragen.
**Verarbeitung:** Das Skript liest automatisch alle Abfragen aus dem Verzeichnis `neo4jqueries/analysis_queries` aus.
**Ausgabe:** Die Ergebnisse der Analyse-Queries werden direkt in der Konsole ausgegeben.
---
### Projektstruktur
| Verzeichnis / Datei | Funktion |
| :---------------------------------- | :--------------------------------------------------------- |
| `neo4j_etl.py` | Das Python-Skript zur Ausführung der Analyse-Queries. |
| `neo4jqueries/loadingQueriesNeo4j/` | Enthält alle Cypher-Dateien für den initialen Datenimport. |
| `neo4jqueries/analysis_queries/` | Enthält Cypher-Dateien für die statistische Auswertung. |
---
## Future Enhancements
Potential extensions to this project:
1. **Neo4j Integration**: Direct graph database storage for complex queries
2. **Machine Learning**: Predictive models for drug efficacy
3. **Temporal Analysis**: Track knowledge graph changes over time
4. **API Development**: REST API for programmatic access
5. **Cloud Deployment**: AWS/GCP hosting for web access
6. **Additional Data Sources**: Integrate DrugBank, KEGG, etc.
1. **Machine Learning**: Predictive models for drug efficacy
2. **Temporal Analysis**: Track knowledge graph changes over time
3. **API Development**: REST API for programmatic access
4. **Cloud Deployment**: AWS/GCP hosting for web access
5. **Additional Data Sources**: Integrate DrugBank, KEGG, etc.
## References