finishes setup

2026-02-10 17:43:26 +01:00
parent 4f1a5c311f
commit 3003310be0
39 changed files with 2251611 additions and 1188 deletions
--- a/README.md
+++ b/README.md
@@ -8,13 +8,13 @@ A comprehensive ETL pipeline and interactive dashboard for analyzing biomedical
 - [Features](#features)
 - [Prerequisites](#prerequisites)
 - [Installation](#installation)
- [Usage](#usage)
 - [Project Structure](#project-structure)
 - [Data Analyses](#data-analyses)
 - [Dashboard Features](#dashboard-features)
 - [Output Files](#output-files)
 - [Technical Details](#technical-details)
-
+- [Neo4j ETL & Analysis Pipeline](#neo4j-etl--analysis-pipeline)
+  
 ## Overview

 This project implements a complete data pipeline for the Hetionet knowledge graph, consisting of:
@@ -82,7 +82,7 @@ networkx>=3.0

 ## Installation

-### 1. Clone or Download Project Files
+### Clone or Download Project Files

 ```bash
 # Create project directory
@@ -90,7 +90,48 @@ mkdir hetionet_analysis
 cd hetionet_analysis
 ```

-### 2. Set Up Python Environment
+### (Optional) Docker Setup for Dashboard
+
+You can run the Streamlit dashboard in a Docker container for easier deployment.
+
+Dockerfile example (already present in project root)
+
+```bash
+FROM python:3.11-slim
+
+ENV PYTHONDONTWRITEBYTECODE=1 \
+    PYTHONUNBUFFERED=1 \
+    PIP_NO_CACHE_DIR=1
+
+WORKDIR /app
+
+COPY . /app
+
+RUN pip install --upgrade pip \
+    && pip install streamlit pandas plotly networkx neo4j
+
+EXPOSE 8501
+
+CMD ["bash", "-c", "python etl.py || true && exec streamlit run dashboard.py --server.port=8501 --server.address=0.0.0.0"]
+
+
+```
+
+Build Docker image
+
+```bash
+docker build -t etl-dashboard .
+```
+
+Run etl and dashboard in Docker
+
+```bash
+docker run -p 8501:8501 etl-dashboard
+```
+
+## Non Docker usage
+
+### 1. Set Up Python Environment

 ```bash
 # Create virtual environment
@@ -106,34 +147,21 @@ source etl_projekt/bin/activate
 pip install pandas streamlit plotly networkx
 ```

-### 3. Download Hetionet Data
-
-Download `hetionet-v1.0.json` from [Hetionet GitHub](https://github.com/hetio/hetionet) and place it in the project directory.
-
-### 4. Add Project Files
-
-Place the following files in your project directory:
-
- `hetionet_etl_final.py` - Main ETL script
- `dashboard.py` - Streamlit dashboard
-
-## Usage
-
-### Step 1: Run ETL Pipeline
+### 2. Run ETL Pipeline

 Execute the ETL pipeline to process the Hetionet data:

 ```bash
-python hetionet_etl_final.py
+python etl.py
 ```

-**Expected Runtime**: 1-2 minutes
+**Expected Runtime**: ~ 1 minute

-**Output**: Creates `neo4j_csv/` directory with 20 CSV files
+**Output**: Creates `neo4j_csv/` directory with CSV files

-### Step 2: Launch Dashboard
+### 3. Launch Dashboard

-Start the interactive dashboard:
+Directly with Python

 ```bash
 streamlit run dashboard.py
@@ -141,7 +169,7 @@ streamlit run dashboard.py

 The dashboard will automatically open in your web browser at `http://localhost:8501`

-### Step 3: Explore Data
+## Explore Data

 Navigate through the dashboard using the sidebar menu:

@@ -165,6 +193,7 @@ hetionet_analysis/
 ├── neo4j_csv/                  # Generated output directory
 │   ├── nodes_*.csv             # Node files by type (11 files)
 │   ├── edges_all.csv           # All relationships
+|   |── edges_*.csv              # Splitted relationsship for neo4j import
 │   ├── analysis_*.csv          # Analysis results (6 files)
 │   ├── network_nodes.csv       # Network visualization nodes
 │   └── network_edges.csv       # Network visualization edges
@@ -375,16 +404,6 @@ edges_df['target'] = edges_df['target'].astype(str)

 This prevents type mismatch errors when joining dataframes.

-### Memory Management
-
-Peak memory usage: ~2GB during ETL processing
-
-**Optimization strategies**:
-
- Process data in chunks where possible
- Drop intermediate dataframes after use
- Use generators for large iterations
-
 ### Edge Direction Conventions

 Hetionet uses directional relationships. Key conventions:
@@ -398,28 +417,6 @@ Hetionet uses directional relationships. Key conventions:

 Current implementation handles Hetionet v1.0 (47K nodes, 2.2M edges).

-For larger datasets:
-
- Implement chunked CSV reading
- Use database backend (PostgreSQL, Neo4j)
- Parallelize analyses with multiprocessing
-
-## Troubleshooting
-
-### Common Issues
-
-**Issue**: "FileNotFoundError: hetionet-v1.0.json"
-**Solution**: Download Hetionet data and place in project directory
-
-**Issue**: "Module not found"
-**Solution**: Ensure virtual environment is activated and dependencies installed
-
-**Issue**: Dashboard shows "No data available"
-**Solution**: Run ETL pipeline first to generate CSV files
-
-**Issue**: "Memory Error" during ETL
-**Solution**: Close other applications or increase system RAM
-
 ### Data Quality

 The analyses depend on Hetionet data quality. Known limitations:
@@ -428,16 +425,79 @@ The analyses depend on Hetionet data quality. Known limitations:
 - Gene-disease associations vary in evidence strength
 - Network is not exhaustive of all biomedical knowledge

+## Neo4j ETL & Analysis Pipeline
+
+This repository includes a script for executing analysis queries on the dataset in a Neo4j database.
+
+### Neo4j Prerequisites
+
+Ensure that the following components are installed and ready to use:
+
+- **Neo4j Desktop:** A local database instance must be created.
+- **Python 3.x:** Installed on your system.
+- **Python Driver:** Install the official Neo4j driver via pip in your virtual environment you created earlier:
+
+    ```bash
+    pip install neo4j
+    ```
+
+---
+
+## Workflow Steps
+
+Follow these steps exactly in the order provided:
+
+### 1. Start Neo4j Database
+
+  Open **Neo4j Desktop**.
+  Select your project and click **Start** on the corresponding database.
+  The database must be active before proceeding to the next steps.
+
+### 2. Copy CSV Files
+
+  After your ETL process has generated the CSV files, they must be moved to the Neo4j import directory.
+  **Locating the path:** In Neo4j Desktop, click on `Open Folder` -> `Import`.
+  Copy all CSV files from your ETL output into this folder.
+
+### 3. Data Import via Cypher
+
+  Navigate to the folder `neo4jqueries/loadingQueriesNeo4j`.
+  Execute the Cypher scripts contained there within the **Neo4j Browser**.
+  These scripts load the data from the import folder and create the nodes and relationships in the graph.
+
+### 4. Execute Python Analysis
+
+  Start the analysis script via your terminal:
+
+```bash
+python neo4j_etl.py
+```
+
+  **Eingabe:** Das Skript wird Sie nacheinander nach Ihrem **Datenbank-Usernamen** (Standard: `neo4j`) und Ihrem **Passwort** fragen.
+  **Verarbeitung:** Das Skript liest automatisch alle Abfragen aus dem Verzeichnis `neo4jqueries/analysis_queries` aus.
+  **Ausgabe:** Die Ergebnisse der Analyse-Queries werden direkt in der Konsole ausgegeben.
+
+---
+
+### Projektstruktur
+
+| Verzeichnis / Datei                 | Funktion                                                   |
+| :---------------------------------- | :--------------------------------------------------------- |
+| `neo4j_etl.py`                      | Das Python-Skript zur Ausführung der Analyse-Queries.      |
+| `neo4jqueries/loadingQueriesNeo4j/` | Enthält alle Cypher-Dateien für den initialen Datenimport. |
+| `neo4jqueries/analysis_queries/`    | Enthält Cypher-Dateien für die statistische Auswertung.    |
+
+---
+
 ## Future Enhancements

 Potential extensions to this project:

-1. **Neo4j Integration**: Direct graph database storage for complex queries
-2. **Machine Learning**: Predictive models for drug efficacy
-3. **Temporal Analysis**: Track knowledge graph changes over time
-4. **API Development**: REST API for programmatic access
-5. **Cloud Deployment**: AWS/GCP hosting for web access
-6. **Additional Data Sources**: Integrate DrugBank, KEGG, etc.
+1. **Machine Learning**: Predictive models for drug efficacy
+2. **Temporal Analysis**: Track knowledge graph changes over time
+3. **API Development**: REST API for programmatic access
+4. **Cloud Deployment**: AWS/GCP hosting for web access
+5. **Additional Data Sources**: Integrate DrugBank, KEGG, etc.

 ## References