finishes setup
This commit is contained in:
184
README.md
184
README.md
@@ -8,13 +8,13 @@ A comprehensive ETL pipeline and interactive dashboard for analyzing biomedical
|
||||
- [Features](#features)
|
||||
- [Prerequisites](#prerequisites)
|
||||
- [Installation](#installation)
|
||||
- [Usage](#usage)
|
||||
- [Project Structure](#project-structure)
|
||||
- [Data Analyses](#data-analyses)
|
||||
- [Dashboard Features](#dashboard-features)
|
||||
- [Output Files](#output-files)
|
||||
- [Technical Details](#technical-details)
|
||||
|
||||
- [Neo4j ETL & Analysis Pipeline](#neo4j-etl--analysis-pipeline)
|
||||
|
||||
## Overview
|
||||
|
||||
This project implements a complete data pipeline for the Hetionet knowledge graph, consisting of:
|
||||
@@ -82,7 +82,7 @@ networkx>=3.0
|
||||
|
||||
## Installation
|
||||
|
||||
### 1. Clone or Download Project Files
|
||||
### Clone or Download Project Files
|
||||
|
||||
```bash
|
||||
# Create project directory
|
||||
@@ -90,7 +90,48 @@ mkdir hetionet_analysis
|
||||
cd hetionet_analysis
|
||||
```
|
||||
|
||||
### 2. Set Up Python Environment
|
||||
### (Optional) Docker Setup for Dashboard
|
||||
|
||||
You can run the Streamlit dashboard in a Docker container for easier deployment.
|
||||
|
||||
Dockerfile example (already present in project root)
|
||||
|
||||
```bash
|
||||
FROM python:3.11-slim
|
||||
|
||||
ENV PYTHONDONTWRITEBYTECODE=1 \
|
||||
PYTHONUNBUFFERED=1 \
|
||||
PIP_NO_CACHE_DIR=1
|
||||
|
||||
WORKDIR /app
|
||||
|
||||
COPY . /app
|
||||
|
||||
RUN pip install --upgrade pip \
|
||||
&& pip install streamlit pandas plotly networkx neo4j
|
||||
|
||||
EXPOSE 8501
|
||||
|
||||
CMD ["bash", "-c", "python etl.py || true && exec streamlit run dashboard.py --server.port=8501 --server.address=0.0.0.0"]
|
||||
|
||||
|
||||
```
|
||||
|
||||
Build Docker image
|
||||
|
||||
```bash
|
||||
docker build -t etl-dashboard .
|
||||
```
|
||||
|
||||
Run etl and dashboard in Docker
|
||||
|
||||
```bash
|
||||
docker run -p 8501:8501 etl-dashboard
|
||||
```
|
||||
|
||||
## Non Docker usage
|
||||
|
||||
### 1. Set Up Python Environment
|
||||
|
||||
```bash
|
||||
# Create virtual environment
|
||||
@@ -106,34 +147,21 @@ source etl_projekt/bin/activate
|
||||
pip install pandas streamlit plotly networkx
|
||||
```
|
||||
|
||||
### 3. Download Hetionet Data
|
||||
|
||||
Download `hetionet-v1.0.json` from [Hetionet GitHub](https://github.com/hetio/hetionet) and place it in the project directory.
|
||||
|
||||
### 4. Add Project Files
|
||||
|
||||
Place the following files in your project directory:
|
||||
|
||||
- `hetionet_etl_final.py` - Main ETL script
|
||||
- `dashboard.py` - Streamlit dashboard
|
||||
|
||||
## Usage
|
||||
|
||||
### Step 1: Run ETL Pipeline
|
||||
### 2. Run ETL Pipeline
|
||||
|
||||
Execute the ETL pipeline to process the Hetionet data:
|
||||
|
||||
```bash
|
||||
python hetionet_etl_final.py
|
||||
python etl.py
|
||||
```
|
||||
|
||||
**Expected Runtime**: 1-2 minutes
|
||||
**Expected Runtime**: ~ 1 minute
|
||||
|
||||
**Output**: Creates `neo4j_csv/` directory with 20 CSV files
|
||||
**Output**: Creates `neo4j_csv/` directory with CSV files
|
||||
|
||||
### Step 2: Launch Dashboard
|
||||
### 3. Launch Dashboard
|
||||
|
||||
Start the interactive dashboard:
|
||||
Directly with Python
|
||||
|
||||
```bash
|
||||
streamlit run dashboard.py
|
||||
@@ -141,7 +169,7 @@ streamlit run dashboard.py
|
||||
|
||||
The dashboard will automatically open in your web browser at `http://localhost:8501`
|
||||
|
||||
### Step 3: Explore Data
|
||||
## Explore Data
|
||||
|
||||
Navigate through the dashboard using the sidebar menu:
|
||||
|
||||
@@ -165,6 +193,7 @@ hetionet_analysis/
|
||||
├── neo4j_csv/ # Generated output directory
|
||||
│ ├── nodes_*.csv # Node files by type (11 files)
|
||||
│ ├── edges_all.csv # All relationships
|
||||
| |── edges_*.csv # Splitted relationsship for neo4j import
|
||||
│ ├── analysis_*.csv # Analysis results (6 files)
|
||||
│ ├── network_nodes.csv # Network visualization nodes
|
||||
│ └── network_edges.csv # Network visualization edges
|
||||
@@ -375,16 +404,6 @@ edges_df['target'] = edges_df['target'].astype(str)
|
||||
|
||||
This prevents type mismatch errors when joining dataframes.
|
||||
|
||||
### Memory Management
|
||||
|
||||
Peak memory usage: ~2GB during ETL processing
|
||||
|
||||
**Optimization strategies**:
|
||||
|
||||
- Process data in chunks where possible
|
||||
- Drop intermediate dataframes after use
|
||||
- Use generators for large iterations
|
||||
|
||||
### Edge Direction Conventions
|
||||
|
||||
Hetionet uses directional relationships. Key conventions:
|
||||
@@ -398,28 +417,6 @@ Hetionet uses directional relationships. Key conventions:
|
||||
|
||||
Current implementation handles Hetionet v1.0 (47K nodes, 2.2M edges).
|
||||
|
||||
For larger datasets:
|
||||
|
||||
- Implement chunked CSV reading
|
||||
- Use database backend (PostgreSQL, Neo4j)
|
||||
- Parallelize analyses with multiprocessing
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Common Issues
|
||||
|
||||
**Issue**: "FileNotFoundError: hetionet-v1.0.json"
|
||||
**Solution**: Download Hetionet data and place in project directory
|
||||
|
||||
**Issue**: "Module not found"
|
||||
**Solution**: Ensure virtual environment is activated and dependencies installed
|
||||
|
||||
**Issue**: Dashboard shows "No data available"
|
||||
**Solution**: Run ETL pipeline first to generate CSV files
|
||||
|
||||
**Issue**: "Memory Error" during ETL
|
||||
**Solution**: Close other applications or increase system RAM
|
||||
|
||||
### Data Quality
|
||||
|
||||
The analyses depend on Hetionet data quality. Known limitations:
|
||||
@@ -428,16 +425,79 @@ The analyses depend on Hetionet data quality. Known limitations:
|
||||
- Gene-disease associations vary in evidence strength
|
||||
- Network is not exhaustive of all biomedical knowledge
|
||||
|
||||
## Neo4j ETL & Analysis Pipeline
|
||||
|
||||
This repository includes a script for executing analysis queries on the dataset in a Neo4j database.
|
||||
|
||||
### Neo4j Prerequisites
|
||||
|
||||
Ensure that the following components are installed and ready to use:
|
||||
|
||||
- **Neo4j Desktop:** A local database instance must be created.
|
||||
- **Python 3.x:** Installed on your system.
|
||||
- **Python Driver:** Install the official Neo4j driver via pip in your virtual environment you created earlier:
|
||||
|
||||
```bash
|
||||
pip install neo4j
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Workflow Steps
|
||||
|
||||
Follow these steps exactly in the order provided:
|
||||
|
||||
### 1. Start Neo4j Database
|
||||
|
||||
Open **Neo4j Desktop**.
|
||||
Select your project and click **Start** on the corresponding database.
|
||||
The database must be active before proceeding to the next steps.
|
||||
|
||||
### 2. Copy CSV Files
|
||||
|
||||
After your ETL process has generated the CSV files, they must be moved to the Neo4j import directory.
|
||||
**Locating the path:** In Neo4j Desktop, click on `Open Folder` -> `Import`.
|
||||
Copy all CSV files from your ETL output into this folder.
|
||||
|
||||
### 3. Data Import via Cypher
|
||||
|
||||
Navigate to the folder `neo4jqueries/loadingQueriesNeo4j`.
|
||||
Execute the Cypher scripts contained there within the **Neo4j Browser**.
|
||||
These scripts load the data from the import folder and create the nodes and relationships in the graph.
|
||||
|
||||
### 4. Execute Python Analysis
|
||||
|
||||
Start the analysis script via your terminal:
|
||||
|
||||
```bash
|
||||
python neo4j_etl.py
|
||||
```
|
||||
|
||||
**Eingabe:** Das Skript wird Sie nacheinander nach Ihrem **Datenbank-Usernamen** (Standard: `neo4j`) und Ihrem **Passwort** fragen.
|
||||
**Verarbeitung:** Das Skript liest automatisch alle Abfragen aus dem Verzeichnis `neo4jqueries/analysis_queries` aus.
|
||||
**Ausgabe:** Die Ergebnisse der Analyse-Queries werden direkt in der Konsole ausgegeben.
|
||||
|
||||
---
|
||||
|
||||
### Projektstruktur
|
||||
|
||||
| Verzeichnis / Datei | Funktion |
|
||||
| :---------------------------------- | :--------------------------------------------------------- |
|
||||
| `neo4j_etl.py` | Das Python-Skript zur Ausführung der Analyse-Queries. |
|
||||
| `neo4jqueries/loadingQueriesNeo4j/` | Enthält alle Cypher-Dateien für den initialen Datenimport. |
|
||||
| `neo4jqueries/analysis_queries/` | Enthält Cypher-Dateien für die statistische Auswertung. |
|
||||
|
||||
---
|
||||
|
||||
## Future Enhancements
|
||||
|
||||
Potential extensions to this project:
|
||||
|
||||
1. **Neo4j Integration**: Direct graph database storage for complex queries
|
||||
2. **Machine Learning**: Predictive models for drug efficacy
|
||||
3. **Temporal Analysis**: Track knowledge graph changes over time
|
||||
4. **API Development**: REST API for programmatic access
|
||||
5. **Cloud Deployment**: AWS/GCP hosting for web access
|
||||
6. **Additional Data Sources**: Integrate DrugBank, KEGG, etc.
|
||||
1. **Machine Learning**: Predictive models for drug efficacy
|
||||
2. **Temporal Analysis**: Track knowledge graph changes over time
|
||||
3. **API Development**: REST API for programmatic access
|
||||
4. **Cloud Deployment**: AWS/GCP hosting for web access
|
||||
5. **Additional Data Sources**: Integrate DrugBank, KEGG, etc.
|
||||
|
||||
## References
|
||||
|
||||
|
||||
Reference in New Issue
Block a user