DuckDB 1.4.4 + Data Science Stack (Ubuntu 24.04) AMI Administrator Guide
1. Quick Start Information
Connection Methods:
- Access the instance via SSH using the
ubuntuuser. Usesudoto run commands requiring root privileges. To switch to the root user, usesudo su - root.
Install Information:
- OS: Ubuntu 24.04 LTS
- DuckDB version: 1.4.4
- Python version: 3.12 (system default)
- Virtual Environment Location:
/opt/duckdb-env - Environment Configuration:
/etc/profile.d/duckdb_env.sh - Python Binary Path:
/opt/duckdb-env/bin/python - pip Binary Path:
/opt/duckdb-env/bin/pip
Installed Python Packages:
- DuckDB:
duckdb(v1.4.4) - In-process analytical database - Pandas:
pandas- Data manipulation and analysis library - NumPy:
numpy- Numerical computing library
Quick Verification Commands:
- Check Python path:
which python - Check DuckDB version:
python -c "import duckdb; print(duckdb.__version__)" - Check Pandas version:
python -c "import pandas; print(pandas.__version__)" - Check NumPy version:
python -c "import numpy; print(numpy.__version__)"
Firewall Configuration:
- Please allow SSH port 22.
- For security, it is recommended to limit SSH access to trusted IPs only.
2. First Launch & Verification
Step 1: Connect to Your Instance
- Launch your instance in your cloud provider's console (e.g., AWS EC2)
- Ensure SSH port 22 is allowed in your security group
- Connect via SSH:
ssh -i your-key.pem ubuntu@YOUR_PUBLIC_IP
Step 2: Verify Python Environment
Check the Python path (should point to virtual environment):
which python
Expected Output:
/opt/duckdb-env/bin/python
Check Python version:
python --version
Expected Output:
Python 3.12.x
Step 3: Verify DuckDB Installation
Test DuckDB import and version:
python -c "import duckdb; print(f'Success! DuckDB version: {duckdb.__version__}')"
Expected Output:
Success! DuckDB version: 1.4.4
Step 4: Verify Data Science Libraries
Test Pandas:
python -c "import pandas; print(f'Pandas version: {pandas.__version__}')"
Test NumPy:
python -c "import numpy; print(f'NumPy version: {numpy.__version__}')"
Step 5: Test DuckDB Query
Run a simple SQL query:
python << 'EOF'
import duckdb
# Create an in-memory database
conn = duckdb.connect(':memory:')
# Create a sample table
conn.execute("CREATE TABLE users (id INT, name VARCHAR, age INT)")
conn.execute("INSERT INTO users VALUES (1, 'Alice', 30), (2, 'Bob', 25), (3, 'Charlie', 35)")
# Query the table
result = conn.execute("SELECT * FROM users WHERE age > 25").fetchall()
print("Query Results:")
for row in result:
print(row)
conn.close()
EOF
Expected Output:
Query Results:
(1, 'Alice', 30)
(3, 'Charlie', 35)
How This Works:
- DuckDB creates an in-memory database (
:memory:) - SQL commands work just like in PostgreSQL or MySQL
- DuckDB is optimized for analytical queries (OLAP)
- No separate database server needed - DuckDB runs in-process
3. Architecture & Detailed Configuration
This AMI uses a Python virtual environment approach to isolate DuckDB and data science packages from the system Python installation, ensuring stability and ease of maintenance.
Installation Architecture:
[System Python 3.12]
↓
[python3 -m venv /opt/duckdb-env]
↓
[Virtual Environment Created]
↓
├── /opt/duckdb-env/bin/python → Isolated Python interpreter
├── /opt/duckdb-env/bin/pip → Isolated package manager
└── /opt/duckdb-env/lib/python3.12/site-packages/
├── duckdb/ → DuckDB library
├── pandas/ → Pandas library
└── numpy/ → NumPy library
↓
[Environment Configuration]
↓
/etc/profile.d/duckdb_env.sh → Adds /opt/duckdb-env/bin to PATH
↓
[Automatic Activation]
↓
All users automatically use DuckDB environment on login
Key Design Decisions:
- Virtual Environment Isolation: Packages installed in
/opt/duckdb-envdon't affect system Python - Automatic PATH Configuration: Environment activates automatically via
/etc/profile.d/ - System-Wide Availability: All users (root, ubuntu, etc.) can use DuckDB without additional setup
- No conda/miniconda: Using venv (built-in to Python) keeps the AMI lightweight
3.1. Environment Configuration File
File Location: /etc/profile.d/duckdb_env.sh
File Content:
# === AWS Marketplace AMI Configuration ===
# Automatically add the Data Science environment to PATH
export PATH="/opt/duckdb-env/bin:$PATH"
How This Works:
- This file is automatically sourced during user login
- Prepends
/opt/duckdb-env/binto the PATH environment variable - Ensures
pythonandpipcommands resolve to the virtual environment binaries - Works for all users (root, ubuntu, custom users)
Why This Approach?
- Transparent: Users don't need to manually activate the environment
- Persistent: Configuration survives reboots and SSH reconnections
- Standard: Uses Linux standard directory
/etc/profile.d/for environment setup
3.2. Virtual Environment Structure
Directory Layout:
/opt/duckdb-env/
├── bin/
│ ├── python → Python 3.12 interpreter
│ ├── python3 → Symlink to python
│ ├── pip → Package installer
│ └── activate → Manual activation script (optional)
├── include/
│ └── python3.12/ → C header files
├── lib/
│ └── python3.12/
│ └── site-packages/
│ ├── duckdb/ → DuckDB library
│ ├── pandas/ → Pandas library
│ ├── numpy/ → NumPy library
│ └── ... → Other dependencies
└── pyvenv.cfg → Virtual environment configuration
Installed Packages:
| Package | Version | Purpose |
|---|---|---|
| duckdb | 1.4.4 | In-process analytical SQL database |
| pandas | Latest | Data manipulation (DataFrames) |
| numpy | Latest | Numerical arrays and math operations |
| pip | Latest | Python package installer |
4. How-To-Create: Reproduce This Environment
This section explains how this AMI was built, allowing you to reproduce the installation on any Ubuntu 24.04 system.
Step 1: Install System Dependencies
Purpose: Install Python virtual environment module and pip (not included by default in Ubuntu 24.04).
# Update package index
sudo apt update
# Install Python venv module and pip
sudo apt install -y python3-venv python3-pip
How This Works:
python3-venv: Provides thevenvmodule for creating virtual environmentspython3-pip: Installs pip (Python package installer) for the system Python
Why These Dependencies?
Ubuntu 24.04 minimal installation does not include venv or pip by default to keep the base image small. These packages are required to create isolated Python environments.
Step 2: Create Virtual Environment
Purpose: Create an isolated Python environment at /opt/duckdb-env.
# Create virtual environment
sudo python3 -m venv /opt/duckdb-env
How This Works:
python3 -m venv: Invokes the venv module to create a new virtual environment/opt/duckdb-env: Target directory for the virtual environmentsudo: Required because/opt/is a system directory
What Gets Created:
- Independent Python interpreter
- Isolated site-packages directory
- Activation scripts (though we don't use them due to auto-activation)
Directory Location Rationale:
/opt/: Standard location for third-party software on Linux systemsduckdb-env: Descriptive name indicating it's a DuckDB environment
Step 3: Upgrade pip and Install Packages
Purpose: Install DuckDB and data science libraries into the virtual environment.
# Upgrade pip to latest version
sudo /opt/duckdb-env/bin/pip install --upgrade pip
# Install DuckDB, Pandas, and NumPy
sudo /opt/duckdb-env/bin/pip install duckdb pandas numpy
How This Works:
/opt/duckdb-env/bin/pip: Uses the virtual environment's pip (not system pip)--upgrade pip: Ensures we have the latest pip versionduckdb pandas numpy: Installs all three packages in one commandsudo: Required to write to/opt/duckdb-env/
Why Upgrade pip First?
Older pip versions may have compatibility issues or security vulnerabilities. Upgrading ensures smooth installation of modern packages.
Package Dependencies:
duckdb: Self-contained, no heavy dependenciespandas: Depends on numpy, so numpy is automatically updated if needednumpy: Core numerical library, used by pandas and duckdb
Step 4: Configure Automatic Environment Activation
Purpose: Make the DuckDB environment available to all users automatically.
# Create environment configuration script
sudo tee /etc/profile.d/duckdb_env.sh > /dev/null << 'EOF'
# === AWS Marketplace AMI Configuration ===
# Automatically add the Data Science environment to PATH
export PATH="/opt/duckdb-env/bin:$PATH"
EOF
# Make the script executable
sudo chmod +x /etc/profile.d/duckdb_env.sh
How This Works:
tee /etc/profile.d/duckdb_env.sh: Creates a new file with specified content/etc/profile.d/: Standard directory for system-wide environment configurationexport PATH="...": Prepends virtual environment binaries to PATHchmod +x: Makes the script executable (required for sourcing)
Why /etc/profile.d/?
- Files in this directory are automatically sourced during login
- Applies to all users (root, ubuntu, custom users)
- Survives system reboots and package updates
Alternative Approaches (Not Used):
- Manual activation with
source activate: Requires user action every login - User-level
.bashrc: Only affects specific users, not system-wide - System Python site-packages: Pollutes system Python, risky on upgrades
5. Using the DuckDB Environment
5.1. Interactive Python REPL
Start Python REPL:
python
Example Session:
>>> import duckdb
>>> conn = duckdb.connect(':memory:')
>>> conn.execute("SELECT 'Hello from DuckDB!' AS message").fetchone()
('Hello from DuckDB!',)
>>> conn.close()
>>> exit()
5.2. Querying CSV Files Directly
DuckDB can query CSV files without loading them into a table:
# Create a sample CSV file
cat > data.csv << 'EOF'
id,name,age,city
1,Alice,30,New York
2,Bob,25,San Francisco
3,Charlie,35,Seattle
4,Diana,28,Boston
EOF
# Query the CSV directly
python << 'EOF'
import duckdb
conn = duckdb.connect(':memory:')
# Query CSV file directly
result = conn.execute("""
SELECT city, COUNT(*) as count, AVG(age) as avg_age
FROM 'data.csv'
GROUP BY city
ORDER BY avg_age DESC
""").fetchall()
print("City Statistics:")
for row in result:
print(f"City: {row[0]}, Count: {row[1]}, Avg Age: {row[2]:.1f}")
conn.close()
EOF
Expected Output:
City Statistics:
City: Seattle, Count: 1, Avg Age: 35.0
City: New York, Count: 1, Avg Age: 30.0
City: Boston, Count: 1, Avg Age: 28.0
City: San Francisco, Count: 1, Avg Age: 25.0
5.3. Working with Pandas DataFrames
Convert between DuckDB and Pandas:
python << 'EOF'
import duckdb
import pandas as pd
# Create a Pandas DataFrame
df = pd.DataFrame({
'product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor'],
'price': [1200, 25, 75, 300],
'quantity': [5, 50, 30, 10]
})
# Query Pandas DataFrame with SQL
conn = duckdb.connect(':memory:')
result = conn.execute("""
SELECT product, price, quantity, (price * quantity) as total_value
FROM df
WHERE price > 50
ORDER BY total_value DESC
""").fetchdf()
print("High-Value Products:")
print(result)
conn.close()
EOF
Expected Output:
High-Value Products:
product price quantity total_value
0 Laptop 1200 5 6000
1 Monitor 300 10 3000
2 Keyboard 75 30 2250
5.4. Persistent Database Files
Create a persistent database:
python << 'EOF'
import duckdb
# Create a persistent database file
conn = duckdb.connect('my_database.duckdb')
# Create table and insert data
conn.execute("CREATE TABLE sales (date DATE, amount DECIMAL(10,2), product VARCHAR)")
conn.execute("INSERT INTO sales VALUES ('2026-01-01', 150.50, 'Widget')")
conn.execute("INSERT INTO sales VALUES ('2026-01-02', 200.75, 'Gadget')")
# Query and display
result = conn.execute("SELECT * FROM sales ORDER BY date").fetchall()
for row in result:
print(row)
conn.close()
print("Database saved to my_database.duckdb")
EOF
# Verify database file was created
ls -lh my_database.duckdb
Database File Features:
- Portable: Can be copied to other systems
- Compressed: Efficient storage format
- Multi-threaded: Supports concurrent reads
- ACID compliant: Transactional safety
5.5. Exporting Query Results
Export to CSV:
python << 'EOF'
import duckdb
conn = duckdb.connect(':memory:')
# Create sample data
conn.execute("CREATE TABLE employees (name VARCHAR, salary INT, department VARCHAR)")
conn.execute("INSERT INTO employees VALUES ('Alice', 80000, 'Engineering')")
conn.execute("INSERT INTO employees VALUES ('Bob', 65000, 'Marketing')")
conn.execute("INSERT INTO employees VALUES ('Charlie', 90000, 'Engineering')")
# Export to CSV
conn.execute("COPY (SELECT * FROM employees WHERE department = 'Engineering') TO 'engineers.csv' (HEADER, DELIMITER ',')")
conn.close()
print("Exported to engineers.csv")
EOF
# View exported file
cat engineers.csv
Expected Output:
name,salary,department
Alice,80000,Engineering
Charlie,90000,Engineering
5.6. Installing Additional Packages
Add more Python libraries:
# Install a new package (e.g., matplotlib for plotting)
sudo /opt/duckdb-env/bin/pip install matplotlib
# Verify installation
python -c "import matplotlib; print('Matplotlib installed successfully')"
Recommended Data Science Packages:
scikit-learn: Machine learning librarymatplotlib/seaborn: Data visualizationsqlalchemy: SQL toolkit (can use DuckDB as backend)jupyter: Interactive notebooks
6. Important File Locations
| File Path | Purpose |
|---|---|
/opt/duckdb-env/ | Virtual environment root directory |
/opt/duckdb-env/bin/python | Python interpreter (virtual environment) |
/opt/duckdb-env/bin/pip | Package installer (virtual environment) |
/opt/duckdb-env/lib/python3.12/site-packages/ | Installed Python packages directory |
/etc/profile.d/duckdb_env.sh | Environment activation script |
~/.cache/pip/ | User-specific pip cache (safe to delete) |
/root/.cache/pip/ | Root user pip cache (safe to delete) |
*.duckdb | DuckDB database files (user-created) |
*.csv | Data files (user-created) |
7. Troubleshooting
Issue 1: Python Command Not Found
Symptoms:
$ python --version
-bash: python: command not found
Diagnosis:
Environment configuration not loaded.
Solution:
Reload environment configuration:
source /etc/profile.d/duckdb_env.sh
Or log out and log back in:
exit
ssh -i your-key.pem ubuntu@YOUR_PUBLIC_IP
Verify:
which python
# Should output: /opt/duckdb-env/bin/python
Issue 2: ImportError: No module named 'duckdb'
Symptoms:
$ python -c "import duckdb"
ImportError: No module named 'duckdb'
Diagnosis:
Using wrong Python interpreter (system Python instead of virtual environment).
Solution:
Check Python path:
which python
# Expected: /opt/duckdb-env/bin/python
# If different: /usr/bin/python or /usr/bin/python3
If using wrong Python:
# Reload environment
source /etc/profile.d/duckdb_env.sh
# Verify again
which python
python -c "import duckdb; print('DuckDB works!')"
Issue 3: Permission Denied When Installing Packages
Symptoms:
$ pip install some-package
error: could not create '/opt/duckdb-env/lib/python3.12/site-packages/...': Permission denied
Diagnosis:
Virtual environment is owned by root, regular users cannot write to it.
Solution:
Use sudo for installing system-wide packages:
sudo /opt/duckdb-env/bin/pip install some-package
Alternative (User-Level Install):
pip install --user some-package
Note: User-level installs go to ~/.local/lib/python3.12/site-packages/, not the virtual environment.
Issue 4: DuckDB Database File Locked
Symptoms:
$ python -c "import duckdb; duckdb.connect('my_database.duckdb')"
duckdb.IOException: Could not set lock on file "my_database.duckdb": Resource temporarily unavailable
Diagnosis:
Another process is using the database file.
Solution:
Find processes using the database:
lsof | grep my_database.duckdb
Kill the process:
kill <PID>
Or wait for the other process to release the lock.
Prevention:
Always close connections:
conn = duckdb.connect('my_database.duckdb')
# ... do work ...
conn.close() # Always close!
Issue 5: Out of Memory Errors
Symptoms:
$ python -c "import duckdb; ..."
MemoryError: std::bad_alloc
Diagnosis:
Query requires more memory than available.
Solution 1: Check Available Memory:
free -h
Solution 2: Limit DuckDB Memory Usage:
import duckdb
conn = duckdb.connect(':memory:')
conn.execute("SET memory_limit='2GB'") # Limit to 2 GB
Solution 3: Use Persistent Database:
# Use disk-based database instead of :memory:
conn = duckdb.connect('my_database.duckdb')
Solution 4: Process Data in Chunks:
# Use LIMIT and OFFSET to process data in batches
for offset in range(0, 1000000, 10000):
result = conn.execute(f"SELECT * FROM large_table LIMIT 10000 OFFSET {offset}").fetchall()
# Process batch
Issue 6: Pandas Version Conflict
Symptoms:
$ python -c "import duckdb, pandas"
ImportError: ... incompatible with this version of pandas
Diagnosis:
DuckDB and Pandas versions are incompatible.
Solution:
Upgrade both packages:
sudo /opt/duckdb-env/bin/pip install --upgrade duckdb pandas
Verify versions:
python -c "import duckdb, pandas; print(f'DuckDB: {duckdb.__version__}, Pandas: {pandas.__version__}')"
8. Final Notes
Key Takeaways
- DuckDB 1.4.4 is installed in an isolated Python virtual environment at
/opt/duckdb-env - Automatic activation via
/etc/profile.d/duckdb_env.shensures seamless user experience - Data science stack includes Pandas and NumPy for comprehensive data analysis
- System Python is untouched, ensuring system stability and upgrade safety
- The installation is production-ready and AMI-optimized (cleaned caches)
DuckDB Use Cases
- Data Analysis: Query CSV, Parquet, JSON files without loading into memory
- ETL Pipelines: Transform data with SQL before loading into data warehouses
- Embedded Analytics: In-process database for applications (no separate server)
- Data Science Workflows: Combine SQL queries with Pandas DataFrames
- Prototyping: Fast analytical queries for data exploration
Additional Resources
- Official DuckDB Documentation: https://duckdb.org/docs/
- DuckDB Python API: https://duckdb.org/docs/api/python/overview
- Pandas Documentation: https://pandas.pydata.org/docs/
- NumPy Documentation: https://numpy.org/doc/
- DuckDB SQL Reference: https://duckdb.org/docs/sql/introduction
Support
If you encounter issues not covered in this guide:
- Check the troubleshooting section above
- Review DuckDB official documentation
- Verify your security group and firewall settings
- Check system logs:
sudo journalctl -xe
For support or questions, please contact the Easycloud team.