DuckDB 1.4.4 + Data Science Stack (Ubuntu 24.04) AMI Administrator Guide

1. Quick Start Information

Connection Methods:

Access the instance via SSH using the ubuntu user. Use sudo to run commands requiring root privileges. To switch to the root user, use sudo su - root.

Install Information:

OS: Ubuntu 24.04 LTS
DuckDB version: 1.4.4
Python version: 3.12 (system default)
Virtual Environment Location: /opt/duckdb-env
Environment Configuration: /etc/profile.d/duckdb_env.sh
Python Binary Path: /opt/duckdb-env/bin/python
pip Binary Path: /opt/duckdb-env/bin/pip

Installed Python Packages:

DuckDB: duckdb (v1.4.4) - In-process analytical database
Pandas: pandas - Data manipulation and analysis library
NumPy: numpy - Numerical computing library

Quick Verification Commands:

Check Python path: which python
Check DuckDB version: python -c "import duckdb; print(duckdb.__version__)"
Check Pandas version: python -c "import pandas; print(pandas.__version__)"
Check NumPy version: python -c "import numpy; print(numpy.__version__)"

Firewall Configuration:

Please allow SSH port 22.
For security, it is recommended to limit SSH access to trusted IPs only.

2. First Launch & Verification

Step 1: Connect to Your Instance

Launch your instance in your cloud provider's console (e.g., AWS EC2)
Ensure SSH port 22 is allowed in your security group

Connect via SSH:

ssh -i your-key.pem ubuntu@YOUR_PUBLIC_IP

Step 2: Verify Python Environment

Check the Python path (should point to virtual environment):

which python

Expected Output:

/opt/duckdb-env/bin/python

Check Python version:

python --version

Expected Output:

Python 3.12.x

Step 3: Verify DuckDB Installation

Test DuckDB import and version:

python -c "import duckdb; print(f'Success! DuckDB version: {duckdb.__version__}')"

Expected Output:

Success! DuckDB version: 1.4.4

Step 4: Verify Data Science Libraries

Test Pandas:

python -c "import pandas; print(f'Pandas version: {pandas.__version__}')"

Test NumPy:

python -c "import numpy; print(f'NumPy version: {numpy.__version__}')"

Step 5: Test DuckDB Query

Run a simple SQL query:

python << 'EOF'
import duckdb

# Create an in-memory database
conn = duckdb.connect(':memory:')

# Create a sample table
conn.execute("CREATE TABLE users (id INT, name VARCHAR, age INT)")
conn.execute("INSERT INTO users VALUES (1, 'Alice', 30), (2, 'Bob', 25), (3, 'Charlie', 35)")

# Query the table
result = conn.execute("SELECT * FROM users WHERE age > 25").fetchall()
print("Query Results:")
for row in result:
    print(row)

conn.close()
EOF

Expected Output:

Query Results:
(1, 'Alice', 30)
(3, 'Charlie', 35)

How This Works:

DuckDB creates an in-memory database (:memory:)
SQL commands work just like in PostgreSQL or MySQL
DuckDB is optimized for analytical queries (OLAP)
No separate database server needed - DuckDB runs in-process

3. Architecture & Detailed Configuration

This AMI uses a Python virtual environment approach to isolate DuckDB and data science packages from the system Python installation, ensuring stability and ease of maintenance.

Installation Architecture:

[System Python 3.12]
         ↓
[python3 -m venv /opt/duckdb-env]
         ↓
[Virtual Environment Created]
         ↓
├── /opt/duckdb-env/bin/python → Isolated Python interpreter
├── /opt/duckdb-env/bin/pip → Isolated package manager
└── /opt/duckdb-env/lib/python3.12/site-packages/
    ├── duckdb/ → DuckDB library
    ├── pandas/ → Pandas library
    └── numpy/ → NumPy library
         ↓
[Environment Configuration]
         ↓
/etc/profile.d/duckdb_env.sh → Adds /opt/duckdb-env/bin to PATH
         ↓
[Automatic Activation]
         ↓
All users automatically use DuckDB environment on login

Key Design Decisions:

Virtual Environment Isolation: Packages installed in /opt/duckdb-env don't affect system Python
Automatic PATH Configuration: Environment activates automatically via /etc/profile.d/
System-Wide Availability: All users (root, ubuntu, etc.) can use DuckDB without additional setup
No conda/miniconda: Using venv (built-in to Python) keeps the AMI lightweight

3.1. Environment Configuration File

File Location: /etc/profile.d/duckdb_env.sh

File Content:

# === AWS Marketplace AMI Configuration ===
# Automatically add the Data Science environment to PATH
export PATH="/opt/duckdb-env/bin:$PATH"

How This Works:

This file is automatically sourced during user login
Prepends /opt/duckdb-env/bin to the PATH environment variable
Ensures python and pip commands resolve to the virtual environment binaries
Works for all users (root, ubuntu, custom users)

Why This Approach?

Transparent: Users don't need to manually activate the environment
Persistent: Configuration survives reboots and SSH reconnections
Standard: Uses Linux standard directory /etc/profile.d/ for environment setup

3.2. Virtual Environment Structure

Directory Layout:

/opt/duckdb-env/
├── bin/
│   ├── python → Python 3.12 interpreter
│   ├── python3 → Symlink to python
│   ├── pip → Package installer
│   └── activate → Manual activation script (optional)
├── include/
│   └── python3.12/ → C header files
├── lib/
│   └── python3.12/
│       └── site-packages/
│           ├── duckdb/ → DuckDB library
│           ├── pandas/ → Pandas library
│           ├── numpy/ → NumPy library
│           └── ... → Other dependencies
└── pyvenv.cfg → Virtual environment configuration

Installed Packages:

Package	Version	Purpose
duckdb	1.4.4	In-process analytical SQL database
pandas	Latest	Data manipulation (DataFrames)
numpy	Latest	Numerical arrays and math operations
pip	Latest	Python package installer

4. How-To-Create: Reproduce This Environment

This section explains how this AMI was built, allowing you to reproduce the installation on any Ubuntu 24.04 system.

Step 1: Install System Dependencies

Purpose: Install Python virtual environment module and pip (not included by default in Ubuntu 24.04).

# Update package index
sudo apt update

# Install Python venv module and pip
sudo apt install -y python3-venv python3-pip

How This Works:

python3-venv: Provides the venv module for creating virtual environments
python3-pip: Installs pip (Python package installer) for the system Python

Why These Dependencies?

Ubuntu 24.04 minimal installation does not include venv or pip by default to keep the base image small. These packages are required to create isolated Python environments.

Step 2: Create Virtual Environment

Purpose: Create an isolated Python environment at /opt/duckdb-env.

# Create virtual environment
sudo python3 -m venv /opt/duckdb-env

How This Works:

python3 -m venv: Invokes the venv module to create a new virtual environment
/opt/duckdb-env: Target directory for the virtual environment
sudo: Required because /opt/ is a system directory

What Gets Created:

Independent Python interpreter
Isolated site-packages directory
Activation scripts (though we don't use them due to auto-activation)

Directory Location Rationale:

/opt/: Standard location for third-party software on Linux systems
duckdb-env: Descriptive name indicating it's a DuckDB environment

Step 3: Upgrade pip and Install Packages

Purpose: Install DuckDB and data science libraries into the virtual environment.

# Upgrade pip to latest version
sudo /opt/duckdb-env/bin/pip install --upgrade pip

# Install DuckDB, Pandas, and NumPy
sudo /opt/duckdb-env/bin/pip install duckdb pandas numpy

How This Works:

/opt/duckdb-env/bin/pip: Uses the virtual environment's pip (not system pip)
--upgrade pip: Ensures we have the latest pip version
duckdb pandas numpy: Installs all three packages in one command
sudo: Required to write to /opt/duckdb-env/

Why Upgrade pip First?

Older pip versions may have compatibility issues or security vulnerabilities. Upgrading ensures smooth installation of modern packages.

Package Dependencies:

duckdb: Self-contained, no heavy dependencies
pandas: Depends on numpy, so numpy is automatically updated if needed
numpy: Core numerical library, used by pandas and duckdb

Step 4: Configure Automatic Environment Activation

Purpose: Make the DuckDB environment available to all users automatically.

# Create environment configuration script
sudo tee /etc/profile.d/duckdb_env.sh > /dev/null << 'EOF'
# === AWS Marketplace AMI Configuration ===
# Automatically add the Data Science environment to PATH
export PATH="/opt/duckdb-env/bin:$PATH"
EOF

# Make the script executable
sudo chmod +x /etc/profile.d/duckdb_env.sh

How This Works:

tee /etc/profile.d/duckdb_env.sh: Creates a new file with specified content
/etc/profile.d/: Standard directory for system-wide environment configuration
export PATH="...": Prepends virtual environment binaries to PATH
chmod +x: Makes the script executable (required for sourcing)

Why /etc/profile.d/?

Files in this directory are automatically sourced during login
Applies to all users (root, ubuntu, custom users)
Survives system reboots and package updates

Alternative Approaches (Not Used):

Manual activation with source activate: Requires user action every login
User-level .bashrc: Only affects specific users, not system-wide
System Python site-packages: Pollutes system Python, risky on upgrades

5. Using the DuckDB Environment

5.1. Interactive Python REPL

Start Python REPL:

python

Example Session:

>>> import duckdb
>>> conn = duckdb.connect(':memory:')
>>> conn.execute("SELECT 'Hello from DuckDB!' AS message").fetchone()
('Hello from DuckDB!',)
>>> conn.close()
>>> exit()

5.2. Querying CSV Files Directly

DuckDB can query CSV files without loading them into a table:

# Create a sample CSV file
cat > data.csv << 'EOF'
id,name,age,city
1,Alice,30,New York
2,Bob,25,San Francisco
3,Charlie,35,Seattle
4,Diana,28,Boston
EOF

# Query the CSV directly
python << 'EOF'
import duckdb

conn = duckdb.connect(':memory:')

# Query CSV file directly
result = conn.execute("""
    SELECT city, COUNT(*) as count, AVG(age) as avg_age
    FROM 'data.csv'
    GROUP BY city
    ORDER BY avg_age DESC
""").fetchall()

print("City Statistics:")
for row in result:
    print(f"City: {row[0]}, Count: {row[1]}, Avg Age: {row[2]:.1f}")

conn.close()
EOF

Expected Output:

City Statistics:
City: Seattle, Count: 1, Avg Age: 35.0
City: New York, Count: 1, Avg Age: 30.0
City: Boston, Count: 1, Avg Age: 28.0
City: San Francisco, Count: 1, Avg Age: 25.0

5.3. Working with Pandas DataFrames

Convert between DuckDB and Pandas:

python << 'EOF'
import duckdb
import pandas as pd

# Create a Pandas DataFrame
df = pd.DataFrame({
    'product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor'],
    'price': [1200, 25, 75, 300],
    'quantity': [5, 50, 30, 10]
})

# Query Pandas DataFrame with SQL
conn = duckdb.connect(':memory:')
result = conn.execute("""
    SELECT product, price, quantity, (price * quantity) as total_value
    FROM df
    WHERE price > 50
    ORDER BY total_value DESC
""").fetchdf()

print("High-Value Products:")
print(result)

conn.close()
EOF

Expected Output:

High-Value Products:
    product  price  quantity  total_value
0   Laptop   1200         5         6000
1  Monitor    300        10         3000
2 Keyboard     75        30         2250

5.4. Persistent Database Files

Create a persistent database:

python << 'EOF'
import duckdb

# Create a persistent database file
conn = duckdb.connect('my_database.duckdb')

# Create table and insert data
conn.execute("CREATE TABLE sales (date DATE, amount DECIMAL(10,2), product VARCHAR)")
conn.execute("INSERT INTO sales VALUES ('2026-01-01', 150.50, 'Widget')")
conn.execute("INSERT INTO sales VALUES ('2026-01-02', 200.75, 'Gadget')")

# Query and display
result = conn.execute("SELECT * FROM sales ORDER BY date").fetchall()
for row in result:
    print(row)

conn.close()
print("Database saved to my_database.duckdb")
EOF

# Verify database file was created
ls -lh my_database.duckdb

Database File Features:

Portable: Can be copied to other systems
Compressed: Efficient storage format
Multi-threaded: Supports concurrent reads
ACID compliant: Transactional safety

5.5. Exporting Query Results

Export to CSV:

python << 'EOF'
import duckdb

conn = duckdb.connect(':memory:')

# Create sample data
conn.execute("CREATE TABLE employees (name VARCHAR, salary INT, department VARCHAR)")
conn.execute("INSERT INTO employees VALUES ('Alice', 80000, 'Engineering')")
conn.execute("INSERT INTO employees VALUES ('Bob', 65000, 'Marketing')")
conn.execute("INSERT INTO employees VALUES ('Charlie', 90000, 'Engineering')")

# Export to CSV
conn.execute("COPY (SELECT * FROM employees WHERE department = 'Engineering') TO 'engineers.csv' (HEADER, DELIMITER ',')")

conn.close()
print("Exported to engineers.csv")
EOF

# View exported file
cat engineers.csv

Expected Output:

name,salary,department
Alice,80000,Engineering
Charlie,90000,Engineering

5.6. Installing Additional Packages

Add more Python libraries:

# Install a new package (e.g., matplotlib for plotting)
sudo /opt/duckdb-env/bin/pip install matplotlib

# Verify installation
python -c "import matplotlib; print('Matplotlib installed successfully')"

Recommended Data Science Packages:

scikit-learn: Machine learning library
matplotlib / seaborn: Data visualization
sqlalchemy: SQL toolkit (can use DuckDB as backend)
jupyter: Interactive notebooks

6. Important File Locations

File Path	Purpose
`/opt/duckdb-env/`	Virtual environment root directory
`/opt/duckdb-env/bin/python`	Python interpreter (virtual environment)
`/opt/duckdb-env/bin/pip`	Package installer (virtual environment)
`/opt/duckdb-env/lib/python3.12/site-packages/`	Installed Python packages directory
`/etc/profile.d/duckdb_env.sh`	Environment activation script
`~/.cache/pip/`	User-specific pip cache (safe to delete)
`/root/.cache/pip/`	Root user pip cache (safe to delete)
`*.duckdb`	DuckDB database files (user-created)
`*.csv`	Data files (user-created)

7. Troubleshooting

Issue 1: Python Command Not Found

Symptoms:

$ python --version
-bash: python: command not found

Diagnosis:

Environment configuration not loaded.

Solution:

Reload environment configuration:

source /etc/profile.d/duckdb_env.sh

Or log out and log back in:

exit
ssh -i your-key.pem ubuntu@YOUR_PUBLIC_IP

Verify:

which python
# Should output: /opt/duckdb-env/bin/python

Issue 2: ImportError: No module named 'duckdb'

Symptoms:

$ python -c "import duckdb"
ImportError: No module named 'duckdb'

Diagnosis:

Using wrong Python interpreter (system Python instead of virtual environment).

Solution:

Check Python path:

which python
# Expected: /opt/duckdb-env/bin/python
# If different: /usr/bin/python or /usr/bin/python3

If using wrong Python:

# Reload environment
source /etc/profile.d/duckdb_env.sh

# Verify again
which python
python -c "import duckdb; print('DuckDB works!')"

Issue 3: Permission Denied When Installing Packages

Symptoms:

$ pip install some-package
error: could not create '/opt/duckdb-env/lib/python3.12/site-packages/...': Permission denied

Diagnosis:

Virtual environment is owned by root, regular users cannot write to it.

Solution:

Use sudo for installing system-wide packages:

sudo /opt/duckdb-env/bin/pip install some-package

Alternative (User-Level Install):

pip install --user some-package

Note: User-level installs go to ~/.local/lib/python3.12/site-packages/, not the virtual environment.

Issue 4: DuckDB Database File Locked

Symptoms:

$ python -c "import duckdb; duckdb.connect('my_database.duckdb')"
duckdb.IOException: Could not set lock on file "my_database.duckdb": Resource temporarily unavailable

Diagnosis:

Another process is using the database file.

Solution:

Find processes using the database:

lsof | grep my_database.duckdb

Kill the process:

kill <PID>

Or wait for the other process to release the lock.

Prevention:

Always close connections:

conn = duckdb.connect('my_database.duckdb')
# ... do work ...
conn.close()  # Always close!

Issue 5: Out of Memory Errors

Symptoms:

$ python -c "import duckdb; ..."
MemoryError: std::bad_alloc

Diagnosis:

Query requires more memory than available.

Solution 1: Check Available Memory:

free -h

Solution 2: Limit DuckDB Memory Usage:

import duckdb
conn = duckdb.connect(':memory:')
conn.execute("SET memory_limit='2GB'")  # Limit to 2 GB

Solution 3: Use Persistent Database:

# Use disk-based database instead of :memory:
conn = duckdb.connect('my_database.duckdb')

Solution 4: Process Data in Chunks:

# Use LIMIT and OFFSET to process data in batches
for offset in range(0, 1000000, 10000):
    result = conn.execute(f"SELECT * FROM large_table LIMIT 10000 OFFSET {offset}").fetchall()
    # Process batch

Issue 6: Pandas Version Conflict

Symptoms:

$ python -c "import duckdb, pandas"
ImportError: ... incompatible with this version of pandas

Diagnosis:

DuckDB and Pandas versions are incompatible.

Solution:

Upgrade both packages:

sudo /opt/duckdb-env/bin/pip install --upgrade duckdb pandas

Verify versions:

python -c "import duckdb, pandas; print(f'DuckDB: {duckdb.__version__}, Pandas: {pandas.__version__}')"

8. Final Notes

Key Takeaways

DuckDB 1.4.4 is installed in an isolated Python virtual environment at /opt/duckdb-env
Automatic activation via /etc/profile.d/duckdb_env.sh ensures seamless user experience
Data science stack includes Pandas and NumPy for comprehensive data analysis
System Python is untouched, ensuring system stability and upgrade safety
The installation is production-ready and AMI-optimized (cleaned caches)

DuckDB Use Cases

Data Analysis: Query CSV, Parquet, JSON files without loading into memory
ETL Pipelines: Transform data with SQL before loading into data warehouses
Embedded Analytics: In-process database for applications (no separate server)
Data Science Workflows: Combine SQL queries with Pandas DataFrames
Prototyping: Fast analytical queries for data exploration

Additional Resources

Official DuckDB Documentation: https://duckdb.org/docs/
DuckDB Python API: https://duckdb.org/docs/api/python/overview
Pandas Documentation: https://pandas.pydata.org/docs/
NumPy Documentation: https://numpy.org/doc/
DuckDB SQL Reference: https://duckdb.org/docs/sql/introduction

Support

If you encounter issues not covered in this guide:

Check the troubleshooting section above
Review DuckDB official documentation
Verify your security group and firewall settings
Check system logs: sudo journalctl -xe

For support or questions, please contact the Easycloud team.

1. Quick Start Information​

2. First Launch & Verification​

Step 1: Connect to Your Instance​

Step 2: Verify Python Environment​

Step 3: Verify DuckDB Installation​

Step 4: Verify Data Science Libraries​

Step 5: Test DuckDB Query​

3. Architecture & Detailed Configuration​

3.1. Environment Configuration File​

3.2. Virtual Environment Structure​

4. How-To-Create: Reproduce This Environment​

Step 1: Install System Dependencies​

Step 2: Create Virtual Environment​

Step 3: Upgrade pip and Install Packages​

Step 4: Configure Automatic Environment Activation​

5. Using the DuckDB Environment​

5.1. Interactive Python REPL​

5.2. Querying CSV Files Directly​

5.3. Working with Pandas DataFrames​

5.4. Persistent Database Files​

5.5. Exporting Query Results​

5.6. Installing Additional Packages​

6. Important File Locations​

7. Troubleshooting​

Issue 1: Python Command Not Found​

Issue 2: ImportError: No module named 'duckdb'​

Issue 3: Permission Denied When Installing Packages​

Issue 4: DuckDB Database File Locked​

Issue 5: Out of Memory Errors​

Issue 6: Pandas Version Conflict​

8. Final Notes​

Key Takeaways​

DuckDB Use Cases​

Additional Resources​

Support​

1. Quick Start Information

2. First Launch & Verification

Step 1: Connect to Your Instance

Step 2: Verify Python Environment

Step 3: Verify DuckDB Installation

Step 4: Verify Data Science Libraries

Step 5: Test DuckDB Query

3. Architecture & Detailed Configuration

3.1. Environment Configuration File

3.2. Virtual Environment Structure

4. How-To-Create: Reproduce This Environment

Step 1: Install System Dependencies

Step 2: Create Virtual Environment

Step 3: Upgrade pip and Install Packages

Step 4: Configure Automatic Environment Activation

5. Using the DuckDB Environment

5.1. Interactive Python REPL

5.2. Querying CSV Files Directly

5.3. Working with Pandas DataFrames

5.4. Persistent Database Files

5.5. Exporting Query Results

5.6. Installing Additional Packages

6. Important File Locations

7. Troubleshooting

Issue 1: Python Command Not Found

Issue 2: ImportError: No module named 'duckdb'

Issue 3: Permission Denied When Installing Packages

Issue 4: DuckDB Database File Locked

Issue 5: Out of Memory Errors

Issue 6: Pandas Version Conflict

8. Final Notes

Key Takeaways

DuckDB Use Cases

Additional Resources

Support