Skip to main content

DuckDB 1.4.4 + Data Science Stack (Ubuntu 24.04) AMI Administrator Guide

1. Quick Start Information

Connection Methods:

  • Access the instance via SSH using the ubuntu user. Use sudo to run commands requiring root privileges. To switch to the root user, use sudo su - root.

Install Information:

  • OS: Ubuntu 24.04 LTS
  • DuckDB version: 1.4.4
  • Python version: 3.12 (system default)
  • Virtual Environment Location: /opt/duckdb-env
  • Environment Configuration: /etc/profile.d/duckdb_env.sh
  • Python Binary Path: /opt/duckdb-env/bin/python
  • pip Binary Path: /opt/duckdb-env/bin/pip

Installed Python Packages:

  • DuckDB: duckdb (v1.4.4) - In-process analytical database
  • Pandas: pandas - Data manipulation and analysis library
  • NumPy: numpy - Numerical computing library

Quick Verification Commands:

  • Check Python path: which python
  • Check DuckDB version: python -c "import duckdb; print(duckdb.__version__)"
  • Check Pandas version: python -c "import pandas; print(pandas.__version__)"
  • Check NumPy version: python -c "import numpy; print(numpy.__version__)"

Firewall Configuration:

  • Please allow SSH port 22.
  • For security, it is recommended to limit SSH access to trusted IPs only.

2. First Launch & Verification

Step 1: Connect to Your Instance

  1. Launch your instance in your cloud provider's console (e.g., AWS EC2)
  2. Ensure SSH port 22 is allowed in your security group
  3. Connect via SSH:
    ssh -i your-key.pem ubuntu@YOUR_PUBLIC_IP

Step 2: Verify Python Environment

Check the Python path (should point to virtual environment):

which python

Expected Output:

/opt/duckdb-env/bin/python

Check Python version:

python --version

Expected Output:

Python 3.12.x

Step 3: Verify DuckDB Installation

Test DuckDB import and version:

python -c "import duckdb; print(f'Success! DuckDB version: {duckdb.__version__}')"

Expected Output:

Success! DuckDB version: 1.4.4

Step 4: Verify Data Science Libraries

Test Pandas:

python -c "import pandas; print(f'Pandas version: {pandas.__version__}')"

Test NumPy:

python -c "import numpy; print(f'NumPy version: {numpy.__version__}')"

Step 5: Test DuckDB Query

Run a simple SQL query:

python << 'EOF'
import duckdb

# Create an in-memory database
conn = duckdb.connect(':memory:')

# Create a sample table
conn.execute("CREATE TABLE users (id INT, name VARCHAR, age INT)")
conn.execute("INSERT INTO users VALUES (1, 'Alice', 30), (2, 'Bob', 25), (3, 'Charlie', 35)")

# Query the table
result = conn.execute("SELECT * FROM users WHERE age > 25").fetchall()
print("Query Results:")
for row in result:
print(row)

conn.close()
EOF

Expected Output:

Query Results:
(1, 'Alice', 30)
(3, 'Charlie', 35)

How This Works:

  • DuckDB creates an in-memory database (:memory:)
  • SQL commands work just like in PostgreSQL or MySQL
  • DuckDB is optimized for analytical queries (OLAP)
  • No separate database server needed - DuckDB runs in-process

3. Architecture & Detailed Configuration

This AMI uses a Python virtual environment approach to isolate DuckDB and data science packages from the system Python installation, ensuring stability and ease of maintenance.

Installation Architecture:

[System Python 3.12]

[python3 -m venv /opt/duckdb-env]

[Virtual Environment Created]

├── /opt/duckdb-env/bin/python → Isolated Python interpreter
├── /opt/duckdb-env/bin/pip → Isolated package manager
└── /opt/duckdb-env/lib/python3.12/site-packages/
├── duckdb/ → DuckDB library
├── pandas/ → Pandas library
└── numpy/ → NumPy library

[Environment Configuration]

/etc/profile.d/duckdb_env.sh → Adds /opt/duckdb-env/bin to PATH

[Automatic Activation]

All users automatically use DuckDB environment on login

Key Design Decisions:

  1. Virtual Environment Isolation: Packages installed in /opt/duckdb-env don't affect system Python
  2. Automatic PATH Configuration: Environment activates automatically via /etc/profile.d/
  3. System-Wide Availability: All users (root, ubuntu, etc.) can use DuckDB without additional setup
  4. No conda/miniconda: Using venv (built-in to Python) keeps the AMI lightweight

3.1. Environment Configuration File

File Location: /etc/profile.d/duckdb_env.sh

File Content:

# === AWS Marketplace AMI Configuration ===
# Automatically add the Data Science environment to PATH
export PATH="/opt/duckdb-env/bin:$PATH"

How This Works:

  • This file is automatically sourced during user login
  • Prepends /opt/duckdb-env/bin to the PATH environment variable
  • Ensures python and pip commands resolve to the virtual environment binaries
  • Works for all users (root, ubuntu, custom users)

Why This Approach?

  • Transparent: Users don't need to manually activate the environment
  • Persistent: Configuration survives reboots and SSH reconnections
  • Standard: Uses Linux standard directory /etc/profile.d/ for environment setup

3.2. Virtual Environment Structure

Directory Layout:

/opt/duckdb-env/
├── bin/
│ ├── python → Python 3.12 interpreter
│ ├── python3 → Symlink to python
│ ├── pip → Package installer
│ └── activate → Manual activation script (optional)
├── include/
│ └── python3.12/ → C header files
├── lib/
│ └── python3.12/
│ └── site-packages/
│ ├── duckdb/ → DuckDB library
│ ├── pandas/ → Pandas library
│ ├── numpy/ → NumPy library
│ └── ... → Other dependencies
└── pyvenv.cfg → Virtual environment configuration

Installed Packages:

PackageVersionPurpose
duckdb1.4.4In-process analytical SQL database
pandasLatestData manipulation (DataFrames)
numpyLatestNumerical arrays and math operations
pipLatestPython package installer

4. How-To-Create: Reproduce This Environment

This section explains how this AMI was built, allowing you to reproduce the installation on any Ubuntu 24.04 system.

Step 1: Install System Dependencies

Purpose: Install Python virtual environment module and pip (not included by default in Ubuntu 24.04).

# Update package index
sudo apt update

# Install Python venv module and pip
sudo apt install -y python3-venv python3-pip

How This Works:

  • python3-venv: Provides the venv module for creating virtual environments
  • python3-pip: Installs pip (Python package installer) for the system Python

Why These Dependencies?

Ubuntu 24.04 minimal installation does not include venv or pip by default to keep the base image small. These packages are required to create isolated Python environments.

Step 2: Create Virtual Environment

Purpose: Create an isolated Python environment at /opt/duckdb-env.

# Create virtual environment
sudo python3 -m venv /opt/duckdb-env

How This Works:

  • python3 -m venv: Invokes the venv module to create a new virtual environment
  • /opt/duckdb-env: Target directory for the virtual environment
  • sudo: Required because /opt/ is a system directory

What Gets Created:

  • Independent Python interpreter
  • Isolated site-packages directory
  • Activation scripts (though we don't use them due to auto-activation)

Directory Location Rationale:

  • /opt/: Standard location for third-party software on Linux systems
  • duckdb-env: Descriptive name indicating it's a DuckDB environment

Step 3: Upgrade pip and Install Packages

Purpose: Install DuckDB and data science libraries into the virtual environment.

# Upgrade pip to latest version
sudo /opt/duckdb-env/bin/pip install --upgrade pip

# Install DuckDB, Pandas, and NumPy
sudo /opt/duckdb-env/bin/pip install duckdb pandas numpy

How This Works:

  • /opt/duckdb-env/bin/pip: Uses the virtual environment's pip (not system pip)
  • --upgrade pip: Ensures we have the latest pip version
  • duckdb pandas numpy: Installs all three packages in one command
  • sudo: Required to write to /opt/duckdb-env/

Why Upgrade pip First?

Older pip versions may have compatibility issues or security vulnerabilities. Upgrading ensures smooth installation of modern packages.

Package Dependencies:

  • duckdb: Self-contained, no heavy dependencies
  • pandas: Depends on numpy, so numpy is automatically updated if needed
  • numpy: Core numerical library, used by pandas and duckdb

Step 4: Configure Automatic Environment Activation

Purpose: Make the DuckDB environment available to all users automatically.

# Create environment configuration script
sudo tee /etc/profile.d/duckdb_env.sh > /dev/null << 'EOF'
# === AWS Marketplace AMI Configuration ===
# Automatically add the Data Science environment to PATH
export PATH="/opt/duckdb-env/bin:$PATH"
EOF

# Make the script executable
sudo chmod +x /etc/profile.d/duckdb_env.sh

How This Works:

  • tee /etc/profile.d/duckdb_env.sh: Creates a new file with specified content
  • /etc/profile.d/: Standard directory for system-wide environment configuration
  • export PATH="...": Prepends virtual environment binaries to PATH
  • chmod +x: Makes the script executable (required for sourcing)

Why /etc/profile.d/?

  • Files in this directory are automatically sourced during login
  • Applies to all users (root, ubuntu, custom users)
  • Survives system reboots and package updates

Alternative Approaches (Not Used):

  • Manual activation with source activate: Requires user action every login
  • User-level .bashrc: Only affects specific users, not system-wide
  • System Python site-packages: Pollutes system Python, risky on upgrades

5. Using the DuckDB Environment

5.1. Interactive Python REPL

Start Python REPL:

python

Example Session:

>>> import duckdb
>>> conn = duckdb.connect(':memory:')
>>> conn.execute("SELECT 'Hello from DuckDB!' AS message").fetchone()
('Hello from DuckDB!',)
>>> conn.close()
>>> exit()

5.2. Querying CSV Files Directly

DuckDB can query CSV files without loading them into a table:

# Create a sample CSV file
cat > data.csv << 'EOF'
id,name,age,city
1,Alice,30,New York
2,Bob,25,San Francisco
3,Charlie,35,Seattle
4,Diana,28,Boston
EOF

# Query the CSV directly
python << 'EOF'
import duckdb

conn = duckdb.connect(':memory:')

# Query CSV file directly
result = conn.execute("""
SELECT city, COUNT(*) as count, AVG(age) as avg_age
FROM 'data.csv'
GROUP BY city
ORDER BY avg_age DESC
""").fetchall()

print("City Statistics:")
for row in result:
print(f"City: {row[0]}, Count: {row[1]}, Avg Age: {row[2]:.1f}")

conn.close()
EOF

Expected Output:

City Statistics:
City: Seattle, Count: 1, Avg Age: 35.0
City: New York, Count: 1, Avg Age: 30.0
City: Boston, Count: 1, Avg Age: 28.0
City: San Francisco, Count: 1, Avg Age: 25.0

5.3. Working with Pandas DataFrames

Convert between DuckDB and Pandas:

python << 'EOF'
import duckdb
import pandas as pd

# Create a Pandas DataFrame
df = pd.DataFrame({
'product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor'],
'price': [1200, 25, 75, 300],
'quantity': [5, 50, 30, 10]
})

# Query Pandas DataFrame with SQL
conn = duckdb.connect(':memory:')
result = conn.execute("""
SELECT product, price, quantity, (price * quantity) as total_value
FROM df
WHERE price > 50
ORDER BY total_value DESC
""").fetchdf()

print("High-Value Products:")
print(result)

conn.close()
EOF

Expected Output:

High-Value Products:
product price quantity total_value
0 Laptop 1200 5 6000
1 Monitor 300 10 3000
2 Keyboard 75 30 2250

5.4. Persistent Database Files

Create a persistent database:

python << 'EOF'
import duckdb

# Create a persistent database file
conn = duckdb.connect('my_database.duckdb')

# Create table and insert data
conn.execute("CREATE TABLE sales (date DATE, amount DECIMAL(10,2), product VARCHAR)")
conn.execute("INSERT INTO sales VALUES ('2026-01-01', 150.50, 'Widget')")
conn.execute("INSERT INTO sales VALUES ('2026-01-02', 200.75, 'Gadget')")

# Query and display
result = conn.execute("SELECT * FROM sales ORDER BY date").fetchall()
for row in result:
print(row)

conn.close()
print("Database saved to my_database.duckdb")
EOF

# Verify database file was created
ls -lh my_database.duckdb

Database File Features:

  • Portable: Can be copied to other systems
  • Compressed: Efficient storage format
  • Multi-threaded: Supports concurrent reads
  • ACID compliant: Transactional safety

5.5. Exporting Query Results

Export to CSV:

python << 'EOF'
import duckdb

conn = duckdb.connect(':memory:')

# Create sample data
conn.execute("CREATE TABLE employees (name VARCHAR, salary INT, department VARCHAR)")
conn.execute("INSERT INTO employees VALUES ('Alice', 80000, 'Engineering')")
conn.execute("INSERT INTO employees VALUES ('Bob', 65000, 'Marketing')")
conn.execute("INSERT INTO employees VALUES ('Charlie', 90000, 'Engineering')")

# Export to CSV
conn.execute("COPY (SELECT * FROM employees WHERE department = 'Engineering') TO 'engineers.csv' (HEADER, DELIMITER ',')")

conn.close()
print("Exported to engineers.csv")
EOF

# View exported file
cat engineers.csv

Expected Output:

name,salary,department
Alice,80000,Engineering
Charlie,90000,Engineering

5.6. Installing Additional Packages

Add more Python libraries:

# Install a new package (e.g., matplotlib for plotting)
sudo /opt/duckdb-env/bin/pip install matplotlib

# Verify installation
python -c "import matplotlib; print('Matplotlib installed successfully')"

Recommended Data Science Packages:

  • scikit-learn: Machine learning library
  • matplotlib / seaborn: Data visualization
  • sqlalchemy: SQL toolkit (can use DuckDB as backend)
  • jupyter: Interactive notebooks

6. Important File Locations

File PathPurpose
/opt/duckdb-env/Virtual environment root directory
/opt/duckdb-env/bin/pythonPython interpreter (virtual environment)
/opt/duckdb-env/bin/pipPackage installer (virtual environment)
/opt/duckdb-env/lib/python3.12/site-packages/Installed Python packages directory
/etc/profile.d/duckdb_env.shEnvironment activation script
~/.cache/pip/User-specific pip cache (safe to delete)
/root/.cache/pip/Root user pip cache (safe to delete)
*.duckdbDuckDB database files (user-created)
*.csvData files (user-created)

7. Troubleshooting

Issue 1: Python Command Not Found

Symptoms:

$ python --version
-bash: python: command not found

Diagnosis:

Environment configuration not loaded.

Solution:

Reload environment configuration:

source /etc/profile.d/duckdb_env.sh

Or log out and log back in:

exit
ssh -i your-key.pem ubuntu@YOUR_PUBLIC_IP

Verify:

which python
# Should output: /opt/duckdb-env/bin/python

Issue 2: ImportError: No module named 'duckdb'

Symptoms:

$ python -c "import duckdb"
ImportError: No module named 'duckdb'

Diagnosis:

Using wrong Python interpreter (system Python instead of virtual environment).

Solution:

Check Python path:

which python
# Expected: /opt/duckdb-env/bin/python
# If different: /usr/bin/python or /usr/bin/python3

If using wrong Python:

# Reload environment
source /etc/profile.d/duckdb_env.sh

# Verify again
which python
python -c "import duckdb; print('DuckDB works!')"

Issue 3: Permission Denied When Installing Packages

Symptoms:

$ pip install some-package
error: could not create '/opt/duckdb-env/lib/python3.12/site-packages/...': Permission denied

Diagnosis:

Virtual environment is owned by root, regular users cannot write to it.

Solution:

Use sudo for installing system-wide packages:

sudo /opt/duckdb-env/bin/pip install some-package

Alternative (User-Level Install):

pip install --user some-package

Note: User-level installs go to ~/.local/lib/python3.12/site-packages/, not the virtual environment.


Issue 4: DuckDB Database File Locked

Symptoms:

$ python -c "import duckdb; duckdb.connect('my_database.duckdb')"
duckdb.IOException: Could not set lock on file "my_database.duckdb": Resource temporarily unavailable

Diagnosis:

Another process is using the database file.

Solution:

Find processes using the database:

lsof | grep my_database.duckdb

Kill the process:

kill <PID>

Or wait for the other process to release the lock.

Prevention:

Always close connections:

conn = duckdb.connect('my_database.duckdb')
# ... do work ...
conn.close() # Always close!

Issue 5: Out of Memory Errors

Symptoms:

$ python -c "import duckdb; ..."
MemoryError: std::bad_alloc

Diagnosis:

Query requires more memory than available.

Solution 1: Check Available Memory:

free -h

Solution 2: Limit DuckDB Memory Usage:

import duckdb
conn = duckdb.connect(':memory:')
conn.execute("SET memory_limit='2GB'") # Limit to 2 GB

Solution 3: Use Persistent Database:

# Use disk-based database instead of :memory:
conn = duckdb.connect('my_database.duckdb')

Solution 4: Process Data in Chunks:

# Use LIMIT and OFFSET to process data in batches
for offset in range(0, 1000000, 10000):
result = conn.execute(f"SELECT * FROM large_table LIMIT 10000 OFFSET {offset}").fetchall()
# Process batch

Issue 6: Pandas Version Conflict

Symptoms:

$ python -c "import duckdb, pandas"
ImportError: ... incompatible with this version of pandas

Diagnosis:

DuckDB and Pandas versions are incompatible.

Solution:

Upgrade both packages:

sudo /opt/duckdb-env/bin/pip install --upgrade duckdb pandas

Verify versions:

python -c "import duckdb, pandas; print(f'DuckDB: {duckdb.__version__}, Pandas: {pandas.__version__}')"

8. Final Notes

Key Takeaways

  1. DuckDB 1.4.4 is installed in an isolated Python virtual environment at /opt/duckdb-env
  2. Automatic activation via /etc/profile.d/duckdb_env.sh ensures seamless user experience
  3. Data science stack includes Pandas and NumPy for comprehensive data analysis
  4. System Python is untouched, ensuring system stability and upgrade safety
  5. The installation is production-ready and AMI-optimized (cleaned caches)

DuckDB Use Cases

  • Data Analysis: Query CSV, Parquet, JSON files without loading into memory
  • ETL Pipelines: Transform data with SQL before loading into data warehouses
  • Embedded Analytics: In-process database for applications (no separate server)
  • Data Science Workflows: Combine SQL queries with Pandas DataFrames
  • Prototyping: Fast analytical queries for data exploration

Additional Resources

Support

If you encounter issues not covered in this guide:

  1. Check the troubleshooting section above
  2. Review DuckDB official documentation
  3. Verify your security group and firewall settings
  4. Check system logs: sudo journalctl -xe

For support or questions, please contact the Easycloud team.