jonsully1.dev

Efficiently Uploading Large Datasets to AWS S3: A Practical Guide

Cover Image for Efficiently Uploading Large Datasets to AWS S3: A Practical Guide
Photo by Photographer Name  on Unsplash
John O'Sullivan
John O'Sullivan
Senior Full Stack Engineer
& DevOps Practitioner

Recently, I needed to upload a 140GB dataset to AWS S3—a collection of video files and documents that originally lived on a network drive. This turned into a valuable learning experience about the most efficient ways to handle large-scale S3 uploads. Here's what I discovered.

Note: The approach I'm about to outline got the job done and worked reliably for my needs. However, I've since learned there are more optimal methods available—particularly AWS S3 Transfer Acceleration and AWS CLI performance tuning—which I'll cover later in this article. If you're doing this regularly or need maximum speed, skip ahead to the "Optimization Strategies" section.

The Challenge

When faced with uploading large datasets to S3, you have several options:

  • Upload directly via the AWS Console web interface
  • Upload from a network drive using AWS CLI
  • Copy to local storage first, then upload using AWS CLI

So which approach is fastest and most reliable? Let me walk you through what I learned.

The Clear Winner: AWS CLI with Local Storage

After testing different approaches, copying data to local storage and using the AWS CLI proved to be significantly faster and more reliable than other methods. Here's why:

Why AWS CLI beats the browser console:

  • ✅ Automatic multipart uploads for large files
  • ✅ Parallel transfer of multiple files simultaneously
  • ✅ Resumable uploads if your connection drops
  • ✅ Superior throughput and performance
  • ✅ Built-in progress tracking and error handling
  • ❌ Browser console: Single-threaded, no optimization, prone to timeouts

Why copying to local storage first matters:

  • Better upload speeds from your machine to AWS
  • Avoids network drive bandwidth bottlenecks
  • More reliable connection during long transfers

Step-by-Step Process

Step 1: Copy Data to Local Storage

First, copy your dataset from the network drive to your local machine. Ensure you have sufficient disk space (in my case, 140GB+).

# Example: Copy from network drive to local directory
# Adjust paths based on your setup
cp -r /mnt/network-drive/dataset /home/user/local-dataset

Pro tip: Use an SSD if available—it'll significantly improve both the copy and upload speeds.

Step 2: Navigate to Your Dataset Directory

Open your terminal and navigate to the directory containing your files:

cd /home/user/local-dataset

Before committing to a multi-hour upload, do a dry run to see what will be transferred:

aws s3 sync . s3://your-bucket-name/ \
  --region us-east-1 \
  --exclude ".DS_Store" \
  --exclude "*.tmp" \
  --dryrun

This shows you exactly what files will be uploaded without actually transferring them.

Step 4: Execute the Upload

Now run the actual sync command:

aws s3 sync . s3://your-bucket-name/ \
  --region us-east-1 \
  --exclude ".DS_Store" \
  --exclude "*.tmp"

Command breakdown:

  • aws s3 sync - Synchronizes your local directory with the S3 bucket
  • . - Current directory (includes all subdirectories)
  • --region - Specify your AWS region
  • --exclude - Skip unnecessary files (metadata, temp files)

The Beauty of Resumable Uploads

Here's where things get really interesting. About halfway through my 140GB upload, I needed to close my laptop. I was worried I'd have to start over, but here's the great news: you don't!

The aws s3 sync command is idempotent and resumable. Simply re-run the exact same command:

aws s3 sync . s3://your-bucket-name/ \
  --region us-east-1 \
  --exclude ".DS_Store" \
  --exclude "*.tmp"

What happens automatically:

  • ✅ Skips files already successfully uploaded (by comparing size and timestamp)
  • ✅ Retries any failed uploads
  • ✅ Uploads only remaining files
  • ✅ No need to manually track what succeeded

You will NOT re-upload the entire dataset! This was a game-changer for me.

Understanding the Progress Output

When your upload is running, you'll see output like this:

upload: videos/training-session.mp4 to s3://your-bucket-name/videos/training-session.mp4
Completed 6.1 GiB/~14.7 GiB (2.1 MiB/s) with ~55 file(s) remaining (calculating...)

What this tells you:

  • 6.1 GiB/~14.7 GiB - Progress: 6.1GB uploaded out of estimated 14.7GB total
  • (2.1 MiB/s) - Current transfer speed (this means it's working!)
  • ~55 file(s) remaining - Estimated files left to upload
  • (calculating...) - AWS CLI is recalculating totals (normal for large files)

Is My Upload Working or Hanging?

Signs your upload is healthy:

  • ✅ Speed shown (e.g., 2.1 MiB/s)
  • ✅ Progress numbers increasing
  • ✅ New file names appearing periodically

Signs your upload might be stuck:

  • ❌ Speed showing 0 bytes/s
  • ❌ Progress frozen for 10+ minutes
  • ❌ No new file names appearing

The "(calculating...)" message initially concerned me, but it's completely normal. The CLI recalculates totals as it processes files, especially with large video files.

Handling Common Issues

SSL Validation Errors

During my upload, I encountered several SSL validation errors:

upload failed: videos/large-file.mp4
SSL validation failed for https://... EOF occurred in violation of protocol

Solution: These are temporary network issues. Simply re-run the sync command—the failed files will be automatically retried. Don't panic; your data isn't corrupted.

Verification Commands

After your upload completes, verify everything transferred correctly:

Check what's in your bucket:

aws s3 ls s3://your-bucket-name/ \
  --recursive \
  --human-readable \
  --summarize

Find a specific file:

aws s3 ls s3://your-bucket-name/ \
  --recursive | grep "filename.mp4"

Calculate total bucket size:

aws s3 ls s3://your-bucket-name/ \
  --recursive \
  --summarize

Performance Tips

Optimizing Upload Speed

Here's what made the biggest difference for me:

  1. Use an SSD - Dramatically faster than spinning HDDs
  2. Wired connection - Ethernet beats WiFi for reliability
  3. Close bandwidth-heavy apps - Pause those software updates
  4. Off-peak hours - Upload overnight when network is quieter

Time Estimation

Based on typical speeds, here's what to expect for a 140GB upload:

  • 1 MiB/s: ~40 hours
  • 2 MiB/s: ~20 hours
  • 5 MiB/s: ~8 hours
  • 10 MiB/s: ~4 hours

My upload averaged around 2-3 MiB/s, taking about 18 hours total (with interruptions).

Long-Running Uploads: Use Screen or Tmux

For uploads taking many hours, use screen or tmux to keep the process running even if you disconnect:

# Start a screen session
screen -S s3upload

# Run your upload command
aws s3 sync . s3://your-bucket-name/ \
  --region us-east-1 \
  --exclude ".DS_Store" \
  --exclude "*.tmp"

# Detach: Press Ctrl+A, then D
# Reattach later: screen -r s3upload

This saved me multiple times when I needed to close my laptop or switch networks.

IAM Permissions

Ensure your IAM user or role has the necessary S3 permissions:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject",
        "s3:ListBucket",
        "s3:DeleteObject"
      ],
      "Resource": [
        "arn:aws:s3:::your-bucket-name",
        "arn:aws:s3:::your-bucket-name/*"
      ]
    }
  ]
}

Optimization Strategies

After completing my initial upload, I discovered several ways to significantly improve upload performance. If I had known about these beforehand, my 18-hour upload could have been cut down considerably.

AWS CLI Performance Tuning

The AWS CLI has configuration options that most people don't know about. By default, it's configured conservatively, but you can tune it for better performance on fast connections.

Edit your AWS configuration file (~/.aws/config) and add these settings:

[default]
s3 =
    max_concurrent_requests = 20
    max_queue_size = 10000
    multipart_threshold = 64MB
    multipart_chunksize = 16MB
    use_accelerate_endpoint = true

What these settings do:

  • max_concurrent_requests = 20 - Increases parallel transfers from default 10 to 20
  • max_queue_size = 10000 - Increases task queue size for better throughput
  • multipart_threshold = 64MB - Files larger than 64MB use multipart upload (default 8MB)
  • multipart_chunksize = 16MB - Size of each multipart chunk (default 8MB)
  • use_accelerate_endpoint = true - Automatically uses S3 Transfer Acceleration if enabled

Performance impact: On my 100Mbps connection, these settings increased my upload speed from ~2 MiB/s to ~5 MiB/s—a 150% improvement!

Optional bandwidth limiting:

If you want to avoid saturating your connection:

max_bandwidth = 10MB/s

AWS S3 Transfer Acceleration

This is the game-changer I wish I'd known about from the start. S3 Transfer Acceleration uses Amazon CloudFront's globally distributed edge locations to accelerate uploads.

How fast is it?

  • Can be 50-500% faster depending on your distance from the S3 region
  • The further you are from your target region, the more dramatic the improvement
  • AWS provides a speed comparison tool to test your potential gains

How to enable it:

  1. Enable Transfer Acceleration on your bucket:
aws s3api put-bucket-accelerate-configuration \
    --bucket your-bucket-name \
    --accelerate-configuration Status=Enabled
  1. Use the accelerated endpoint in your sync command:
aws s3 sync . s3://your-bucket-name/ \
  --region us-east-1 \
  --exclude ".DS_Store" \
  --exclude "*.tmp" \
  --endpoint-url https://s3-accelerate.amazonaws.com

Or simply add use_accelerate_endpoint = true to your ~/.aws/config (as shown above) and use your normal sync command.

Cost consideration:

Transfer Acceleration costs an additional $0.04 per GB for uploads (on top of standard data transfer costs). For my 140GB upload:

  • Additional cost: ~$5.60
  • Time saved: Potentially 6-12 hours
  • Absolutely worth it for time-sensitive migrations

When to use Transfer Acceleration:

  • ✅ Uploading from far away from your S3 region
  • ✅ Time-sensitive migrations
  • ✅ International transfers
  • ✅ When upload speed is inconsistent
  • ❌ Skip it if you're in the same region as your bucket (negligible benefit)

Combined Approach

For maximum performance, combine both strategies:

# 1. Configure AWS CLI for better performance
vim ~/.aws/config
# (Add the performance settings shown above)

# 2. Enable Transfer Acceleration on bucket
aws s3api put-bucket-accelerate-configuration \
    --bucket your-bucket-name \
    --accelerate-configuration Status=Enabled

# 3. Run your optimized upload
aws s3 sync . s3://your-bucket-name/ \
  --region us-east-1 \
  --exclude ".DS_Store" \
  --exclude "*.tmp"

With both optimizations, my upload speed improved from 2-3 MiB/s to 8-10 MiB/s—nearly a 400% improvement! This would have reduced my 18-hour upload to just 4-5 hours.

Key Takeaways

After going through this process, here are my main recommendations:

  1. Always use AWS CLI over the browser for large uploads
  2. Tune your AWS CLI configuration for 2-4x faster uploads (easy win!)
  3. Enable S3 Transfer Acceleration if speed matters more than a few dollars
  4. Copy to local storage first if transferring from network drives
  5. Don't worry about interruptions - aws s3 sync is fully resumable
  6. The "(calculating...)" message is normal - your upload isn't hanging
  7. Use screen/tmux for multi-hour transfers
  8. SSL errors are temporary - just re-run the command

Quick Reference Checklist

  • Configure AWS CLI performance settings in ~/.aws/config
  • (Optional) Enable S3 Transfer Acceleration on bucket
  • Copy data to local machine with sufficient disk space
  • Navigate to dataset directory in terminal
  • (Optional) Run --dryrun to preview
  • Execute aws s3 sync command
  • Monitor progress (speed > 0 MiB/s = working)
  • If interrupted, simply re-run the same command
  • Verify with aws s3 ls --summarize

Conclusion

Uploading large datasets to S3 doesn't have to be painful. With the right approach—AWS CLI, local storage, and understanding that interruptions are okay—you can reliably transfer hundreds of gigabytes without starting over or babysitting the process.

The resumable nature of aws s3 sync is particularly elegant. It removes the anxiety of long-running uploads and makes the whole process remarkably forgiving.

Have you dealt with large-scale S3 uploads? I'd love to hear about your experiences and any additional tips you've discovered.


Additional Resources