Efficiently Uploading Large Datasets to AWS S3: A Practical Guide

& DevOps Practitioner


& DevOps Practitioner
Recently, I needed to upload a 140GB dataset to AWS S3—a collection of video files and documents that originally lived on a network drive. This turned into a valuable learning experience about the most efficient ways to handle large-scale S3 uploads. Here's what I discovered.
Note: The approach I'm about to outline got the job done and worked reliably for my needs. However, I've since learned there are more optimal methods available—particularly AWS S3 Transfer Acceleration and AWS CLI performance tuning—which I'll cover later in this article. If you're doing this regularly or need maximum speed, skip ahead to the "Optimization Strategies" section.
The Challenge
When faced with uploading large datasets to S3, you have several options:
- Upload directly via the AWS Console web interface
- Upload from a network drive using AWS CLI
- Copy to local storage first, then upload using AWS CLI
So which approach is fastest and most reliable? Let me walk you through what I learned.
The Clear Winner: AWS CLI with Local Storage
After testing different approaches, copying data to local storage and using the AWS CLI proved to be significantly faster and more reliable than other methods. Here's why:
Why AWS CLI beats the browser console:
- ✅ Automatic multipart uploads for large files
- ✅ Parallel transfer of multiple files simultaneously
- ✅ Resumable uploads if your connection drops
- ✅ Superior throughput and performance
- ✅ Built-in progress tracking and error handling
- ❌ Browser console: Single-threaded, no optimization, prone to timeouts
Why copying to local storage first matters:
- Better upload speeds from your machine to AWS
- Avoids network drive bandwidth bottlenecks
- More reliable connection during long transfers
Step-by-Step Process
Step 1: Copy Data to Local Storage
First, copy your dataset from the network drive to your local machine. Ensure you have sufficient disk space (in my case, 140GB+).
# Example: Copy from network drive to local directory
# Adjust paths based on your setup
cp -r /mnt/network-drive/dataset /home/user/local-dataset
Pro tip: Use an SSD if available—it'll significantly improve both the copy and upload speeds.
Step 2: Navigate to Your Dataset Directory
Open your terminal and navigate to the directory containing your files:
cd /home/user/local-dataset
Step 3: Preview Your Upload (Optional but Recommended)
Before committing to a multi-hour upload, do a dry run to see what will be transferred:
aws s3 sync . s3://your-bucket-name/ \
--region us-east-1 \
--exclude ".DS_Store" \
--exclude "*.tmp" \
--dryrun
This shows you exactly what files will be uploaded without actually transferring them.
Step 4: Execute the Upload
Now run the actual sync command:
aws s3 sync . s3://your-bucket-name/ \
--region us-east-1 \
--exclude ".DS_Store" \
--exclude "*.tmp"
Command breakdown:
aws s3 sync- Synchronizes your local directory with the S3 bucket.- Current directory (includes all subdirectories)--region- Specify your AWS region--exclude- Skip unnecessary files (metadata, temp files)
The Beauty of Resumable Uploads
Here's where things get really interesting. About halfway through my 140GB upload, I needed to close my laptop. I was worried I'd have to start over, but here's the great news: you don't!
The aws s3 sync command is idempotent and resumable. Simply re-run the exact same command:
aws s3 sync . s3://your-bucket-name/ \
--region us-east-1 \
--exclude ".DS_Store" \
--exclude "*.tmp"
What happens automatically:
- ✅ Skips files already successfully uploaded (by comparing size and timestamp)
- ✅ Retries any failed uploads
- ✅ Uploads only remaining files
- ✅ No need to manually track what succeeded
You will NOT re-upload the entire dataset! This was a game-changer for me.
Understanding the Progress Output
When your upload is running, you'll see output like this:
upload: videos/training-session.mp4 to s3://your-bucket-name/videos/training-session.mp4
Completed 6.1 GiB/~14.7 GiB (2.1 MiB/s) with ~55 file(s) remaining (calculating...)
What this tells you:
- 6.1 GiB/~14.7 GiB - Progress: 6.1GB uploaded out of estimated 14.7GB total
- (2.1 MiB/s) - Current transfer speed (this means it's working!)
- ~55 file(s) remaining - Estimated files left to upload
- (calculating...) - AWS CLI is recalculating totals (normal for large files)
Is My Upload Working or Hanging?
Signs your upload is healthy:
- ✅ Speed shown (e.g., 2.1 MiB/s)
- ✅ Progress numbers increasing
- ✅ New file names appearing periodically
Signs your upload might be stuck:
- ❌ Speed showing 0 bytes/s
- ❌ Progress frozen for 10+ minutes
- ❌ No new file names appearing
The "(calculating...)" message initially concerned me, but it's completely normal. The CLI recalculates totals as it processes files, especially with large video files.
Handling Common Issues
SSL Validation Errors
During my upload, I encountered several SSL validation errors:
upload failed: videos/large-file.mp4
SSL validation failed for https://... EOF occurred in violation of protocol
Solution: These are temporary network issues. Simply re-run the sync command—the failed files will be automatically retried. Don't panic; your data isn't corrupted.
Verification Commands
After your upload completes, verify everything transferred correctly:
Check what's in your bucket:
aws s3 ls s3://your-bucket-name/ \
--recursive \
--human-readable \
--summarize
Find a specific file:
aws s3 ls s3://your-bucket-name/ \
--recursive | grep "filename.mp4"
Calculate total bucket size:
aws s3 ls s3://your-bucket-name/ \
--recursive \
--summarize
Performance Tips
Optimizing Upload Speed
Here's what made the biggest difference for me:
- Use an SSD - Dramatically faster than spinning HDDs
- Wired connection - Ethernet beats WiFi for reliability
- Close bandwidth-heavy apps - Pause those software updates
- Off-peak hours - Upload overnight when network is quieter
Time Estimation
Based on typical speeds, here's what to expect for a 140GB upload:
- 1 MiB/s: ~40 hours
- 2 MiB/s: ~20 hours
- 5 MiB/s: ~8 hours
- 10 MiB/s: ~4 hours
My upload averaged around 2-3 MiB/s, taking about 18 hours total (with interruptions).
Long-Running Uploads: Use Screen or Tmux
For uploads taking many hours, use screen or tmux to keep the process running even if you disconnect:
# Start a screen session
screen -S s3upload
# Run your upload command
aws s3 sync . s3://your-bucket-name/ \
--region us-east-1 \
--exclude ".DS_Store" \
--exclude "*.tmp"
# Detach: Press Ctrl+A, then D
# Reattach later: screen -r s3upload
This saved me multiple times when I needed to close my laptop or switch networks.
IAM Permissions
Ensure your IAM user or role has the necessary S3 permissions:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:ListBucket",
"s3:DeleteObject"
],
"Resource": [
"arn:aws:s3:::your-bucket-name",
"arn:aws:s3:::your-bucket-name/*"
]
}
]
}
Optimization Strategies
After completing my initial upload, I discovered several ways to significantly improve upload performance. If I had known about these beforehand, my 18-hour upload could have been cut down considerably.
AWS CLI Performance Tuning
The AWS CLI has configuration options that most people don't know about. By default, it's configured conservatively, but you can tune it for better performance on fast connections.
Edit your AWS configuration file (~/.aws/config) and add these settings:
[default]
s3 =
max_concurrent_requests = 20
max_queue_size = 10000
multipart_threshold = 64MB
multipart_chunksize = 16MB
use_accelerate_endpoint = true
What these settings do:
max_concurrent_requests = 20- Increases parallel transfers from default 10 to 20max_queue_size = 10000- Increases task queue size for better throughputmultipart_threshold = 64MB- Files larger than 64MB use multipart upload (default 8MB)multipart_chunksize = 16MB- Size of each multipart chunk (default 8MB)use_accelerate_endpoint = true- Automatically uses S3 Transfer Acceleration if enabled
Performance impact: On my 100Mbps connection, these settings increased my upload speed from ~2 MiB/s to ~5 MiB/s—a 150% improvement!
Optional bandwidth limiting:
If you want to avoid saturating your connection:
max_bandwidth = 10MB/s
AWS S3 Transfer Acceleration
This is the game-changer I wish I'd known about from the start. S3 Transfer Acceleration uses Amazon CloudFront's globally distributed edge locations to accelerate uploads.
How fast is it?
- Can be 50-500% faster depending on your distance from the S3 region
- The further you are from your target region, the more dramatic the improvement
- AWS provides a speed comparison tool to test your potential gains
How to enable it:
- Enable Transfer Acceleration on your bucket:
aws s3api put-bucket-accelerate-configuration \
--bucket your-bucket-name \
--accelerate-configuration Status=Enabled
- Use the accelerated endpoint in your sync command:
aws s3 sync . s3://your-bucket-name/ \
--region us-east-1 \
--exclude ".DS_Store" \
--exclude "*.tmp" \
--endpoint-url https://s3-accelerate.amazonaws.com
Or simply add use_accelerate_endpoint = true to your ~/.aws/config (as shown above) and use your normal sync command.
Cost consideration:
Transfer Acceleration costs an additional $0.04 per GB for uploads (on top of standard data transfer costs). For my 140GB upload:
- Additional cost: ~$5.60
- Time saved: Potentially 6-12 hours
- Absolutely worth it for time-sensitive migrations
When to use Transfer Acceleration:
- ✅ Uploading from far away from your S3 region
- ✅ Time-sensitive migrations
- ✅ International transfers
- ✅ When upload speed is inconsistent
- ❌ Skip it if you're in the same region as your bucket (negligible benefit)
Combined Approach
For maximum performance, combine both strategies:
# 1. Configure AWS CLI for better performance
vim ~/.aws/config
# (Add the performance settings shown above)
# 2. Enable Transfer Acceleration on bucket
aws s3api put-bucket-accelerate-configuration \
--bucket your-bucket-name \
--accelerate-configuration Status=Enabled
# 3. Run your optimized upload
aws s3 sync . s3://your-bucket-name/ \
--region us-east-1 \
--exclude ".DS_Store" \
--exclude "*.tmp"
With both optimizations, my upload speed improved from 2-3 MiB/s to 8-10 MiB/s—nearly a 400% improvement! This would have reduced my 18-hour upload to just 4-5 hours.
Key Takeaways
After going through this process, here are my main recommendations:
- Always use AWS CLI over the browser for large uploads
- Tune your AWS CLI configuration for 2-4x faster uploads (easy win!)
- Enable S3 Transfer Acceleration if speed matters more than a few dollars
- Copy to local storage first if transferring from network drives
- Don't worry about interruptions -
aws s3 syncis fully resumable - The "(calculating...)" message is normal - your upload isn't hanging
- Use screen/tmux for multi-hour transfers
- SSL errors are temporary - just re-run the command
Quick Reference Checklist
- Configure AWS CLI performance settings in
~/.aws/config - (Optional) Enable S3 Transfer Acceleration on bucket
- Copy data to local machine with sufficient disk space
- Navigate to dataset directory in terminal
- (Optional) Run
--dryrunto preview - Execute
aws s3 synccommand - Monitor progress (speed > 0 MiB/s = working)
- If interrupted, simply re-run the same command
- Verify with
aws s3 ls --summarize
Conclusion
Uploading large datasets to S3 doesn't have to be painful. With the right approach—AWS CLI, local storage, and understanding that interruptions are okay—you can reliably transfer hundreds of gigabytes without starting over or babysitting the process.
The resumable nature of aws s3 sync is particularly elegant. It removes the anxiety of long-running uploads and makes the whole process remarkably forgiving.
Have you dealt with large-scale S3 uploads? I'd love to hear about your experiences and any additional tips you've discovered.
Additional Resources
- AWS CLI S3 Sync Documentation
- AWS S3 Transfer Acceleration - For even faster uploads across long distances
- AWS CLI Installation Guide