Imagine having a tireless little helper who wakes up every hour and checks a shelf to see what’s already there. That shelf is your S3 bucket. Now, suppose this helper also receives a fresh list of packages—like new documents or files—from an external source.
Here’s the clever part: the helper compares the new packages with what’s already on the shelf and only adds the ones that are truly new. Once that job is done, they press a big friendly button to call in a second helper. This next helper’s only task is to update a colorful, ever-growing book of knowledge with the newly added items.
In essence, it’s a simple, reliable routine: one helper fetches and filters, the other updates and organizes. And the beauty is—it’s all automated. Once set up, it just works, hour after hour, without you lifting a finger.
Introduction#
Welcome to this comprehensive guide on automating knowledge base updates using AWS Lambda and S3. In this article, we’ll explore:
- The core concepts behind automating document sync pipelines
- A visual model for intuitive understanding
- Step-by-step implementation using AWS Lambda, S3, and EventBridge
- A practical runbook for deploying it in production
What You’ll Learn#
By the end of this guide, you’ll have a solid understanding of:
- Fundamentals: Automating incremental data ingestion using serverless architecture
- Implementation: Building a two-step Lambda pipeline with dependency management, version control, and event triggers
- Applications: When and where to apply this architecture (e.g., document sync, chatbot ingestion, or indexing flows)
- Best Practices: Efficient file deduplication, cost control, and reliable error handling
Prerequisites#
To follow along, you should have:
- Basic Python programming experience
- Familiarity with AWS services (Lambda, S3, IAM, EventBridge)
- An AWS account with permission to create and manage Lambda functions, EventBridge rules, and S3 buckets
Let’s dive in!
Conceptual Overview (Pencil Sketch)#
Before diving into code, here’s a simple “ASCII-style” sketch of our pipeline:
flowchart TD A['A' EventBridge
Schedule Trigger] --> B['B' First Lambda: Fetch & Upload] B -->|Check existing files| C[ 'C' S3 Bucket] B -->|Upload new files| C B -->|Trigger| D['D' Second Lambda: Sync Knowledge] D -->|Sync new files| C %% Styling classDef lambda fill:#f3f7ff,stroke:#4a90e2,stroke-width:2px; classDef service fill:#fef6e4,stroke:#f5a623,stroke-width:2px; classDef storage fill:#eafbea,stroke:#2ecc71,stroke-width:2px; class A,B,D lambda; class C storage;
Explanation of Flow#
- EventBridge (A) triggers the First Lambda (B) on a defined schedule.
- First Lambda checks what files already exist in S3 (C).
- It fetches new data, compares, and uploads only the new files.
- Once done, First Lambda triggers the Second Lambda (D).
- Second Lambda reads the new files from S3 and resync, and updates the knowledge base.
Sequence Diagram#
For a more structured and professional illustration, here’s the same flow visualized:
sequenceDiagram autonumber participant EB as EventBridge (Scheduled Trigger) participant Lambda1 as First Lambda (Fetch & Upload) participant S3 as S3 Bucket (Document Store) participant KB as Bedrock KnowledgeBase (Vector Reindexing) EB->>Lambda1: (1) Trigger every hour Lambda1->>S3: (2) Check existing files Lambda1->>S3: (3) Upload only new files S3->>KB: (4) Bedrock polls & reindexes new documents
Sequence Steps#
Scheduled Trigger (EventBridge)
Runs every hour (or as configured) and invokes the first Lambda.Check Existing Files (Lambda1)
The Lambda lists existing objects in the S3 bucket and compares them with the incoming document list (e.g., via filenames, hashes, or metadata).Upload Only New Files to S3
The Lambda uploads only new or changed documents to a pre-configured S3 folder connected to the Bedrock Knowledge Base.Bedrock Knowledge Base Polls & Reindexes Automatically
Once new files land in the target folder, Bedrock Knowledge Base performs the following steps asynchronously:- Fetches the new documents
- Splits them into chunks
- Embeds each chunk into vector format
- Persists both raw content and vector embeddings in its internal vector store
- Updates the vector index used for retrieval
How AWS Bedrock Knowledge Base Handles Ingestion and Indexing#
When you configure a Knowledge Base in AWS Bedrock with an S3 data source, here’s what happens under the hood:
1. Document Ingestion#
- The Knowledge Base service periodically polls the specified S3 path for new or updated files.
- Supported formats:
.txt
,.md
,.csv
,.html
,.pdf
, etc. - You can configure file types and optionally specify prefixes (folders) for organization.
2. Chunking (Document Splitting)#
- Each file is split into smaller semantic chunks based on token count, paragraph boundaries, or configurable delimiters.
- This chunking improves retrieval granularity and r
Runbook: Deploying the Two-Lambda Pipeline#
Step 1: Prepare Your Python Script Locally#
mkdir lambda_fetcher && cd lambda_fetcher
touch main.py requirements.txt
main.py
: Your Python code for fetching JSON files from a URL and uploading to S3.requirements.txt
: List dependencies (e.g.,requests
,boto3
).
Example:
requests
boto3
Install dependencies locally:
pip install -r requirements.txt -t ./package
cp main.py package/
cd package && zip -r ../lambda_fetcher.zip .
Step 2: Create and Deploy the First Lambda Function#
Go to AWS Lambda Console → Create Function → Author from scratch
Runtime: Python 3.11
Upload your
lambda_fetcher.zip
Set environment variables in the Lambda console:
BUCKET_NAME
TARGET_URL
SECOND_LAMBDA_NAME
(optional, if invoking manually)
Assign IAM role with permissions:
s3:ListBucket
,s3:PutObject
lambda:InvokeFunction
(for the second Lambda)
Set a timeout of ~2 minutes
Step 3: Schedule It with EventBridge#
- Go to EventBridge → Rules → Create Rule
- Rule type: Scheduled (cron or rate expression, e.g.,
rate(1 hour)
) - Target: Lambda → select the first Lambda function
- Save
Now your first Lambda runs automatically on a schedule.
Step 4: Writing the Logic to Avoid Duplicate Uploads#
In main.py
, use the following logic pattern:
import boto3, requests, os
s3 = boto3.client('s3')
bucket = os.environ['BUCKET_NAME']
url = os.environ['TARGET_URL']
def lambda_handler(event, context):
response = requests.get(url)
files = response.json() # assume JSON contains file names + content
existing_files = {obj['Key'] for obj in s3.list_objects_v2(Bucket=bucket).get('Contents', [])}
for file in files:
key = file['filename']
if key not in existing_files:
s3.put_object(Bucket=bucket, Key=key, Body=file['content'])
# Optional: trigger next Lambda
second_lambda = os.environ.get('SECOND_LAMBDA_NAME')
if second_lambda:
boto3.client('lambda').invoke(FunctionName=second_lambda, InvocationType='Event')
Step 5: Create the Second Lambda (Sync Logic)#
Create another Python Lambda (e.g.,
lambda_syncer
)Upload logic to:
- Read recently uploaded files from S3
- Push data into your knowledge base / database / chatbot system
Permissions:
s3:GetObject
Tip: You can use S3 prefixes or tags to isolate “new” data.
Best Practices#
- Use prefixes or naming conventions in S3 to track versions (e.g.,
daily/2025-10-16/...
) - Add error handling, retries, and logging via CloudWatch
- Use AWS Parameter Store or Secrets Manager for sensitive config values
- Always test in dev with dry-run mode before deploying to prod
- Use zip deployment for Lambdas with dependencies < 250MB. For more, use container packaging.
Conclusion#
You now have a complete, hands-off solution for syncing new data files from an external source to S3, and updating your knowledge base automatically with minimal duplication or overhead. This is a simple but powerful serverless pattern for real-world AI, data pipelines, or content refresh flows.
Let the bots do the work—on time, every time.