aws

Instagram Crawler Deployment Using AWS: Design

하리하링웹 2024. 8. 15. 13:32

Overview

I was given the job of deploying an Instagram crawler using the GPT API to ensure stable operation for a side project currently in progress.

The crawler itself has been fully developed by the PM I'm collaborating with, and my responsibility is to deploy it in a stable and scalable manner.

We plan to use AWS for the overall deployment, with the following requirements:

A list of specific food influencer IDs will be provided, and the crawler needs to search the feeds for each of these IDs.
Each account may have more than 1000 posts, and each post may contain over 10 videos or photos.
The crawler uses the Rapid API, and the retrieved data is then normalized using the GPT API.
To minimize costs, the GPT API's batch response feature will be utilized. This feature involves providing a specific URL during the initial request, where GPT will post the results as a file at some point within 24 hours. The results will be output as a batch-processed file.
Based on the normalized data, the system will search for restaurant information using the Naver Map app. Multiple restaurant options may be returned, but the system must identify and select the most relevant one.
Once identified, the restaurant information will be uploaded to MongoDB.
The focus here is solely on the deployment process; the API implementation, normalization, and identification of the most relevant restaurant have already been programmed.
The deployment should prioritize cost-efficiency, and it is acceptable if the process is interrupted as long as it automatically retries.
The system should be configured to run at least once every few days.
Scalability should be considered in the deployment strategy.

Implementation

Instagram Crawling and Batch Requests

Deploy the crawler code using AWS Lambda. Each Lambda function will operate based on the influencer ID, sending API requests to collect feeds and then forwarding the data to the GPT API for processing. Since each Lambda function runs independently for each influencer ID, this approach is both scalable and cost-effective. For batch processing, group multiple predefined influencers together in a single request.

Polling Batch Results

Use an EC2 Spot instance to poll for batch results. Choose an appropriate polling interval and compare the cost with using Lambda; if Lambda is cheaper, opt for Lambda instead. The results will be uploaded to S3. To ensure scalability, implement Spot Fleet or auto-scaling features to extend the number of instances as needed. Since I haven't used this approach before, consider exploring ECS or other alternatives during development.

Post-Processing After S3 Upload

Add a trigger to S3 to execute a Lambda function whenever a file is uploaded (s3:ObjectCreated:* or put). This Lambda function will process the GPT results and then upload the processed data to MongoDB.

Batch Processing and Automation

Add a Lambda function at the beginning of the process to slice the array of influencer IDs into batch sizes and asynchronously trigger each corresponding Lambda function. If issues occurred, consider switching to an SQS-based approach, where the system pulls from SQS to handle events. Develop the system to ensure that the Lambda function runs automatically. Consider using Amazon EventBridge or a Cron job as possible solutions.

Error Handling and Permissions

Manage overall error handling using CloudWatch alarms. Store failure logs in S3 and set up IAM roles to control access between different resources. Address any additional issues dynamically as they arise during development.

Validation

After deployment, conduct several tests to verify that everything functions as intended. Check how the system handles failures and ensure that it runs correctly on a scheduled basis. Once testing is complete, deploy the system to production, utilizing monitoring dashboards and alarms to continuously monitor its performance.

Conclusion

In practice, even if the development follows the design outlined above, many unforeseen issues are likely to occure, and adjustments to the design will inevitably be necessary. Since this is the initial design, the development will proceed according to this flow, and any issues will be addressed as they occure. I plan to write about the implementation process as well, so I will write it later.

저작자표시

'aws' 카테고리의 다른 글

[AWS]비용 최적화 아키텍쳐 (0)	2024.08.22
[AWS]보안성 아키텍처 (0)	2024.08.20
AWS Certified Solutions Architect - Associate 취득 후기 (0)	2024.08.13
[AWS] 고성능 아키텍처 (0)	2024.08.10
[AWS]복원성 아키텍쳐 (0)	2024.08.06

현재글Instagram Crawler Deployment Using AWS: Design

kjsik11@gmail.com / 잘못된 정보에 대한 피드백은 언제나 환영입니다.

E2E, docker, TAILWIND, re-rendering, AWS, mongodb, builder, JavaScript, ECS, DevOps, React, bundler, WEB, test, redux, node, useSWR, nextjs, TypeScript, frontend,

Today :
Yesterday :

일	월	화	수	목	금	토
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

하리하링웹

Instagram Crawler Deployment Using AWS: Design

Overview

Implementation

Instagram Crawling and Batch Requests

Polling Batch Results

Post-Processing After S3 Upload

Batch Processing and Automation

Error Handling and Permissions

Validation

Conclusion

'aws' 카테고리의 다른 글

'aws'의 다른글

티스토리툴바

Instagram Crawler Deployment Using AWS: Design

Overview

Implementation

Instagram Crawling and Batch Requests

Polling Batch Results

Post-Processing After S3 Upload

Batch Processing and Automation

Error Handling and Permissions

Validation

Conclusion

'aws' 카테고리의 다른 글

'aws'의 다른글

관련글

티스토리툴바