SAMでスクレイピング定期実行環境を実装

アプリケーションサービス部で現在研修中の鎌田(義)です。

Serverless Application Model(SAM)を使って、
スクレイピングを定期実行するアプリケーションを作成してみようと思います。
ローカル環境で開発を行い、テスト実行、AWSへのデプロイまでを行います。

弊社ではServerless Frameworkを使用することが多いのですが、
私自身はまだどちらも経験がない為、今回はSAMを選択してみました。

完成イメージ
- 構成図
- フォルダ構成
前提
手順
動作確認
- cron設定変更/デプロイ
- 実行結果確認
片付け
- sam delete
- テスト用リソース削除
最後に

完成イメージ

構成図

フォルダ構成

最終的に下記フォルダ構成になります。

sam-app
├─ .asw-sam/
├─ events/event.json
├─ functions/
│    ├─ extract_save/
│    └─ scraping/
├─ layers/
│    ├─ headless/bin/
│    │    ├─ chromedriver
│    │    └─ headless-chromium
│    └─ selenium/lib/python3.7/site-packages/...
├─ tests/
├─ __init__.py
├─ .gitignore
├─ README.md
├─ samconfig.toml
└─ template.yaml

前提

本記事では、SAMを使用します。
また、ローカルでのテストも行う為Docker Desktopが必要です。
以下のドキュメントを参考にインストールしておいて下さい。
docs.aws.amazon.com

docs.aws.amazon.com

また、SAMについては、過去の弊社エントリーにて紹介しております。
blog.serverworks.co.jp

スクレイピングは非常に便利ではありますが、
スクレイピングを禁止しているサイトもありますので、
サイトの利用規約順守、過剰なアクセスでサイトへ負荷を与えないなど
配慮の上、自己責任でご利用をお願い致します。

手順

SAMサンプルアプリの準備

sam init

LambdaでSeleniumを扱う場合、Python3.7以外での手段が分からなかった為、
Python3.7を選択しています。

$ sam init

You can preselect a particular runtime or package type when using the `sam init` experience.
Call `sam init --help` to learn more.

Which template source would you like to use?
        1 - AWS Quick Start Templates
        2 - Custom Template Location
Choice: 1

Choose an AWS Quick Start application template
        1 - Hello World Example
        2 - Multi-step workflow
        3 - Serverless API
        4 - Scheduled task
        5 - Standalone function
        6 - Data processing
        7 - Hello World Example With Powertools
        8 - Infrastructure event management
        9 - Serverless Connector Hello World Example
        10 - Multi-step workflow with Connectors
        11 - Lambda EFS example
        12 - DynamoDB Example
        13 - Machine Learning
Template: 1

Use the most popular runtime and package type? (Python and zip) [y/N]:

Which runtime would you like to use?
        1 - aot.dotnet7 (provided.al2)
        2 - dotnet6
        3 - dotnet5.0
        4 - dotnetcore3.1
        5 - go1.x
        6 - go (provided.al2)
        7 - graalvm.java11 (provided.al2)
        8 - graalvm.java17 (provided.al2)
        9 - java11
        10 - java8.al2
        11 - java8
        12 - nodejs18.x
        13 - nodejs16.x
        14 - nodejs14.x
        15 - nodejs12.x
        16 - python3.9
        17 - python3.8
        18 - python3.7
        19 - python3.10
        20 - ruby2.7
        21 - rust (provided.al2)
Runtime: 18

What package type would you like to use?
        1 - Zip
        2 - Image
Package type: 1

Based on your selections, the only dependency manager available is pip.
We will proceed copying the template using pip.

Would you like to enable X-Ray tracing on the function(s) in your application?  [y/N]:

Would you like to enable monitoring using CloudWatch Application Insights?
For more info, please view https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/cloudwatch-application-insights.html [y/N]:

Project name [sam-app]:

    -----------------------
    Generating application:
    -----------------------
    Name: sam-app
    Runtime: python3.7
    Architectures: x86_64
    Dependency Manager: pip
    Application Template: hello-world
    Output Directory: .
    Configuration file: sam-app/samconfig.toml

    Next steps can be found in the README file at sam-app/README.md


Commands you can use next
=========================
[*] Create pipeline: cd sam-app && sam pipeline init --bootstrap
[*] Validate SAM template: cd sam-app && sam validate
[*] Test Function in the Cloud: cd sam-app && sam sync --stack-name {stack-name} --watch


SAM CLI update available (1.82.0); (1.78.0 installed)
To download: https://docs.aws.amazon.com/serverless-application-model/latest/developerguide/serverless-sam-cli-install.html

sam local invoke

hello worldが表示されることを確認します。
Docker Desktopが起動している必要があります。

$ cd sam-app
$ sam local invoke
Invoking app.lambda_handler (python3.7)
Local image is out of date and will be updated to the latest runtime. To skip this, pass in the parameter --skip-pull-image
Building image.................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
Using local image: public.ecr.aws/lambda/python:3.7-rapid-x86_64.

Mounting /**/sam_scraping_app/sam-app/hello_world as /var/task:ro,delegated, inside runtime container
START RequestId: ff3e93e8-d25b-4a5f-8c7f-fdd2b9d650a6 Version: $LATEST
END RequestId: ff3e93e8-d25b-4a5f-8c7f-fdd2b9d650a6
REPORT RequestId: ff3e93e8-d25b-4a5f-8c7f-fdd2b9d650a6  Init Duration: 0.10 ms  Duration: 91.95 ms      Billed Duration: 92 ms      Memory Size: 128 MB     Max Memory Used: 128 MB
{"statusCode": 200, "body": "{\"message\": \"hello world\"}"}

Windowsで下記のようなエラーが表示される場合、
SAMがコンテナ接続できていない可能性があります。

No response from invoke container for HelloWorldFunction

「samconfig.toml」に下記オプションを追記するか、
「sam local invoke --container-host-interface 0.0.0.0」を実行すると解消するかもしれません。

[default.local_invoke.parameters]
container_host_interface = "0.0.0.0"

SAMで利用するIAM準備

下記のドキュメントを参考にSAMで使用するIAMに権限を付与します。

docs.aws.amazon.com

今回は「sam_developer」というIAMユーザを作り以下の権限を付与しました。

AWSCloudFormationFullAccess
IAMFullAccess
AWSLambda_FullAccess
AmazonS3FullAccess
CloudWatchEventsFullAccess
AmazonDynamoDBFullAccess

アクセスキーを作成し、ローカルの~/.aws/configにアクセス情報を追記します。

[profile dev_sam]
aws_access_key_id = <ACCESS_KEY>
aws_secret_access_key = <SECRET_ACCESS_KEY>
region = ap-northeast-1

以降、AWSとの接続にはこのprofileを使用します。

テスト用S3バケット作成

今回の記事ではS3とDynamoDBと連携しますが、
ローカルでS3とDynamoDBを使用するのではなく、AWSにリソースを作成してテストします。

任意の名前でテスト用のS3バケットを作成します。

$ aws s3 mb s3://<BUCKET_NAME> --profile dev_sam

Lambdaレイヤー準備

seleniumパッケージ用のSeleniumレイヤーと
seleniumで使用するheadless-chromiumと、chromeドライバー用のHeadlessレイヤーを作成します。

# Lambdaレイヤー用のディレクトリ作成
$ mkdir -p layers/selenium
$ mkdir -p layers/headless/bin

# Lambda環境で実行可能なバージョンを指定してSeleniumをインストール
$ pip install -t layers/selenium/lib/python3.7/site-packages/ selenium==3.141.0

# headless-chromiumインストール
$ curl -SL https://github.com/adieuadieu/serverless-chrome/releases/download/v1.0.0-37/stable-headless-chromium-amazonlinux-2017-03.zip > headless-chromium.zip
$ unzip headless-chromium.zip -d layers/headless/bin/
$ rm headless-chromium.zip

# chromeドライバーインストール
# headless-chromiumとドライバーはバージョンを合わせる必要がある為
# Chrome version 64.0.3282.186に対応するドライバーをインストールしています。
$ curl -SL https://chromedriver.storage.googleapis.com/2.37/chromedriver_linux64.zip > chromedriver.zip
$ unzip chromedriver.zip -d layers/headless/bin/
$ rm chromedriver.zip

スクレイピング関数作成

# Lambda関数用ディレクトリ作成
$ mkdir functions

# サンプルアプリの準備で作成した初期ディレクトリのパスを修正します
$ mv hello_world/ functions/scraping

functions/scraping/app.pyを修正します。

import logging
import os
  
import boto3
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions
  
logger = logging .getLogger()
logger.setLevel(logging.INFO)
  
s3 = boto3.client('s3')
  
def set_driver_options():
    options = webdriver.ChromeOptions()
    options.add_argument('--headless')
    options.add_argument('--single-process')
    options.add_argument('--no-sandbox')
    options.binary_location = '/opt/python/bin/headless-chromium'
    return options
  
def put_object(bucket, key, html):
    logger.info('put_object. bucket=%s, key=%s', bucket, key)
  
    res = s3.put_object(
        Bucket=bucket,
        Key=key,
        Body=html,
    )
    logger.info('put_object. status=%s', res['ResponseMetadata']['HTTPStatusCode'])
    return res
  
def scraping(url, options):
    logger.info('scraping url %s', url)
    driver = webdriver.Chrome(
        executable_path='/opt/python/bin/chromedriver',
        options=options
    )
  
    driver.get(url)
    WebDriverWait(driver, 10).until(
        expected_conditions.visibility_of_element_located((By.CLASS_NAME, 'list-media-blog_inner'))
    )
    html = driver.page_source
    driver.close()
    logger.info('scraping success')
  
    return html
  
def lambda_handler(event, context):
    bucket = os.environ.get('BUCKET')
    key = os.environ.get('OBJECT_KEY')
  
    html = scraping(os.environ.get('URL'), set_driver_options())
  
    res = put_object(bucket, key, html)
  
    return 'scraping done'

弊社ブログページ(https://www.serverworks.co.jp/blog/)を開き、
htmlソースを取得しS3へアップロードしています。
ブラウザでjavascriptが実行された後に出現する要素(list-media-blog_inner)を
取得したい為41行目で要素が出現するまで最大10秒間待機しています。

template.yamlを修正します。

AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31
Description: >
  scraping sample app
  
Parameters:
  Bucket:
    Type: String
  Env:
    Type: String
    AllowedValues:
      - local
      - prd
  Url:
    Type: String
    Default: https://www.serverworks.co.jp/blog/
  Key:
    Type: String
    Default: swx_blog.html
  
Resources:
  ScrapingFunction:
    Type: AWS::Serverless::Function
    Properties:
      CodeUri: functions/scraping/
      Handler: app.lambda_handler
      Runtime: python3.7
      Timeout: 60
      MemorySize: 512
      Environment:
        Variables:
          BUCKET: !Ref Bucket
          ENV: !Ref Env
          URL: !Ref Url
          OBJECT_KEY: !Sub "${Env}/${Key}"
      Layers:
        - !Ref SeleniumLayer
        - !Ref HeadlessLayer
      Policies:
        - S3CrudPolicy:
            BucketName: !Ref Bucket
  
  SeleniumLayer:
    Type: AWS::Serverless::LayerVersion
    Properties:
      ContentUri: layers/selenium
      CompatibleRuntimes:
        - python3.7
    Metadata:
      BuildMethod: python3.7
  
  HeadlessLayer:
    Type: AWS::Serverless::LayerVersion
    Properties:
      ContentUri: layers/headless
      CompatibleRuntimes:
        - python3.7
    Metadata:
      BuildMethod: python3.7

SeleniumレイヤーとHeadlessレイヤーを作成する為、
前項で準備したディレクトリを指定しています。

samconfig.tomlの末尾に下記を追記します。
sam local invokeを実行したときに指定した環境変数を上書きしてくれます。

[default.local_invoke.parameters]
parameter_overrides = "Bucket=\"<BUCKET_NAME>\" Env=\"local\""

requirements.txtを下記に修正します。
2023/05/03にrequests==2.30.0がリリースされましたが、
OpenSSL1.1.1未満のサポート終了の関係でエラーになってしまう為、回避策として2.29.0を指定しています。

requests==2.29.0

スクレイピング関数テスト

# template.yamlをもとにビルドします。
$ sam build --use-container

--use-containerオプションで、Lambdaに似たDockerコンテナ内でビルドを行います。
今回はpython3.7を使用している為、オプションを指定しないと
ローカルのpythonバージョンと差異がある場合エラーになります。

docs.aws.amazon.com

$ sam local invoke ScrapingFunction --profile dev_sam

下記のようなログが出力されていればOKです。

START RequestId: 71c0a75f-d433-48e3-a3c2-635d07e2b5bf Version: $LATEST
[INFO]  2023-05-08T06:46:09.443Z                Found credentials in environment variables.

[INFO]  2023-05-08T06:46:09.504Z        71c0a75f-d433-48e3-a3c2-635d07e2b5bf    scraping url https://www.serverworks.co.jp/blog/

[INFO]  2023-05-08T06:46:12.418Z        71c0a75f-d433-48e3-a3c2-635d07e2b5bf    scraping success

[INFO]  2023-05-08T06:46:12.418Z        71c0a75f-d433-48e3-a3c2-635d07e2b5bf    put_object. bucket=<BUCKET_NAME>, key=local/swx_blog.html

[INFO]  2023-05-08T06:46:12.615Z        71c0a75f-d433-48e3-a3c2-635d07e2b5bf    put_object. status=200

"scraping done"END RequestId: 71c0a75f-d433-48e3-a3c2-635d07e2b5bf
REPORT RequestId: 71c0a75f-d433-48e3-a3c2-635d07e2b5bf  Init Duration: 0.18 ms  Duration: 3419.34 ms    Billed Duration: 3420 ms    Memory Size: 512 MB     Max Memory Used: 512 MB

先ほど作成したS3バケットにオブジェクトが保存されていることを確認します。

$ aws s3 ls --recursive s3://<BUCKET_NAME> --profile dev_sam
2023-05-12 16:46:23      31159 local/swx_blog.html

テスト用DynamoDBテーブル作成

任意の名前でDynamoDBテーブルを作成します。
今回はパーティションキーに投稿日(posted_at)、ソートキーに記事タイトル(title)を指定しました。

aws dynamodb create-table \
--profile dev_sam \
--table-name <TABLE_NAME> \
--attribute-definitions \
  AttributeName=posted_at,AttributeType=S \
  AttributeName=title,AttributeType=S \
--key-schema AttributeName=posted_at,KeyType=HASH AttributeName=title,KeyType=RANGE \
--provisioned-throughput ReadCapacityUnits=1,WriteCapacityUnits=1 \
--table-class STANDARD

抽出関数作成

# 抽出関数用ディレクトリ作成
$ mkdir -p functions/extract_save

functions/extract_save/app.pyを作成します。

import logging
import os
import urllib.parse
  
import boto3
from bs4 import BeautifulSoup
  
logger = logging.getLogger()
logger.setLevel(logging.INFO)
  
s3 = boto3.client('s3')
dynamodb = boto3.client('dynamodb')
  
def fetch_object(bucket, key):
    logger.info('fetch_object. bucket=%s, key=%s', bucket, key)
    response = s3.get_object(Bucket=bucket, Key=key)
    logger.info('fetch_object. status=%s', response['ResponseMetadata']['HTTPStatusCode'])
    return response['Body'].read().decode('utf-8')
  
def delete_object(bucket, key):
    logger.info('delete_object. bucket=%s, key=%s', bucket, key)
    response = s3.delete_object(Bucket=bucket, Key=key)
    logger.info('delete_object. status=%s', response['ResponseMetadata']['HTTPStatusCode'])
  
def extract(html):
    soup = BeautifulSoup(html, 'html.parser')
    blog_list = soup.find(id='tech_entry_list').find_all(class_='list-media-blog_col')
  
    blogs = []
    for blog_elem in blog_list:
        posted_at, title = blog_elem.stripped_strings
        blogs.append({
            'posted_at': posted_at,
            'title': title,
        })
        logger.info('extract. blog_data={posted_at: %s, title: %s}', posted_at, title)
  
    return blogs
  
def batch_write(blogs):
    table_name = os.environ.get('TABLE_NAME')
    items = {table_name: []}
  
    logger.info('batch_write. table=%s, items=%s', table_name, blogs)
    for blog in blogs:
        items[table_name].append(
            {
                'PutRequest': {
                    'Item': {
                        'posted_at': {'S': blog['posted_at']},
                        'title': {'S': blog['title']}
                    }
                }
            }
        )
    options = {'RequestItems': items}
    response = dynamodb.batch_write_item(**options)
    logger.info('batch_write. status=%s', response['ResponseMetadata']['HTTPStatusCode'])
  
def lambda_handler(event, context):
    bucket = event['Records'][0]['s3']['bucket']['name']
    key = urllib.parse.unquote_plus(event['Records'][0]['s3']['object']['key'], encoding='utf-8')
  
    html = fetch_object(bucket, key)
    blogs = extract(html)
    batch_write(blogs)
    delete_object(bucket, key)
  
    return 'extract save done'

eventのバケット名、オブジェクト名をもとに対象オブジェクトをS3から取得し
htmlソースから投稿日時、記事タイトルのみを抽出しています。
抽出した値をDynamoDBテーブルに保存した後は、対象のオブジェクトはS3から削除しています。

ページ赤枠部分の記事ごとの日付とタイトルのみを抜き出しています。

template.yamlに下記を追加します。

Parameters:
  BucketName:
    Type: String
  Env:
    Type: String
    AllowedValues:
      - local
      - prd
  Url:
    Type: String
    Default: https://www.serverworks.co.jp/blog/
  Key:
    Type: String
    Default: swx_blog.html
  TableName:
    Type: String
  
~~~中略~~~
  
  ExtractSaveFunction:
    Type: AWS::Serverless::Function
    Properties:
      CodeUri: functions/extract_save/
      Handler: app.lambda_handler
      Runtime: python3.9
      Timeout: 5
      MemorySize: 256
      Events:
        S3Event:
          Type: S3
          Properties:
            Bucket: !Ref HtmlBucket
            Events: s3:ObjectCreated:*
            Filter:
              S3Key:
                Rules:
                  - Name: prefix
                    Value: !Ref Env
                  - Name: suffix
                    Value: !Ref Key
      Environment:
        Variables:
          TABLE_NAME: !Ref TableName
      Policies:
        - S3CrudPolicy:
            BucketName: !Ref Bucket
        - DynamoDBCrudPolicy:
            TableName: !Ref TableName

S3トリガーでは、指定したS3バケットに「/swx_blog.html」が作成されたときに
起動するよう指定しています。

samconfig.tomlも修正します。

[default.local_invoke.parameters]
parameter_overrides = "Bucket=\"<BUCKET_NAME>\" Env=\"local\" TableName=\"<TABLE_NAME>\""

requirements.txtを作成します。

beautifulsoup4==4.11.1
requests==2.29.0

抽出関数では、S3イベントをトリガーにしており、
Lambda側でevent情報を引数として受け取る為、
ローカルでのテストではevent.jsonファイルを使用します。
events/event.jsonをS3トリガー用に下記のように修正します。

<BUCKET_NAME>の部分を作成したバケット名に修正します。

{
  "Records": [
    {
      "eventVersion": "2.0",
      "eventSource": "aws:s3",
      "awsRegion": "us-east-1",
      "eventTime": "1970-01-01T00:00:00.000Z",
      "eventName": "ObjectCreated:Put",
      "userIdentity": {
        "principalId": "EXAMPLE"
      },
      "requestParameters": {
        "sourceIPAddress": "127.0.0.1"
      },
      "responseElements": {
        "x-amz-request-id": "EXAMPLE123456789",
        "x-amz-id-2": "EXAMPLE123/5678abcdefghijklambdaisawesome/mnopqrstuvwxyzABCDEFGH"
      },
      "s3": {
        "s3SchemaVersion": "1.0",
        "configurationId": "testConfigRule",
        "bucket": {
          "name": "<BUCKET_NAME>",
          "ownerIdentity": {
            "principalId": "EXAMPLE"
          },
          "arn": "arn:aws:s3:::<BUCKET_NAME>"
        },
        "object": {
          "key": "local%2Fswx_blog.html",
          "size": 1024,
          "eTag": "0123456789abcdef0123456789abcdef",
          "sequencer": "0A1B2C3D4E5F678901"
        }
      }
    }
  ]
}

抽出関数テスト

# template.yamlをもとにビルドします。
$ sam build --use-container

ローカルでテストします。

$ sam local invoke ExtractSaveFunction --profile dev_sam --event events/event.json

下記のようなログが出力されていればOKです。

START RequestId: 8b52be7c-c248-4fff-8a8f-f3c812d8e3ae Version: $LATEST
[INFO]  2023-05-12T07:51:45.041Z                Found credentials in environment variables.
[INFO]  2023-05-12T07:51:45.128Z        8b52be7c-c248-4fff-8a8f-f3c812d8e3ae    fetch_object. bucket=<BUCKET_NAME>, key=local/swx_blog.html
[INFO]  2023-05-12T07:51:45.329Z        8b52be7c-c248-4fff-8a8f-f3c812d8e3ae    fetch_object. status=200
[INFO]  2023-05-12T07:51:45.348Z        8b52be7c-c248-4fff-8a8f-f3c812d8e3ae    extract. blog_data={posted_at: 2023.05.12, title: AWS Summitでご紹介した【Amazon Connect 自動化事例】を動画でご覧いただけるようになりました}
[INFO]  2023-05-12T07:51:45.349Z        8b52be7c-c248-4fff-8a8f-f3c812d8e3ae    extract. blog_data={posted_at: 2023.05.10, title: セッションマネージャーのアイドルタイムアウト設定を理解する。}
[INFO]  2023-05-12T07:51:45.349Z        8b52be7c-c248-4fff-8a8f-f3c812d8e3ae    extract. blog_data={posted_at: 2023.05.09, title: ALBのX-Forwarded-Forオプションの挙動を見てみる}
[INFO]  2023-05-12T07:51:45.349Z        8b52be7c-c248-4fff-8a8f-f3c812d8e3ae    extract. blog_data={posted_at: 2023.05.09, title: 【ビブリオラジオ#4】コミュニケーションとITの原理原則、あるいはイキり散らしていた僕たちの黒歴史『人を動かす』他}
[INFO]  2023-05-12T07:51:45.349Z        8b52be7c-c248-4fff-8a8f-f3c812d8e3ae    extract. blog_data={posted_at: 2023.05.09, title: AWSリソース間をインターネット経由で通信したらどこ通るか見てみる}
[INFO]  2023-05-12T07:51:45.349Z        8b52be7c-c248-4fff-8a8f-f3c812d8e3ae    batch_write. table=test-swx-blog-table, items=[{'posted_at': '2023.05.12', 'title': 'AWS Summitでご紹介した【Amazon Connect 自動化事例】を動画でご覧いただけるようになりました'}, {'posted_at': '2023.05.10', 'title': 'セッションマネージャーのアイドルタイムアウト設定を理解する。'}, {'posted_at': '2023.05.09', 'title': 'ALBのX-Forwarded-Forオプションの挙動を見てみる'}, {'posted_at': '2023.05.09', 'title': '【ビブリオラジオ#4】コミュニケーションとITの原理原則、あるいはイキり散らしていた僕たちの黒歴史『人を動かす』他'}, {'posted_at': '2023.05.09', 'title': 'AWSリソース間をインターネット経由で通信したらどこ通るか見てみる'}]
[INFO]  2023-05-12T07:51:45.478Z        8b52be7c-c248-4fff-8a8f-f3c812d8e3ae    batch_write. status=200
[INFO]  2023-05-12T07:51:45.478Z        8b52be7c-c248-4fff-8a8f-f3c812d8e3ae    delete_object. bucket=<BUCKET_NAME>, key=local/swx_blog.html
[INFO]  2023-05-12T07:51:45.522Z        8b52be7c-c248-4fff-8a8f-f3c812d8e3ae    delete_object. status=204
END RequestId: 8b52be7c-c248-4fff-8a8f-f3c812d8e3ae
REPORT RequestId: 8b52be7c-c248-4fff-8a8f-f3c812d8e3ae  Init Duration: 0.18 ms  Duration: 859.68 ms     Billed Duration: 860 ms     Memory Size: 256 MB     Max Memory Used: 256 MB

先ほど作成したDynamoDBテーブルにアイテムが保存されていることを確認します。
今回はアイテム数も少ない為、scanで表示します。
5件登録されているかと思います。

$ aws dynamodb scan --table-name <TABLE_NAME> --profile dev_sam

また、関数の最後でS3のオブジェクトを削除しているので削除されていることを確認します。

$ aws s3 ls --recursive s3://<BUCKET_NAME> --profile dev_sam

デプロイ用にテンプレート作成

では、AWSへデプロイする準備としてtemplate.yamlを修正しておきます。

template.yaml最終版

AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31
Description: >
  scraping sample app
  
Parameters:
  Bucket:
    Type: String
  Env:
    Type: String
    AllowedValues:
      - local
      - prd
  Url:
    Type: String
    Default: https://www.serverworks.co.jp/blog/
  Key:
    Type: String
    Default: swx_blog.html
  TableName:
    Type: String
  
Resources:
  HtmlBucket:
    Type: AWS::S3::Bucket
    Properties:
      BucketName: !Ref Bucket
  
  ScrapingFunction:
    Type: AWS::Serverless::Function
    Properties:
      CodeUri: functions/scraping/
      Handler: app.lambda_handler
      Runtime: python3.7
      Timeout: 60
      MemorySize: 512
      Events:
        ScrapeStartSchedule:
          Type: Schedule
          Properties:
            Schedule: 'cron(0 9 * * ? *)' # UTC timezone
            Name: DailyScrapingSchedule
            Description: Daily Scraping Schedule
            Enabled: true
      Environment:
        Variables:
          BUCKET: !Ref Bucket
          ENV: !Ref Env
          URL: !Ref Url
          OBJECT_KEY: !Sub "${Env}/${Key}"
      Layers:
        - !Ref SeleniumLayer
        - !Ref HeadlessLayer
      Policies:
        - S3CrudPolicy:
            BucketName: !Ref Bucket
  
  ExtractSaveFunction:
    Type: AWS::Serverless::Function
    Properties:
      CodeUri: functions/extract_save/
      Handler: app.lambda_handler
      Runtime: python3.9
      Timeout: 5
      MemorySize: 256
      Events:
        S3Event:
          Type: S3
          Properties:
            Bucket: !Ref HtmlBucket
            Events: s3:ObjectCreated:*
            Filter:
              S3Key:
                Rules:
                  - Name: prefix
                    Value: !Ref Env
                  - Name: suffix
                    Value: !Ref Key
      Environment:
        Variables:
          TABLE_NAME: !Ref TableName
      Policies:
        - S3CrudPolicy:
            BucketName: !Ref Bucket
        - DynamoDBCrudPolicy:
            TableName: !Ref TableName
  
  SeleniumLayer:
    Type: AWS::Serverless::LayerVersion
    Properties:
      ContentUri: layers/selenium
      CompatibleRuntimes:
        - python3.7
    Metadata:
      BuildMethod: python3.7
  
  HeadlessLayer:
    Type: AWS::Serverless::LayerVersion
    Properties:
      ContentUri: layers/headless
      CompatibleRuntimes:
        - python3.7
    Metadata:
      BuildMethod: python3.7
  
  BlogTable:
    Type: AWS::DynamoDB::Table
    Properties: 
      TableName: !Ref TableName
      BillingMode: PROVISIONED
      ProvisionedThroughput:
        ReadCapacityUnits: 1
        WriteCapacityUnits: 1
      AttributeDefinitions:
        - AttributeName: posted_at
          AttributeType: S
        - AttributeName: title
          AttributeType: S
      KeySchema:
        - AttributeName: posted_at
          KeyType: HASH
        - AttributeName: title
          KeyType: RANGE

いくつか注意点として、
Lambda起動イベントとしてS3を使用する場合、S3リソースはテンプレート内で定義しておく必要があります。
EventBridgeのcron式では、日と曜日を両方指定できない為、指定しない方を「?」とする必要があります。
また、UTCタイムゾーンでの指定となり上記例では、毎日18:00を指定しています。

samconfig.tomlのデプロイ用パラメータ部分を修正しておきます。
新規作成する為、テスト用とは別のBUCKET_NAME, TABLE_NAMEを指定します。

[default.deploy.parameters]
~~
region = "ap-northeast-1"
parameter_overrides = "Bucket=\"<PRD_BUCKET_NAME>\" Env=\"prd\" TableName=\"<PRD_TABLE_NAME>\""

デプロイ

$ sam build --use-container

$ sam deploy -g --profile dev_sam
# template.yamlでParametersがある場合、値の入力が求められますが
# samconfig.tomlのデプロイ用パラメータで環境変数を更新するようにしている場合、
# 自動的にパラメータが入っている為、空エンターで進めます。

Configuring SAM deploy
======================

        Looking for config file [samconfig.toml] :  Found
        Reading default arguments  :  Success

        Setting default arguments for 'sam deploy'
        =========================================
        Stack Name [sam-scraping-app]: [任意の名前を指定]
        AWS Region [ap-northeast-1]:
        Parameter Bucket [<BUCKET_NAME>]:
        Parameter Env [prd]:
        Parameter Url [https://www.serverworks.co.jp/blog/]:
        Parameter Key [swx_blog.html]:
        Parameter TableName [<TABLE_NAME>]:
        #Shows you resources changes to be deployed and require a 'Y' to initiate deploy
        Confirm changes before deploy [Y/n]:
        #SAM needs permission to be able to create roles to connect to the resources in your template
        Allow SAM CLI IAM role creation [Y/n]:
        #Preserves the state of previously provisioned resources when an operation fails
        Disable rollback [y/N]:
        Save arguments to configuration file [Y/n]:
        SAM configuration file [samconfig.toml]:
        SAM configuration environment [default]:

~~~
~~~
# 下記のようにAWSリソースの追加を反映するか確認されるので、「y」を押下します。

CloudFormation stack changeset
-------------------------------------------------------------------------------------------------------------------------
Operation                      LogicalResourceId              ResourceType                   Replacement
-------------------------------------------------------------------------------------------------------------------------
+ Add                          BlogTable                      AWS::DynamoDB::Table           N/A
+ Add                          ExtractSaveFunctionRole        AWS::IAM::Role                 N/A
+ Add                          ExtractSaveFunctionS3EventPe   AWS::Lambda::Permission        N/A
                               rmission
+ Add                          ExtractSaveFunction            AWS::Lambda::Function          N/A
+ Add                          HeadlessLayeraedcfed2a8        AWS::Lambda::LayerVersion      N/A
+ Add                          HtmlBucket                     AWS::S3::Bucket                N/A
+ Add                          ScrapingFunctionRole           AWS::IAM::Role                 N/A
+ Add                          ScrapingFunctionScrapeStartS   AWS::Lambda::Permission        N/A
                               chedulePermission
+ Add                          ScrapingFunctionScrapeStartS   AWS::Events::Rule              N/A
                               chedule
+ Add                          ScrapingFunction               AWS::Lambda::Function          N/A
+ Add                          SeleniumLayercf30d3959a        AWS::Lambda::LayerVersion      N/A
-------------------------------------------------------------------------------------------------------------------------

Previewing CloudFormation changeset before deployment
======================================================
Deploy this changeset? [y/N]: y

~~~
~~~

Successfully created/updated stack - sam-scraping-app in ap-northeast-1

Successfullyの文字が表示されればデプロイ完了です。

動作確認

cron設定変更/デプロイ

cronトリガーで設定した時刻にならないと動作確認できない為、
変更デプロイもかねて、template.yamlのSchedule部分を数分先に変更してみます。
変更完了したら先ほどと同様deployコマンドを実行します。

$ sam deploy -g --profile dev_sam

~~~
~~~
# 下記のように変更部分が表示されるので問題なければ「y」を押下します。

Waiting for changeset to be created..

CloudFormation stack changeset
-------------------------------------------------------------------------------------------------------------------------
Operation                      LogicalResourceId              ResourceType                   Replacement
-------------------------------------------------------------------------------------------------------------------------
* Modify                       ScrapingFunctionScrapeStartS   AWS::Lambda::Permission        True
                               chedulePermission
* Modify                       ScrapingFunctionScrapeStartS   AWS::Events::Rule              True
                               chedule
-------------------------------------------------------------------------------------------------------------------------

Previewing CloudFormation changeset before deployment
======================================================
Deploy this changeset? [y/N]: y

実行結果確認

テーブルにアイテムが追加されていることを確認します。

$ aws dynamodb scan --table-name <TABLE_NAME> --profile dev_sam

画面からも確認してみます。

CloudWatchログにスクレイピング実行時のログが出力されていることを確認します。

片付け

sam delete

# 作成したリソースを削除します。
$ sam delete
        Are you sure you want to delete the stack sam-scraping-app in the region ap-northeast-1 ? [y/N]: y
        Are you sure you want to delete the folder sam-scraping-app in S3 which contains the artifacts? [y/N]: y
        - Deleting S3 object with key sam-scraping-app/47eecadc9db8b31b8070c138914d98e1
        - Deleting S3 object with key sam-scraping-app/a5755f269711b8611f345af5f2851e22
        - Deleting S3 object with key sam-scraping-app/ea3b33cdd3853590a6e07b12106895f6
        - Deleting S3 object with key sam-scraping-app/ac06908570b3a2e7118b7146fc4029e1
        - Deleting S3 object with key sam-scraping-app/9d4e2e6472b944a04245e7d29e185793.template
        - Deleting Cloudformation stack sam-scraping-app

Deleted successfully

テスト用リソース削除

テスト用で作成したリソースは手動で削除します。

$ aws s3 rb s3://<BUCKET_NAME> --profile dev_sam

$ aws dynamodb delete-table --table-name <TABLE_NAME> --profile dev_sam

最後に

Lambdaの開発、テスト、その他関連するリソースのデプロイまでを
ローカル環境で比較的容易に実装できました。
私自身まだServerless Frameworkは使用したことがないので
SAMとの違いなど、これからキャッチアップしていければなと思います。

長くなってしまいましたが、最後までご覧頂きありがとうございました。

完成イメージ

構成図

フォルダ構成

前提

手順

SAMサンプルアプリの準備

sam init

sam local invoke

SAMで利用するIAM準備

テスト用S3バケット作成

Lambdaレイヤー準備

スクレイピング関数作成

スクレイピング関数テスト

テスト用DynamoDBテーブル作成

抽出関数作成

抽出関数テスト

デプロイ用にテンプレート作成

デプロイ

動作確認

cron設定変更/デプロイ

実行結果確認

片付け

sam delete

テスト用リソース削除

最後に

Visual Studio Codeの拡張機能でAmazon Q Developerを使ってみる

非推奨となったAMIを利用する方法（RHEL版）

WSL と VSCode を使った Python 開発環境構築

AWSクラウドでもっともセキュアなデータ消去方法 (暗号化消去)

【CI/CD for Amazon ECS】Blue/Greenデプロイの動きを触って理解する