0%

百度指数爬取02 年均值爬取

程序设计需求:爬取百度指数目标城市的年均值。

  该程序使用抓包方法获取百度指数对应搜索目标的年均值数据。通过观察百度指数网页的api调用情况,百度指数像后端请求的api构成应当为:

1
http://index.baidu.com/api/SearchApi/index?area={area}&word={words}&startDate={startDate}&endDate={endDate}

其中,area为搜索发起地,words为搜索关键字,至多包含5个,startDate和endDate分别为起始时间与结束时间。
这些变量的构建方式如下:
1
2
3
4
5
6
7
words = [[{"name": key, "wordType": 1}] for key in keys]

words = str(words).replace(" ", "").replace("'", "\"")

startDate = f"{year}-01-01"

endDate = f"{year}-12-31"

其中,keys是搜索的各对象。

通过request构建请求头,并使用get方法请求数据:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
headers = {

        "Connection": "keep-alive",

        "Accept": "application/json, text/plain, */*",

        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36",

        "Sec-Fetch-Site": "same-origin",

        "Sec-Fetch-Mode": "cors",

        "Sec-Fetch-Dest": "empty",

        "Cipher-Text": "1698156005330_1698238860769_ZPrC2QTaXriysBT+5sgXcnbTX3/lW65av4zgu9uR1usPy82bArEg4m9deebXm7/O5g6QWhRxEd9/r/hqHad2WnVFVVWybHPFg3YZUUCKMTIYFeSUIn23C6HdTT1SI8mxsG5mhO4X9nnD6NGI8hF8L5/G+a5cxq+b21PADOpt/XB5eu/pWxNdwfa12krVNuYI1E8uHQ7TFIYjCzLX9MoJzPU6prjkgJtbi3v0X7WGKDJw9hwnd5Op4muW0vWKMuo7pbxUNfEW8wPRmSQjIgW0z5p7GjNpsg98rc3FtHpuhG5JFU0kZ6tHgU8+j6ekZW7+JljdyHUMwEoBOh131bGl+oIHR8vw8Ijtg8UXr0xZqcZbMEagEBzWiiKkEAfibCui59hltAgW5LG8IOtBDqp8RJkbK+IL5GcFkNaXaZfNMpI=",

        "Referer": "https://index.baidu.com/v2/main/index.html",

        "Accept-Language": "zh-CN,zh;q=0.9",

        'Cookie': cookie}

    res = requests.get(url, headers=headers)

    res_json = res.json()

返回的年均值与请求关键词如下两个列表所示:

1
2
3
4
5
retuen_keys_num = len(res_json['data']['generalRatio'])

avg_list = [res_json['data']['generalRatio'][i]['all']['avg'] for i in range(retuen_keys_num)]

destination_list = [res_json['data']['generalRatio'][i]['word'][0]['name'] for i in range(retuen_keys_num)]