I want to create a workflow that automatically scrapes reviews of an app from Google Play Store at a scheduled time every day and stores it in my collection in MongoDB Atlas. So first, I created a Python script called scraping_daily.py
that will scrape 5,000 new reviews and filter out any that were previously collected. When I tested it and ran it manually, the script worked perfectly fine. Here's what the script looks like:
# Import librariesimport numpy as npimport pandas as pdfrom google_play_scraper import Sort, reviews, reviews_all, appfrom pymongo import MongoClient# Create a connection to MongoDBclient = MongoClient("mongodb+srv://<MY_USERNAME>:<MY_PASSWORD>@project1.lpu4kvx.mongodb.net/?retryWrites=true&w=majority")db = client["vidio"]collection = db["google_play_store_reviews"]# Load the data from MongoDBdf = pd.DataFrame(list(collection.find()))df = df.drop("_id", axis=1)df = df.sort_values("at", ascending=False)# Collect 5000 new reviewsresult = reviews("com.vidio.android", lang="id", country="id", sort=Sort.NEWEST, count=5000)new_reviews = pd.DataFrame(result[0])new_reviews = new_reviews.fillna("empty")# Filter the scraped reviews to exclude any that were previously collectedcommon = new_reviews.merge(df, on=["reviewId", "userName"])new_reviews_sliced = new_reviews[(~new_reviews.reviewId.isin(common.reviewId)) & (~new_reviews.userName.isin(common.userName))]# Update MongoDB with any new reviews that were not previously scrapedif len(new_reviews_sliced) > 0: new_reviews_sliced_dict = new_reviews_sliced.to_dict("records") batch_size = 1_000 num_records = len(new_reviews_sliced_dict) num_batches = num_records // batch_size if num_records % batch_size != 0: num_batches += 1 for i in range(num_batches): start_idx = i * batch_size end_idx = min(start_idx + batch_size, num_records) batch = new_reviews_sliced_dict[start_idx:end_idx] if batch: collection.insert_many(batch)
Next, I want to schedule my script using GitHub Actions. Just like what I followed from YouTube tutorials, I created an actions.yml
file in the .github/workflows
folder. Here's what the YAML file looks like:
name: Scraping Google Play Reviewson: schedule: - cron: 50 16 * * * # At 16:50 every dayjobs: build: runs-on: ubuntu-latest steps: - name: check out the repository content uses: actions/checkout@v2 - name: set up python uses: actions/setup-python@v4 with: python-version: '3.10' - name: install requirements run: python -m pip install --upgrade pip pip install -r requirements.txt - name: execute the script run: python -m scraping_daily.py
However, it always throws an error when it executes my script. The error message is:
Traceback (most recent call last): File "/opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/runpy.py", line 187, in _run_module_as_main mod_name, mod_spec, code = _get_module_details(mod_name, _Error) File "/opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/runpy.py", line 110, in _get_module_details __import__(pkg_name) File "/home/runner/work/vidio_google_play_store_reviews/vidio_google_play_store_reviews/scraping_daily.py", line 16, in <module> df = pd.DataFrame(list(collection.find())) File "/opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pymongo/cursor.py", line 1248, in next if len(self.__data) or self._refresh(): File "/opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pymongo/cursor.py", line 1139, in _refresh self.__session = self.__collection.database.client._ensure_session() File "/opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pymongo/mongo_client.py", line 1740, in _ensure_session return self.__start_session(True, causal_consistency=False) File "/opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pymongo/mongo_client.py", line 1685, in __start_session self._topology._check_implicit_session_support() File "/opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pymongo/topology.py", line 538, in _check_implicit_session_support self._check_session_support() File "/opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pymongo/topology.py", line 554, in _check_session_support self._select_servers_loop( File "/opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pymongo/topology.py", line 238, in _select_servers_loop raise ServerSelectionTimeoutError(pymongo.errors.ServerSelectionTimeoutError: ac-dc8axn9-shard-00-01.lpu4kvx.mongodb.net:27017: connection closed,ac-dc8axn9-shard-00-02.lpu4kvx.mongodb.net:27017: connection closed,ac-dc8axn9-shard-00-00.lpu4kvx.mongodb.net:27017: connection closed, Timeout: 300.0s, Topology Description: <TopologyDescription id: 641dd5b78e0efba394e00ffc, topology_type: ReplicaSetNoPrimary, servers: [<ServerDescription ('ac-dc8axn9-shard-00-00.lpu4kvx.mongodb.net', 27017) server_type: Unknown, rtt: None, error=AutoReconnect('ac-dc8axn9-shard-00-00.lpu4kvx.mongodb.net:27017: connection closed')>, <ServerDescription ('ac-dc8axn9-shard-00-01.lpu4kvx.mongodb.net', 27017) server_type: Unknown, rtt: None, error=AutoReconnect('ac-dc8axn9-shard-00-01.lpu4kvx.mongodb.net:27017: connection closed')>, <ServerDescription ('ac-dc8axn9-shard-00-02.lpu4kvx.mongodb.net', 27017) server_type: Unknown, rtt: None, error=AutoReconnect('ac-dc8axn9-shard-00-02.lpu4kvx.mongodb.net:27017: connection closed')>]>Error: Process completed with exit code 1.
I tried to increase the timeout setting by adding serverSelectionTimeoutMS=300000
inside MongoClient()
, but it still gave me the same error. How can I solve this?
By the way, I'm using a Windows machine (I'm not sure if it's useful information though).