Before getting hands-on, make sure you have a gcloud CLI client either on your local machine or in Cloud Shell and create a new project, as described in Chapter 4; store the PROJECT_ID in an environment variable using:
export
BigQuery
To start with, you can use the bq command-line tool to try out the query you will use to retrieve the Stack Overflow tag names from the public dataset. This tool is installed as part of the Google Cloud SDK:
bq
query
–max_rows
=
10
–nouse_legacy_sql
–format
=
csv
“SELECT tag_name FROM bigquery-public-data.stackoverflow.tags order by tag_name”
Use the following flags:
–max_rows
Limits the number of results to 10 instead of the default 100
–nouse_legacy_sql
Uses Google Standard SQL for the query
–format=csv
Formats the result as a CSV file
The results of this command should show the first ten Stack Overflow tag names. You will use a similar query in your service. For now, let’s output all the tags to a file named tags.csv. There are approximately 63,000 Stack Overflow tags, so setting the maximum rows to 100,000 will retrieve them all.
Tip
It is good practice to always specify the maximum rows with BigQuery, as it is billed by the amount of data queried. One day, you may accidentally write a query that queries trillions of rows and be left with a hefty bill.
Define an environment variable for the FILE_NAME (e.g., tags.csv); even when programming at the command line, it is good to follow the cloud native principle of externalizing configuration:
export
FILE_NAME
=[
FILE_NAME
]
Now you can issue the bq command to write the tags to the file:
bq
query
–max_rows
=
100000
\
–nouse_legacy_sql
\
–format
=
csv
\
“SELECT tag_name FROM bigquery-public-data.stackoverflow.tags order by tag_name”
>
$FILE_NAME
You can check that it was successful by listing the number of lines in the file:
wc
-l
$FILE_NAME
If all goes well, the result should be approximately 63,654 lines, allowing one line for the CSV header.
Cloud Storage
You now need to create a Cloud Storage bucket to store the file you generate from your query. You can do that with the gsutil command, which is also included with the gcloud CLI.
First, create a BUCKET_NAME environment variable with the bucket to use. Like a project ID, the bucket name must be globally unique. As your project ID is unique, you can use that as a prefix to the bucket-name (e.g., skillsmapper-tags):
export
BUCKET_NAME
=
“
${
PROJECT_ID
}
-[BUCKET_SUFFIX]”
Then use the gsutil command to create the new bucket:
gsutil
mb
gs://
$BUCKET_NAME
With the bucket created, you can then copy the file containing the list of tags to the bucket:
gsutil
cp
$FILE_NAME
gs://
$BUCKET_NAME
/
$FILE_NAME
You can check that the file has been created successfully by again counting the number of lines and making sure that matches the approximately 63,654 lines you had in the file you generated:
gsutil
cat
gs://
$BUCKET_NAME
/
$FILE_NAME
|
wc
-l
Running commands like this would be one way of keeping the tags up to date. Alternatively, you could automate it into a bash script and save it as a file named update-tags.sh:
include::code/update-tags.sh
Then run the script providing the bucket name and file name as variables:
./update-tags.sh
$BUCKET_NAME
$FILE_NAME
You could then manually run the script periodically or use a cron job on a machine with Google Cloud credentials; however, there is a better cloud native approach where you can implement the task programmatically using a Cloud Function.