How to Access and Query Your Google BigQuery Data Using Python and R
Connecting Google BigQuery with Python
To query your Google BigQuery data using Python, we need to connect the Python client to our BigQuery instance. We do so using a cloud client library for the Google BigQuery API. You can also choose to use any other third-party option to connect BigQuery with Python; the BigQuery-Python library by tylertreat is also a great option.
We use the Google Cloud BigQuery library because it is stable and officially supported by Google.
For this post, we assume that you already have a Python development environment set up. If not, we highly recommend you refer to the Python Development Environment Setup Guide.
To install the library, run the following command from your terminal:
pip install --upgrade google-cloud-bigquery
Next, we connect the client to the database. To do this, you will need to download a JSON file that contains the BigQuery service account credentials. If you don’t have a service account, follow this guide to create one, and then proceed to download the JSON file to your local machine.
Now that we have everything set up, we proceed to initialize the connection. The following Python code is used to do so:
PYTHON
from google.cloud import bigqueryfrom google.oauth2 import service_accountcredentials = service_account.Credentials.from_service_account_file('path/to/file.json')project_id = 'my-bq'client = bigquery.Client(credentials= credentials,project=project_id)
In the snippet above, you will need to specify the project_id and the location of your JSON key file by replacing the 'path/to/file.json' with the actual path to the locally stored JSON file.
In Google BigQuery, the project is a top-level container and provides default access control across all the datasets.
Executing Queries on BigQuery Data with Python
Now that we have the BigQuery client set up and ready to use, we can execute queries on the BigQuery dataset.
For this, we use the query method, which inserts a query job into the BigQuery queue. These queries are then executed asynchronously – in the sense that we do not specify any timeout, and the client waits for the job to complete. As soon as the job is complete, the method returns a Query_Job instance containing the results.
For more details on how this method works, please refer to the official documentation here.
The required Python code is as follows:
PYTHON
query_job = client.query("""SELECT *FROM dataset.my_tableLIMIT 1000 """)results = query_job.result() # Wait for the job to complete.
Please note that the above query uses the standard SQL syntax as a default. If you wish you to use legacy SQL, use the code below:
PYTHON
job_config.use_legacy_sql = Truequery_job = client.query("""SELECT *FROM dataset.my_tableLIMIT 1000""", job_config = job_config)results = query_job.result() # Wait for the job to complete.
Connecting to Google BigQuery with R
To install bigrquery, we run the following command from within R console:
install.packages(“bigrquery”)
And that’s it! We are good to go.
As with Python, we will need to authorize our R client to access Google Cloud Services. As per the documentation for bigrquery, we will follow the prompt within the R console to open the authorization URL and copy the code into the console.
Note that this authorization needs to be done only once. The subsequent requests will automatically refresh the access credentials.
Executing Queries on BigQuery Data with R
To execute queries on the BigQuery data with R, we will follow these steps:
- Specify the project ID from the Google Cloud Console, as we did with Python.
- Form your query string to query the data.
- Call query_exec with your project ID and query string.
The code to implement this is as below:
TEXT
#import librarylibrary(bigrquery)#Your project ID hereproject_id <- "your-project-id" # Your project ID goes here#Sample querysql_string <- "SELECT * FROM dataset.my_table LIMIT 1000"#Execute the query and storing the resultquery_results <- query_exec(sql_string, project = project_id, useLegacySql = FALSE)
As with Python, if you wish you can execute queries using legacy SQL, you can change useLegacySql to TRUE in your query_exec function.
Conclusion
In this post, we saw how easy and straightforward it is to access and manipulate the data stored in Google BigQuery using Python and R.
These two languages make it quite easy to build a statistical model on top of this data, which can be used for various purposes – understanding customers’ in-app behavior, predicting the churn rate, etc. are just some of the use-cases.
There is a significant advantage to using a database to store your data compared to using other mediums such as CSV files. Apart from the flexibility to store large volumes of data with varying data types, you can leverage SQL’s power to generate complex queries that give you meaningful insights.
While this post focuses on Google BigQuery, using any other database tool with R and Python is equally easy. The only difference will be the choice of Python/R library used to connect to the database.