Use Case #3: Accessing and Joining Metadata Files With Data
All data uploaded to the portal is associated with metadata—read more about that here.
This use case covers how to download metadata files associated with a specific study, and how to join those metadata files with the data files of that study (programmatically). Think of this use case as a follow up to Use Case #1: Find and Download Files Associated With a Selected Study or Use Case #2: Download Files in Bulk Using the Command Line Client. We’ll use the same study—MC-CAA—to illustrate.
This example should be helpful if you’re trying to answer a question such as, “Where can I find sex, age, or tissue type among these metadata files?”
The following can be done using the R client or Python client. Both sets of instructions are included below
Instructions for using R client
How to find and download metadata files
To find and download the metadata files associated with MC-CAA, we’ll follow the steps as outlined here:
Go to Explore → Studies and search for MC-CAA using the search tool in the top right
The MC-CAA study should appear in the list below—click on it to access its Study Details page
On the resulting page, click the Study Data tab
From the table of contents on the left, click Study Metadata
Download the metadata files using the same instructions as found in Use Case #1, step 6
Once you have these files downloaded, did you know that you can join them together and combine them with the data files? You can find instructions on how to do this programmatically below.
How to read metadata files
At this point, you should know that these metadata files are presented in three types: individual, biospecimen, and assay type. Let’s say you are looking for a specific subset of metadata information, such as sex or tissue type. How do you do that?
First, determine which file your information of interest would be stored in. For reference, see What is contained in each metadata file? Let’s use the examples of sex and tissue type. Sex is associated with the individual, so it would be located in the individual metadata file, while tissue type is associated with the specimen, so it would be located in the biospecimen metadata file. To find these values, you would open each of these downloaded CSV files, and find the column that represents the value (i.e., sex, tissue).
How to join multiple files
In short, metadata files of the individual, biospecimen, and assay type can be joined together on the keys individualID and specimenID. Below are instructions on how to use R software to join multiple files.
First, you need the metadata files—see above
Install and load the
tidyverse
package in R to perform data frame manipulations:install.packages("tidyverse")
library(tidyverse)
While reading in each metadata file with
read_csv
, specify the column types as character with “c”. Consistent column types ensure that common variables can be joined. This code joins data frames using all variables in common across the individual, biospecimen, and assay metadata files. Aright_join
preserves only the individuals and biospecimens that are characterized in each assay type: RNA-seq and SNP array.Rindividual <- read_csv("MC-CAA_individual_human_metadata.csv", col_types = cols(.default = "c") ) biospecimen <- read_csv("MC-CAA_biospecimen_metadata.csv", col_types = cols(.default = "c") ) rnaseq_assay <- read_csv("MC-CAA_assay_RNAseq_metadata.csv", col_types = cols(.default = "c") ) snparray_assay <- read_csv("MC-CAA_assay_snpArray_metadata.csv", col_types = cols(.default = "c") ) RNASeq <- reduce( list(individual, biospecimen, rnaseq_assay), right_join ) snpArray <- reduce( list(individual, biospecimen, snparray_assay), right_join )
Instructions for using Python client
How to import libraries and log in to Synapse
Make sure pandas and the synapseclient python libraries are installed. Import both libraries and create a synapse object syn
.
You will need a Synapse account to access Synapse data with the synapseclient. You can supply a username and password, ex: `syn.login("username", "password"), or use a local .synapseConfig file to supply credentials.
# 1. Import libraries
import pandas as pd
import synapseclient
syn=synapseclient.Synapse()
# log in to Synapse
syn.login()
How to download the metadata files
The four files needed for this analysis are:
MC-CAA_individual_human_metadata.csv : syn10930250
MC-CAA_biospecimen_metadata.csv : syn21522653
MC-CAA_assay_RNAseq_metadata.csv : syn21499318
MC-CAA_assay_snpArray_metadata.csv : syn21499317
Files can be downloaded manually through the AD Knowledge Portal, or you can use the code below to download them with the Synapse python client. These files contain controlled-access human data, and should only be downloaded in a secure environment.
# We use the downloadLocation argument to specify the download directory.
# (https://python-docs.synapse.org/build/html/index.html?highlight=get#synapseclient.Synapse.get)
individual_human_metadata = syn.get("syn10930250", downloadLocation = "./metadata") # MC-CAA_individual_human_metadata.csv
biospecimen_metadata = syn.get("syn21522653", downloadLocation = "./metadata") # MC-CAA_biospecimen_metadata.csv
RNAseq_metadata = syn.get("syn21499318", downloadLocation = "./metadata") # MC-CAA_assay_RNAseq_metadata.csv
snpArray_metadata = syn.get("syn21499317", downloadLocation = "./metadata") # MC-CAA_assay_snpArray_metadata.csv
How to read data from CSVs into pandas dataframes
individual_human_metadata_df = pd.read_csv("metadata/MC-CAA_individual_metadata.csv", dtype=str)
biospecimen_metadata_df = pd.read_csv("metadata/MC-CAA_biospecimen_metadata.csv", dtype=str)
RNAseq_metadata_df = pd.read_csv("metadata/MC-CAA_assay_RNAseq_metadata.csv", dtype=str)
snpArray_metadata_df = pd.read_csv("metadata/MC-CAA_assay_snparray_metadata.csv", dtype=str)
How to join data
# Define function to right join multiple dfs
# Right join "preserves only the individuals and biospecimens that are characterized in each assay type: RNA-seq and SNP array."
# https://help.adknowledgeportal.org/apd/Use-Case-%233:-Working-with-File-Annotations-and-Metadata.2426208334.html
def right_join_multiple (left_df_list, right_df):
for df in left_df_list:
right_df = pd.merge(df, right_df, how = 'right')
return right_df
# Use function to join biospecimen, individual and RNASeq metadata. Do the same with SNP array data
left_df_list = [biospecimen_metadata_df, individual_human_metadata_df]
Joined_RNAseq_metadata_df = right_join_multiple(left_df_list, RNAseq_metadata_df)
Joined_SNP_metadata_df = right_join_multiple(left_df_list, snpArray_metadata_df)
How to display joined data
pd.set_option("max_rows", 10) #Change the second argument to 'None' if you wish to display all rows
Joined_RNAseq_metadata_df #Run this line to display individual and biospecimen metadata that have RNA Seq data