Prepare data from CbioPortal for Flexynesis integration

Overview
Creative Commons License: CC-BY Questions:
  • How to download data from cBioPortal?

  • How to prepare omics data for Flexynesis integration.

Objectives:
  • Download Breast cancer data from Metabric through cBioportal using Flexynesis

  • Clean and preprocess genomics data

  • Format data for downstream integration analysis

Time estimation: 1 hour
Supporting Materials:
Published: Aug 10, 2025
Last modification: Aug 10, 2025
License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MIT
version Revision: 1

The cBioPortal is an open-access web platform that provides intuitive access to large-scale cancer genomics datasets Gao et al. 2013 cbioPortal community. Originally developed to make complex molecular profiling data more accessible to the broader research community, cBioPortal hosts data from thousands of cancer studies and tens of thousands of tumor samples.

In this tutorial, we will work with data from the METABRIC consortium metabric community, one of the landmark breast cancer genomics studies available through cBioPortal. This dataset contains comprehensive molecular and clinical data from over 2,000 breast cancer patients, including gene expression profiles, copy number alterations, mutation data, and clinical outcomes.

Agenda

In this tutorial, we will cover:

  1. Import data from CbioPortal
  2. Data cleanup
    1. Clinical data
    2. Omics data
  3. Split data to train and test
  4. Conclusion

Import data from CbioPortal

Hands On: Import data from cBioPortal
  1. Flexynesis cBioPortal import ( Galaxy version 0.2.20+galaxy3) with the following parameters:
    • “I certify that I am not using this tool for commercial purposes.”: Yes
    • “cBioPortal study ID”: brca_metabric
Question
  1. What modalities are imported to Galaxy?
    • Clinical data
    • Copy number alteration
    • Methylation
    • Gene expression
    • Mutation

Data cleanup

Now we need to clean up our data. This means removing comment lines in the matrix, removing duplicate samples and so on.

Clinical data

Hands On: Prepare clinical data

Here we’ll First extract the clinical data from the collection, then we remove the comment lines, and finally we remove the extra index column.

  1. Extract dataset with the following parameters:
    • param-file “Input List”: datasets (output of Flexynesis cBioPortal import tool)
    • “How should a dataset be selected?”: Select by element identifier
      • “Element identifier:”: data_clinical_patient
  2. Table Compute ( Galaxy version 1.2.4+galaxy2) with the following parameters:
    • “Input Single or Multiple Tables”: Single Table
      • param-file “Table”: data_clinical_patient (output of Extract dataset tool)
      • In “Advanced File Options “:
        • “Header begins at line N”: 4
      • “Type of table operation”: No operation (just reformat on output)
  3. Advanced Cut ( Galaxy version 9.5+galaxy2) with the following parameters:
    • param-file “File to cut”: table (output of Table Compute tool)
    • “Operation”: Discard
    • “Cut by”: fields
      • “Is there a header for the data’s columns ?”: Yes
        • “List of Fields”: c1:
  4. Rename the data clinical data - cleaned
Question
  1. How many samples are there in the clinical data?
  1. Click on the data, you will see the data has 2510 lines, so 2509 samples.

Omics data

Time to prepare our omics data. We are interested in mutation and gene expression.

Hands On: Prepare gene expression data

Here we’ll First extract the clinical data from the collection, then we remove duplicate genes from the matrix, and finally we remove the ENTREZ Ids.

  1. Extract dataset with the following parameters:
    • param-file “Input List”: data (output of Flexynesis cBioPortal import tool)
    • “How should a dataset be selected?”: Select by element identifier
      • “Element identifier:”: data_mrna_illumina_microarray
  2. Sort ( Galaxy version 9.5+galaxy2) with the following parameters:
    • param-file “Sort Query”: data_mrna_illumina_microarray (output of Extract dataset tool)
    • “Number of header lines”: 1
    • In “Column selections”:
      • param-repeat “Insert Column selections”
        • “on column”: c1
    • “Output unique values”: Yes
  3. Advanced Cut ( Galaxy version 9.5+galaxy2) with the following parameters:
    • param-file “File to cut”: output (output of Sort tool)
    • “Operation”: Discard
    • “Cut by”: fields
      • “Is there a header for the data’s columns ?”: Yes
        • “List of Fields”: c2: Entrez_Gene_Id
  4. Rename the data expression data - cleaned
Question
  1. How many samples and genes are there in the gene expression data?
  1. Click on the data, you will see the data has 20386 lines and 1981 columns, so 20385 genes and 1980 samples.
Hands On: Prepare mutation data

Here we’ll First extract the clinical data from the collection, then we remove comment lines from the matrix, and remove the extra index column, and finally, we create a binarized matrix which indicates the number of mutations per genes.

  1. Extract dataset with the following parameters:
    • param-file “Input List”: data (output of Flexynesis cBioPortal import tool)
    • “How should a dataset be selected?”: Select by element identifier
      • “Element identifier:”: data_mutations
  2. Table Compute ( Galaxy version 1.2.4+galaxy2) with the following parameters:
    • “Input Single or Multiple Tables”: Single Table
      • param-file “Table”: data_mutations (output of Extract dataset tool)
      • In “Advanced File Options “:
        • “Header begins at line N”: 1
      • “Type of table operation”: No operation (just reformat on output)
  3. Advanced Cut ( Galaxy version 9.5+galaxy2) with the following parameters:
    • param-file “File to cut”: table (output of Table Compute tool)
    • “Operation”: Discard
    • “Cut by”: fields
      • “Is there a header for the data’s columns ?”: Yes
        • “List of Fields”: c1:
  4. Flexynesis utils ( Galaxy version 0.2.20+galaxy3) with the following parameters:
    • “I certify that I am not using this tool for commercial purposes.”: Yes
    • “Flexynesis utils”: Binarize mutation data
      • param-file “Mutation data”: table (output of Advanced Cut tool)
      • “Column in the mutation file with genes”: Column: 1
      • “Column in the mutation file with samples”: Column: 17
  5. Rename the data mutation data - cleaned
Question
  1. How many samples and genes are there in the mutation data?
  1. Click on the data, you will see the data has 174 lines and 2370 columns, so 174 genes and 2370 samples.

Split data to train and test

In the last step we split our clinical and omics data into train and test with ratio of 0.7 (70% as training and 30% as test)

Hands On: Task description
  1. Flexynesis utils ( Galaxy version 0.2.20+galaxy2) with the following parameters:
    • “I certify that I am not using this tool for commercial purposes.”: Yes
    • “Flexynesis utils”: Split data to train and test
      • param-file “Clinical data”: clinical data - cleaned (output of Advanced Cut tool)
      • param-files “Omics data”: expression data - cleaned (output of Advanced Cut tool), mutation data - cleaned (output of Flexynesis utils tool)

Conclusion

In this tutorial, we showed how to download and prepare multi-modal cancer genomics data from cBioPortal for integration using Flexynesis. Working with the METABRIC breast cancer dataset, we covered the complete data preparation pipeline from raw data access to analysis-ready formats.