Prepare Data from CbioPortal for Flexynesis Integration

Author(s)	Amirhossein Naghsh Nilchi Polina Polunina Björn Grüning
Reviewers

Overview
Questions:

How to download data from cBioPortal?

How to prepare omics data for Flexynesis integration.

Objectives:

Download Breast cancer data from Metabric through cBioportal using Flexynesis

Clean and preprocess genomics data

Format data for downstream integration analysis

Time estimation: 1 hour

Supporting Materials:

Datasets

Workflows

galaxy-history-answer Answer Histories

usegalaxy.eu
2025-08-01

help How to Use This

FAQs

instances Available on these Galaxies

Known Working

UseGalaxy.eu ✅ ⭐️

Possibly Working

UseGalaxy.be

Published: Aug 10, 2025

Last modification: Aug 13, 2025

License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MIT

purl PURL: https://gxy.io/GTN:T00553

version Revision: 2

The cBioPortal is an open-access web platform that provides intuitive access to large-scale cancer genomics datasets Gao et al. 2013 cbioPortal community. Originally developed to make complex molecular profiling data more accessible to the broader research community, cBioPortal hosts data from thousands of cancer studies and tens of thousands of tumor samples.

In this tutorial, we will work with data from the METABRIC consortium metabric community, one of the landmark breast cancer genomics studies available through cBioPortal. This dataset contains comprehensive molecular and clinical data from over 2,000 breast cancer patients, including gene expression profiles, copy number alterations, mutation data, and clinical outcomes.

Agenda

In this tutorial, we will cover:

Import data from CbioPortal

Data cleanup

Clinical data

Omics data

Split data to train and test

Conclusion

Import data from CbioPortal

Hands On: Import data from cBioPortal

Flexynesis cBioPortal import ( Galaxy version 0.2.20+galaxy3) with the following parameters:

“I certify that I am not using this tool for commercial purposes.”: Yes

“cBioPortal study ID”: brca_metabric

Question

What modalities are imported to Galaxy?

Clinical data

Copy number alteration

Methylation

Gene expression

Mutation

Data cleanup

Now we need to clean up our data. This means removing comment lines in the matrix, removing duplicate samples and so on.

Clinical data

Hands On: Prepare clinical data

Here we’ll First extract the clinical data from the collection, then we remove the comment lines, and finally we remove the extra index column.

Extract dataset with the following parameters:

param-file “Input List”: datasets (output of Flexynesis cBioPortal import tool)

“How should a dataset be selected?”: Select by element identifier

“Element identifier:”: data_clinical_patient

Table Compute ( Galaxy version 1.2.4+galaxy2) with the following parameters:

“Input Single or Multiple Tables”: Single Table

param-file “Table”: data_clinical_patient (output of Extract dataset tool)

In “Advanced File Options “:

“Header begins at line N”: 4

“Type of table operation”: No operation (just reformat on output)

Advanced Cut ( Galaxy version 9.5+galaxy2) with the following parameters:

param-file “File to cut”: table (output of Table Compute tool)

“Operation”: Discard

“Cut by”: fields

“Is there a header for the data’s columns ?”: Yes

“List of Fields”: c1:

Rename the data clinical data - cleaned

Question

How many samples are there in the clinical data?

Click on the data, you will see the data has 2510 lines, so 2509 samples.

Omics data

Time to prepare our omics data. We are interested in mutation and gene expression.

Hands On: Prepare gene expression data

Here we’ll First extract the clinical data from the collection, then we remove duplicate genes from the matrix, and finally we remove the ENTREZ Ids.

Extract dataset with the following parameters:

param-file “Input List”: data (output of Flexynesis cBioPortal import tool)

“How should a dataset be selected?”: Select by element identifier

“Element identifier:”: data_mrna_illumina_microarray

Sort ( Galaxy version 9.5+galaxy2) with the following parameters:

param-file “Sort Query”: data_mrna_illumina_microarray (output of Extract dataset tool)

“Number of header lines”: 1

In “Column selections”:

param-repeat “Insert Column selections”

“on column”: c1

“Output unique values”: Yes

Advanced Cut ( Galaxy version 9.5+galaxy2) with the following parameters:

param-file “File to cut”: output (output of Sort tool)

“Operation”: Discard

“Cut by”: fields

“Is there a header for the data’s columns ?”: Yes

“List of Fields”: c2: Entrez_Gene_Id

Rename the data expression data - cleaned

Question

How many samples and genes are there in the gene expression data?

Click on the data, you will see the data has 20386 lines and 1981 columns, so 20385 genes and 1980 samples.

Hands On: Prepare mutation data

Here we’ll First extract the clinical data from the collection, then we remove comment lines from the matrix, and remove the extra index column, and finally, we create a binarized matrix which indicates the number of mutations per genes.

Extract dataset with the following parameters:

param-file “Input List”: data (output of Flexynesis cBioPortal import tool)

“How should a dataset be selected?”: Select by element identifier

“Element identifier:”: data_mutations

Table Compute ( Galaxy version 1.2.4+galaxy2) with the following parameters:

“Input Single or Multiple Tables”: Single Table

param-file “Table”: data_mutations (output of Extract dataset tool)

In “Advanced File Options “:

“Header begins at line N”: 1

“Type of table operation”: No operation (just reformat on output)

Advanced Cut ( Galaxy version 9.5+galaxy2) with the following parameters:

param-file “File to cut”: table (output of Table Compute tool)

“Operation”: Discard

“Cut by”: fields

“Is there a header for the data’s columns ?”: Yes

“List of Fields”: c1:

Flexynesis utils ( Galaxy version 0.2.20+galaxy3) with the following parameters:

“I certify that I am not using this tool for commercial purposes.”: Yes

“Flexynesis utils”: Binarize mutation data

param-file “Mutation data”: table (output of Advanced Cut tool)

“Column in the mutation file with genes”: Column: 1

“Column in the mutation file with samples”: Column: 17

Rename the data mutation data - cleaned

Question

How many samples and genes are there in the mutation data?

Click on the data, you will see the data has 174 lines and 2370 columns, so 174 genes and 2370 samples.

Split data to train and test

In the last step we split our clinical and omics data into train and test with ratio of 0.7 (70% as training and 30% as test)

Hands On: Task description

Flexynesis utils ( Galaxy version 0.2.20+galaxy2) with the following parameters:

“I certify that I am not using this tool for commercial purposes.”: Yes

“Flexynesis utils”: Split data to train and test

param-file “Clinical data”: clinical data - cleaned (output of Advanced Cut tool)

param-files “Omics data”: expression data - cleaned (output of Advanced Cut tool), mutation data - cleaned (output of Flexynesis utils tool)

Conclusion

In this tutorial, we showed how to download and prepare multi-modal cancer genomics data from cBioPortal for integration using Flexynesis. Working with the METABRIC breast cancer dataset, we covered the complete data preparation pipeline from raw data access to analysis-ready formats.

You've Finished the Tutorial

Key points

cBioportal is a repository for accessible and interpretable cancer genomic data

Flexynesis comprehensive tool can be used to make data ready for integration.

Frequently Asked Questions

Have questions about this tutorial? Have a look at the available FAQ pages and support channels

References

Gao, J., B. A. Aksoy, U. Dogrusoz, G. Dresdner, B. Gross et al., 2013 Integrative Analysis of Complex Cancer Genomics and Clinical Profiles Using the cBioPortal. Science Signaling 6: 10.1126/scisignal.2004088
community, cbioPortal GcBioPortal for cancer genomics. https://www.cbioportal.org/
community, metabric METABRIC. https://ega-archive.org/studies/EGAS00000000083

Feedback

Did you use this material as an instructor? Feel free to give us feedback on how it went.
Did you use this material as a learner or student? Click the form below to leave feedback.

Citing this Tutorial

Amirhossein Naghsh Nilchi, Polina Polunina, Björn Grüning, Prepare Data from CbioPortal for Flexynesis Integration (Galaxy Training Materials). https://training.galaxyproject.org/training-material/topics/statistics/tutorials/flexynesis_cbio_import/tutorial.html Online; accessed TODAY
Hiltemann, Saskia, Rasche, Helena et al., 2023 Galaxy Training: A Powerful Framework for Teaching! PLOS Computational Biology 10.1371/journal.pcbi.1010752
Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012

@misc{statistics-flexynesis_cbio_import,
author = "Amirhossein Naghsh Nilchi and Polina Polunina and Björn Grüning",
	title = "Prepare Data from CbioPortal for Flexynesis Integration (Galaxy Training Materials)",
	year = "",
	month = "",
	day = "",
	url = "\url{https://training.galaxyproject.org/training-material/topics/statistics/tutorials/flexynesis_cbio_import/tutorial.html}",
	note = "[Online; accessed TODAY]"
}
@article{Hiltemann_2023,
	doi = {10.1371/journal.pcbi.1010752},
	url = {https://doi.org/10.1371%2Fjournal.pcbi.1010752},
	year = 2023,
	month = {jan},
	publisher = {Public Library of Science ({PLoS})},
	volume = {19},
	number = {1},
	pages = {e1010752},
	author = {Saskia Hiltemann and Helena Rasche and Simon Gladman and Hans-Rudolf Hotz and Delphine Larivi{\`{e}}re and Daniel Blankenberg and Pratik D. Jagtap and Thomas Wollmann and Anthony Bretaudeau and Nadia Gou{\'{e}} and Timothy J. Griffin and Coline Royaux and Yvan Le Bras and Subina Mehta and Anna Syme and Frederik Coppens and Bert Droesbeke and Nicola Soranzo and Wendi Bacon and Fotis Psomopoulos and Crist{\'{o}}bal Gallardo-Alba and John Davis and Melanie Christine Föll and Matthias Fahrner and Maria A. Doyle and Beatriz Serrano-Solano and Anne Claire Fouilloux and Peter van Heusden and Wolfgang Maier and Dave Clements and Florian Heyl and Björn Grüning and B{\'{e}}r{\'{e}}nice Batut and},
	editor = {Francis Ouellette},
	title = {Galaxy Training: A powerful framework for teaching!},
	journal = {PLoS Comput Biol}
}

                   

Congratulations on successfully completing this tutorial!

You can use Ephemeris's shed-tools install command to install the tools used in this tutorial.

shed-tools install [-g GALAXY] [-a API_KEY] -t <(curl https://training.galaxyproject.org/training-material/api/topics/statistics/tutorials/flexynesis_cbio_import/tutorial.json | jq .admin_install_yaml -r)

Alternatively you can copy and paste the following YAML

---
install_tool_dependencies: true
install_repository_dependencies: true
install_resolver_dependencies: true
tools:
- name: flexynesis_cbioportal_import
  owner: bgruening
  revisions: e9a7cb5a3c63
  tool_panel_section_label: Machine Learning
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: flexynesis_cbioportal_import
  owner: bgruening
  revisions: 693011647a67
  tool_panel_section_label: Machine Learning
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: flexynesis_utils
  owner: bgruening
  revisions: 35f41ee7ca20
  tool_panel_section_label: Machine Learning
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: flexynesis_utils
  owner: bgruening
  revisions: f73ce81f7795
  tool_panel_section_label: Machine Learning
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: text_processing
  owner: bgruening
  revisions: c41d78ae5fee
  tool_panel_section_label: Text Manipulation
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: text_processing
  owner: bgruening
  revisions: c41d78ae5fee
  tool_panel_section_label: Text Manipulation
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: table_compute
  owner: iuc
  revisions: cd36d6e45e29
  tool_panel_section_label: Text Manipulation
  tool_shed_url: https://toolshed.g2.bx.psu.edu/

No feedback has been recieved yet for this training. Be the first one by filling in the feedback form.