BaseSpace Data Model Explained
Last updated
Last updated
© 2023 Illumina, Inc. All rights reserved. All trademarks are the property of Illumina, Inc. or their respective owners. Trademark information: illumina.com/company/legal.html. Privacy policy: illumina.com/company/legal/privacy.html
NOTE: This information applies to the New BaseSpace model only. Classic BaseSpace utilizes a different data model and is not covered in this article.
The basic unit of data for analysis on BaseSpace is the biosample. When a run is uploaded onto BaseSpace, biosamples are generated during FASTQ generation according to the information specified on the sample sheet. This process, as well as the other data types utilized on BaseSpace, is explained in detail below.
Run: Runs contain all run files uploaded from the instrument to BaseSpace; this includes the metrics files in the InterOp folder if run monitoring is selected, as well as the base call files (bcls) if run storage is also selected. Files produced from demultiplexing, including FASTQs, are not included as a part of the run. FASTQs from runs demultiplexed locally will remain on the instrument, while FASTQs from runs demultiplexed on BaseSpace will be housed as a data set in the project associated with the FASTQ generation analysis. Thus, sharing or transferring a run will not share or transfer any FASTQs generated from it; only the run data itself will be shared or transferred. When deleting a run, however, selecting "all run-related files" will delete data from associated Analyses as well.
Analysis/App Session: Analyses, or app sessions, receive input data and process it to produce output data. The first analysis performed on most data uploaded to BaseSpace is FASTQ generation, which is usually performed automatically on runs uploaded to BaseSpace. Analyses output their data as data sets. These are stored into projects, and analyses do not store any data themselves. When a run is uploaded to BaseSpace with a valid sample sheet, an automatic FASTQ generation analysis is triggered, outputting a FASTQ data set for each sample into their specified projects, or creating a new project to house them if no project(s) are specified.
An analysis can output data into multiple projects. A user must have ownership of all projects in order to perform certain actions on analyses, such as deletion. This is because deleting an analysis removes all of its output data sets, and ownership of the projects housing these data sets is necessary to delete them.
Data Set: Data sets are created as the outputs of analyses. The FASTQs produced by FASTQ generation are one type of data set, but other analysis output data, as well as data uploaded directly onto BaseSpace, are also considered data sets. Data sets are associated with biosamples, and when a biosample is selected as input for an analysis, all data sets that are marked as QC Passed, and are valid for the type of analysis selected, will be used.
Project: Projects house the outputs of analyses, as well as other files the user uploads manually to them. They do not house any bcls or metrics files from runs. Projects can contain data from multiple analyses. Sharing or transferring a project will share or transfer all data sets within, independent of the source runs used to generate them.
Biosample: Biosamples are the primary unit for analysis beyond FASTQ generation on BaseSpace. Once a run is demultiplexed, the generated FASTQ data sets are associated with biosamples corresponding to the listed sample IDs provided on the sample sheet.
Biosamples do not belong to a project themselves; however they are associated with data sets that themselves belong to projects. This includes the project(s) that house the FASTQ data set associated with the biosample, as well as the project(s) that house data sets produced by analyses performed with the biosamples. All biosamples must have a default project specified where analyses using the biosample will output data unless another output location is specified.
Libraries: If the same sample ID is provided on a sample sheet as one used in a previous run, BaseSpace then checks the sample name listed, reading it in as the library associated with the sample. If the library/sample ID combination has been used before, the data will be aggregated into the existing library under the biosample. Otherwise, a new library will be created under the sample ID.
Pools: When multiple libraries are specified in the sample name column in the same lane, these libraries will also be grouped together into a pool. If other lanes on the instrument also include the same combination of libraries, these will also be included in the pool. Pools are not shared across instruments, so if the same library combination is run on another instrument, a new pool will be created.
Biosample Aggregation: When the same sample ID is used for multiple FASTQ generation analyses, FASTQ data sets from each demultiplexing will be aggregated in the same biosample. This will cause the biosample to lock, preventing further manipulation of the biosample until its status is changed. If the biosample is unlocked, all associated data sets will be used for downstream analyses; otherwise, if any datasets are marked as QC Failed, these will not be used when the biosample is selected for analysis. By default, when FASTQ generation for a run is requeued on BaseSpace, FASTQs from the prior analyses will be marked as QC Failed automatically so that only the results of the most recent requeue will be used for further analysis.
Users can mark a data set, library, pool, or lane as QC Failed, excluding that portion of the biosample from analysis.
In this example, all data sets have been aggregated, and as such all three sets of FASTQs will be used when this biosample is been selected for analysis.
In this example, one library has been marked as QC failed. This library and its two associated FASTQ data sets will not be used if the biosample is selected for analysis, and only the data set associated with the other library will be used.
In this example, two runs are performed using each of the sample sheets below. In total six biosamples are created, and data is aggregated for each sample based on the specified information.
Run 1:
SampleID
SampleName
Project
Sample1
LibraryA
Project1
Sample2
LibraryB
Project1
Sample3
LibraryC
Project1
Sample4
LibraryC
Project2
Sample5
LibraryC
Project2
Run 2:
SampleID
SampleName
Project
Sample1
LibraryA
Project1
Sample6
LibraryB
Project1
Sample3
LibraryD
Project1
Sample4
LibraryC
Project1
Sample5
LibraryC
Project3
Sample1: One library (Library A). Has aggregated FASTQ data sets from both Run 1 and Run 2. Sample2: One library (Library B). No aggregated data, only has a FASTQ data set from Run 1. Sample3: Aggregated libraries (Libraries C and D); Library C has a FASTQ data set from Run 1 while Library D has a FASTQ data set from Run 2. Sample4: One library (Library C). Has aggregated FASTQ data sets from both Run 1 and Run 2. Note that this Library C is unrelated to that of Sample3 and Sample5, as libraries are not shared between biosamples. Sample5: One library (Library C). Has aggregated FASTQ data sets from both Run 1 and Run 2. Note that this Library C is unrelated to that of Sample3 and Sample4, as libraries are not shared between biosamples. Sample6: One library (Library B). No aggregated data, only has a FASTQ data set from Run 2. Note that this Library B is unrelated to that of Sample2, as libraries are not shared between biosamples.
For more information, please refer to the Basespace Sequence Hub Help Center.
For any feedback or questions regarding this article (Illumina Knowledge Article #7009), contact Illumina Technical Support techsupport@illumina.com.