Frequently Asked Questions
  • Questions common to both tools.
    • Q1
    • A1
  • Questions about ProteomeHD (unsupervised machine learning)
    • What is ProteomeHD?
    • ProteomeHD is a database of protein abundance changes in response to biological perturbations.
      It is a data matrix for functional proteomics: proteins that are up- or downregulated to a similar extent under the same biological conditions probably have related cellular functions.

      ProteomeHD differs from other drafts of the human proteome in that it does not catalogue the proteome of specific tissues or subcellular compartments. Instead, ProteomeHD catalogues the transitions between different proteome states, i.e. changes in protein abundance or localization resulting from cellular perturbations. HD, or high-definition, refers to two aspects of the dataset. First, quantitation accuracy: all experiments are quantified using SILAC (stable isotope labelling by amino acids in cell culture). Second, HD refers to the number of observations ("pixels") available for each protein. As more perturbations are analysed, regulatory patterns become more refined and can be compared more accurately.
    • How many proteins and conditions does ProteomeHD cover?
    • ProteomeHD v1.0 contains 10,323 human proteins and 294 biological conditions. Not every protein has been detected in every experiment. On average, there are 112 SILAC measurements for each protein.
    • How do you measure protein co-regulation in ProteomeHD?
    • We use the unsupervised machine-learning algorithm treeClust from Buttrey and Whitaker. For all possible pairs of proteins it determines how similar are their abundance changes across ProteomeHD. We find treeClust to be a strong improvement over the more commonly used Pearson correlation, both in sensitivity and robustness.
      TreeClust uses decision trees that can handle missing values, which are common in shotgun proteomics data. It outputs a dissimilarity metric that reflects how often two proteins land in the same decision tree leaves. For any two proteins this allows us to determine a “co-regulation score”, defined as (1 - treeClust dissimilarity).
    • Why modify the co-regulation score cut-off?
    • We use the treeClust algorithm to determine a co-regulation score for all possible pairs of proteins. The co-regulation score ranges from 0 (completely unrelated) to 1 (perfect co-regulation). We define two proteins as “co-regulated” if their co-regulation score is above 0.5. However, this is an arbitrary cut-off. For some proteins with many co-regulation partners, you may wish to increase the cut-off, in order to focus only on the most strongly co-regulated proteins. Other proteins may not be co-regulated strongly with any other proteins. In such cases, lowering the cut-off may be necessary to bring up functional clues. In practice, we recommend trying cut-offs between 0.4 and 0.7.

      Note that some restrictions apply: The interactive plots on the website display up to 1,000 co-regulation partners (with a warning), but the full set can be downloaded as csv file using score cut-off 0.0. A maximum of 100 proteins can be transferred to STRING and subjected to GO or KEGG analysis (the 100 with the highest score are automatically selected).
    • What is the proteome co-regulation map?
    • We determine a co-regulation score for all possible pairs of proteins. In other words, for every protein we measure how strongly - or weakly - it is co-regulated with any other protein. These are the data you can download, they are displayed in the tables and form the basis of all downstream analyses. However, this co-regulation data set is also very complex, so we visualise it through t-Distributed Stochastic Neighbor Embedding (t-SNE). t-SNE is a technique for dimensionality reduction, which captures relationships between proteins in the high-dimensional co-regulation dataset and preserves them in a two-dimensional map. The more similar two proteins behave across ProteomeHD, the closer they are plotted in the t-SNE map. In this way, complex relationships between thousands of proteins can be visualised in a simple, human-readable 2D plot.
    • Why are some co-regulated proteins far apart on the map?
    • In short, a protein may be far apart from its co-regulation partner if it is more strongly co-regulated with a different set of proteins. The co-regulation map is a two-dimensional representation of the complex (high-dimensional) co-regulation dataset. It is a simplification that enables us to show the general layout of the human proteome in a 2D plot. However, many proteins are multifunctional and may be partially co-regulated with proteins from distinct biological processes. In such cases the position in the 2D map needs to be a compromise, optimised by the t-SNE algorithm. In general, it is best to use the map as a visualization tool or to get a quick, general impression of a protein’s potential function.
      For detailed functional annotation it is recommended to explore the actual pairwise co-regulation scores.
    • Why are not all proteins covered by the map?
    • At the moment we restrict the co-regulation analysis to proteins which have been observed in at least 95 SILAC experiments, in order to increase robustness and accuracy.
    • Does the Gene Ontology and KEGG enrichment take into account the selected score cut-off?
    • Yes, when the protein or score cut-off is changed, GO and KEGG enrichment analysis is repeated. Only the top 100 proteins will be used for the analysis. This is an upper cap and not a fixed number. This means that if your protein list has 10 members, only those 10 will be used for enrichment analysis.
    • Can I link to a particular protein and cut-off?
    • Yes, the “Copy shareable link to clipboard” button creates a link that contains both the protein ID and the currently selected score cut-off.
    • Which organisms does ProteomeHD cover?
    • Homo sapiens only. Other organisms may be added in the long-term future.
    • Why does the STRING network not show all my co-regulated proteins?
    • We transfer up to 100 proteins to STRING. If there are more than 100 co-regulated proteins at your chosen score cut-off, only the 100 highest scoring will be transferred to STRING.
      However, you can download the entire list of co-regulation partners and search STRING manually at https://string-db.org
    • Do you have any APIs for accessing the resources?
    • We offer two separate API calls for direct linking to our resources. You can directly to a protein of interest with a given score cutoff by constructing a link: https://www.proteomehd.net/proteomehd/QUERY_PROTEIN_UNIPROT_ACC/_SCORE_CUTOFF_
      Example:https://www.proteomehd.net/proteomehd/O15213/0.99

      Make sure that the exact Uniprot Accession is in our data. You can copy such a link into your clipboard by clicking a button on the query page.

      We also offer a way to link to a specific interaction, i.e. a query protein and the given target with a score above the chosen cut-off.
      For this you will have to create such a link:
      https://www.proteomehd.net/proteomehd/highlight/_FROM_UNIPROT_/_TO_UNIPROT_/_SCORE_CUTOFF_
      Example:https://www.proteomehd.net/proteomehd/highlight/O15213/Q8NEJ9-2/0.99

      At the moment purely automated access (e.g. returning a json) is not implemented.
    • How can I cite ProteomeHD?
    • You can’t yet. But please check here again soon, we are working on it!
  • Questions about Progulon Finder (supervised machine learning)
    • Which Random Forest score cut-off do you use to define a “protein regulon”?
    • From a statistical perspective, a Random Forest score above 0.5 signifies that a protein belongs to the “same class” as the uploaded protein list. In our case this would translate to being part of the same regulon. This should be taken as a rule of thumb only. We find it is often helpful to manually inspect the predictions and perhaps choose a biologically more relevant cut-off at a slightly higher (or lower) Random Forest score.
    • How do you calculate the area under the ROC curve (AUC) and which value do you consider sufficient?
    • RegulonFinder calculates a ROC curve based on cross-validated training proteins (learn more about the architecture of the workflow). An area under the curve (AUC) of 1 means that all uploaded training proteins could be perfectly separated from ~1,000 randomly chosen proteins. In practice, AUCs between 1 and 0.9 indicate very good predictions, while values between 0.9 and 0.8 may still be useful. Care should be taken when analysing experiments with AUCs below 0.85, as they may have a significant amount of false-positive hits.
    • Why do you use cross-validation instead of reporting the out-of-bag error?
    • Random Forests have an in-built mechanism to estimate the test set error, known as the out-of-bag (oob) error estimate. Unfortunately, due to certain computational restraints we can at the moment not automatically extract and report the oob, but we hope to implement this feature in the future. As an alternative error estimate, we cross-validate training data and calculate the area under a ROC curve. Please note that our main reason for cross-validating training proteins is not to assess performance, but to get unbiased prediction scores for these training proteins.
    • Why does the result file contain multiple Uniprot ACs for the same protein?
    • In proteomics, proteins are identified by sequencing peptides, but normally only a few peptides of each protein are observed. If protein isoforms differ outside the observed region it is impossible to know which isoform was actually detected. Therefore, all isoforms that potentially fit the data are reported (separated by a semicolon).
    • Can I use specific protein isoforms for training?
    • No. When uploading a protein list, all available protein isoforms will be used for training. It would be difficult to use specific isoforms and exclude others, because often it is not clear which isoform precisely was observed in the experiment. Proteomics experiments usually report a range of isoforms that could all fit the observed data. If your protein of interest has isoforms with large functional differences (and these isoforms are found in ProteomeHD), you will need to omit that protein from your training list.
    • Is there a limit to how many predictions I can submit?
    • Yes. Due to limited resources each user is given a daily limit of 10 submissions per day. For unlimited access to the pipeline please refer to our Download section where you can download the appropriate KNIME workflow.