Web Site User’s Guide for Pathway Tools-Based Web Sites
3 Searching Pathway/Genome Databases
This document describes how to use Web sites based on the Pathway Tools software from SRI International. Since multiple Web sites such as BioCyc, YeastCyc, AraCyc, and MouseCyc are all based on the same underlying software, the same usage instructions apply to all. (Note that differences in configuration and in software version may introduce some variability among sites).
Please note that the desktop version of Pathway Tools that you can install locally provides some additional operations compared to the Web capabilities described here. Click here for more details.
Unless otherwise indicated, all Pathway/Genome Database searches are restricted to a single database. In most cases, a database describes a single organism – although a small number of multi-organism Pathway/Genome Databases exist (examples include MetaCyc and PlantCyc). The database against which searches will be conducted is indicated below the Quick Search box in the page banner. To search a different database, click on the ‘change organism database’ link below the Quick Search box. In the dialog that pops up, you can either search for the organism of interest by starting to type its name, by browsing the organism taxonomy, or by querying various properties.
If the site supports user accounts, and you are logged in, you may select one database as your preferred database. This database will be your default selection when starting a new web session.
Once you have selected the desired database from one of the tabs described below, click OK to exit the dialog. This will navigate to the page of summary statistics for the selected database.
Note that if you follow a link to a page for a different organism database, then the selected database for searching will change to match the organism of the currently displayed page.
By default, the By Name tab will be initially selected. If a small number of databases is available, a full scrollable list of databases is present to select from. When a large number of databases is available, you must start typing or select a starting letter from the alphabetical index to the left of the database list in order to see the list of matching databases. If you start typing an organism name or select a starting letter, the full list of databases (if available) will be replaced by a list of databases matching the typed string or starting with the selected letter — you can use the mouse or the up/down arrows on your keyboard to select the desired database. An organism name will match the string you type if any word in its name (i.e. genus, species or strain name) starts with the string you type.
In the list of matching databases, some database names may be displayed with a gray background – these indicated databases that have had some level of manual review and/or curation. Tier 1 databases, i.e. those that have received at least a year of literature-based curation, will have a dark gray background. Tier 2 databases, i.e. those with a lower level of manual curation, will have a light gray background. All others are Tier 3 databases, which means they have been computationally generated with little or no manual review. Lists of your recently used databases and the site’s most popular databases provide shortcuts for selecting those databases.
The By Taxonomy tab allows you to select an organism by browsing for it. After the name of each class of organisms is listed the number of organism databases in that class. The taxonomy tree does not include all taxonomy classes, only those that contain at least one organism database – if a particular taxon does not appear in the tree, it means there is no database available for it or its children. Clicking on a class name will show or hide its list of child taxa. Clicking on an organism name will select that database and show its name at the top.
You may search for any taxon by starting to type its name in the text box. If you select one of the options from the resulting auto-complete box, the taxonomy will automatically expand to show the selected taxon (you must still click on the organism name in the taxonomy to select that database, however).
The By Organism Properties tab allows you to query for all organisms that have (or do not have) some property. The types of properties that can be queried (known as the organism “metadata”) include such attributes as when and where and from what host the sample was collected, whether or not the organism is a pathogen, its relationship to oxygen (e.g. aerobic or anaerobic), etc. Not all organism databases contain data for each of these attributes. In the list of properties from which to select, the number of databases that have values for that property is listed in parentheses.
After selecting a property, you can constrain its value, or just select all databases that have (or do not have) any value for that property. To select from a list of all available values, click in the text box. In the resulting list of possibilities, the number in parentheses after each value is the total number of organisms that match that value. If you start to type, the list of visible options will be limited to those that match the string you have typed. Multiple options may be selected by clicking in the text box again after selecting a value – in that case, an organism will satisfy the constraint if it matches any of the selected values (i.e. the values are connected by an implicit OR). For properties whose values consist of free text, you may also query by substring. The first few values that match your substring are shown, but you are not obligated to select any of them. For properties whose values are numeric, a variety of numeric operators are available, as well as the option to select from all available values. If you specify an = constraint, an organism will satisfy the constraint if its value falls within a small range on either side of the specified value – the size of this range depends on the property, and is indicated below with the description of each property. To specify a different range, use a combination of < and > constraints.
Up to six different constraints may be specified (use the “Add Constraint” button to add a new constraint, up to the limit). These may be connected by either AND (an organism must satisfy both constraints) or OR (an organism may satisfy either constraint). Since there is no way to group constraints, if you are are building a query that combines both ANDs and ORs, ordering becomes very important. Queries are processed in a left-to-right order, so X AND Y OR P AND Q is interpreted as ((X AND Y) OR P) AND Q, which may not match what was intended. If the ordering of constraints do not allow for a desired query, you may be better off splitting your query into multiple queries and searching for the desired organism one part of the query at a time.
The following properties are available for searching:
Once you have specified the desired constraints, use the “Find Organisms” button to search for all matching organisms. In the resulting table, which includes all properties for which at least one of the matching organisms has a value, you may click on any column heading to sort by that column. Click on a row to select that organism.
The Quick Search box in the upper right hand corner of every page is useful if you know the name (or part of the name) or database identifier of the object you are searching for. You may use this box to search for genes, proteins, compounds, RNAs, reactions, pathways, operons, and GO terms. If the query string matches a single object, the page for that object will be displayed immediately. If there are multiple matches, the full list of matches will be shown, organized by the type of object (e.g. gene, protein, etc.). Some examples of what can be entered into the Quick Search box include:
A few additional rules govern searches:
The Search menu contains links to specialized search pages for Compounds, Genes/Proteins/RNAs, Reactions and Pathways. Each such page contains options for searching using a number of different criteria, either individually or in combination. When the page is initially loaded, only the name searches are active, but by clicking on the different search bars, you can enable or disable additional search criteria. If multiple search criteria are specified for a given search, then unless otherwise specified the results must satisfy all of them (that is, an AND connector is used to combine the different criteria). The results of all object searches is a table containing the names of all objects that satisfy the search, with hyperlinks to their corresponding data pages, along with any additional columns relevant to the particular search. The table will initially be sorted alphabetically by name, but small triangles in the column headers allow the user to sort by any column, in either ascending or descending order. The sections below describe the different search criteria that are available for each object type.
Many databases include information about DNA or mRNA sites other than genes. The kinds of sites that can be searched here include transcription units, promoters, terminators, transcription-factor binding sites, riboswitches, REP elements, transposons, phage attachment sites, etc., although most databases will not include all of these site types.
Some databases may include sets of growth media, along with information about whether or not the organism can grow on a particular medium and under what conditions (for example, gene knockout studies can indicate whether the organism can grow on a particular medium in the absence of a particular gene). To see the full list of growth media for a database, including an indication of which media have associated knockout data, click on the All Growth Media for this Organism button. Use the other fields of this form to search for growth media that meet certain criteria.
Some databases include DNA or mRNA sites that are not genes, such as transcription-units, promoters, terminators, binding-sites, extragenic-sites, etc. This page includes a checklist of all types of such sites that are present in the current database. Select one or more types that you wish to search. The other fields of this form allow you to further constrain your search.
The Advanced Search tool facilitates generation of queries that are more complex than those supported by the object search tools described above. Using the Advanced Search tool, you can write queries that combine data from multiple organisms or multiple types of objects, and you can search fields that are not supported by the individual object search pages. Detailed instructions for using the Advanced Search tool to construct complex queries are available here.
The Cross Organism Search tool is only available on the BioCyc.org web-servers. It enable queries across all the organisms on the BioCyc.org website.
Search results are presented sorted by relevance (or match strength) in a table with clickable links, which link to the details for each matched entity. Each column in the table can be used to sort the results, with the relevance being used as the default. Re-sorting the table re-sorts all of the results, and this sorting is preserved as you navigate through the results table, from one page to the next.
This facility (not available for MetaCyc) allows you to perform sequence-similarity searches using the BLAST program to compare your protein or nucleic acid sequence against the complete genome of the selected organism database.
The Search Menu → Google This Site command uses Google to perform a full text search over this entire Web site. Searches will not be restricted to the selected database, and can locate text strings found in page comments, help pages, and other page content not queryable by other means. Submitting this form will direct the user outside this Web site to a page generated by Google. A Google full text search is also offered as an option when a Quick Search fails to return any result (or does not return the desired result).
Textpresso is a package for indexing and searching a corpus of biological literature. Textpresso searches are available for searching a large Escherichia coli literature corpus only at the BioCyc Web site, and are available only when EcoCyc is the selected database.
An ontology is a carefully constructed vocabulary of terms, often called a controlled vocabulary. The terms are organized into a classification hierarchy (also called a taxonomy). Ontologies can be used to browse and search for objects by drilling down from more general categories to more specific ones. Each Pathway/Genome Database contains several ontologies. Those that can be searched are available from the Ontologies sub-menu in the Search menu. These ontologies can also be accessed from the object search page for their particular object type. The browseable ontologies are:
Pathway Tools Web accounts give users the ability to customize their experience when accessing PGDBs via the Web, and to store SmartTables of objects in their account.
Web site accounts provide several benefits. Through your account you can:
To create an account, click “Create New Account” at the top right of most Web pages. (If those words are missing it probably means that Web Accounts are not enabled for this Pathway Tools Web site. The Pathway Tools User Guide describes how to enable and configure Web Accounts for a Pathway Tools Web site.)
The genome browser can be used to examine one replicon (chromosome or plasmid) at a time. Its tracks capability can be used to visualize high-throughput datasets in a genome context.
The genome browser can be invoked by
At the top of the genome-browser page, the full length of the chromosome is shown at low resolution. A region of the chromosome can be selected for display at much higher magnification in the lower part of the screen. The selected region will be drawn using as many lines as will comfortably fit on the Web browser page. The full chromosome view at the very top indicates the magnified region by means of a red, rectangular cursor.
Selection of the magnified region can be achieved by the following methods:
The magnified section indicates the transcription direction of genes by rectangular blocks with an arrow at one end, pointing from the 5’ to the 3’ end. ORFs for actual or inferred proteins have symmetrical arrowheads (with the arrow apex in the center), whereas RNA genes have an asymmetrical arrowhead (with the apex at the top edge). Phantom- and pseudo-genes are crossed out with a big, diagonal X. When a gene wraps across more than one line, a zigzag at the end of the line indicates that the gene continues on the next line. Clicking on a gene brings up the corresponding gene description page.
Gene arrows filled with solid colors have transcription unit (operon) information available. All the adjacent genes that are part of a given operon are assigned the same color. Genes that have not been assigned to any transcription unit are not colored. Additionally, transcription-units are indicated by a gray background area behind the genes, spanning the entire region of the operon.
Moving the mouse-cursor over the genes reveals their product name and the length in base pairs of the intergenic region between the chosen gene and its neighboring genes to the left and right. If the number of base pairs carries a minus sign, the genes overlap by that many bases. As an example:
Gene: xdhB Product: putative xanthine dehydrogenase subunit, FAD-binding domain Intergenic distances (bp): xdhA< +11 xdhB -3 >xdhC
This means that there are 11 bp to the left of xdhB before xdhA is reached, but to the right, xdhC overlaps with xdhB by 3 bp.
If the overlap between adjacent genes is more than a small amount, the shorter gene is drawn above the longer gene to avoid visual clashes.
When zooming in to a great level of detail, transcription start sites and terminators are drawn. Transcription start sites are indicated by small arrows that point toward the 3’ end of the transcript. Moving the mouse-cursor over a transcription start site reveals the operon it is part of. The transcription factors controlling the operon are also shown, with a plus sign meaning activation and a minus sign meaning inhibition. Clicking on a transcription start site brings up the corresponding transcription unit description page.
External datasets can be shown alongside the display of a replicon region, in form of additional tracks that are uploaded by the user. The supported tracks file format is GFF, version 2. A short description of this format can be found on the help page, reached by clicking on the green icon containing a question mark, on the far right side of the genome browser’s navigational controls.
The GFF file allows definition of segments on the chromosome that are denoted by a start and stop base-pair position. In an attribute field of the file, a name can be assigned to the segment, and in a score field, a numerical value (such as an expression value) can be supplied. This allows a broad range of different data types to be shown in the genome browser, aligned with the genes and transcription units that a PGDB already describes. This could include alternate gene predictions, or the results of expression experiments. Each specified segment can state a source and feature value, allowing different segment types to be supplied in one file. The external track mode of the genome browser will display different combinations of source/feature values grouped together. If in these groups some of the shown segments overlap due to their base-pair positions, such horizontal segments will be displayed on separate lines, to avoid visual clashes.
To view data from such a GFF file in an external track, first open the genome browser. Next click the “Show Tracks” button to the right of the gene name dialog box. This will enter the external tracks mode, in which the magnified genome region will no longer wrap to fill the screen, instead making room for external tracks that will be displayed underneath. Vertical hair lines will be shown for easier visual alignment of features in external tracks with the magnified region. Next, add tracks data from an external data file using the controls at the bottom of the page. The data file can be specified through a Web site URL (click the “Add Track” button to the right of “Load track data from GFF file via URL”), or from a file on your computer’s hard disk (click “Browse...” to find the file, then click its associated “Add Track” button). Depending upon the size of your GFF file, it can take several minutes to upload a file. During this time, the page will not respond, and you should not click more controls. After the file has finished successfully uploading and being parsed, it will let you know by refreshing the page.
The external tracks display will show the feature name on the left, the sequence name if one is included, and the appropriate color to match the feature’s score, if a score value was found in the GFF file. Following the display of a track, you can continue to browse the genome normally, using the standard Left, Right, Zoom Out, and Zoom In controls, and the Gene Name box.
You can display data from more than one GFF file at the same time. Load each file individually using the procedure described above. Tracks from the first file loaded will appear just below the gene line. Tracks from the second file loaded will appear below those from the first, and so on. The order of the tracks can be changed, by left-clicking on the underlined track titles on the left side, which name the feature type. The popup menu allows the chosen track to be moved up or down by one step relative to the current ordering.
The horizontal bars represent the feature data found in the GFF track file. These are arranged in rows distributed vertically, so as to help prevent overlapping features from running into each other and being indistinguishable. The number of distributed rows may vary with the zoom scale, so that features can fit; there is no other meaning to the number of lines. The length of each horizontal bar shows the extent of each individual feature reading. The color is drawn from a spectrum that shows the magnitude of a score. In order to get a better feel for this magnitude, a graph of the same track feature data is also plotted above the horizontal bars. In the default graph mode, each feature score is represented by a horizontal line spanning the feature’s start and end base-pair coordinates. The magnitude of the score is represented as the height on the graph. This offers an intuitive method of viewing trends and anomalies in the data at a glance.
In the bar graph mode, the rectangular area between the feature’s horizontal line and the baseline (corresponding to a score of zero) is filled by a solid color. This is useful for features that tend to be very short, which may otherwise be hard to see.
It is possible to choose to display, or turn off the display, of either the horizontal bars or the graph plot or both, for each of multiple tracks viewed simultaneously. Reference a pull-down selector control next to the listing of the track at the bottom of the page, which switches between “Show both graph and horizontal”, “Show both bar graph and horizontal”, “Show only graph”, “Show only bar graph”, “Show only horizontal”, and “Both invisible”. This control allows you to stack graphs from different tracks close to each other, so that you can compare them and see fine differences between them.
It is also possible to shift the plotted range of this graph for each track file viewed. Beside the listing of the track there is also a line saying “graph Y range from [ ] to [ ]” with a “Set” button. Fill in the desired lower and upper Y coordinates of the range, press the “Set” button, and that particular graph will be redisplayed with that setting. Entries may be in integers or decimals. The lower range must be less than the upper range coordinate. Score values that fall outside the range will result in the display of a horizontal line just a little bit outside the graph range, to visually indicate this over- or underflow condition.
In graph mode, the entire track is assigned a color from a predefined set of colors. However, it is possible for the user to choose the color of a track, by adding a new header comment line close to the top of the GFF file, before uploading the file. An example line looks like this:
Several common color names can be substituted for "green".
The comparative genome browser can be used to examine several replicons (chromosomes or plasmids) simultaneously, side by side. This view facilitates comparison of related organisms to observe similarities and differences in their gene arrangements. For the alignment to work, ortholog links must exist among genes of the organisms to be compared. The comparative genome browser is usually entered from a page describing a gene. To invoke it, select Align in Multi-Genome Browser from the operations box on the right side of the page. You will first be asked to specify the organisms whose genome regions you wish to compare. The selected set of organisms is remembered for some time by the Web browser. If you wish to change them, use the command Change organisms/databases for comparison operations.
When the comparative genome browser is invoked from a gene page, that gene and its organism orchestrate the rest of the alignment. In the display, the top-most replicon is the reference, against which the comparisons are made by following the ortholog links for every gene of the top replicon in its visible section. The selected gene that is the focus of the comparison is highlighted on each replicon by a thick outline and a slanted hashed background. These selected genes are lined up at the center position of their lengths. The magnified region can be adjusted by the following methods:
Genes with solid colors have links to orthologs. Corresponding orthologs are assigned the same color, out of a set of a dozen colors that will be reused repeatedly. Genes for which no ortholog links were found in the PGDB are not colored. The other display features are the same as described for the regular genome browser.
A SmartTable is a collection of PGDB objects, such as genes or pathways, together with associated data, that can be displayed in tabular form. SmartTables (formerly called “Web Groups”) allow you to store experimental results (e.g., a set of genes of interest from an experimental study), analyze those results (e.g., perform an enrichment analysis to learn if those genes share common biological processes, or paint those genes into a metabolic map diagram), and share SmartTables with colleagues. SmartTables can be created from tabular data files, and from query results, and SmartTables can be exported to files. Transformations, filtering, and set operations on SmartTables can be performed. Example transformations include:
Web SmartTables are stored in a user’s web account, so to create SmartTables you must have an account and be logged in. Users who aren’t logged in can view and download SmartTables that others have made public. A SmartTable has a persistent URL, so they can be used as a data publishing and sharing platform. SmartTables can be private, public, or shared with a selected SmartTable of users.
Firefox is the recommended browser to use with SmartTables. Other browsers will work but have not been as thoroughly tested with SmartTables and thus minor issues may arise. Use of Internet Explorer is discouraged, but, for the most part, will work as well.
A number of SmartTables operations can also be invoked via web services.
Some terminology: A SmartTable consists of a set of rows and columns. A cell is the intersection of a row and a column, and can contain one or more values, which may be Pathway Tools objects (such as genes or pathways), numbers, or strings.
A SmartTable is displayed on its own web page (see the figure below). The URL of this page is persistent and may be bookmarked or shared. At the top of this page are some metadata about the SmartTable, such as its title and a textual description (these can both be edited by clicking on them). Information about the SmartTable’s contents and sharing status is also displayed.
In this example, we started with a SmartTable of genes (in the first column after the checkboxes), and added some properties.
Typically the first column of a SmartTable will be a set of PGDB frames (e.g., a set of genes from a search or from an experimental result) and other columns will be properties or other values derived from the first column (e.g., the products of the genes in the first column). The blue column headings are clickable and can be used to select individual columns for certain operations. A SmartTable must always contain at least one column.
If a SmartTable has more elements than will fit on a page, paging controls will be displayed above the column headings. All rows can also be dispalyed on one page.
The checkboxes on the left are used to select subsets of the SmartTable’s rows for deleting or copying to a new SmartTable. Note that checkboxes work properly over multiple pages — that is, some rows can be checked, a new page can be navigated to and check some more, and the ones on the first page will still be considerered checked. Checking/unchecking the checkbox in the header will check or uncheck all rows in the SmartTable (not just the ones on the current page). This checkbox behavior also applies to any lists of SmartTables.
The SmartTable directory page provides a list of accessible SmartTables. It may be accessed via any of the items under the SmartTables menu. The directory is composed of several tabs:
By default the SmartTable directory is ordered by update time (most recently changed first), but it can be resorted using the sort arrows in column headings.
There are a number of ways to create a SmartTable. To create a saved SmartTable you must be logged-in to the PGDB website; otherwise the SmartTable will be temporary.
The results of web searches (e.g., from the Search → Search compounds page) can be converted to a SmartTable by means of the “Turn into a SmartTable” button.
An empty SmartTable can be created and filled in by hand. To do this:
A SmartTable can be created by importing a text file in tab-separated value format.
Unless “Try to make objects” is selected in the upload menu, values in uploaded files are initially just strings. To turn them into recognized database objects (e.g., genes) after importing, select the appropriate column and use the Column → Set Type… action.
A SmartTable can be created by importing a text file that specifies the coordinates of replicon regions, and associated sequence variants, in a tab-separated file format. A special transformation supports further analysis and interpretation of sequence-variant data — see Section 6.5.2
To perform an import via a file of replicon coordinates, do the following:
The input file format is as follows (an example file is available at http://brg.ai.sri.com/ptools/replicon-coords.dat):
Replicons can be specified in the file by either frame name or common name. Nucleotide coordinates for the start and end positions are relative to the replicon specified. If only either a start or end position is given, it is defined as a single nucleotide region. Any invalid data may result in a row containing “NIL” and the row may have other unexpected results.
The resulting SmartTable will contain either one or two columns — the first column will contain the specified regions; the second column will contain region comments, if supplied; see example below. Clicking on a cell in the first column will open the genome browser around that region.
There are a number of ways to create new SmartTables from existing SmartTables. A SmartTable can be copied via the New → Copy of this SmartTable action. Additionally, if the SmartTable can only be viewed but not edited, such as “Special SmartTables”, a message will appear prompting the user to create a writeable copy of the SmartTable.
A column of a SmartTable can be used and have its contents turned into a new SmartTable, using the + icon that appears in column headings, or using the New → SmartTable from Column action (these are equivalent operations).
Rows of a SmartTable can be used to create a new SmartTable that shares the same column headings by selecting the desired rows using the checkboxes at the beginning of each row, then using the New → SmartTable from Selected Rows action.
See also the Filtering operation which has the option of creating a new SmartTable based on a filtered subset of rows.
SmartTables can be manipulated in a large number of ways, both at a fine level of granularity (such as editing individual cells), and by applying transformations to an entire SmartTable.
Property columns show attributes (slot values) of an object, such as the molecular weight of a compound or the pI of a protein. The most common situation is to add a property column for the objects listed in the first column of the SmartTable, but the Add Property Column dropdown menu will list available properties to show for the currently selected column. Frequently used properties include Common-Name, Comment, Citations, and Creation-Date. The ability to create a property column or an enrichment column from another property column may not be available.
Columns can be added to a SmartTable from the Add → Column action (which creates an empty editable column), or by using the transform and property selectors (see below).
Editable columns (which are those that are not defined by a transform or other computation) can be edited by clicking the edit icon in the column header. This changes the cells to editable fields. Clicking the icon a second time will turn off editing for that column.
A row can be added by means of the link at the bottom of a SmartTable, or using the Add → Row action (they are equivalent). Any editable cells in the new row are displayed in edit mode, so values can be entered.
Additionally, certain object pages, such as those for a gene or protein, have an “Add to SmartTable” button, which places the object in an existing SmartTable.
Rows can be deleted by selecting them using the checkboxes on the left of the display, then choosing the Delete → Delete checked rows action.
Columns can be rearranged with the Column → Move … menu items. They can be deleted either with the Columns → Delete menu item. These operations apply to the selected column. A column can also be deleted by clicking on the “–” icon in the column header. This icon will not be present if deleting the column is not currently a valid action, such as when the SmartTable has only one column.
SmartTables can be resorted on the values of any column by means of the sorting controls (triangles) in column headers.
Filtering means selecting a subset of rows from a SmartTable according to some criterion. The filter menu context may differ between column types. For example, numeric columns will be given options to specify a range value condition, such as greater than, equal to, less than, and so on. Likewise, string columns have options to filter based on various substring conditions. To filter, select the appropriate column and choose the Filter action. A dialog appears that allows for selection based on the filtering criterion.
The filter can either modify the SmartTable in place or create a new SmartTable with a specified name. In either case, if the resulting SmartTable is empty, an error is displayed instead of completing the operation.
The values in cells have a type, which may be either a Pathway Tools object (e.g., a gene) or a string or number. Generally values in a single column will all be of the same type, but this is not required. The type can be controlled by means of the Column → Set Type… action. In general this is used after importing data from a file, to turn string values into Pathway Tools objects.
Under the Set Operations… action, various set operations based on set theory, such as union, intersection, and difference, can be performed between the current SmartTable and a second SmartTable. A new SmartTable can be created or the current SmartTable can be modified in-place. For example, these operations can compute the intersection (items common to both) of two SmartTables.
Transformations apply a computational procedure to all cells within a selected SmartTable column to generate a new column in that SmartTable. To perform a transformation, select a column, then click on the Transformations drop-down menu. Depending on the type objects contained within the selected column, different transformations will be available, e.g.,, different transformations are available for genes than for metabolites. Overall, the difference between properties and transformations is that properties of an object are stored in the database containing that object, whereas transformations are computed by the software.
The easiest way to see what transformations are available for a column type in question is to view a SmartTable containing that type of column and examine the transformations drop-down menu.
Example transformations include: transforming a column of genes to their upstream binding sites, to their promoters, to their Gene Ontology terms, to their orthologous genes within another PGDB, or to the set of genes regulated by those genes; transforming a column of pathways to the genes within the pathways, to the metabolites within the pathways, or to the reactions within the pathways. The following subsections present transformations on metabolites, and a transformation for analyzing sequence variant information.
The menu below shows the transformations available when a column of metabolites is selected. For example, the “Pathways of compound” transformation will generate a new column where each cell in the new column contains the set of metabolic pathways in which the compound in the selected cell in the same row occurs. Imagine that we want to create a new SmartTable consisting of all pathways that the preceding SmartTable of metabolites are in, that is, to create a new SmartTable consisting of the result of the preceding transformation. We can do so by clicking the “+” at the top of the column containing the pathways. That operation will create a new SmartTable with two colums: Column 1 contains a non-duplicative list of all pathways in the preceding column; Column 2 lists the metabolites from Column 1 of the previous SmartTable that are present in each pathway.
The transformation “Compare – remove objects present in other species PGDB” will generate a new column containing those metabolites not present in another specified PGDB. The transformation “Compounds – proteins that bind compound” will generate a new column containing all proteins known to bind each corresponding metabolite (e.g., as an enzyme activator or transcription-factor ligand).
This transformation takes as its starting point a SmartTable of genome regions and sequence substitutions within those regions, as described in Section 6.3.4. The transformation “Sequence – nearest gene to DNA region” adds several new computed columns to such a SmartTable, shown here:
Column 3 lists the gene whose coding region is nearest to the DNA region in the first column.
Columns 4 and 5: If the coding region of the nearest gene overlaps the DNA region in the first column, then Column 4 says “intragenic” followed by the DNA strand from which the gene is transcribed; Column 5 lists the amino-acid change caused by the substitution at the given region (the column is empty for RNA-coding genes). If the coding region of the nearest gene does not overlap the region in the first column, Column 4 states the distance from the region in the first column to the coding region of the nearest gene, and Column 5 is blank.
A natural next analysis step is to click on the top of the Nearest Gene column and then perform an enrichment analysis (described in the next section) to determine what these genes have in common.
Enrichment analysis is a computational technique for identifying known categories of objects (e.g., pathways) that are statistically over-represented in a set of objects (e.g., genes that are significantly up-regulated in an expression experiment). For example, enrichment analysis allows us to ask whether a set of genes contains more genes regulated by a given transcriptional regulator than one would expect to occur by chance, or more metabolites in a given metabolic pathway than one would expect to occur by chance. Please see the Pathway Tools Users Manual for more information on enrichment, including a description of the parameters available on the web.
Enrichment analysis can be invoked on a SmartTable of objects in a SmartTable by:
This operation always creates a new SmartTable, which contains three columns: the enriched objects, the p-value, and the matched objects from the original SmartTable. The new SmartTable will be sorted by p-value, lowest (most significant matches) first.
Once a SmartTable is defined, there are a few things that can be done with it (other than browse it on the web). The SmartTable can be exported in a variety of ways or shared with others.
SmartTables can be exported to tab-separated value format files using the SmartTables → Export → to Spreadsheet File … menu command. When selected, the option is given whether to export the frame names of objects stored in the SmartTable or to use the common name of the objects. Keep in mind that, generally, it’s easier to re-import data by using frame names in the generated file, but the file will also be more difficult to read.
SmartTables with a gene column can be exported to FASTA format files using the Export → to FASTA File… action. The sequences used will be the currently selected column and the names used will be a string representation of the values in the first column.
Objects of the appropriate types (any types that have frame representations in the current PGDB, such as compounds, reactions, or genes) can be displayed over the cellular overview using the Paint Data → On Cellular Overvew command. Be sure to select the appropriate column first. If the first column of the SmartTable contains objects (e.g. genes, compounds), and one or more other columns contain numerical data values, then the SmartTable can be displayed on the Cellular Overview Omics Viewer using the command Paint Data → On Cellular Overvew Omics Viewer. You will be asked to select the data columns you wish to display, and to specify what kinds of values they are (e.g. absolute or relative, log or linear). Another way to paint data from a SmartTable on the Cellular or Regulatory Overview is to navigate to the desired overview and use the command Overlay Experimental Data → From SmartTable.
By default, SmartTables are readable and writeable only by their creator. Access can be granted to other users by means of the Sharing dialog, available via the Sharing… command.
Access by the general public is controlled by the first two checkboxes. “Public?” means that anyone can view the contents of the SmartTable; “Public and writable?” means that anyone can view and edit the contents of the SmartTable (editing is restricted to logged-in users).
Access can also be controlled on a per-user level using the “Share with users” boxes, which accept email addresses of registered Pathway Tools users.
As part of SmartTables, an enhanced public user page has been created, which can be accessed by clicking on any user name in the SmartTable directory (try the Public SmartTables tab). A user page displays the user’s name, an optional user-settable graphic picture, and a list of the user’s public SmartTables. There is also a user directory available.
Under the Browse this SmartTable command, the current SmartTable can be browsed one row at a time. Depending on the type of data in the SmartTable, various text and image elements will be displayed in a single page for a row. In the upper-left corner of the page, a grey box will be shown that displays the name of the SmartTable being browsed as well as a Next link to move to the next row’s page. The Clear link can be used to stop browsing and stay in the current page.
Pathway Tools based Web sites offer multiple tools for analysis of gene expression, metabolomics, and other large-scale datasets. Omics data file format is described in Section 8.3.1.
A number of these capabilities are also available as web services.
The following tools can be used for analysis of combined datasets from multiple high-throughput technologies.
Many of the following tools can accept proteomics as well as gene-expression data.
The Cellular Overview enables the user to drill-down to see the data available for specific genes or metabolites. Omics Pop-Ups enable users to see bar charts, X–Y plots, or heat maps of omics data for single genes or metabolites, or for all genes or metabolites within a pathway. The pop-ups can be customized for a publication or to otherwise make them more legible.
First, mouse over a reaction or metabolite in the Cellular Overview and, by selecting the “Keep” button, lock the resulting tooltip in place to create a caption window. Then, to view an omics pop-up for single genes or metabolites, examine the associated caption. The caption pop-up will include an “Omics” button, if there is omics data associated with the selected node. Selecting the “Omics” button transforms the pop-up into a graphic display of the data.
Right-click on a reaction node in a pathway for which there is omics data to expose a menu including the item “Display Omics Data for Every Node in Pathway: <pathway name>”. The graphics will include the omics data for every gene or metabolite in the pathway to which this reaction belongs.
The tool described in this section make use of a “Pathway Perturbation Score” (PPS). The PPS is meant to capture the activation level of a given pathway at a single point in time. The PPS is computed from the expression levels of the genes or metabolites within each pathway. Note that the PPS differs from the pathway score computed by PathoLogic during pathway prediction; that score captures the likelihood that the pathway is present, as opposed to the pathway activation level captured by the PPS.
The “Differential Pathway Perturbation Score” (DPPS) attempts to capture the degree to which a pathway’s activation level changes across multiple time points, and is computed from multiple values of the PPS for each pathway. You can upload an omics dataset into this website, have the software compute PPS or DPPS scores for each known pathway from those data, and then generate a table depicting each pathway painted with omics data and sorted by the PPS or DPPS scores. You can select how many of the highest-scoring pathways are included in the table. To generate this table, start from the Cellular Overview Diagram (Metabolism → Cellular Overview) for the organism of interest. Use the Upload Data from File command to enter your datafile information. By default, the “Show data” option will overlay the data onto the Cellular Overview Diagram. However, you can instead request that the data be shown either “As a table of pathway diagrams” or “Both on this diagram and as a table in a new tab” — either one of these options will cause a table to be generated. You must specify how many pathways should be included in the table.
The Pathway Perturbation Scores and Differential Pathway Perturbation Scores are computed as follows:
PPS: The PPS computes the overall activation level of a pathway from the activation levels of all reactions in the pathway. A Reaction Perturbation Score (RPS) is computed for each reaction as the maximum absolute value of all data values for objects associated with the reaction. For gene expression data, the RPS is computed from all genes coding for enzymes catalyzing the reaction; for metabolomics data, the RPS is computed from all metabolites that are reactants or products within the pathway. If the data values are not already in log format, they are first converted to log values. For example, if a reaction has three associated genes with log gene expression values -1.5, .3 and 1.2, the RPS would be 1.5.
To compute the PPS, we sum the squares of the RPSs for all reactions in the pathway for which data are available, divide by the number of reactions for which data are available, and take the square root of the result (we use the square of the RPSs instead of the average in order to weight larger RPSs more heavily). For a pathway containing N reactions: PPS = sqrt[(RPS12 + RPS22 + ... + RPSN2)/N]. DPPS: For multi-column datasets (meaning multiple time points or multiple treatment conditions), the Differential PPS (DPPS) is a single number that measures the extent to which a pathway is perturbed across columns. The DPPS is computed the same way as the PPS, by combining RPS values for each reaction. However, when computing the RPS from the entities (e.g. genes, metabolites) associated with a reaction, the data value we use is not the entity’s expression value for any single column, but rather the difference between its maximum and minimum values across all columns. For example, if a single gene in a three-column series has values .1, 2, -1.5, the value for that gene used in the RPS computation would be (2 - -1.5) = 3.5. The differential RPS (DRPS) is then computed as the maximum of these difference values for all entities associated with the reaction. The DPPS is computed from these DRPS values as above, using DRPS values in place of single-column RPS values, i.e. DPPS = sqrt[(DRPS12 + DRPS22 + ... + DRPSN2)/N]. Because PPS measures perturbation in either direction, the DPPS is not a simple difference between PPS values – a pathway can have a high DPPS even if its PPS is relatively similar for each column if either (a) the value for some object swings between a large positive value and a similar magnitude negative value between columns, or (b) if different reactions in the pathway experience their large perturbations in different columns.
Note that for metabolomics datasets, the RPS value for a reaction is the maximum data value for all metabolites (reactants and products) in the reaction. Because side metabolites (those metabolites not shared between adjacent reactions in a pathway) are omitted from the pathway diagrams in the table, and because the colored circles showing metabolite expression levels are shown for main (shared) metabolites only, some data values may not be visible on the diagram.
For multi-omics datasets, the RPS calculation for a reaction will be the maximum of all data values associated with associated with the reaction, whether those are values for a metabolite, gene, protein, or the reaction itself. This is really only useful if all data values are normalized, such that a given value for a metabolite is of roughly equal consequence as that same value for a gene. Otherwise, the RPS and therefore the PPS computations will be distorted. Thus, we do not recommend using this tool with most multi-omics data.
The Cellular Overview diagram depicts the biochemical machinery of an organism as described in a PGDB. Each node in the diagram (such as the small circles and triangles) represents a single metabolite, and each blue line represents a single bioreaction. This page describes the organization of the Cellular Overview and the operations users can perform to interrogate it. Different PGDBs will have different components of the diagram present or absent depending on what was included by the PGDB authors. Note: The Cellular Overview has been tested on Firefox 59.0, Safari 11.1, and Chrome 65.0.
Note: The desktop version of Pathway Tools that you can install locally provides different and additional operations on the Web Overview. Click here for more details.
Organization of the Cellular Overview: Within the cytoplasmic membrane, the small-molecule metabolism of the organism is depicted in several regions. The glycolysis and the TCA cycle pathways, if present, will be placed in the middle of the diagram to separate predominantly catabolic pathways on the right from pathways of anabolism and intermediary metabolism on the left. The existence of anaplerotic pathways prevents rigid classification. The majority of pathways operate in the downward direction. Signal transduction pathways, if present, run along the bottom of the diagram. Pathways are grouped into related clusters as indicated by the shaded regions.
The large group of individual reactions at the right of the diagram represent reactions of small-molecule metabolism that have not been assigned to any pathway. The shapes of the metabolite icons represent various compound classes. The different shapes used are as follows:
The one or more cellular membranes of the organism are depicted, depending on the cellular architecture of the organism, and on whether that architecture was specified when the PGDB was created. Transporters will be depicted in the membrane in which they reside as blue lines whose arrowhead indicates the direction of transport. For gram-negative bacteria, periplasmic proteins will be depicted when identified in the PGDB. Getting Started: The Cellular Overview is accessible from the menu bar Metabolism → Cellular Overview. The current selected organism, as displayed on the right in the banner of the Web page, is used to generate the Cellular Overview diagram. The generation of the diagram can take some time if it was not previously generated by the Web server.
Once the Cellular Overview diagram is displayed, the most common operation is to move it left, right, up or down, since sometimes the entire overview cannot fit in the Web page. This can be done by holding down your left mouse button in a blank area then moving the mouse in the desired direction. This is called a panning operation.
There are 4 distinct levels of detail, or zoom levels, in the cellular overview. The current zoom level is reflected in the ladder like gadget in the left of the window.
At each level, more information becomes visible:
Each step of the ladder is a zoom level. There’s a slider on the ladder indicating the current zoom level visually.
When using the scroll wheel, the pointer location on the diagram becomes the centering point around which the zoom occurs - point to something on one edge of the diagram and zoom in, and the result is that point in the center of the diagram, the diagram shifted to accomodate this.
Generating a cellular overview must be done if its not cached from a previous use. Typically, this takes a minute or two to complete. Once generated it is cached until the server is restarted or re-installed (depending on where the CWEST_TEMP environment variable resolves to).
Mousing over a Cellular Overview icon (e.g., a ‘tee’ icon for a tRNA) displays information about the object in a small tooltip popup. Click the ‘Keep’ button to keep that informational window open; drag the window by its title to re-position it.
Note for Mac users with a one-button mouse: left-click is the usual click, and right-click is the Mac control-click (i.e., you hold down the control key and click). But the exact keys can be customized on your Mac via the system preferences panel.
All the commands for the Cellular Overview are available from the right-clicking menu or the operations box on the right side of the page.
The Cellular Overview can display your experimental data — see Section 8.3.
MetaCyc, which is a multi-organism database, has no cellular diagram.
There are three sliders that control aspects of the display to make highlighted items more (or less) obvious:
The commands in the Cellular Overview menu are:
The following sections describe in more detail these operations and some others.
In this document, ‘Searching’ and ‘Highlighting’ are synonymous terms. There are several commands to search for reactions, pathways, enzymes, genes, and compounds. The search commands are available from the right-click menu and the the Cellular Overview menu from the top menu bar.
When a search is done, the objects found are highlighted in the Cellular Overview diagram which also creates a new overlay. The list of overlays is shown in the Layer Switcher panel on the right of the Overview Web page. This panel might be minimized, in which case a small icon with a plus-sign is shown. Click on the plus-sign icon to open the panel. From this panel you can activate or deactivate specific overlays. You cannot delete an individual overlay. But all highlighting, i.e., all overlays, can be removed by using the command Clear All Highlighting.
Since each overlay corresponds to a search operation, an overlay is identified with the keyword you entered to do the search. This is the name of the overlay. Next to each name a button labeled ‘List.’ Clicking ‘List’ opens a small dialog window listing the objects found for the corresponding search. Each object name is a hyperlink—clicking any of these links centers the Overview on the corresponding object and a red marker emphasizes its location.
Highlighting operations can also be applied via web services.
The Pathway Tools Omics Viewer uses the Cellular Overview for an organism to visualize data from high-throughput experiments in a global metabolic pathway context. The input to the Cellular Omics Viewer is a set of gene, protein, and/or reaction names or identifiers, and data values for each gene, protein, and reaction. The Omics Viewer generates a new version of the Cellular Overview in which the reaction steps identified by the input genes, proteins, and reactions are colored according to the provided data values. For example, for a gene expression experiment, the software identifies the reactions catalyzed by the product of each supplied gene, and colors that reaction with a color value computed from the data point provided for each gene. The data values in the provided dataset are mapped to a spectrum of colors. Similarly, for metabolomics experiments, compound nodes in the Cellular Overview are colored according to the data values for the specified compounds. This facility enables the user to see which pathways are active or inactive under some set of experimental conditions.
The Omics Viewer can be used for:
The Regulatory Overview also has an omics viewer, but it can display gene data only.
The Cellular Omics Viewer can show absolute data values (such as the concentration of a metabolite or protein, or the absolute expression level of a gene), or it can be used to compare two sets of experimental data by computing a ratio and mapping the ratios onto a color spectrum. The superposition of multiple sets of experimental data on the Celllular Overview can also be animated to show, for example, how gene expression levels of enzymes change with time over the course of an experiment.
The Cellular Omics Viewer can also be invoked via web services.
The commands under Overlay Experimental Data (Omics Viewer), available from the right-click menu and the right side operations box, overlays experimental data over the Cellular Overview diagram. Once the Overlay Experimental Data command is invoked, a window will open, called the Omics Form, where you can specify a data file to upload and various parameters to control the interpretation of the data. The parameters are documented in the window but more details follow on the file format and the parameters to specify.
Experimental data is imported from a file provided by the user that is stored on the user’s computer. Each line of the file contains data for a single gene, protein, reaction or metabolite, and is of the form:
Columns are separated by the tab character. Lines that
If the first line of the file (that is not blank or a comment line) begins with a $ character, it is treated as column labels rather than data (these column labels will be included in the display for an animation). The software uses the first row of labels or data (i.e., the first line that is not a comment line) to determine the number of data columns to process. For example, if the first row contains five columns, only the first five columns of each subsequent row will be processed. Thus, even if not all fields for the first row contain data, you must make sure that it contains the appropriate number of Tab characters.
Short examples (see 8.3.1 for full example files):
# In this file the data columns are columns 2-4. # # The first non-comment line begins with a $ character, which indicates it contains column headers. $Items Names Data 1 Data 2 Data 3 # The first two lines of data specify genes. trpA tryptophan synthetase 3.2 3.8 4.3 This line identifies the gene by a gene name # This next line identifies the gene by an accession number that is # listed on the EcoCyc gene page, hence we can be sure that EcoCyc # will recognize it. b0383 alkaline phosphatase 1.1 4.2 2.9 # # The next two lines specify metabolites. # TRP L-tryptophan 6.3 2.3 4.3 Column 0 specifies the EcoCyc ID for this metabolite # This next line specifies spermidine by its name and KEGG ID and PubChem ID spermidine$KEGG:C00315$PubChem:6992097 spermidine 1.1 2.8 5.1 # # ---------- END OF FILE ----------
The numbers in the data columns can represent either absolute or relative (e.g., ratios or log ratios) values. If the data values represent absolute numbers, you may choose to visualize either a single column of absolute data values (select “Absolute” and one data column), or the ratio of two data columns as relative data values (select “Relative” and two data columns). If the data values themselves represent relative numbers, then you need supply only a single column number, and select “Relative.” An entry (a row of data for a gene or other object) may contain any number of data columns (for example, if you want to compile measurements from several experiments or time points into a single file), but only those data columns specified will be visualized at a time — all other columns will be ignored.
The color scale used depends on the type and, by default, the range of the data. Thus, a particular color may correspond to one gene expression level for one dataset, and a different gene expression level for another dataset, depending on the range of values or the supplied maximum cutoff value for each dataset. We use the spectrum from yellow/green to red, with yellow representing the lowest expression levels or ratios in the dataset, blue representing values in the middle, and red representing the highest values. Reactions for which no data was provided are drawn in black. The legend for mapping colors to data values is shown in the key, which is drawn to the right of the overview for a single experiment, or to the left for an animation.
A maximum cutoff value is chosen. By default, this is computed from the data. Alternatively, the user may supply a maximum cutoff value to use. Supplying the same maximum cutoff value for multiple experiments ensures that the same color scale is used for each one, so that the displays are directly comparable.
The minimum cutoff value is determined based on the maximum cutoff value and the other parameters. For absolute data values, we use a minimum cutoff value of zero. For relative data values that are not logs, we use the inverse of the maximum cutoff. For relative data values that are logs, we use the negative of the maximum cutoff. The color spectrum is then mapped evenly along a log scale between the maximum cutoff and the minimum cutoff.
In many cases, several genes or proteins, each with their own expression level or concentration, will map to a single reaction. This is because the reaction might be catalyzed by an enzyme complex made up of several gene products, or the reaction might be catalyzed by several isozymes, each with its own gene or genes. Since a reaction can only be colored a single color, we must choose which data value to use. For absolute data values, we choose the maximum. For relative data values, we choose the value whose log has the greatest deviation from zero, under the assumption that the user is primarily interested in identifying the entities whose behavior differ most between the two datasets.
Once the form to upload the data is submitted, by clicking the Submit button at the bottom of the Omics Form, the data are processed by the Web server. The time to process the file depends on the speed of the server and the amount of data in the file. The results are returned to your browser in the form of highlighted objects (e.g., reactions). If several data experiments are loaded from the same file (i.e., several data columns are provided from the uploaded file), an animation is created where each step of the animation corresponds to one experiment (i.e., one column). A small dialog window is opened to display the color scale for the experiment(s) and buttons to control the animation, if any. You can pause, restart, go forward or backward, increase or decrease the animation speed from this window.
Overlaying exprimental data can be done at any zoom level. Once the data are uploaded and overlayed, zooming out or in can be done, and the corresponding highlighting will be adjusted accordingly.
In addition, there are two sliders in this control panel, which have to do with what values are displayed in the diagram: Maximum Value Displayed; Minimum Value Displayed. These can be used in conjunction with each other to, for example, show only the highest values, or only the lowest values.
The tooltips for highlighted objects show the experimental data if one selects the “Omics” button in the tooltip.
Flux Balance Analysis (FBA) is a computational method for simulating an organism’s metabolic network. Metabolic models based on FBA depict a steady-state condition of a cell. Among the components of the simulation are the biochemical reactions in the organism’s metabolic network, the metabolites utilized by the organism as nutrients, the compounds secreted by the organism, and the biomass metabolites synthesized by the metabolic network. The nutrients are the inputs to the metabolic machinery, and the secretions and biomass metabolites are the outputs of that machinery.
For a quick overview of how to run a metabolic model through this web interface, please execute the following steps.
The modeling tool available from this interface, called Web-MetaFlux, allows you to modify, execute, and store FBA-based metabolic models for organisms available on this website. The Web-MetaFlux interface provides a subset of the functionality of the MetaFlux tool available from the Pathway Tools desktop software. More precisely, Web-MetaFlux provides the ability to execute models for single organisms only (“solving mode”), whereas the desktop version provides several other modes: development mode aids creation of new metabolic models, knockout mode enables modeling of gene and reaction knockouts, and another mode enables modeling of organism communities.
The models on this website can be designated as public or private. You cannot directly modify a public model that you do not own, but you can copy such a model under your user account, and then modify the copy. Modifying a model can include adding or removing nutrients, secretions, or biomass metabolites, or adding or removing reactions. These modifications allow you to study the behavior of an organism for different growth conditions (e.g., anaerobic), or under different reaction availability. Note then that we use the term “model” to include parameters such as the nutrients on which the organism is to be grown.
As you make modifications to a model, those modifications are automatically saved permanently on the web server. Therefore, there is no save button. However, when you modify any entry, you must clearly indicate that you have finished modifying that entry by pressing Tab, pressing Enter, selecting an autocomplete choice, or clicking on any other entry.
Begin by finding an existing metabolic model that you want to execute, or an existing model that you want to modify and then execute. If you want to create a metabolic model de novo, install a local copy of the Pathway Tools software; this website does not support de novo model creation.
To find all organisms in this website having metabolic models, enter the organism selector (click “change organism database”), and select the tab “Having Metabolic Models.” Click on the organism you are interested in modeling to select that organism.
To see the metabolic models available for that organism, run the command Metabolism → Run Metabolic Model.
Click the “Select” button for a given model to select it for execution. Click “Copy” to make your own copy of the model in order to modify the model or its parameters.
Once you have selected or copied a model, you are on the model summary page, which summarizes the state of the current model, and provides tabs near the bottom of the page for viewing the components of the model. Click the “Execute” button to run the model. The results of execution will appear in the Results tab. If a biomass flux of 0.0 is obtained, then no cellular growth was obtained for the model given its specified reactions, biomass metabolites, nutrients, and secretions. If a positive biomass flux is obtained, then this number is the optimal value found for the objective function in the linear programming problem defined for this model. When the model is defined to optimize the production of cellular biomass, then the biomass flux is the steady-state cellular growth rate under the defined conditions of growth.
A table in the Results tab lists the flux values computed for reactions in the model that carry a non-zero flux. Those reactions can be visualized on a zoomable metabolic map diagram by clicking “Show Fluxes on Cellular Overview.” The button labeled “Show Fluxes on Dashboard,” opens a window where the Dashboard displays the aggregate fluxes of reactions and compounds according to the default classes selected by the Dashboard. This information is complementary to the fluxes shown on the Cellular Overview, where the flux of each reaction is shown. More details about the model run can be obtained by clicking the buttons “Show Solution File” and “Show Log File.”
A set of four tabs on the model summary page, called Reactions, Biomass, Nutrients, and Secretions, allow you to inspect models owned by others, and to inspect and modify models that you own. Here we discuss these tabs in more detail.
Under the Reactions tab, you can specify the set of reactions from the PGDB (the organism database) to include in your model, which can be done in the following way.
A metabolic model uptakes nutrients from the cell’s environment to activate biochemical reactions and produce biomass. The set of nutrients provided must be sufficient to activate the reactions needed to produce all of the specified biomass metabolites. Otherwise, the model cannot show growth.
Nutrients can be added and removed from a simulation using the Nutrients tab. The first row of the nutrients table can be used to add a nutrient based on its name (e.g., palmitoleate) or its frame id (e.g., CPD-9245). Autocompletion is provided for these two types of entries. Once a nutrient is added, optional parameters can be provided, such as a compartment, upper and lower bounds on the flux of the nutrient, and a comment. The compartment specifies the cellular location of the nutrient. Although a nutrient can be provided directly in the cytosol, a more realistic model should provide the nutrient into the extracellular space and provide transport reactions to import nutrients. Bounds are optional but typically at least one nutrient has an upper bound to limit the use of all the nutrients. It is common to limit the carbon source, although other nutrients can be used to control growth (e.g., oxygen). For example, if glucose is a nutrient and an upper bound of 10 is specified, then the flux of glucose in the model will not exceed 10. On the other hand, a lower bound on oxygen would force the uptake and use of oxygen by the model.
You can remove a nutrient by clicking the red “x” button on the far left of a row.
The computational objective of an FBA model is to produce all biomass metabolites. At least one metabolite must be specified as biomass, otherwise there is no objective to satisfy. The biomass metabolites must be produced given the specified nutrients, reactions and secretions, otherwise there is no growth. When the model is executed, the fluxes of biomass metabolites are maximized. Furthermore, the fluxes of the biomass metabolites must satisfy the coefficients specified in the Biomass table. Those coefficients are major determinants of the computed reaction fluxes, and they typically reflect the relative masses of the biomass component in dried-down cells. The maximization is constrained by the bound(s) on fluxes specified for nutrients and secretions, if any. You can add a biomass metabolite using the first row of the table shown under the Biomass tab. You can remove a metabolite from that table by clicking the red “x” button on the far left of a row.
The Secretions tab operates very similarly to the Nutrients tab. Production of secreted metabolites is often required for model growth. It is important to note the difference between the secretions and the biomass metabolites. A biomass metabolite must be produced by the model whereas a secretion may be produced by the model. If a secretion is not produced, the model may still grow, but if any biomass metabolite is not produced, the model cannot grow.
In most cases, it is better to specify more secretions than is necessary, because secretions that are not active when a model is executed cannot stop growth. On the other hand, only one secretion that is needed for growth that is not specified can prevent growth. For example, if CO2 is produced by an organism under a given growth condition, but there is no way for the CO2 to escape the model, the steady-state constraint that fluxes are balanced at all metabolites will be violated, and no solution will be found for the model. It is therefore recommended to work with a set of secretions needed for many different growth environments (e.g., different sets of nutrients). Care should be taken to select the appropriate compartment for each secretion — in a more realistic model, each secretion will be transported to the extracellular space and then secreted from the model. If a secretion is not produced, it will be reported in the solution file when the model is executed. The lower-bound flux and the upper-bound flux specified for a secretion can be used to limit the growth of an organism, and multiple such bounds can be specified at the same time on several secretions. When a model is executed, the computed solution fluxes will be constrained by these bounds.
Metabolic Route Search is a software tool to search and analyze routes in the metabolic reaction network of an organism. Given a starting compound, a target compound, and other parameters, the tool finds the best (least cost) routes between these compounds, taking into account atom conservation, path length, and (potentially) adding a minimum number of foreign reactions from MetaCyc.
The tool is activated by first selecting the organism to search using the “change organism database” link on the top right corner of the Web page and then by selecting the command Metabolism → Metabolic Route Search from the menu bar. This command is available for single organism databases only, but is not available for MetaCyc. A Multi-Organism search mode was added (in version 21.0, April 2017), which enables route searches across the union of reactions from multiple organisms. An example use case would be performing a route search across the set of reactions within HumanCyc plus those within a microbiome from a body site, such as the gut or skin. Selecting the Routes across Multiple Organisms ? checkbox activates the Multi-Organism mode. Primarily, this selection makes a multi-organism selector available, to select or modify the set of organisms that contribute their reactions to the pool considered for route searches.
When Pathway Tools is running as a non-public web server, MetaCyc can be used as a search option, not as a native organism, but as a library of additional reactions (to activate this mode, start the private web server with the option -metaroute-metacyc). In this case, MetaCyc can be used only as a set of foreign reactions to add to a selected single organism database.
To support investigations regarding how a compound is degraded or produced when a goal or start compound is not known, a set of goal or start compounds can be selected, which could consist, for example, of the common intermediates in central metabolism. Therefore, for the start and goal compounds, an additional selector enables choosing a Smart Table containing a set of compounds. When a set is selected for either start or goal, then a separate optimal search will be performed for each compound in the set. At the end, all of the found routes are collected and sorted according to cost, and shown together. Because as many searches are performed as there are compounds in the set, this will take more time overall. The parameter settings below, including Maximum Time, apply to each separate route search.
The parameters to specify before clicking the “Search Routes” button are (defaults are provided for most of them):
A summary of what each parameter means is provided online by clicking the green question mark located on the left of each labelled input box.
The cost of a route is the sum of all costs: the cost of atom losses, and the reaction costs from the native database and, if available, the MetaCyc database.
Once the parameters are entered, clicking the “Search Routes” button will initiate the search on the Web server. The solution, that is, the routes found, will be displayed under the parameters. The routes are sorted in ascending order of their cost (best routes are presented first). Displaying a large list of reactions might take significant time due to the complexity of formatting all compound structures and atom mappings.
Each route found is displayed horizontally across the Web page with the starting compound on the left and the target compound on the right. You may need to scroll the window to see some of the compounds since the whole route may not fit the width of your browser window.
On the left of each route is displayed a text summary of the characteristics of the route. The summary includes the cost of the route, the number of atoms kept from the source compound to the target compound, and the number of reactions in the route.
In the Multi-Organism mode, the summary also shows a blue link at the bottom, called Organism Table. Clicking it brings up a temporary SmartTable in a new Web browser tab. This table shows the reactions of the route as the columns, and underneath the reactions is a list of all the organisms that contained the particular reaction. This is useful for a more detailed analysis, because depending on how large the organism set is, there could be hundreds of organisms listed, which could not be shown in the route display in a practical manner. The table data can be exported (for downloading) by all the usual methods available for SmartTables.
The chemical structure of each compound involved in the route is displayed and its name appears underneath the structure. If the compound is from the native database, its name is in grey; if the compound is from MetaCyc, its name is in red. Clicking the compound opens a new browser tab to display a complete description of the compound.
Each reaction is shown with a right arrow. If the reaction is from MetaCyc, the arrow is red, if it is from the native organism, the arrow is grey. Underneath the arrow, the protein name is displayed. Clicking the arrow stem opens a new browser tab to display a complete description of the reaction.
For each route, the atom mapping (i.e., atom tracing) is displayed using colors on atoms and bonds from compound to compound . A moiety that is conserved across several compounds is colored with a specific color. Mousing over an atom highlights that atom across all compounds that conserves it. For example, an atom that is conserved from the source compound to target compound can be seen by mousing over it in the source compound and the corresponding atoms in all intermediate compounds up to the target compound will be highlighted. Note that this highlighting feature enables you to find out quickly which atoms of the source compound are lost and by which reaction by mousing over each atom of the source compound.
A new search can be initiated by changing any parameter and clicking the “Search Routes” button. The current solution will be erased and a new solution will be displayed.
Examples: (Please select the organism E. coli )
The following searches assume that the default cost parameters are used, that is, 100 for atom lost cost and five for native reaction cost. All five atom species (i.e., C, O, P, N, S) are tracked. The number of routes to search could be set to one or more, depending on the number of optimal routes you would like to analyze. The maximum route length can be left at 10 (the default), although, as it is shown below, longer routes conserving more atoms exist for the third search.
A Pathway Collage is a diagram containing a user-specified set of pathways for an organism. The initial collage is generated from a SmartTable or omics dataset, and can be manipulated and customized in various ways. Pathways are initially laid out automatically so that pathways in the same general class are placed near each other, but both pathways and individual nodes can be manually relocated. The collage is zoomable, with pathway, metabolite, and enzyme labels becoming visible when the collage is at a sufficiently high magnification level to make them readable. The user can selectively highlight objects of interest, delete unwanted portions, import new pathways, edit labels, and use the diagram to display omics data.
The collage can be saved and later reloaded, or it can be exported to a PNG image file for use in a presentation or publication. See an example of a Pathway Collage which has been manipulated in various ways to illustrate some of the possibilities, and then saved.
The Pathway Collage application should be intuitive and easy to use. A comprehensive help document is available via the Help→Display Help command.
The simplest way to generate a Pathway Collage is from a SmartTable containing a set of pathways, using the command Export→Export pathways to Pathway Collage. If the SmartTable contains multiple columns, make sure that the currently selected column is one that contains pathways (if it does not, the software will attempt to find a column that does, but results could be unpredictable). If the SmartTable column happens to contain a pathway class, then all instances of that class will be included. If the SmartTable, in addition to one or more pathways, contains one or more individual reactions, then those reactions will also be included in the Pathway Collage.
A Pathway Collage generated in this way automatically includes data from the most recently loaded omics dataset (i.e. loaded onto the Cellular or Regulatory Overview Diagram, or onto a pathway diagram), if any, but it is not visible until the user requests to see it, and a new omics dataset can be loaded onto an existing Pathway Collage at any time using the File→Add or Replace Omics Data command.
Metabolism → Pathway Collages will take you to a page where you can select pathways from a list of all pathways in the current organism, and generate a Pathway Collage containing the selected pathways.
From any pathway page, simply invoke the command Generate Pathway Collage. This will generate a Pathway Collage containing just one pathway. You can add to the collage by right-clicking on any metabolite node in the collage and selecting Add Pathways Containing This Compound. A dialog will pop up listing all the pathways that contain that metabolite, and you can choose which ones to include. Note that when building a Pathway Collage in this fashion, you must position the added pathways yourself, and if you import a super-pathway of a pathway that is already present in your collage, you will end up with duplication (but you can always delete any duplicated pathways or parts of pathways manually).
From the Cellular Overview page, invoke the command Upload Data from File, and fill in most of the fields in the pop-up dialog as if you were displaying your data on the Cellular Overview diagram. However, for the “Show data” field, select “As a Pathway Collage” and indicate how many of the highest-scoring pathways should be included (maximum 100). Using this option, a Pathway Collage will be generated containing those pathways with the highest Pathway Perturbation Score (PPS) or Differential PPS.
The Regulatory Overview enables you to visually analyze the regulatory relationships between genes for a specific organism. These relationships are based on the regulatory data available in the database (i.e., PGDB) of the organism. Currently, the relationships are based on transcriptional regulatory data (future versions may cover other types of regulation). Note: The Regulatory Overview has been tested on Internet Explorer 7.0, Firefox 3.3, Safari 4.0 and Chrome 2.0. It is recommended not to use Internet Explorer for the Regulatory Overview since its performance can be very slow when manipulating a large number (more than 100) of highlighted genes. The performance of the three other browsers are much better compared to Internet Explorer.
The Regulatory Overview is represented as a network with nodes and arrows (i.e., arcs). Each node represents a gene of a specific organism. There is an arrow from gene A to gene B if and only if A regulates B.
When first displayed, the overview does not show any regulatory arrow relationships since, typically, their great number would clutter the overview. These arrows can be selectively added by using the highlighting commands. See the sections below for more information on highlighting commands.
Not all organisms have regulatory data in their PGDB. If the command Genome → Regulatory Overview is grayed out, no Regulatory Overview can be displayed for the selected organism. Otherwise, by selecting the command Genome → Regulatory Overview a Regulatory Overview Web page will open and the complete Regulatory Overview of the selected organism will be displayed. The operations box on the right has several commands specifically for the Regulatory Overview.
It is possible to display a regulatory subnetwork of a specific organism by doing a series of highlighting and then use the command Redisplay Highlighted Genes Only. This command will create a new, smaller layout of the regulatory network that contains the genes that are highlighted only. Genes that do not regulate, or are not regulated by any highlighted genes, are not included in the subnetwork. Further operations can be done on this subnetwork as for the complete overview. See the Section Redisplay Highlighted Genes Only below for more details.
The most common operation is to move the Regulatory Overview left, right, up or down, since sometimes the entire network cannot fit entirely in the Web page. This can be done by holding down your left mouse button in a blank area then moving the mouse in the desired direction. This is called a panning operation. Panning can also be done by a small increment by clicking the arrows on the graphic at the top left of the screen called the panning widget.
To zoom-in or zoom-out, you can use the icon in the form of a ladder on the left of the overview Web page. Each step of the ladder is a zoom level. You can select any one of them at any time. You can also click a plus or minus sign (displayed on the top and bottom of this ladder) to zoom-in (increase size) or zoom-out (decrease size) the regulatory network. By increasing the zoom level (i.e., going up in the ladder), the gene names might overlap the network nodes— increasing the zoom level should remove such overlaps. The last zoom level (i.e., the last step of the ladder) will always force the display of all gene names in the network.
Note that depending on the speed of the server, generating large regulatory network overviews (i.e., a zoom-in near the top of the ladder) may require some time. They might have been already generated or they might need to be generated by the server. Accordingly, the response time might vary.
Mousing over a gene node displays a tooltip with data about the genes, its product, the possible ligand, the direct regulatees and regulators. Left-clicking the gene node will open a new Web page containing even more data specific for the gene. Other more complex visual commands can be reached by right-clicking on genes or in a blank area. This is discussed in detail in the following sections. Note for Mac users with a one-button mouse: left-click is the usual click, and right-click is the Mac control-click (i.e., you hold down the control key and click). But the exact keys to use may be customized on your Mac via the preferences panel.
Organism Selection: Selecting a new organism through the organism selector does not immediately change the Regulatory Overview to this organism. The next operation such as zoom-in or zoom-out will apply to the new selected organism. At any moment you can display the complete regulatory overview of the selected organism by selecting the command Display Complete Regulatory Overview under the right-clicking menu in a blank area or from the right operations box Redisplay Complete Regulatory Overview.
The following sections describe in more details these operations and some others.
For any organism, there are two layouts available: nested ellipses or top to bottom.
The layout nested ellipses uses up to three ellipses to display the gene nodes. The inner most ellipse contains, in alphabetical order of the gene names, the genes that have the largest number of regulatees. The middle ellipse contains genes that regulate at least one gene. The outer ellipse contains the genes that have no regulatees. They might be displayed as groups of genes regulated by the same set of genes (a multi-regulon). This is typically done using triangles or a short straight line if the group is small.
The layout top to bottom uses several straight rows to display the gene nodes. Each row contains genes that do not directly regulate each other. The top row contains the genes that regulate the largest number of genes. The bottom row contains genes that do not regulate any genes. In between rows contain genes that regulate some other genes. As for the nested ellipses layout, this row might have genes grouped in straight lines or triangles.
There are several commands to highlight genes and show the regulatory relationship arrows between them. Two commands use the gene name, or a substring of gene names, or a gene frame-id. Both of these commands are available by right-clicking in a blank area, or from the top menu bar under Regulatory Overview. The command Highlight Gene By Name or Frame ID highlights at most one gene. It is essentially a search command since you might not know the location of that gene in the regulatory network. Once found, the regulatory network will be centered on the location of the gene. The command Highlight Genes By Substring may highlight several genes. Selecting the command opens a panel from which you can enter a string of characters. Once clicking the button labeled Highlight in the panel, the genes highlighted have a name that contains the given string (this is a case-insensitive search). For this command it is also possible to include the regulatory relationships between the genes found. The command HighlightGenesByGeneOntologyTerms accessible from the right-clicking menu enables you to select one or more Gene Ontology (GO) terms. The genes that produce proteins annotated with the selected GO terms will be highlighted. The option Include Relationships Arrows enables you to add relationship arrows between the highlighted genes. Note that if you are displaying a subnetwork, there might be genes with such products in the organism but that these might not be in the subnetwork. In such a case, a warning is given that no genes have been highlighted.
Right-clicking on a gene will open a menu of highlighting commands specific to that gene. The menu may contain from one to seven commands. Since some genes do not have any regulators or/and any regulatees, this list of commands may vary from gene to gene. Here are the list of all possible commands available from this menu where name will be the gene name (e.g., trpA) on which the right-clicking was done. The highlighting is done with one a specific color but that color changes from one executed highlighting command to the next.
When a highlighting operation is done, a new overlay is created. The list of overlays is shown in the Layer Switcher panel on the right of the overview Web page. This panel may be minimized, in which case a small icon with a plus-sign is shown. Click on the plus-sign icon to open the panel. From this panel you can activate or deactivate specific overlays. This is particularly useful if you use the command Redisplay Highlighted Genes Only.
All highlighting can be removed by using the command Clear All Highlighting.
For more information about highlighting, see Section Redisplay Highlighted Genes Only.
The command Redisplay Highlighted Genes Only will display a regulatory network by considering only the genes that are highlighted. The layout is changed to “top to bottom” since it is usually a better layout when using a small set of genes. This command would be used after a series of highlighting operations to select a set of genes to analyze closely. The current displayed regulatory network will be removed and a new regulatory network will be displayed. The active highlighting will remain active. All overlays (active or not) will also remain. It is useful to keep the deactivated overlays since you may come back to the complete regulatory network and reactivate them to recreate a new regulatory subnetwork. Note that genes that do not regulate or are not regulated by any highlighted genes are not included in the subnetwork.
To redisplay the complete regulatory network, use the command Display Complete Regulatory Overview accessible when right-clicking in a blank area. The current active overlays remain active and the deactivated overlays are not removed.
The information in tooltips within a subnetwork display (produced when mousing over gene nodes) are restricted to that subnetwork. That is, the tooltip’s list of regulatees and regulators are for the subnetwork, not for the entire regulatory network of the organism. However, when you transition from a subnetwork display back to the display of the entire network, any highlighting done on a subnetwork will be expanded for the entire regulatory network to show relationships within the full network. For example, if gene A has four direct regulatees in a subnetwork, but twenty regulatees in the entire network, when the operation Highlight Gene A and its Direct Regulatees is applied in the subnetwork, only the four regulatees are highlighted, but once you redisplay the entire network, the twenty regulatees will be highlighted.
The Pathway Tools Regulatory Omics Viewer illustrates the results of high-throughput experiments in the context of gene regulation. Genes that are involved in regulation are mapped to gene
levels in a given experimental dataset is mapped to a spectrum of colors. This facility enables the user to see instantly which genes are active or inactive under some set of experimental conditions.
The Omics Viewer for the Regulatory Overview is very similar to the Omics Viewer for the Cellular Overview. Data files submitted to the Regulatory Omics Viewer must contain in their first column gene names or frame ids. To start the Regulatory Omics Viewer, use the command Overlay Experimental Data (Omics Viewer) under the Regulatory Overview menu. See Section 8.3 for details of how to use the Regulatory Omics Viewer.
Several types of comparative operations are available within Pathway Tools Web sites. Note that all of the PGDBs to be compared must be resident within a single Pathway Tools Web site.
Start a comparative analysis by specifying the organism(s) you want to compare. In many cases this can be done from the menu command Select organisms/databases for comparison operations, which is accessible through the Gene, Pathway, Reaction, and Compound menus. It is also accessible through the Choose Organisms button in the Analysis → Comparative Analysis page. This tool supports multi-organism selection using the following three modes. In each mode, a list of organisms for comparison is built up on the right side; you can add to, remove from, or clear that entire list using the buttons in the middle.
Most object pages in Pathway Tools Web sites contain commands for navigating to that same object in one or more other PGDBs. For example, the command Show this gene in another database on a gene page will find the same gene in a specified PGDB The command Show this compound in another database from a compound page will show the same metabolite in a specified PGDB. Similarly, Search for this gene in multiple databases on a gene page will generate a table showing information about that gene in multiple specified PGDBs.
Pathway Tools finds “the same object” using different mechanisms for different types of objects:
The following comparison commands are all available under the Gene, Compound, Reaction, and Pathway menus:
In addition, the following command will generate a table comparing the operon context of a gene across multiple organisms: Show orthologs (with operon diagrams) in multiple databases.
The comparative genome browser described in Section 5.2 supports more powerful viewing of genome regions around orthologous genes.
The “Species Comparison” operation in the operations box for pathway and reaction pages generates tables comparing a pathway or reaction across multiple PGDBs. If you wish to change the organisms being compared, use the command Change organisms/databases for comparison operations.
The reaction comparison table lists the enzyme(s) that catalyze the reaction; activators, inhibitors, and cofactors for those enzymes; and the one or more pathway(s) containing the reaction in that organism.
The pathway comparison table includes a graphic of the pathway showing which reactions in the pathway have enzymes present in each organism; a list of the enzymes catalyzing each reaction; and operon diagrams for each gene in the pathway.
Analysis → Comparative Analysis allows users to generate summaries of individual PGDBs, and to compare statistics between PGDBs. Currently we support comparative analysis of reactions, pathways, compounds, proteins, orthologs, transporters, and transcription units — select the type(s) of reports you wish to generate.
Next select one or more PGDBs for which to perform the analysis.
Please experiment with these commands to see the detailed reports generated by each comparison.
Pathway/Genome Databases (PGDBs) that have sequence data can be searched using NCBI BLAST. To access the Web interface for BLAST searches, go to: Search Menu → BLAST search.
Documentation on the use of the Web interface for NCBI BLAST can be found here.
PatMatch [2, 1] allows you to search for a short nucleotide or amino-acid sequence within a specific genome, using an exact search or using degenerate nucleotide or amino-acid symbols. The minimum length of the input string is 3 residues. The results are displayed initially as a simple web-page table, with the option of displaying the result as a SmartTable, if there are less than 5000 results. If there are more than 5000 results, then a file download link is provided.
To access the PatMatch search, go to: Search → Sequence Pattern Search .
For each genome, the user can search several alternative sequence databases:
A multiple sequence alignment viewer can be invoked to view alignments of amino-acid and nucleotide sequences. The tool can be invoked on a set of orthologs or on a set of genes or proteins via SmartTables.
To invoke the alignment viewer on a set of orthologs:
To invoke the alignment viewer on a set of genes in a SmartTable:
The sequence alignment viewer enables the user to zoom in to a region of the alignment by clicking on a point within the alignment graphic, to move left or right in the sequence by clicking the green arrows to the left/right of the coordinate line, and to re-render the alignment between specified coordinates.
©2018 SRI International, 333 Ravenswood Avenue, Menlo Park, CA 94025-3493