
GeneSpeed Frequently Asked Questions:
Q1: “How do I cite the use of the GeneSpeed Database?”
Q6: “Why can’t I access the full site – and how do I get full permissions?”
Q1: “How do I cite the use of the GeneSpeed Database?”
A1: Please cite our recent publication: Nucleic Acids Res. 2007 Jan;35(Database issue):D674-9. Epub 2006 Nov 28.
Q2: “When browsing by Gene Ontology sometimes my output consists of
thousands of family members and other times I get none, why is this?”
A2: The Gene Ontology (GO) dataset contains many annotations and is a great resource for finding gene classification terminology. Regrettably, the Gene Ontology annotations are far from complete and will probably not be for many years to come. Fortunately, however, many new annotations are added to the Gene Ontology dataset on a regular basis (often daily), and thus, as time moves on the GO will become more and more complete. As a result of this, many Gene Ontology classes do not have any annotated members and this will show up as ‘zero’ hits when browsing the GeneSpeed Database. You may try moving higher in the GO nodal tree until hits are observed. When a large number of members are seen in a GeneSpeed output, this can be the result of many annotations residing under the particular GO node(s) that were selected. This may also be the result of the e-score being set too high for the output criteria. First try lowering the e-score. If many hits are still observed, move to a lower node in the GO tree structure.
Q3: “When viewing e-scores in a query output, is there a way to trace
what sequence was used to generate that e-score?”
A3: Yes. One of the options for query output is called ‘Blast seq’. This field will show the original protein used in the BLAST search as well as amino acids that correspond to the particular domain of interest. GeneSpeed does not provide an actual alignment.
Q4: “Why does the GeneSpeed Database list domains with very low similarity?
Why doesn’t GeneSpeed utilize an standardized e-score cutoff to eliminate
such hits?”
A4: It is true that there are many low scoring domain hits in the GeneSpeed Database, and some of these hits have such a low e-score that it is extremely unlikely that they represent a homolog for the given protein domain. This, however, is one of the true strengths of the database; in many cases we have observed ‘true’ hits for domains with reasonably insignificant e-scores. If we were to implement a default e-score cutoff, then these ‘true’ hits would not reside in the database at all. As a result, we have allowed these low scoring hits to be in the database. We have given the user the ability to set the e-score to a stringent or a lenient value. In this way user may set a significant e-score (low e-score) and thus eliminate any false positives. On the other hand, the user may set a lenient score (high e-score) and thus include false positives, but at the same time homologies that might make biological sense may be found. Indeed, we have observed in many cases, ‘true’ members of a family to have fairly insignificant (high) e-scores – this is the natural result of billions of years of protein evolution, in which an ancestral domain may have been extensively duplicated and mutated. For more information on how to decide what e-score cutoff to use see our section on Establishing a Default Expectancy Score Cutoff.
Q5: “The GeneSpeed Database has performed a detailed characterization
of the transcription factor family; why not conduct classifications of other
regulatory families as well?”
A5: The nice thing about the transcription factor family is that it is a large fairly well studied regulatory family; this is what makes a detailed and reasonably thorough classification possible. Unfortunately, other regulatory families are not well characterized and there is much that is still not known about them. As more information becomes available, we may revisit the idea of classifying other families as well.
Q6: “Why can’t I access the full site – and how do I get
full permissions?”
A6: At present the GeneSpeed Database is in development. Full-site login use is given on a case-by-case basis for academic or commercial investigators. We monitor incoming IP addresses, and reserve the right to cancel log-on permissions at any time. If you would like to use the GeneSpeed Database in your research, please contact us so that we may set up a free account for you. The UCDHSC and the Jensen laboratory maintain all intellectual property rights to the process and content of the GeneSpeed Database, and as such, the source codes and database content may not be copied, distributed or otherwise propagated in any way that has not been approved by us. All users of the GeneSpeed Database are required to read, understand, and comply with the GeneSpeed Terms and Conditions.
Q7: “I have read a paper on the gene interesting, I think that
the process it regulates is important in the organ I study, but I am not sure
that the gene is expressed there – or if any other homologous gene might
have a role similar to it. How can the GeneSpeed Database help me?”
A7: That
sort of question is the primary motivation for us generating the GeneSpeed
Database. We expect that homologous genes from different species may share
similar expression profiles depending on the cell or tissue of origin. Therefore,
it is always valuable to obtain a general overview of how many proteins of
a given type may exist in the genome of a single organism, for use as an aid
to study their expression.
The procedure for obtaining this information using the GeneSpeed Database
is as follows. The GeneSpeed Database will allow you to perform a “Keyword”
search using the text string “interesting”. It will then display
genes containing this keyword in either the gene name, or the annotated symbol.
This will provide you with an overview of previously annotated genes with
that name. GeneSpeed is not the only site to provide such information, as
a simple text search at NCBI entrez will give that information as well. However,
GeneSpeed does provide an output file you will be able to extend to un-annotated
genes containing similar domain content as the product of the your interesting
gene. Perform a Domain Sub-search on gene “interesting” to find
all regulatory like domains within “interesting”. Then conduct
an IPR (Interpro) Sub-search on each of these domains to and GeneSpeed will
provide a list of genes that contain that domain. The resulting list may include
previously uncharacterized genes containing domains that are similar to the
domain of interest in interesting. You could repeat the IPR sub-search process
with other domains within “interesting” until you find a domain
that truly represents interesting and pulls out homologous genes. Finally,
use the linked Novartis array data series to determine in what tissues gene
"interesting" is expressed in.
Q8: “I use genomics extensively, and I am a skilled user of gene chip
analysis software. However, I constantly get frustrated about the lack of
protein annotation for some of my top-scoring hits. There are simply too many
EST’s crawling out there, with no functional annotation! It takes forever
to gauge what sort of gene is behind every Riken clone, and I feel that I
am wasting my precious time. Can the GeneSpeed Database help me?”
A8: It should. We have implemented an upload mechanism where you may generate your own gene list (in the form of Unigene Ids) as a copy of an output file from array analysis software. If the Unigene numbers are retired, or updated, you will be notified about this, and the list converted to the most recent Unigene ids. Your custom updated list can then be used to query the updated content of the GeneSpeed Database. This means that you will be able to see what domain content your unknown EST’s contain (by using the “Domain Sub-search” tool), and if there are any, you will quickly be able to move over to the InterPro site using the adjacent hyperlink. There, you will find a detailed description of the domain, which may help you in further evaluating the role of the unknown gene you have discovered. Also, you may obtain information whether the domain types discovered have any previous Gene Ontology annotation. If so, you may be able to deduct a presumed function of your novel gene immediately. Bear in mind that such connections do not prove a functional similarity – it is only a suggestion that needs to be determined empirically through experimentation. That is the nature of bioinformatics science – computer-based alignment strategies will only provide for educated guesswork.
Q9: “I am intrigued by the Gene Ontology basis for classifying gene
function. This is a major advancement in getting to grips with ~25,000 genes
in a mammalian organism. I currently investigate the regulatory process called
“very important pathway”, and this is classified as a GO-node,
with some sub-nodes called “this is important too” and “so
is this”. It seems that two proteins are assigned to the “very
important pathway” and to “so is this”. These are very different
– one is 200 amino acids, and the other is 1200 – a huge difference.
I also find no homology between these when I blast them against each other.
I do not really understand this, and can it really be true that only two proteins
are connected to my “very important pathway”. Please explain,
and tell me if the GeneSpeed Database can help me in any way, and not confusing
me more than I am.”
A9: We think the GeneSpeed Database can help you to understand the proteins in your “very important pathway”. First let us discuss the basic concept of the Gene Ontology (GO) node structure. A GO node attempts to represent a biological function or process, while child nodes (or sub-nodes below it) may be thought of as “sub categories” of that original “parent” GO node and represent more specific functions of that original parent node. As an example we may consider the parent GO node “development”. Specific functions or sub GO nodes of “development” include “mesoderm development” or “blastocyst development”. Although these two sub-nodes both represent different types of “development”, the genes or domains categorized in each may be very different. Of course one would not expect the same exact assortment of genes to be involved in both mesoderm and blastocyst development. This does not exclude the possibility that there may be overlaps, however. Also, as GO is a tiered structure of biological functions, there are no reason that the same biological function can not be accomplished by completely different protein types. Just consider “wnt signaling” as a GO-term. Wnt signaling may include the ligand (wnt’s), soluble receptors (Frzb’s), receptors (Frizzled), intracellular mediators (disheveled), transcription factors (Tcf’s), and co-activators (b-catenin). Thus, the fact that the sub-nodes of your “very important pathway” include proteins with very different domains that contain no similarity with one another should become clear. Fortunately the GeneSpeed Database is an excellent tool to study your domains further. First browse the GO section of GeneSpeed and find a node representing your “very important pathway”. Press the “Expand Selection” button, which will list all the sub-nodes of “very important pathway” which should include “this is important too” and “so is this”. Select each of these and then press the button “Choose Selection” which will display all the genes in each of these sub-nodes. NOTE that this does not select any genes only classified in the “very important pathway”, it only displays genes in each of the sub-nodes that were selected and their corresponding sub-nodal trees. At this point you may use the domain sub-search tool to investigate the different domains contained in these proteins. You may also use the IPR (Interpro) Sub-search tool to see what other proteins contain these domains. In addition, take advantage of the www hyperlinks out to sites like Pfam, Interpro, Unigene, Ensembl, etc. to further study your domains or genes of interest. One can not guarantee that the closest matches to your initial GO-defined factor also share the GO-function of that. You will need to evaluate such hits on a case-by-case basis.
Q10: “I have just performed a time-series on organ development of heart.
I have 30 DNA chips, and a wealth of data. I like using the GO-function in
GeneSpring to extract certain genes and query these based on presumed function.
However, I felt that I do not get sufficient insight, as many genes are not
properly GO-annotated, and as a result, I feel I am drowning in data. Given
that you have developed a method of linking up presumed GO-terms to un-annotated
genes, maybe the GeneSpeed Database can help me solve this?”
A10: It sure can. In addition to uploading a custom list into GeneSpeed and studying their associated domains, you may also export a custom gene list as well. This allows you to generate specific gene lists inside GeneSpeed, say including all genes of 10 select protein classes, and export these into your GeneChip analysis program. This method should allow you to focus your array analysis to gene lists of particular functions, thereby eliminating the typical “data drowning syndrome” frequently observed with chip analyses. It is true that several support functions within the GeneChip analysis programs will help you to extract functional group. NetAffX is one site dedicated to this. The difference of using the GeneSpeed site is that the domain-restricted lists should always be more complete than the curated, extracted lists integrated in your program. Check for yourself.
Q11: “I am confused by the number of hits that the GeneSpeed Database
provides after a simple query. Some proteins contain >10 different suggested
domains, and sometimes a query for transcription factors return some proteins
typical of being a membrane type. What is going on? This surely cannot be
correct.”
A11: You are absolutely right. However, these problems are rooted in the accepted expectancy score (E-score) for any of the queries you performed as well as the domain that you are looking at. As the E-score is dependent on domain size, we would not impose a more stringent E-score cut-off as this may not bring out short domain size matches. Please see the discussion on 'Establishing a Default Expectancy Score Cutoff', which describes our “recommendations” for E-score cutoff based on studies we have performed with the domain size. For example, the cutoff for a small domain may be as low as 1e-3. This signifies very low homology, but is still above the random polypeptide hit ratio for small domains. Most hits for larger domains at those E-scores, however, truly show no overall homology to the domain input. For comparison, we have found that a typical domain size of 100 amino acids should have an E-score below 1e-13 to begin to represent a true domain hit. Therefore, as you decrease the E-score value, you will see a decreasing number of hits, but your false positive rate will also decrease as well.
Q12: “I heard somebody refer to the GeneSpeed Database as ‘Unigene
on steroids’. What is this all about?”
A12: That is an interesting notion. The GeneSpeed Database could not be created without the existence of Unigene. There are a couple of reasons for this. Unigene represents a non-redundant set of transcribed sequences for different organisms. If the root dataset was redundant, we would have to set aside excessive resources in manual curation of search results. At present, only two persons were responsible for creation of the GeneSpeed Database, thus we have limited resources available to tackle any non-redundant root data set. Also, with the almost full sequencing of individual organism transcriptomes, very few new Unigenes are added, and fewer are lost at every Unigene data update. As a result, the root dataset becomes more stable, which improves the search generality. However, two other recent developments have also been crucial. First, the InterPro site, has an extensive compilation of known domain structures and made aligned lists of such grouped domains from any species available as direct downloadable files through the PFAM site. We could thus take advantage of this without having to build our own multi-domain tBLASTn input files, which was a major help. Second, the current gene function structure using the Gene Ontology classification schema, although incompletely populated due to the lack of our biological knowledge, nonetheless provides a good scaffold for extracting valuable classes of proteins, and thus their domains. As the few proteins assigned with GO-functions are linked via Interpro, this then allows for a GO-driven, domain-type extraction process, whereby multiple sequences in Unigene could be assigned. What all this provides is essentially a top-down and more structured view of the root Unigene database. So, Yes, “Unigene on steroids”, is perhaps a fitting description.