The Splicing Codes
Marcello Barbieri , page 14
The primary transcripts of the genes are often transformed into messenger RNAs by removing some RNA pieces (called introns) and by joining together the remaining pieces (the exons). This cutting-and-sealing operation, known as splicing, is a true assembly because exons are assembled into messengers, and we need therefore to find out if it is a catalyzed assembly (like transcription) or a codified assembly (like translation). In the first case splicing would require only catalysts (comparable to RNA-polymerases), whereas in the second case it would need an assembly machine and a set of adaptors (comparable to ribosome and tRNAs). These parallels immediately suggest that splicing is a codified process because it is implemented by structures that are very much comparable to those of protein synthesis. The splicing bodies, known as spliceosomes, are huge molecular machines like ribosomes, and employ small molecules, known as small-nuclear- RNAs (snRNAs) which are comparable to tRNAs. The similarity, however, goes much deeper than that because splicing is carried out by molecular structures that are true adaptors. They perform two independent recognition processes, one for the beginning and one for the end of each exon, thus creating a specific correspondence between primary transcripts and messenger RNAs. Splicing, in other words, is a codified process based on adaptors and takes place with sets of rules that have been referred to as splicing codes. It must be underlined, however, that there are two outstanding complications in splicing. One is the fact that the order in which the exons are joined together can be shuffled in various ways, an operation, called alternative splicing, that allows many species to generate a whole family of variant proteins from the same gene. The expression of these proteins, furthermore, can change from one tissue to another and in different stages of embryonic development, thus enormously increasing the protein variety that can be associated to a gene. Alternative splicing has in this way a powerful role in the generation of biological complexity, and splicing mistakes often have pathological effects; it has been estimated that they account for about one fifth of all inherited diseases.
The other great complication of splicing is the fact that many introns carry sequences that are similar to exons but translate into nonsense and for this reason are called pseudo exons or pseudo genes. They would create havoc if incorporated into mRNAs and the splicing machinery needs the means to differentiate real exons from pseudo ones. The result is that real exons contain internal identity marks that are known as exonic splicing enhancers (ESEs) and exonic splicing silencers.
Question: Had these indentity marks not have to be present in the process right from the beginning ? Would the absence of these or ones not fully developed not make the process impossible to happen without mistakes ?
The presence of these marks, in turn, means that the adaptors of the splicing codes are not single molecules but combinations of molecules because they must be able to recognize not only the beginning and the end of the real exons, but also their internal identity marks.
This makes the whole process even more impossible to emerge in a stepwise manner, since both, the recognition of the beginning and the end of the exons is required, that means, the genome needs to have the start and stop signals at the right place, and the molecular machines, programmed to recognize the signals must be in place, fully developed, and fully programmed, and the identity marks are required beside the hardware as well. Furthermore, this seems to be one more irreducible complex system , since both, the software, and the hardware, had to be in place, just right , fully developed and programmed since the beginning.
The actual deciphering of the splicing codes has already started but it is taking considerably longer than that of the genetic code because it is incredibly more complex. Let us keep in mind that the discovery of the genetic code has been facilitated by two particularly favourable features. More precisely, by the fact that
(1) the adaptors are single molecules (the tRNAs) and
(2) the coding units form a closed set (64 codons and 20 amino acids).
In the case of splicing, instead, the adaptors are combinations of molecules (combinatorial codes), and the domain (or alphabet) of the codes is open and potentially unlimited. The overall complexity of splicing is such that the most practical way of discovering its codes is by building computational models that are capable of predicting new splicing rules on the basis of existing data. Such models have already started appearing in the literature , and represent our first glimpse of the rules of the splicing codes.
The Metabolic Code
This is the first organic code that came to light after the discovery of the genetic code. It was described in Science, in 1975, by Gordon Tomkins, a professor of biochemistry at the University of San Francisco. Tragically, Tomkins died that very year, aged 49, from a brain tumour, and apparently his idea died with him. Recently, however, there has been an attempt to rescue his work from oblivion (Swan and Goldberg 2010) and here we will try to show that such attempt is amply justified. Tomkins investigated the evolution of metabolism and started from the need of the ancestral cells to obtain energy. “Since both nucleic acid and protein synthesis are endergonic reactions, primordial cells were almost certainly endowed with the capacity to capture the necessary energy from the environment and to transform it into usable form, presumably ATP (adenosine triphosphate). The biosynthetic capabilities of primitive cells were, however, probably quite limited ::: survival would therefore have required the evolution of regulatory mechanisms that could maintain a relatively constant intracellular environment in the face of changes in external conditions” (Tomkins 1975). Granted this basic need of the cells to evolve regulatory mechanisms, Tomkins distinguished between two types of regulation that he called simple and complex, elationship (positive or negative) between the components of a metabolic circuit, and the end products affecting their own metabolism.
Complex regulation is characterized by two new entities that Tomkins called symbols and domains. In order to illustrate them, Tomkins made the example of molecules that are accumulated inside a cell as a consequence of a particular environment and become a symbol of that environment. In most microorganisms, for example, cyclic AMP is accumulated as a result of carbon starvation and becomes a symbol of that deficiency. Another example is ppGpp (guanosine50 -diphosphate30 - diphosphate) that accumulates as a result of amino acid starvation and represents a symbol of that condition. These molecules are symbols because they bear no structural relationship to the molecules that promote their accumulation (cyclic AMP, for example is accumulated as a result of glucose starvation, but it is not a chemical analog of glucose). This is what suggested to Tomkins the existence of a metabolic code. “Since a particular environmental condition is correlated with a corresponding intracellular symbol, the relationship between the extra- and intracellular events may be considered as a metabolic code in which a specific symbol represents a unique state of the environment.” Tomkins went on to show how metabolic coding in unicellular organisms might have evolved into the endocrine system of the metazoa, and described what happens in the slime mold Dictyostelium discoideum. “Given sufficient nutrients, this organism exists as independent myxamoebas. Upon starvation, they generate cyclic AMP and release it into the surrounding medium. This substance serves as a chemical attractant that causes the aggregation of a large number of myxamoebas to form a multicellular slug. In this case, as in E. coli, cyclic AMP acts as an intracellular symbol of carbon-source starvation. In addition, however, the cyclic nucleotide is released from the Dictyostelium cells in which it is formed and diffuses to other nearby cells, promoting the aggregation response. Cyclic AMP thus acts in these organisms both as an intracellular symbol of starvation and as a hormone which carries this metabolic information from one cell to another.”
Hormones, according to Tomkins, evolved in order “to carry information from sensor cells in direct contact with the environment, to more sequestered responder cells. Specifically, the metabolic state of a sensor cell, represented by the levels of its intracellular symbols, is encoded by the synthesis and secretion of corresponding levels of hormones. When hormones reach the responder cells, the metabolic message is decoded into corresponding primary intracellular symbols. In this way, endocrine cells act as both sensors and responders, that is, intermediates in the transmission of metabolic information from primary sensor cells to the tissues in which the final chemical responses take place.”
The Signal Transduction Codes
Living cells react to many physical and chemical stimuli from the environment, and in general their reactions consist in the expression of specific genes. We need therefore to understand how the environment interacts with the genes, and the turning point, in this field, came from the discovery that the external signals (known as first messengers) never reach the genes. They are invariably transformed into a different world of internal signals (called second messengers) and only these, or their derivatives, reach the genes. In most cases, the molecules of the external signals do not even enter the cell and are captured by specific receptors of the cell membrane, but even those that do enter (some hormones) must interact with
intracellular receptors in order to influence the genes (Sutherland 1972). The transfer of information from environment to genes takes place therefore in two distinct steps: one from first to second messengers, called signal transduction, and a second path from second messengers to genes which is known as signal integration. The surprising thing about signal transduction is that there are literally hundreds of first messengers (ions, nutrients, hormones, growth factors, neurotransmitters, etc.) whereas the second messengers belong to only four molecular families: cyclic AMP or GMP, calcium ions (Ca2+), inositol trisphosphate (IP3), and diacylglycerol (DAG) (Alberts et al. 2007). First and second messengers, in other words, belong to two very different worlds, and this suggests immediately that signal transduction may be based on organic codes. This is reinforced by the discovery that there is no necessary connection between first and second messengers, because it has been proven that the same first messengers can activate different types of second messengers, and that different first messengers can act on the same type of second messengers (Alberts et al. 2007). The only plausible explanation is that signal transduction is based on organic codes, but of course one would like a direct proof. The signature of an organic code, as we have seen, is the presence of adaptors and the transmembrane receptor proteins of signal transduction do have the defining characteristics of the adaptors.
The transduction system consists of at least three types of molecules:
a receptor for the first messengers,
an amplifier for the second messengers and
a mediator in between (Berridge 1985).
This transmembrane system performs two independent recognition processes, one for the first and the other for the second messenger, and the two steps are connected by the bridge of the mediator. This connection, on the other hand, could be implemented in countless different ways since any first messenger can be coupled with any second messenger, and this makes it imperative to have a selection in order to guarantee biological specificity.
In signal transduction, in short, we find the three defining features of a code:
(1) two independents worlds of objects (first messengers and second messengers),
(2) a potentially unlimited number of arbitrary connections produced by adaptors, and
(3) a set of coding rules (a selection of the adaptors) that ensures the specificity of the correspondence.
The effects that external signals have on cells, in short, do not depend on the energy or the information that they carry, but on the meaning that cells give them with sets of rules that have been referred to as signal transduction codes (Barbieri 1998, 2003). One may wonder at this point why signal transduction codes are never mentioned in biochemistry books despite the fact that the their molecules are true adaptors. The problem here is that the study of signal transduction started when organic codes were not known, and it has always been assumed a priori that in this process there is no need for them. A code, in short, has not been found simply because it has never been looked for. The genetic code, on the contrary, was predicted on theoretical grounds, and it was discovered precisely because experiments were devised with the specific purpose to look for it.
The Signal Integration Codes
We have seen that there are only four families of second messengers in the cell, and yet the reactions that they set in motion can pick up an individual gene among tens of thousands. How this is achieved is still a mystery, but some progress has been made. Perhaps the most illuminating discovery, so far, is that second messengers do not act independently. Calcium ions and cyclic-AMPs, for example, have effects that in some occasions reinforce each other whereas in others are mutually exclusive. The cell, in short, can combine its internal signals in countless different ways, and it is precisely this combinatorial ability that explains why a small number of second messengers can generate an extraordinarily high number of specific genetic responses. The activation of second messengers, in other words, sets in motion a cascade of reactions that normally ends with the expression of a target gene, and again we need to understand if they are normal catalized reactions or if at least some of them are based on the rules of a code. One of the most interesting clues, in this field, is the fact that signalling molecules have in general more than one function. Epidermal growth factor, for example, stimulates the proliferation of fibroblasts and keratinocytes, but it has an antiproliferative effect on hair follicle cells, whereas in the intestine it is a suppressor of gastric acid secretion. Other findings have proved that all growth factors can have three distinct functions, with proliferative, anti-proliferative, and proliferationindependent effects. They are, in short, multifunctional molecules. In addition to growth factors, it has been found that many other molecules have multiple functions. Adrenaline, for example, is a neurotransmitter, but it is also a hormone produced by the adrenal glands to spring the body into action by increasing the blood pressure, speeding up the heart and releasing glucose from the liver. Acetylcholine is another common neurotransmitter in the brain, but it also act on the heart (where it induces relaxation), on skeletal muscles (where the result is contraction), and in the pancreas (which is made to secrete enzymes). Cholecystokinin is a peptide that acts as a hormone in the intestine, where it increases the bile flow during digestion, whereas in the nervous system is a neurotransmitter. Encephalins are sedatives in the brain, but in the digestive system are hormones which control the mechanical movements of food. Insulin is universally known for lowering the sugar levels in the blood, but it also controls fat metabolism and in other less known ways it is affecting almost every cell of the body. The discovery of multifunctional molecules suggests that their function is not decided solely by their structure, but also by the context in which they find themselves. What matters, in other words, is not their ability to catalize a specific reaction, but the fact that they are employed as molecular signs that can be given one meaning in a certain context and a different meaning in another one. A second finding that points to the existence of codes in signal integration is the fact that the regulation processes set in motion by second messengers are strongly conserved in evolution, and yet the actual reactions involved have undergone great changes in the history of life. The regulation of cellular energy homeostasis, for example, has been highly conserved from yeast to man, with the key role being played by a protein kinase that is called AMPK in animals and Snf1 in yeast. Despite this overall conservation, it has been found that an evolutionary divergence of about 150 million years between two species of budding yeasts (Saccharomyces cerevisiae and Kluyveromyces lactis) has produced substantial differences in their Snf1 regulatory networks. Again, what seems to matter in these regulation processes is not a specific set of catalysts, but a set of rules that can be implemented in many different ways. The information carried by first messengers, in conclusion, undergoes two great transformations in its journey towards the genes. First, it is transformed into internal messengers with the rules of the signal transduction codes, and then it is channelled along complex three-dimensional circuits that integrate it with other signals according to the rules of one or more signal integration codes.
The Histone Code
The classic double helix described by Watson and Crick has a width of 2 nm (two millionths of a millimeter), but in eukaryotes many segments of this filament are folded around groups of eight histone proteins and form blocks, called nucleosomes, that give to the filament a ‘beads-on-a-string’ appearance. This string, called chromatin, is almost six times thicker than the double helix and is further folded into spirals of nucleosome groups, called solenoids, that arrange it in fibers of increasing thickness and ultimately into the 600 nm fiber of the chromosome. These multiple foldings allow the eukaryotic cells to pack their long chromosomes into the tiny space of their nuclei, and for this reason it was initially assumed that the histones have a purely packaging role. The experimental data, however, have shown that the ‘tails’ of the histones (the parts that protrude from the surface of the nucleosomes) are subject to a wide variety of post-translational modifications (in particular acetylation, methylation and phosphorylation) that have highly dynamic roles and are involved in the activation or repression of gene activity. The histone tails represent about 25–30 % of the histone mass, and their posttranslational modifications can alter the chromatin either directly or indirectly. The direct modifications are those that physically open or close the molecular space (in particular the electrostatic barrier) that surrounds the genes and in this way control the transit of DNA-binding proteins. Several discoveries, however, have shown that the most frequent effects are obtained by indirect mechanisms. In these cases, the modified histone tails provide ‘marks’ on the surface of the nucleosomes that are recognized by specialized effector proteins which set in motion chains of biological reactions that eventually end in the activation or the repression of specific gene. A crucial breakthrough, is this field, was the discovery that the post-translational modifications of the histones do not act individually. Most of them are involved in both the activation and the repression of genes (the phosphorilation of histone H3, for example, takes part in the condensation as well as in the decondensation of chromatin), which means that the final result is due to a combination of histone marks rather than a single one. This led David Allis and colleagues to propose that the histone marks operate in combinatorial groups, like letters that are put together into the words of a molecular ‘language’ that was referred to as histone code. The same concept was independently proposed by Brian Turner who argued that there is an epigenetic code at the heart of the regulation mechanisms that are initiated by histone tail modifications. Turner pointed out that these modifications are epigenetic because they operate in addition to genetic changes, and underlined that they have both short-term and long-term effects. The shortterm modifications change rapidly in response to external signals and represent a mechanism by which the genome quickly responds to the environment. The long-term modifications, instead, are those that are put in place at early stages of embryonic development and allow the transcription or the silencing of specific genes at more advanced stages. The existence of long-term effects was revealed by the discovery that many histone modifications survive the trauma of mitosis and are transmitted to the daughter cells. This is particularly important in embryonic development where the cells must perpetuate their state of differentiation into distinct tissues. The histone modifications, in other words, provide a mechanism of cell memory, in the sense that they enable the cells to ‘remember’ their specific pattern of gene expression for many generations. It has been shown, for example, that the expression of Hox genes in embryonic development is regulated by histone modifications . Another example of long-term effects is provided by the histone modifications that allow neural cells to generate faster action potentials the more they are used, making the transmission of action potentials increasingly easier. Today, in conclusion, a large number of data support the idea that the regulation of genetic activity by histone modifications plays a fundamental role in all eukaryotes and is based on the rules of a combinatorial code that has become known as ‘histone code’.
Is the “Histone Code” an Organic Code?
This question is the title of a paper where Stefan Kühn and Jan-Hendrik Hofmeyr described the results of a research project dedicated to find out whether or not the histone code has all the essential characteristics of an organic code. The prototype example of the genetic code shows that an organic code requires three things:
(1) two independent molecular worlds,
(2) a set of molecular adaptors that create a mapping between them, and
(3) the demonstration that the mapping is arbitrary because its rules can be changed. Kühn and Hofmeyr tested the histone code in respect to all these points.
1. The Two Independent Worlds of the Histone Code
An organic code is a mapping between organic signs and organic meanings, and in many cases signs and meanings are both organic molecules. The genetic code, for example, is a mapping between codons and amino acids, whereas the signal transduction code is a mapping between first and second messengers. Kühn and Hofmeyr, however, pointed out that the organic meanings can be biological effects rather than molecules. In principle this may not seem an extension of the original definition because biological effects are necessarily implemented by molecules, but in practice it is a very useful generalization because there are cases in which a biological function is an experimental reality even when its molecular components are not fully known. And this is precisely the case in the histone code, where the organic signs are groups of histone modifications and the organic meanings are biological reactions that promote the activation or the repression of specific genes. The histone code, in other words, is a mapping between two independent worlds.
2. The Adaptors of the Histone Code
The effector proteins of the histone code are the molecules that establish a bridge between organic signs and organic meanings, but in order to prove that they are true adaptors it is necessary to show that they operate independently on signs and meanings. Kühn and Hofmeyer underlined that this is precisely what happens because the effector proteins have two distinct domains: one that recognizes histone modifications and a different type that initiates biological reactions. It has been shown, for example, that the acetylated lysines are specifically recognized only by the bromodomains of the effector proteins . The methylated amino acids are recognized by a greater variety of domains but again each recognition step is absolutely specific . The effector proteins, in other words, perform two independent recognition processes on signs and meanings and are therefore true adaptors.
3. The Arbitrariness of the Histone Code
An organic code is arbitrary when its rules are not dictated by physical necessity and in this case it must be possible, at least in principle, to exchange the part of an adaptor that recognizes an organic sign with a different one and show that the modified adaptor associates the old organic meaning to the new sign. Kühn and Hofmeyr noticed that the experimental data support this possibility because there is evidence that the chromodomains of the effector proteins can be interchanged. The histone code, in conclusion, did pass the three tests and Kühn and Hofmeyr ended their paper with these words: “Although we probably do not yet know the complete histone code, we have more than enough information to be able to recognize the histone code as a bona fide organic code.”
The Tubulin Code
Tubulin is the major component of the microtubules, the filaments that form an internal scaffolding in all eukaryotic cells and give origin to organelles such as cilia, centrioles, basal bodies and the mitotic spindle. Most microtubules are in a state of rapid turnover by dynamic instability and alternate very quickly between growth and shrinkage. Within the cell, however, there is also a population of microtubules that are relatively stable, in the sense that their turnover is measured in hours rather than minutes. The function of the stable microtubules is still not completely known, but there are clear indications that they are involved in the morphogenesis of the eukaryotic cell. What is certain, is that the stable microtubules undergo a variety of post-translational modifications (PTMs) that have been strongly conserved because they are found in all eukaryotic taxa. These PTMs consist in processes like acetylation, phosphorylation, polyglutamylation, polyglycylation, detyrosination, and palmitoylation that act preferentially on stable microtubules. They have been studied with various tests on purified tubulin, but the experiments have failed to detect any direct effect of the PTMs on the dynamics of the microtubules. This means that PTMs do not act by changing directly the intrinsic properties of the microtubules, but rather by providing combinatorial signals for the recruitment of proteins that interact with the microtubules. Different combinations of PTMs, in other words, act like signposts that specify the properties that stable microtubules are going to have in different regions of the cell or in different periods of the cell cycle. To this set of signposts that operate on stable microtubules, Kristen Verhey and Jacek Gaertig gave the name of Tubulin code. Any organic code, as we have seen, requires molecules that act like adaptors between two different domains. Verhey and Gaertig have called these molecules ‘interpreters’, and have identified three major classes of microtubule binding proteins that can be considered interpreters of the tubulin code:
“First, microtubule associated proteins (MAPs) such as Tau, MAP1 and MAP2 that bind statically along the length of microtubules.
Second, plus-end tracking proteins (+TIPs) that bind in a transient manner to the plus-ends of growing microtubules.
And third, molecular motors that use the energy of ATP hydrolysis to carry cargoes along microtubule tracks.”
Verhey and Gaertig have also called attention to a unique characteristic of the tubulin code. Many epigenetic modifications are transmitted from one generation to the next, but this does not usually happen in the tubulin world: “Some microtubule-based organelles (e.g., centrosomes and basal bodies) are inherited by a template-driven mechanism but there is no evidence that the template organelle directly influences the PTM pattern in the new organelle. Rather, the PTM pattern is recreated in the newly formed organelle in a gradual manner : : : Other microtubulebased structures, such as cytoplasmic microtubules, the mitotic spindle and cilia, are formed de novo mostly, if not entirely, from unmodified tubulin heterodimers. Thus, in case of both template-dependent and template-independent microtubular structures, PTM patterns are probably recreated without a direct influence of preexisting PTMs.” The existence of the tubulin code, in conclusion, is based on sound experimental evidence but the actual deciphering of its rules is still at a preliminary stage and requires a detailed understanding of how the PTMs influence the recruitment of proteins and regulate the functions of the stable microtubules.
The Sugar Code
For a long time, sugars have been regarded as molecules that provide energy (mostly in the form of glucose and glycogen) or structural support (like cellulose in plants), but molecular biology has shown that they also have a third outstanding function: by binding to proteins they generate glycoproteins, molecules that take part in countless communication processes in and between cells. The addition of sugars to proteins is a post-translational modification, called glycosylation, that greatly expands the potentialities of many protein families and gives origin to glycoproteins that perform a wide variety of functions. Some operate on the cell membrane and act as antennae for receiving molecular signals or as docking sites for importing compounds. Other glycoproteins take part in cell-to-cell interactions, for example in sperm-oocyte attachment, in bacteria-to-cell relationships and in the aggregation of platelets. A third family operates in the immune system where glycoproteins interact with antigens, recognize white blood cells, and take part in the major histocompatibility complex (MHC). Yet another family is that of the glycoproteins that act as hormones, like human corionic gonadotropin (HCG), thyroid-stimulating hormone (TSH) and erythropoietin (EPO). Then there are glycoproteins that have protective functions (mucins), some that are involved in transport (transferrin) and others that act as enzymes (alkaline phosphatase). The key point in these interactions is that in most cases it is the sugar component that determines the recognition ability of the glycoproteins. This point has been particularly underlined by Winterburn and Phelps (1972), who convincingly argued that “the significance of the glycosyl residues is to impart a discrete recognitional role on the protein”. Sugars, in other words, are carriers of information because their sequences have specific biological functions, and yet the information they carry is only partially contained in the genome. In most cases it is due to subtle epigenetic modifications in the terminals of the sugar antennae. It has been found, furthermore, that sugars have a capacity to store information that is many orders of magnitudes higher than that of nucleotides and amino acids . This makes us realize that, after nucleotides and amino acids, sugars are a third great family of informational molecules, but how do they transmit their messages to the other components of the cell? The key discovery, on this point, is that the functions that are associated with sugars are not set in motion by the sugars. In most cases, they are set in motion by proteins that interact with the sugars and recognize the specific role that they have in any given set of circumstances. These sugar-binding proteins became popular in the early 1900s mainly because they served to determine the chemical structure of the ABO blood groups and were originally called agglutinins. In 1954, however, Boyd argued that they should be given a new name that reflects the unique function that they actually perform, i.e., the highly specific selection of carbohydrates. To this purpose he proposed to call them lectins, on the ground that this term derives “from the Latin lectus, the past principle of legere meaning to pick, choose or select” (Boyd 1954). The next step in the discovery of the informational properties of the sugars was the recognition, by Hans-Joachim Gabius, that their messages must be decoded in order to have biological effects, and that lectins are the decoding devices in this process. Gabius, in other words, realized that lectins are adaptors, molecules that act as intermediaries between sugars and biological reactions and establish connections between them that are not determined by physical necessity. This is why he proposed that there is a Sugar code at the basis of the communication processes that involve sugars, and that “lectins are the translators of the Sugar code”.