MAKING PLANS FOR NIGEL: Defining interfaces between computational representations of linguistic structure and output systems: Adding intonation, punctuation and typography systems to the PENMAN system. 

Chapter 5 : The phonological Interface

previous up next 
PDF version

5.0  Outline

This chapter is similar in function to the last, but it is about phonology instead of graphology.  The phonological interfacial code was introduced in chapter two, (it and Halliday’s notation for intonation and rhythm appear in Appendix II).  The main section (5.1) describes the output system in much more general terms than the description of GEKKO.  There is also a brief discussion (5.2) of the broader implications for work on PENMAN.

5.1  The output program: how it works

It is important to note that the discussion below is not a critical survey of the literature on intonation, it is purely descriptive of the output system, part of the text-to-speech system being developed by King and Vonwiller. The recent literature on intonation has been drawn on extensively in development of this speech synthesiser, but mainly for the empirical research, not for theoretical models.  A good source for a summary of the recent literature, its main themes and theories, is Silverman (1987).

For the purposes of this thesis it is not important to know exactly what the synthesiser does.  In fact, it follows from the interface principles that the internal organisation of the synthesiser should not have to be known; and it must be possible to hook up to different synthesizers.  In converting the interfacial code into sound, the motivation for the interfacial code has come from work on grammar and meaning.  The process of generating the sound can be thought of in purely pragmatic terms.  As long as the machine works it is unimportant how it does so.  The design of the speech synthesiser is not motivated systemically, so it will not be discussed on the same terms as other components of the NIGEL system.

There is no room here to describe the output program in detail; what follows is a brief summary of the process that will be used to convert code containing an orthographic string, together with specifications of its Tonality, Tonicity, Tone and which are content words.  The first step is to work out the rhythm of the utterance; the next is to locate the tonic syllable; then the correct sort of tone contour is chosen from a table and “stretched” to fit the tone group.  These steps are not enough, however, to generate natural sounding speech: it is necessary to take account of a number of syntagmatic aspects of the actual speech signal.  The most important of these syntagmatic aspects are discussed below, following the summary of the way that the paradigmatic variables, tonicity, tonality etc. are synthesised together.

The simplest way of assigning the rhythm to an utterance is to insert a foot boundary before every content word.  Silverman (1987) writes that such an approach results in “careful sounding” speech.  This can be illustrated using one of Halliday’s examples.  In the interfacial code used here the text would look like this (it is the second sentence with which we are concerned, the first is provided for context):

5.1
In this job, Anne, we’re working with silver.  //Now Csilver [N Cneeds to have Clove]//.

In Halliday’s notation, with ‘/’ representing a foot-boundary, and the word containing the tonic syllable in bold-face, 5.1 becomes 5.2:

5.2

//^ Now/ silver / needs to have /love//

A foot boundary has been placed in front of each content word, and the tonic is the last stressed syllable in the New element.

However, this process can not generate correct rhythm in all cases: as was noted in chapter two the information structure is somewhat indeterminate - it is often not clear where the New element starts  - but there are some cases in which Given / New structure conditions rhythm.  Halliday provides two versions of 5.1 to illustrate this.

5.3
(a)
I’ll tell you about silver. // it [N Cneeds to have Clove]//.
Phonetically: // ^ it / needs to have / love //
(b)
I’ll tell you what silver needs to have. // it Cneeds to have [NEW Clove]//
Phonetically: // ^ it needs to have / love //

In (a) needs is salient, which indicates that it is the beginning of the New; whereas in (b) it is part of the initial proclitic foot, reflecting the fact that in this instance it is Given, being mentioned in the preceding clause.  But not all given elements are characterised by this absence of salience. (Halliday 1985.a p.276)

So, in cases like example 5.3, an algorithm that simply places foot boundaries before content words would fail to capture the distinction between (a) - for which the algorithm would work - and (b) - for which it would not, in which the content word “needs” does not get a foot boundary.  The more sophisticated algorithm, capable of making this distinction, is still under development.

Having worked out which word contains the tonic syllable and the rhythm of an utterance the next step is to take an intonation contour from a table of intonation contours and to fit it to the utterance.  This is done by a process of scaling - by stretching the contour so that it ‘fits’.

As with the rest of the treatment of phonology, this chapter is back-grounding pre- and post-tonics.  Pre- and post-tonics are assigned to each non-tonic foot.  These are generated somewhat like the major intonation contour.  In Vonwiller’s system appropriate contours are assigned and may be internally randomised, so that the output does not become monotonous.  It would be reasonable to expect a very general system to be able to chose if it ‘wanted’ to sound monotonous, or to sound animated.  For the time being the meaningful specification of pre- and post-tonics is not supported.

Syntagmatic organisation

Silverman points out (1989, 5.3) that a simple linear fitting of intonation contours would result in unnatural sounding speech, so it is necessary to shape the contour so that it conforms to the syntagmatic, phonetic rules of intonation.  A list of some of these syntagmatic factors appears below; the list is by no means exhaustive, but it is representative of the major syntagmatic factors that the output program will take into account.

  1. Successive intonation contours, typically exhibit declination, that is, the high and low points of the contours progressively get lower. Ladd (1984 p.54) says: ‘declination’ may be defined as the gradual decline in the phonetic frame of reference’.  There is some contention about the nature of declination (Ladd 1984).  Adjacent high points in intonation are perceived to have the same pitch if the second is slightly lower. The break that comes between a series of contours which have exhibited down stepping, and the marshalling of resources to ‘start high’ again  has been characterised by various theorists as a break between ‘prosodic paragraphs’ (Silverman 1989 ch.2), as such the grouping of IUs represents a possible further meaningful extension to NIGEL (see section 5.3 below).

  2. There are various effects on the duration of syllables, depending on their tonicity.  The most noteworthy of such phenomena is that tonic syllables, in general, are longer than non-tonic ones (Eady et al. 1986), and the later in the string the tonic occurs the less the lengthening.  Although they are too complex to detail here, the output program will implement a number of variations in the duration of the tonic depending on its position in the string.

  3. <inserted because of an error in numbering in the original>

  4. Each sound segment, or phoneme, has a certain intrinsic frequency (Silverman 1987, 4.8).  It is necessary to allow for this when producing intonation contours, the relevant deviation in frequency will be added to the tone contour for each segment.

  5. Each intonation contour will have different table entries for different places in the tone group.  Figure 5.1 shows this for tone contour five, “rising-falling”: the starting and finishing points of the contour are specified with respect to each other, depending on the position of the tonic in the tone-group.

    graphics1

    FIGURE 5.1
    Different Pitch shapes are chosen for TONE 5 for different placements of the tonic.

  6. When it has been selected from the table, and scaled to fit the utterance, each contour is smoothed to look natural.  The work done by Pierrehumbert and collaborators (see Silverman 1987) provides a useful description of the physical properties of intonation contours, that have been used in the design of the output program, but the TS model of generation is not used here.  TS models could be said to treat this smoothing process as a process of ‘joining the dots’.  In contrast to this approach, King’s program performs mathematical transformations on the chosen contour to make it ‘look’ natural.

  7. As mentioned above, the assignment of pre- and post-tonics is currently treated as a syntagmatic matter.  The program will have tables of intonation contours for different pre- and post-tonic segments.  These are treated in much the same way as the major tonic contours in that they will have different shapes according to their position in the tone-group, and they will be smoothed to sound natural.

The approach to producing intonation taken here is to treat it as an engineering problem.  This program is driven from above and from without; the meaningful input to and the meaning making success of the program are the important aspects.

So far this thesis has sketched the interfacial codes that will be used for phonological and graphological text.  The focus has been different for graphology and phonology: for graphology the graphological system itself and the typographic output system has received close attention while the grammatical systems that are realised in graphology have not been discussed.  The discussion of phonology has touched on grammatical systems, but has been less concise about the operation of the output program.

5.2  Looking ahead

In contrast to the treatment of graphology, which concentrated on describing all aspects of typographic text, the treatment of phonology has been rather narrower.  One dimension of this narrowness has been the rank-scale; phonological organisation into larger units than the tone group was not considered.  One such larger unit is the prosodic paragraph discussed by Silverman (1987).  Another is the turn, a unit in the organisation of dialogue.

It is expected that the basic notation for prosodic information developed for graphology can be applied to phonology, to allow NIGEL to specify such larger phonological units.  It is not clear whether the indexing system for crossed brackets will be needed, in adapting the notation to phonology, but it seems likely that the indices may be needed if verbal art is to be considered.  For instance, sounded poetry can explicitly mark textual organisation and metrical organisation using the phonology, in the way that the graphological categories of line and sentence can ‘cross’ as in example 4.4.

Silverman’s work (1987) on synthesising prosodic paragraphs shows that the prosodic paragraph - a series of tone-groups which are grouped together with declination in tone, and marked off by a final drop in tone - is a unit.  It serves to group text in much the same way as typographic paragraphs.  The prosodic paragraph must consist of a number of whole tone-groups; so a simple bracketing without any indexing seems appropriate: so we might expect NIGEL to produce text of the form [\PP\ // IU // IU // IU // IU //], a number of information units grouped together, where PP is for prosodic paragraph.  (Work by Brazil (1985) on Key (which is more like the musical notion of key-signature than Halliday’s grammatical Key) provides a framework for this aspect of intonation.)

There has been no discussion in this thesis so far of dialogue.  Systems like NIGEL are, however, frequently used interactively, for instance, in allowing people to access data-bases using natural language.  It has already been noted, above, that NIGEL lacks a suitable interaction base to allow the generation of co-operative texts but there are a number of systemic models of dialogic discourse (see Martin 1990 for a discussion).  One such system is Martin’s which suggests the units dialogue, exchange and turn as a rank-scale.  Whatever model is adopted for NIGEL it is expected that just as for the typographic specification of whole documents it should be possible to specify certain aspects of the phonetic realisation of phonology for long spans of text.  An example of this would be specifications for the selection of male as opposed to female voice (or the other way around), and high volume, for the whole length of telephone dialogues.

In the future, it might be expected that voice quality, loudness and accent could be chosen in an informed way, in much the same way as fonts and styles are chosen through graphology.