Information Theory

  • November 2019
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Information Theory as PDF for free.

More details

  • Words: 16,541
  • Pages: 22
Fuzzy Sets and Systems 132 (2002) 11 – 32

www.elsevier.com/locate/fss

Possibilistic information theory: a coding theoretic approach  Andrea Sgarro ∗ Department of Mathematical Sciences (DSM), University of Trieste, 34100 Trieste, Italy Received 20 April 2001; accepted 19 November 2001

Abstract We de*ne information measures which pertain to possibility theory and which have a coding-theoretic meaning. We put forward a model for information sources and transmission channels which is possibilistic rather than probabilistic. In the case of source coding without distortion we de*ne a notion of possibilistic entropy, which is connected to the so-called Hartley’s measure; we tackle also the case of source coding with distortion. In the case of channel coding we de*ne a notion of possibilistic capacity, which is connected to a combinatorial notion called graph capacity. In the probabilistic case Hartley’s measure and graph capacity are relevant quantities only when the allowed decoding error probability is strictly equal to zero, while in the possibilistic case they are relevant quantities for whatever value of the allowed decoding error possibility; as the allowed error possibility becomes larger the possibilistic entropy decreases (one can reliably compress data to smaller sizes), while the possibilistic capacity increases (one can reliably transmit data at a higher rate). We put forward an interpretation of possibilistic coding, which is based on distortion measures. We discuss an application, where possibilities are used to cope with uncertainty as induced by a “vague” linguistic description of the transmission channel. c 2001 Elsevier Science B.V. All rights reserved.  Keywords: Measures of information; Possibility theory; Possibilistic sources; Possibilistic entropy; Possibilistic channels; Possibilistic capacity; Zero-error information theory; Graph capacity; Distortion measures

1. Introduction When one speaks of possibilistic information theory, usually one thinks of possibilistic information measures, like U-uncertainty, say, and of their use in uncertainty management; the approach which one takes is axiomatic, in the spirit of the validation of Shannon’s entropy which is obtained by using Hin
 Partially supported by MURST and GNIM-CNR. Part of this paper, based mainly on Section 5; has been submitted for presentation at Ecsqaru-2001, to be held in September 2001 in Toulouse, France. ∗ Corresponding author. Tel.: +40-6762623; fax: +40-6762636. E-mail address: [email protected] (A. Sgarro).

measures which pertain to possibility theory and which have a coding-theoretic meaning. This kind of operational approach to information measures was *rst taken by Shannon when he laid down the foundations of information theory in his seminal paper of 1948 [18], and has proved to be quite successful; it has lead to such important probabilistic functionals as are source entropy or channel capacity. Below we shall adopt a model for information sources and transmission channels which is possibilistic rather than probabilistic (is based on logic rather than statistics); this will lead us to de*ne a notion of possibilistic entropy and a notion of possibilistic capacity in much the same way as one arrives at the corresponding probabilistic notions. An interpretation of possibilistic coding is discussed, which is based on distortion measures, a notion which is currently used in probabilistic coding.

c 2001 Elsevier Science B.V. All rights reserved. 0165-0114/01/$ - see front matter  PII: S 0 1 6 5 - 0 1 1 4 ( 0 1 ) 0 0 2 4 5 - 7

12

A. Sgarro / Fuzzy Sets and Systems 132 (2002) 11 – 32

We are con*dent that our operational approach may be a contribution to enlighten, if not to disentangle, the vexed question of de*ning adequate information measures in possibility theory. We recall that both the entropy of a probabilistic source and the capacity of a probabilistic channel are asymptotic parameters; more precisely, they are limit values for the rates of optimal codes, compression codes in the case of sources, and error-correction codes in the case of channels; the codes one considers are constrained to satisfy a reliability criterion of the type: the decoding-error probability of the code should be at most equal to a tolerated value , 06¡1. A streamlined description of source codes and channel codes will be given below in Sections 4 and 5; even from our Keeting hints it is however apparent that, at least a priori, both the entropy of a source and the capacity of a channel depend on the value  which has been chosen to specify the reliability criterion. If in the probabilistic models the mention of  is usually omitted, the reason is that the asymptotic values for the optimal rates are the same whatever the value of , provided however that  is strictly positive. 1 Zero-error reliability criteria lead instead to quite diBerent quantities, zero-error entropy and zero-error capacity. Now, the problem of compressing information sources at zero error is so trivial that the term zero-error entropy is seldom used, if ever. 2 Instead, the zero-error problem 1

The entropy and the capacity relative to a positive error probability  allow one to construct sequences of codes whose probability of a decoding error is actually in/nitesimal; it will be argued below that this point of view does not make much sense for possibilistic coding; cf. Remark 4.3. 2 No error-free data compression is feasible for probabilistic sources if one insists, as we do below, on using block-codes, i.e., codes whose codewords have all the same length; this is why one has to resort to variable-length codes, e.g., to HuBman codes. As for variable-length coding, the possibilistic theory appears to lack a counterpart for the notion of average length; one should have to choose one of the various aggregation operators which have been proposed in the literature (for the very broad notion of aggregation operators, and of “averaging” aggregations in particular, cf., e.g., [12] or [16]). Even if one insists on using block-codes, the problem of data compression at zero error is far from trivial when a distortion measure is introduced; cf. Appendix B. In this paper we deal only with the basics of Shannon’s theory, but extensions are feasible to more involved notions, compound channels, say, or multi-user communication (as for these information-theoretic notions cf., e.g., [3] or [4]).

of data protection in noisy channels is devilishly diLcult, and has lead to a new and fascinating branch of coding theory, and more generally of information theory and combinatorics, called zero-error information theory, which has been pretty recently overviewed and extensively referenced in [15]. In particular, the zero-error capacity of a probabilistic channel is expressed in terms of a remarkable combinatorial notion called Shannon’s graph capacity (graph-theoretic preliminaries are described in Appendix A). So, to be fastidious, even in the case of probabilistic entropy and probabilistic capacity one deals with two step-functions of , which can assume only two distinct values, one for  = 0 and the other for ¿0. We shall adopt a model of the source and a model of the channel which are possibilistic rather than probabilistic, and shall choose a reliability criterion of the type: the decoding-error possibility should be at most equal to , 06¡1. As shown below, the possibilistic analogues of entropy and capacity exhibit quite a perspicuous step-wise behaviour as functions of , and so the mention of  cannot be disposed of. As for the “form” of the functionals one obtains, it is of the same type as in the case of the zero-error probabilistic measures, even if the tolerated error possibility is strictly positive. In particular, the capacities of possibilistic channels are always expressed in terms of graph capacities; in the possibilistic case, however, as one loosens the reliability criterion by allowing a larger error possibility, the relevant graph changes and the capacity of the possibilistic channel increases. We describe the contents of the paper. In Section 2, after some preliminaries on possibility theory, possibilistic sources and possibilistic channels are introduced. Section 3 contains two simple lemmas, Lemmas 3.1 and 3.2, which are handy tools apt to “ translate” probabilistic zero-error results into the framework of possibility theory. Section 4 is devoted to possibilistic entropy and source coding; we have decided to deal in Section 4 only with the problem of source coding without distortion, and to relegate the more taxing case of source coding with distortion to an appendix (Appendix B); this way we are able to make many of our points in an extremely simple way. In Section 5, after giving a streamlined description of channel coding, possibilistic capacity is de*ned and a coding theorem is provided. Section 6 explores

A. Sgarro / Fuzzy Sets and Systems 132 (2002) 11 – 32

the consequences of changing the reliability criterion used in Section 5; one requires that the average error possibility should be small, rather than the maximal error possibility. 3 Up to Section 6, our point of view is rather abstract: the goal is simply to understand what happens when one replaces probabilities by possibilities in the standard models for data transmission. A discussion of the practical meaning of our proposal is instead deferred to Section 7: we put forward an interpretation of the possibilistic model which is based on distortion measures. We discuss an application to the design of error-correcting telephone keyboards; in the spirit of “soft ” mathematics possibilities are seen as numeric counterparts for linguistic labels, and are used to cope with uncertainty as induced by “vague” linguistic information. Section 7 points also to future work, which does not simply aim at a possibilistic translation and generalization of the probabilistic approach. Open problems are mentioned, which might prove to be stimulating also from a strictly mathematical viewpoint. In this paper we take the asymptotic point of view which is typical of Shannon theory, but one might prefer to take the constructive point of view of algebraic coding, and try to provide *nite-length code constructions, as those hinted at in Section 7. We deem that the need for a solid theoretical foundation of “soft ” coding, as possibilistic coding basically is, is proved by the fact that several ad hoc coding algorithms are already successfully used in practice, e.g., those for compressing images, which are not based on probabilistic descriptions of the source or of the channel (an exhaustive list of source coding algorithms is to be found in [21]). Probabilistic descriptions, which are derived from statistical estimates, are often too costly to obtain, or even unfeasible, and at the same time they are uselessly detailed. The paper aims at a minimum level of selfcontainment, and so we have shortly re-described certain notions of information theory which are quite 3 The new possibilistic frame includes the traditional zero-error probabilistic frame, as argued in Section 3: it is enough to take possibilities which are equal to zero when the probability is zero, and equal to one when the probability is positive, whatever its value. However, the consideration of possibility values which are intermediate between zero and one does enlarge the frame; cf. Theorem 6.1 in Section 6, and the short comment made there just before giving its proof.

13

standard; for more details we refer the reader, e.g., to [3] or [4]. As for possibility theory, and in particular for a clari*cation of the elusive notion of non-interactivity, which is often seen as the natural possibilistic analogue of probabilistic independence (cf. Section 2), we mention [5,6,9,11,12,16,23]. 2. Possibilistic sources and possibilistic channels We recall that a possibility distribution  over a *nite set A = {a1 ; : : : ; ak }, called the alphabet, is de*ned by giving a possibility vector  = ( 1 ; 2 ; : : : ; k ) whose components i are the possibilities (ai ) of the k singletons ai (16i6k, k¿2): (ai ) = i ;

0 6 i 6 1;

max i = 1:

16i6k

The possibility 4 of each subset A ⊆ A is the maximum of the possibilities of its elements: (A) = max i : ai ∈A

(2.1)

In particular (∅) = 0; (A) = 1. In logical terms taking a maximum means that event A is -possible when at least one of its elements is so, in the sense of a logical disjunction. Instead, probability distributions are de*ned through a probability vector P = (p1 ; p2 ; : : : ; pk ), P(ai ) = pi ; 06pi 61, 16i6k pi = 1, and have an additive nature:  pi : P(A) = ai ∈A

With respect to probabilities, an empirical interpretation of possibilities is less clear. The debate on the meaning and the use of possibilities is an ample and long-standing one; the reader is referred to standard texts on possibility theory, e.g., those quoted at the

4 The fact that the symbol  is used both for vectors and for distributions will cause no confusion; below the same symbol will be used also to denote a stationary and non-interactive source, since the behaviour of the latter is entirely speci*ed by the vector . Similar conventions will be tacitly adopted also in the case of probabilistic sources, and of probabilistic and possibilistic channels.

14

A. Sgarro / Fuzzy Sets and Systems 132 (2002) 11 – 32

end of Section 1; cf. also Section 7, where the applicability of our model to real-world data transmission is discussed. The probability distribution P over A can be extended in a stationary and memoryless way to a probability distribution P n over the Cartesian power An by setting for each sequence x = x1 x2 : : : xn ∈ An : P n (x) =



P(xi ):

16i6n

We recall that the elements of An are the k n sequences of length n built over the alphabet A. Each such sequence can be interpreted as the information which is output in n time instants by a stationary and memoryless source, or SML source. The memoryless nature of the source is expressed by the fact that the n probabilities P(xi ) are multiplied. Similarly, we shall extend the possibility distribution  in a stationary and non-interactive way to a possibility distribution [n] over the Cartesian power An : Denition 2.1. A stationary and non-interactive information source over the alphabet A is de*ned by setting for each sequence x ∈ An : [n]

 (x) = min (xi ): 16i6n

In logical terms, this means that the occurrence of sequence x = x1 x2 : : : xn is declared -possible when this is so for all of the letters xi , in the sense of a logical conjunction. An interpretation of non-interactivity in our models of sources and channels is discussed in Section 7. Let A = {a1 ; : : : ; ak } and B = {b1 ; : : : ; bh } be two alphabets, called in this context the input alphabet and the output alphabet, respectively. Probabilistic channels are usually described by giving a stochastic matrix W whose rows are headed to the input alphabet A and whose columns are headed to the output alphabet B. We recall that the k rows of such a stochastic matrix are probability vectors over the output alphabet B; each entry W (b|a) is interpreted as the transition probability from the input letter a ∈ A to the output letter b ∈ B. A stationary and memoryless channel W n , or SML channel, extends W to n-tuples, and is de*ned by setting for each x ∈ An and

each y = y1 y2 : : : yn ∈ Bn : W n (y|x) = W n (y1 y2 : : : yn |x1 x2 : : : xn ) =

n 

W (yi |xi ):

(2.2)

i=1

Note that W n is itself a stochastic matrix whose rows are headed to the sequences in An , and whose columns are headed to the sequences in Bn . The memoryless nature of the channel is expressed by the fact that the n transition probabilities W (yi |xi ) are multiplied. We now de*ne the possibilistic analogue of stochastic (probabilistic) matrices. The k rows of a possibilistic matrix  with h columns are possibility vectors over the output alphabet B. Each entry (b|a) will be interpreted as the transition possibility 5 from the input letter a ∈ A to the output letter b ∈ B; cf. the example given below. In De*nition 2.2  is such a possibilistic matrix. Denition 2.2. A stationary and non-interactive channel, or SNI channel, [n] , extends  to n-tuples and is de*ned as follows: [n] (y|x) = [n] (y1 y2 : : : yn |x1 x2 : : : xn ) = min (yi |xi ): 16i6n

(2.3)

Products as in (2:2) are replaced in (2:3) by a minimum operation; this expresses the non-interactive nature of the extension. Note that [n] is itself a possibilistic matrix whose rows are headed to the sequences in An , and whose columns are headed to the sequences in Bn . Taking the minimum of the n transition possibilities (yi |xi ) can be interpreted as a 5 Of course transition probabilities and transition possibilities are conditional probabilities and conditional possibilities, respectively, as made clear by our notation which uses a conditioning bar. We have avoided mentioning explicitly the notion of conditional possibilities because they are the object of a debate which is far from being closed (cf. e.g., Part II of [5]); actually, the worst problems are met when one starts by assigning a joint distribution and wants to compute the marginal and conditional ones. In our case it is instead conditional possibilities that are the starting point: as argued in [2], “prior” conditional possibilities are not problematic, or rather they are no more problematic than possibilities in themselves.

A. Sgarro / Fuzzy Sets and Systems 132 (2002) 11 – 32

logical conjunction: only when all the transitions are -possible, it is -possible to obtain output y from input x; cf. also Section 7. If B is a subset of Bn , one has in accordance with (2:1): [n] (B|x) = max [n] (y|x): y∈B

Example 2.1. For A = B = {a; b} we show a possibilistic matrix  and its “square” [2] which speci*es the transition possibilities from input couples to output couples. The possibility that a is received when b is sent is ; this is also the possibility that aa is received when ab is sent, say; 0661. Take B = {aa; bb}; then [2] (B|ab) = max[; 0] = . In Section 6 this example will be used assuming  = 0,  = 1. [2] | aa ab ba bb 

| a

b

−− + −− −− −− −−

−− + −−

−− aa

| 1

0

0

0

a

| 1

0

ab

| 

1

0

0

b

| 

1

ba

| 

0

1

0

bb

| 





1

3. A few lemmas Sometimes the actual value of a probability does not matter, what matters is only whether that probability is zero or non-zero, i.e., whether the corresponding event E is “impossible” or “possible”. The canonical transformation maps probabilities to binary (zero-one) possibilities by setting Poss{E} = 0 if and only if Prob{E} = 0, else Poss{E} = 1; this transformation can be applied to the components of a probability vector P or to the components of a stochastic matrix W to obtain a possibility vector  or a possibilistic matrix , respectively. Below we shall introduce an equivalence relation called -equivalence which in a way extends the notion of a canonical transformation; here and in the sequel  is a real number such as 06¡1. It will appear that a vector  or a matrix  obtained canonically from P or from W are -equivalent to P or to W , respectively, for whatever value of . Denition 3.1. A probability vector P and a possibility vector  over alphabet A are said to be

15

-equivalent when the following double implication holds ∀a ∈ A: P(a) = 0 ⇔ (a) 6 : The following lemma shows that -equivalence, rather than a relation between letters, is a relation pertaining to the extended distributions P n and [n] , seen as set-functions over An : Lemma 3.1. Fix n¿1. The probability vector P and the possibility vector  are -equivalent if and only if the following double implication holds ∀A ⊆ An : P n (A) = 0 ⇔ [n] (A) 6 : Proof. To prove that the double implication implies -equivalence, just take A = {aa : : : a} for each letter a ∈ A. Now we prove that if P and  are -equivalent then the double implication in Lemma 3.1 holds true. First assume A is a singleton, and contains only sequence x. The following chain of double implications holds: P n (x) = 0 ⇔ ∃i : P(xi ) = 0 ⇔

∃i : (xi ) 6  ⇔ min (xi ) 6  i

[n]

⇔  (x) 6 : This means that, if the two vectors P and  are -equivalent, so are also P n and [n] , seen as vectors with k n components. Then the following chain holds too, whatever the size of A: P n (A) = 0 ⇔ ∀x ∈ A : P n (x) = 0 ⇔ ∀x ∈ A : [n] (x) 6  ⇔ max [n] (x) 6  ⇔ [n] (A) 6 : x∈A

However simple, Lemma 3.1 and its straightforward generalization to channels, Lemma 3.2 below, are the basic tools used to convert probabilistic zero-error coding theorems into possibilistic ones.

16

A. Sgarro / Fuzzy Sets and Systems 132 (2002) 11 – 32

Denition 3.2. A stochastic matrix W and a possibilistic matrix  are said to be -equivalent when the following double implication holds ∀a ∈ A; ∀b ∈ B:

the possibilistic matrix :  (a; a ) = ((|a); (|a )) = max [(b|a) ∧ (b|a )]:

W (b|a) = 0 ⇔ (b|a) 6 : Lemma 3.1 soon generalizes as follows: Lemma 3.2. Fix n¿1. The stochastic matrix W and the possibility matrix  are -equivalent if and only if the following double implication holds ∀x ∈ An , ∀B ⊆ Bn : W n (B|x) = 0 ⇔ [n] (B|x) 6 : In Sections 5 and 6 on channel coding we shall need the following notion of confoundability between letters: two input letters a and a are confoundable for the probabilistic matrix W if and only if there exists at least an output letter b such that the transition probabilities W (b|a) and W (b|a ) are both strictly positive. Given matrix W , one can construct a confoundability graph G(W ) whose vertices are the letters of A by joining two letters by an edge if and only if they are confoundable (graph-theoretic notions are described in Appendix A). We now de*ne a similar notion for possibilistic matrices; to this end we introduce a proximity index  between possibility vectors  = ( 1 ; 2 ; : : :) and  = ( 1 ; 2 ; : : :), which in our case will be possibility vectors over the output set B: (;  ) = max [ i ∧ i ]: 16i6h

Above the wedge symbol ∧ stands for a minimum and is used only to improve readability. The index  is symmetric: (;  ) = ( ; ). One has 06(;  )61, with (;  ) = 0 if and only  and  have disjoint supports, and (;  ) = 1 if and only if there is at least a letter a for which (a) =  (a) = 1; in particular, this happens when  =  (we recall that the support of a possibility vector is made up by those letters whose possibility is strictly positive). The proximity index  will be extended to input letters a and a , by taking the corresponding rows in

b∈B

Example 3.1. We re-take Example 2.1 above. One has:  (a; a) =  (b; b) = 1,  (a; b) = . With respect to [2] , the proximity of two letter couples x and x is either 1 or , according whether x = x or x = x (recall that [2] can be viewed as a possibilistic matrix over the “alphabet ” of letter couples). Cf. also Examples 5.1, 5.2 and the example worked out in Section 7. Denition 3.3. Once a possibilistic matrix  and a number  are given (06¡1), two input letters a and a are said to be -confoundable if and only if their proximity exceeds :  (a; a ) ¿ : Given  and , one constructs the -confoundability graph G (), whose vertices are the letters of A, by joining two letters by an edge if and only if they are -confoundable for . Lemma 3.3. If the stochastic matrix W and the possibilistic matrix  are -equivalent the two confoundability graphs G(W ) and G () coincide. Proof. We have to prove that, under the assumption of -equivalence, any two input letters a and a are confoundable for the stochastic matrix W if and only if they are -confoundable for the possibilistic matrix . The following chain of double implications holds: a and a are confoundable for W ⇔ ∃b: W (b|a) ¿ 0; ∃b: (b|a) ¿ ;

W (b|a ) ¿ 0 ⇔ (b|a ) ¿  ⇔

max [(b|a) ∧ (b|a )] ¿  ⇔ b

 (a; a ) ¿  ⇔ a

and

-confoundable for :

a

are

A. Sgarro / Fuzzy Sets and Systems 132 (2002) 11 – 32

Remark 3.1. The index  (a; a ) establishes a fuzzy relation between input letters, which may be represented by means of a fuzzy graph G() with vertex set equal to the input alphabet A: each edge (a; a ) belongs to the edge set of G() with a degree of membership equal to  (a; a ). Then the crisp graphs G () are obtained as (strong) -cuts of the fuzzy graph G(). By the way, we observe that  (a; a ) is a proximity relation in the technical sense of fuzzy set theory [17]. For basic notions in fuzzy set theory cf., e.g., [7,12] or [16]. 4. The entropy of a possibilistic source We start by the following general observation, which applies both to source and channel coding. The elements which de*ne a code f, i.e., the encoder f+ and the decoder f− (cf. below), do not require a probabilistic or a possibilistic description of the source, or of the channel, respectively. One must simply choose the alphabets (or at least the alphabet sizes): a primary alphabet A, which is the source alphabet in the case of sources and the input alphabet in the case of channels, and the secondary alphabet B, which is the reproduction alphabet in the case of sources and the output alphabet in the case of channels. 6 One must also specify a length n, which is the length of the messages which are encoded in the case of sources, and the length of the codewords which are sent through the channel in the case of channels, respectively. Once these elements, A, B and n, have been chosen, one can construct a code f, i.e., a couple encoder=decoder. Then one can study the performance of f by varying the “behaviour” of the source (of the channel, respectively): for example one can *rst assume that this behaviour has a probabilistic nature, while later one changes to a less committal possibilistic description. The *rst coding problem which we tackle is data compression without distortion. The results of this section, or at least their asymptotic counterparts, inclusive of the notion of -entropy, might have been obtained as a very special case of data compression 6 Actually, in source coding without distortion the primary alphabet and the reproduction alphabet coincide and so the latter is not be explicitly mentioned in Section 4.

17

with distortion, as explained in Appendix B. The reason for con*ning the general case with distortion to an appendix is just ease of readability: actually, data compression without distortion as covered in this section oBers no real problem from a mathematical point of view, but at the same type is very typical of the novelties which our possibilistic approach to coding presents with respect to the standard probabilistic approach. We give a streamlined description of what a source code f is; for more details we refer to [3] or [4]. A code f is made up of two elements, an encoder f+ and a decoder f− ; the encoder maps the n-sequences of An , i.e., the messages output by the information source, to binary strings of a *xed length l called codewords; the decoder maps back codewords to messages in a way which should be “reliable”, as speci*ed below. In practice (and without loss of generality), the basic element of a code is a subset C ⊆ An of messages, called the codebook. The idea is that, out of the k n messages output by the information source, only those belonging to the codebook C are given separate binary codewords and are properly recognized by the decoder; should the source output a message which does not belong to C, then the encoder will use any of the binary codewords meant for the messages in C, and so a decoding error will be committed. Thinking of a source which is modelled probabilistically, as is standard in information theory, a good code should trade oB two conKicting demands: the binary codewords should be short, so as to ensure compression of data, while the error probability should be small, so as to ensure reliability. In practice, one chooses a tolerated error probability , 06¡1, and then constructs a set C as small as possible with the constraint that its probability be at least as great as 1 − . The number n−1 log |C| is called the code rate; log |C| is interpreted as the (non-necessarily integer) length of the binary sequences which encode source sequences 7 and so the rate is measured as number of 7 In Shannon theory one often incurs into the slight but convenient inaccuracy of allowing non-integer “lengths”. By the way, the logarithms here and below are all to the base 2, and so the unit we choose for information measures is the bit. Bars as in |C| denote size, i.e., number of elements. Notice that, not to overcharge our notation, the mention of the length n is not made explicit in the symbols which denote coding functions f and codebooks C.

18

A. Sgarro / Fuzzy Sets and Systems 132 (2002) 11 – 32

bits per source letter. Consequently, the fundamental optimization problem of probabilistic source coding boils down to *nding a suitable codebook C: Minimize the code rate

1 log |C| with the n

constraint Prob{¬C} 6 

H0 (P) = log |{a: P(a) ¿ 0}|: (4.1)

(the symbol ¬ denotes negation, or set complementation). As is usual, we shall consider only Bernoullian (i.e., stationary and memoryless, or SML) sources, which are completely described by the probability vector P over the alphabet letters of A; then in (4:1) the generic indication of probability can be replaced by the more speci*c symbol P n . Given the SML source P, its -entropy H (P) is de*ned as the limit of the rates Rn (P; ) of optimal codes which solve the optimization problem (4:1); obtained as the length n of the encoded messages goes to in*nity: H (P) = lim Rn (P; ): n→+∞

In the probabilistic theory there is a dramatic diBerence between the case  = 0 and the case  = 0. Usually one tackles the case  = 0, the only one which allows actual data compression, as it can be proved. It is well-known that the rates of optimal codes tend to the Shannon entropy H(P) as n goes to in*nity:  H (P) = H(P) = − pi log pi ; 0 ¡  ¡ 1 16i6k

(we use the script symbol H to distinguish Shannon entropy from the operational entropy H ). So Shannon entropy is the asymptotic value of optimal rates. Note that this asymptotic value does not depend on the tolerated error probability ¿0; only the speed of convergence is aBected; this is why the mention of  is in most cases altogether omitted. Instead, we *nd it convenient to explicitly mention , and say that the (probabilistic) -entropy H (P) of the SML source ruled by the probability vector P is equal to the Shannon entropy H(P) for whatever ¿0. Let us go to the case  = 0. In this case the structure of optimal codebooks is extremely simple: each sequence of positive probability must be given its own codeword, and so the optimal codebook is C = {a: P(a) ¿ 0}n

whose rate log |{a: P(a)¿0}| is the same whatever the given length n. Consequently, this is also the value of the zero-error entropy H0 (P) of the SML source:

(4.2)

For  strictly positive, one has, as is well known, H (P) = H(P)6H0 (P), the inequality being strict unless P is uniform over its support. Note that the -entropy is a step-function of : however, the function’s step is obtained only if one keeps the value  = 0, which is rather uninteresting because it corresponds to a situation where no data-compression is feasible, but only data transcription into binary; this happens, say, when one uses ASCII. The zero-error entropy H0 (P) is sometimes called Hartley’s measure (cf. [12]); in the present context it might be rather called Hartley’s entropy ( = 0), to be set against Shannon’s entropy (¿0). Example 4.1. Take P = (1=2; 1=4; 1=4; 0) over an alphabet A of four letters. For  = 0 one has H0 (P) = log 3 ≈ 1:585, while H (P) = H(P) = 1:5 whenever 0¡¡1. We now go to a stationary and non-interactive source, or SNI source, over alphabet A, which is entirely described by the possibilistic vector  over alphabet letters. The source coding optimization problem (4.1) will be replaced by (4.3), where one bounds the decoding error possibility rather than the decoding error probability: Minimize the code rate

1 log |C| n

with the constraint [n] (¬C) 6 :

(4.3)

Now we shall de*ne the possibilistic entropy; as in the probabilistic case, the de*nition is operational, i.e., is given in terms of a coding problem. Denition 4.1. Given the stationary and noninteractive source , its possibilistic -entropy H () is de*ned as the limit of the rates Rn (; ) of optimal codes which solve the optimization problem (4.3), obtained as the length n goes to in*nity: H () = lim Rn (; ): n→+∞

A. Sgarro / Fuzzy Sets and Systems 132 (2002) 11 – 32

Because of Lemma 3.1, the constraint in (4.3) can be re-written as P n {¬C} = 0 for whatever P which is -equivalent with . This means that solving the minimization problem (4.3) is the same as solving the minimization problem (4.1) at zero error for whatever P such as to be -equivalent with . So, the following lemma holds: Lemma 4.1. If P and  are -equivalent, the very same code f = (f+ ; f− ) which is optimal for criterion (4:1) at zero error, with P n (¬C) = 0, is optimal also for criterion (4:3) at -error, with [n] (¬C)6, and conversely. A comparison with (4.2) shows that an optimal codebook C for (4.3) is formed by all the sequences of length n which are built over the sub-alphabet of those letters whose possibility exceeds : C = {a: (a) ¿ }n : Consequently: Theorem 4.1. The possibilistic -entropy H () is given by: H () = log |{a: (a) ¿ }|;

0 6  ¡ 1:

The fact that the possibilistic entropy is obtained as the limit of a constant sequence of optimal rates is certainly disappointing; however, asymptotic optimal rates are not always so trivially found, as will appear when we discuss channel coding (Section 5) or source coding with distortion (Appendix B); we shall comment there that reaching an optimal asymptotic value “too soon” (for n = 1) corresponds to a situation where one is obliged to use trivial code constructions. In a way, we have simply proved that in possibilistic source coding without distortion trivial code constructions are unavoidable. Below we stress explicitly the obvious fact that the possibilistic entropy H () is a stepwise nonincreasing function of , 06¡1. The steps of the function H () begin in correspondence to the distinct possibility components i ¡1 which appear in vector , inclusive of the value 0 = 0 even if 0 is not a component of ; below the term “consecutive” refers to an ordering of the numbers i .

19

Proposition 4.1. If 06¡ ¡1, then H ()¿ H (). If i ¡ i+1 are two consecutive entries in , then H () is constant for i 6¡ i+1 . Example 4.2. Take  = (1; 1; 1=2; 1=2; 1=4; 0) over an alphabet A of six letters. Then H () = log 5 ≈ 2:322 when 06¡1=4, H () = log 4 = 2 when 1=46¡ 1=2, H () = log 2 = 1 when 1=26¡1. Remark 4.1. In the probabilistic case the constraint Prob{¬C}6 can be re-written in terms of the probability of correct decoding as Prob{C}¿1 − , because Prob{C} + Prob{¬C} = 1. Instead, the sum Poss{C} + Poss{¬C} can be strictly larger than 1, and so Poss{C}¿1 −  is a diBerent constraint. This constraint, however, would be quite loose and quite uninteresting, since the possibility Poss{C} of correct decoding and the error possibility Poss{¬C} can be both equal to 1 at the same time. Remark 4.2. Unlike in the probabilistic case, in the possibilistic case replacing the “weak” reliability constraint [n] {¬C}6 by a strict inequality, [n] {¬C}¡, does make a diBerence even asymptotically. In this case the De*nition 3.1 of -equivalence should be modi*ed by requiring P(a) = 0 if and only if (a)¡, 0¡61. The “strict” possibilistic entropy one would obtain is however the same stepfunction as H () above, only that the “steps” of the new function would be closed on the right rather than being closed on the left. Remark 4.3. In the probabilistic case one can produce a sequence of source codes whose rate tends to Shannon entropy and whose error probability goes to zero; in other terms Shannon entropy allows one to code not only with a decoding error probability bounded by any ¿0, but even with an “in*nitesimal” (however, positive) error probability. In our possibilistic case, however, requiring that the error possibility goes to zero is the same as requiring that it is zero for n high enough, as easily perceived by considering that the possibilistic entropy is constant in a right neighbourhood of  = 0. Remarks 4.1, 4.2 and 4.3, suitably reformulated, would apply also in the case of source coding with distortion as in Appendix B and channel coding as in Section 5, but we shall no further insist on them.

20

A. Sgarro / Fuzzy Sets and Systems 132 (2002) 11 – 32

5. The capacity of a possibilistic channel Let A = {a1 ; : : : ; ak } and B = {b1 ; : : : ; bh } be two alphabets, called in this context the input alphabet and the output alphabet, respectively. We give a streamlined description of what a channel code is; for more details we refer to [3,4], and also to [15], which is speci*cally devoted to zero-error information theory. The basic elements of a code f are the encoder f+ and the decoder f− . The encoder f+ is an injective (invertible) mapping which takes uncoded messages onto a set of codewords C ⊆ An ; the set M of uncoded messages is left unspeci*ed, since its “structure” is irrelevant. Codewords are sent as input sequences through a noisy medium, or noisy channel. They are received at the other end of the channel as output sequences which belong to Bn . The decoder f− takes back output sequences to the codewords of C, and so to the corresponding uncoded messages. This gives rise to a partition of Bn into decoding sets, one for each codeword c ∈ C. Namely, the decoding set Dc for codeword c is Dc = {y: f− (y) = c} ⊆ Bn . The most important feature of a code f = (f+ ; f− ) is its codebook C ⊆ An of size |C|. The decoder f− , and so the decoding sets Dc , are often chosen by use of some statistical principle, e.g., maximum likelihood, but we shall not need any special assumption (possibilistic decoding strategies are described in [1,10]). The encoder f+ will never be used in the sequel, and so its speci*cation is irrelevant. The rate Rn of a code f with codebook C is de*ned as Rn =

1 log |C|: n

The number log |C| can be seen as the (not necessarily integer) binary length of the uncoded messages, the ones which carry information; then the rate Rn is interpreted as a transmission speed, which is measured in information bits (bit fractions, rather) per transmitted bit. The idea is to design codes which are fast and reliable at the same time. Once a reliability criterion has been chosen, one tries to *nd the optimal code for each pre-assigned codeword length n, i.e., a code with highest rate among those which meet the criterion. Let us consider a stationary and memoryless channel W n , or SML channel, as de*ned in (2.2). To

declare a code f reliable, one requires that the probability that the output sequence does not belong to the correct decoding set is acceptably low, i.e., below a pre-assigned threshold , 06¡1. If one wants to play safe, one has to insist that the decoding error should be low for each codeword c ∈ C which might have been transmitted (a looser criterion will be examined in Section 6). The reliability criterion which a code f must meet is so: max W n (¬Dc |c) 6 : c∈C

(5.1)

We recall that the symbol ¬ denotes negation, or set-complementation; of course the inequality sign in (5.1) can be replaced by an equality sign whenever  = 0. Once the length n and the threshold  are chosen, one can try to determine the rate Rn = Rn (W; ) of an optimal code which solves the optimization problem: Maximize the code rate Rn so as to satisfy the constraint (5:1): The job can be quite tough, however, and so one has often to be contented with the asymptotic value of the optimal rates Rn , which is obtained when the codeword length n goes to in*nity. This asymptotic value is called the capacity of channel W . For 0¡¡1 the capacity C is always the same, only the speed of convergence of the optimal rates to C is aBected by the choice of . When one says “capacity” one refers by default to this positive -capacity; cf. [3] or [4]. Instead, when  = 0 there is a dramatic change. In this case one uses the confoundability graph G(W ) associated with channel W ; we recall that in G(W ) two vertices, i.e., two input letters a and a , are adjacent if and only if they are confoundable; cf. Section 3. If W n is seen as a stochastic matrix with k n rows headed to An and hn columns headed to Bn , one can consider also the confoundability graph G(W n ) for the k n input sequences of length n; two input sequences are confoundable, and so adjacent in the graph, when there is an output sequence which can be reached from any of the two with positive probability. If C is a maximal independent set in G(W n ) the limit of n−1 log |C| when n goes to in*nity is by de*nition the graph

A. Sgarro / Fuzzy Sets and Systems 132 (2002) 11 – 32

capacity C(G(W )) of the confoundability graph G(W ). 8 As easily checked, the codebook C ⊆ An of an optimal code is precisely a maximal independent set of G(W n ). Consequently, the zero-error capacity C0 (W ) of channel W is equal to the capacity of the corresponding confoundability graph G(W ): C0 (W ) = C(G(W )): The paper [19] which Shannon published in 1956 and which contains these results inaugurated zero-error information theory. Observe however that the last equality gives no real solution to the problem of assessing the zero-error capacity of the channel, but simply re-phrases it in a neat combinatorial language; actually, a single-letter expression of the zero-error capacity is so far unknown, at least in general (“singleletter” means that one is able to calculate the limit so as to get rid of the codeword length n). This unpleasant observation applies also to Theorem 5.1 below. We now pass to a stationary and non-interactive channel [n] , or SNI channel, as de*ned in (2.3). The reliability criterion (5.1) is correspondingly replaced by: max [n] (¬Dc |c) 6 : c∈C

(5.2)

The optimization problem is now: Maximize the code rate Rn so as to satisfy the constraint (5:2): The number  is now the error possibility which we are ready to accept. Again the inequality sign in (5.2) is to be replaced by the equality sign when  = 0. A looser criterion based on average error possibility rather than maximal error possibility will be examined in Section 6. Denition 5.1. The -capacity of channel  is the limit of optimal code rates Rn (; ), obtained as the codeword length n goes to in*nity. 8

We recall that an independent set in a graph, called also a stable set, is a set of vertices no two of which are adjacent, and so in our case the vertices of an independent set are never confoundable; all these graph-theoretic notions, inclusive of graph capacity, are explained more diBusely in Appendix A.

21

The following lemma is soon obtained from Lemmas 3.2 and 3.3, and in its turn soon implies Theorem 5.1; it states that possibilistic coding and zero-error probabilistic coding are diBerent formulations of the same mathematical problem. Lemma 5.1. Let the SML channel W and the SNI channel  be -equivalent. Then a code f = (f+ ; f− ) satis/es the reliability criterion (5:1) at zero error for the probabilistic channel W if and only if it satis/es the reliability criterion (5:2) at -error for the possibilistic channel . Theorem 5.1. The codebook C ⊆ An of an optimal code for criterion (5:2) is a maximal independent set of G ([n] ). Consequently, the -capacity of the possibilistic channel  is equal to the capacity of the corresponding -confoundability graph G (): C () = C(G ()): Observe that the speci*cation of the decoding sets Dc of an optimal code (and so of the decoding strategy) is obvious: one decodes y to the unique codeword c for which [n] (y|c)¿; there cannot be two codewords with this property, because they would be -confoundable, and this would violate independence. If [n] (y|c)6 for all c ∈ C, then y can be assigned to any decoding set, this choice being irrelevant from the point of view of criterion (5.2). Below we stress explicitly the obvious fact that the graph capacity C () = C(G ()) is a stepwise nondecreasing function of , 06¡1; the term “consecutive” refers to an ordering of the distinct components i which appear in  ( i can be zero even if zero does not appear as an entry in ): Proposition 5.1. If 06¡ ¡1, then C ()6 C (). If i ¡ i+1 are two consecutive entries in , then C () is constant for i 6¡ i+1 . Example 5.1. Binary possibilistic channels. The input alphabet is binary, A = {0; 1}, the output alphabet is either the same (“doubly” binary channel), or is augmented by an erasure symbol 2 (binary erasure channel); the corresponding possibilistic matrices 1 and

22

A. Sgarro / Fuzzy Sets and Systems 132 (2002) 11 – 32

2 are given below:

(1; ; ; 0; 0) in which 0¡ ¡¡1: | a1

1

|

−− + 0

|

1

|

0

1

2

|

0

−−

−−

1



0

|

1

1

1

|

0

−− + −−

2

1

−−

−−



0 1

with 0¡ 6¡1. As soon checked, one has for the proximities between input letters: 1 (0; 1) =  in the case of the doubly binary channel and 2 (0; 1) = 6 in the case of the erasure channel. The relevant confoundability graphs are G0 (1 ) = G0 (2 ), where the input letters 0 and 1 are adjacent, and G (1 ) = G (2 ), where they are not. One has C0 (1 ) = C0 (2 ) = 0, C (1 ) = C (2 ) = 1, and so C (1 ) = C (2 ) = 0 for 06¡ , C (1 ) = 0¡ C (2 ) = 1 for 6¡, else C (1 ) = C (2 ) = 1. Some of these intervals may vanish when the transition possibilities and  are allowed to be equal and to take on also the values 0 and 1. Data transmission is feasible when the corresponding capacity is positive. In this case, however the capacity is “too high” to be interesting, since a capacity equal to 1 in the binary case means that the reliability criterion is so loose that no data protection is required: for *xed codeword length n, the optimal codeword is simply C = An . Actually, whenever the input alphabet is binary, one is necessarily confronted with two limit situations which are both uninteresting: either the confoundability graph is complete and the capacity is zero (i.e., the reliability criterion is so demanding that reliable transmission of data is hopeless), or the graph is edgefree and the capacity is maximal (the reliability criterion is so undemanding that data protection is not needed). In Section 7 we shall hint at interactive models for possibilistic channels which might prove to be interesting also in the binary case; cf. Remark 7.1. Example 5.2. A “rotating” channel. Take k = 5; the quinary input and output alphabet is the same; the possibilistic matrix  “rotates” the row-vector

a2

a3

a4

a5

−− + −− −− −− −− −− a1

|

1



a2

|

0

1



a3

|

0

0

1



a4

|

0

0

1



a5

|

0

0

1



0

0 0

After setting by circularity a6 = a1 , a7 = a2 , one has: (ai ; ai ) = 1¿(ai ; ai+1 ) = ¿(ai ; ai+2 ) = , 16i65. Capacities can be computed as explained in Appendix A: C0 () = 0√(the corresponding graph is complete), C () = log 5 (the pentagon graph pops up), C () = log 5 (the corresponding graph is√edgefree). So C () = 0 for 06¡ , C () = log 5 for 6¡, else C () = log 5.

6. Average-error capacity versus maximal-error capacity Before discussing an interpretation and an application of the possibilistic approach, we indulge in one more “technical” section. In the standard theory of probabilistic coding the reliability criterion (5.1) is often replaced by the looser criterion: 1  n W (¬Dc |c) 6  |C| c∈C

which requires that the average probability of error, rather than the maximal probability, be smaller than  so as to be declared acceptable. Roughly speaking, one no longer requires that all codewords perform well, but is contented whenever “most” codewords do so, and so resorts to an arithmetic mean rather than to a maximum operator. The new criterion being looser for ¿0, higher rates can be achieved; however one proves that the gain evaporates asymptotically (cf., e.g., [4]). So, the average-error -capacity and the maximal-error -capacity (the only one we have considered so far) are in fact identical. We shall pursue a similar approach also in the case of possibilistic

A. Sgarro / Fuzzy Sets and Systems 132 (2002) 11 – 32

channels, and adopt the reliability criterion: 1  [n]  (¬Dc |c) 6  |C|

(6.1)

c∈C

rather than (5.2). The corresponding optimization problem is: Maximize the code rate Rn so as to satisfy the constraint (6:1): For ¿0 one can achieve better rates than in the case of maximal error, as the following example shows. Example 6.1. We re-take the 2 × 2 matrix  of Example 2.1, which is basically the matrix 1 of Example 5.1 when = 0. We choose n¿1 and adopt the reliability criterion (5.2) which involves the maximal decoding error. For 06¡ the graph G () is complete and so is also G ([n] ). Maximal independent sets are made up by just one sequence: the optimal rate is as low as 0; in practice this means that no information is transmittable at that level of reliability. Let us pass instead to the reliability criterion (6.1) which involves the average decoding error. Let us take a codebook whose codewords are all the 2n sequences in An ; each output sequence is decoded to itself. The rate of this code is as high as 1. It easy to check that the decoding error possibility for each transmitted sequence c is always equal to , except when c = aa : : : a is sent, in which case the error possibility is zero. This means that with an error possibility  such that 2n − 1 6¡ 2n the optimal rate is 0 for criterion (5.2) while it is 1 for criterion (6.1). Observe that the interval where the two optimal rates diBer evaporates as n increases. In analogy to the maximal-error -capacity C (), the average-error -capacity is de*ned as follows: Denition 6.1. The average-error -capacity CQ () of channel  is the limit of code rates RQ n (; ) optimal with respect to criterion (6.1), obtained as the codeword length n goes to in*nity. (Our result below will make it clear that such a limit does exist.) We shall prove below that also

23

in the possibilistic case the maximal-error capacity and the average-error capacity coincide for all . We stress that, unlike Theorem 5.1, Theorem 6.1 is not solved by simply re-cycling a result already available in the probabilistic framework (even if the “expurgation” technique used below is a standard tool of Shannon theory). This shows that the possibilistic framework is strictly larger than the zero-error probabilistic framework, as soon as one allows possibility values which are intermediate between zero and one. From now on we shall assume  = 0, else (5.2) and (6.1) become one and the same criterion, and there is nothing new to say. Clearly, (6.1) being a looser criterion, the average-error possibility of any pre-assigned code cannot be larger than the maximal-error possibility, and so the average-error -capacity of the channel cannot be smaller than the maximal-error -capacity: CQ ()¿C (). The theorem below will be proven by showing that also the inverse inequality holds true. Theorem 6.1. The average-error -capacity CQ () and the maximal-error -capacity C () of the SNI possibilistic channel  coincide for whatever admissible error possibility , 06¡1: CQ  () = C (): Proof. Let us consider an optimal code which satis*es the reliability criterion (6.1) for *xed codeword length n and *xed tolerated error possibility ¿0; since the code is optimal, its codebook C has maximal size |C|. Let 1 = 0 ¡ 2 ¡ · · · ¡ r = 1

(6.2)

be the distinct components which appear as entries in the possibilistic matrix  which speci*es the transition possibilities, and so speci*es the possibilistic behaviour of the channel we are using; r¿1. Fix codeword c: we observe that the error possibility for c, i.e., the possibility that c is incorrectly decoded, is necessarily one of the values which appear in (6.2), as it is derived from those values by using maximum and minimum operators (we add i = 0 even if 0 is not to be found in ). This allows us to partition the codebook C into r classes Ci , 16i6r, by putting into the same class Ci those codewords c whose error possibility is equal precisely to i (some of the classes Ci

24

A. Sgarro / Fuzzy Sets and Systems 132 (2002) 11 – 32

may be void). The reliability criterion (6.1) satis*ed by our code can be re-written as:  |Ci | i 6 : |C|

(6.3)

16i6r

We can now think of a non-negative random variable X which takes on the values i , each with probability |Ci |=|C|; to this random variable X we shall apply the well-known Markov inequality (cf., e.g., [4]), which is written as: Prob{X ¿ #XQ } 6

1 ; #

where XQ is the expectation of X , i.e., the *rst side of (6.3), while # is any positive number. Because of (6.3), which can be written also as XQ 6, one has a fortiori: Prob{X ¿ #} 6

1 #

or, equivalently: 

|Ci | 6

i: i ¿#

|C| : #

Now we choose # and set it equal to: #=

where j is the smallest value in (6.2) such as to be strictly greater than . With this choice one has #¿1; we stress that # is a constant once  and  are chosen; in particular # does not depend on n. The last summation can be now taken over those values of i for which:  + j i ¿ 2 i.e., since there is no i left between  and j , the inequality can be re-written as:

i: i ¿

|Ci | 6

|C| : #

The r classes Ci are disjoint and give a partition of C; so, if one considers those classes Ci for which the error possibility i is at most , one can equivalently write:  #−1 |C|: (6.4) |Ci | ¿ # i: 6 i

i

where Rn is the optimal rate with respect to criterion (5.2) relative to maximal error. On the other hand, in terms of the rate RQ n = n−1 log |C| optimal with respect to criterion (6.1) relative to average error, (6.4) can be re-written as: ∗

2nRn ¿

# − 1 nRQn 2 : #

(6.6)

In (6.6) the term (#−1)=# is a positive constant which belongs to the open interval ]0; 1[, and so its logarithm is negative. Comparing (6.5) and (6.6), and recalling that Rn 6RQ n : #−1 1 6 R∗n 6 Rn 6 RQ n : RQ n + log n #

 + j ; 2



Now, the union of the classes on the left side of (6.4) can be used as the codebook of a new code with maximal error possibility 6. It will be enough to modify the decoder by enlarging in whatever way the decoding sets Dc with error possibility [n] (¬Dc |c) = i 6, so as to cover Bn ; by doing so the error possibility cannot become larger. Of course the new code need not be optimal in the class of all codes which satisfy criterion (5.2) for *xed n and ; so for its rate R∗n one has  1 R∗n = log |Ci | 6 Rn ; (6.5) n i: 6

One obtains the theorem by going to the limit. 7. An interpretation of the possibilistic model based on distortion measures We have examined a possibilistic model of data transmission and coding which is inspired by the standard probabilistic model: what we did is simply replacing probabilities by possibilities and independence by non-interactivity, a notion which is often seen as the “right” analogue of probabilistic independence in possibility theory. In this section we shall try to give an interpretation of our possibilistic approach. The example of an application to the design of telephone keyboards will be given. We concentrate on noisy channels and codes for correcting transmission errors; we shall consider sources at the end of the section. The idea is that in some cases statistical likelihood may be eBectively

A. Sgarro / Fuzzy Sets and Systems 132 (2002) 11 – 32

replaced by what one might call “structural resemblance”. Suppose that a “grapheme” is sent through a noisy channel which we are unable to describe in all statistical details. A distorted grapheme will be received at the other end of the channel; the repertoire of input graphemes and of output graphemes are supposed to be both *nite. We assume that it is plausible 9 that the grapheme which has been received has a small distortion, or even no distortion at all, from the grapheme which has been sent over the channel; large distortions are instead unplausible. Without real loss of generality we shall “norm” the distortions to the interval [0; 1], so that the occurrence of distortion one is seen as quite unplausible. Correspondingly, the onecomplement of the distortion can be seen as an index of “structural resemblance” between the input symbol and the output symbol; with high plausibility this index will have a high value. We shall assign a numeric value to the plausibility by setting it equal precisely to the value of the resemblance index; in other words, we assume the “equality”: plausibility = structural resemblance:

(7.1)

Long sequences of graphemes will be sent through the channel. The distortion between the input sequence x and the output sequence y will depend on the distortions between the single graphemes xi and yi which make up the sequences; to specify how this happens, we shall take inspiration from rate-distortion theory, which is shortly reviewed in Appendix B; cf. also [3] or [4]. We recall here how distortion measures are de*ned. One is given two alphabets, the primary alphabet A and the secondary alphabet B, which in our case will be the alphabet of possible input graphemes and the alphabet of possible output graphemes, respectively. A distortion measure d is given which speci*es the distortions d(a; b) between each primary letter a ∈ A and each secondary letter b ∈ B; for each primary letter a there is at least one secondary letter b such that d(a; b) = 0, which perfectly reproduces a. Distortions d(a; b) are always non-negative, but in our case they are also constrained not to exceed 1. The distortion between letters a and b can be extended to 9 The term plausibility is a technical term of evidence theory; actually, possibilities can be seen as very special plausibilities; cf., e.g., [12]. So, the adoption of a term which is akin to “possibility” is more committal than it may seem at *rst sight.

25

a distortion between sequences x ∈ A n and y ∈ B n in several way; one resorts, e.g., to peak distortion: d∗n (x; y) = max d(xi ; yi )

(7.2)

16i6n

or, more commonly and less demandingly, to average distortion: 1  dn (x; y) = d(xi ; yi ): (7.3) n 16i6n

Let us be very demanding and adopt peak distortion: structurally two sequences resemble each other only when they do so in each position. Following the philosophy of the equality (7:1) above, where the term “plausibility” has been replaced by the more speci*c term “transition possibility” and where the resemblance is interpreted as the one-complement of the distortion, we set: (b|a) = 1 − d(a; b); [n] (y|x) = 1 − d∗n (x; y) = min (yi |xi ):

(7.4)

16i6n

This corresponds precisely to a stationary and noninteractive channel . To make our point we now examine a smallscale example. We assume that sequences of circled graphemes out of the alphabet A={⊕; ⊗; ; ; } are sent through a channel. Because of noise, some of the bars inside the circle can be erased during transmission; instead, in our model the channel cannot add any bars, and so the repertoire of the output graphemes is a superset of A: B = A ∪ { ; \ }. We do not have any statistical information about the behaviour of the channel; we shall be contented with the following “linguistic judgements”: It is quite plausible that a grapheme is received as it has been sent It is pretty plausible that a single bar has been erased It is pretty unplausible that two bars have been erased Everything else is quite unplausible We shall “numerize” our judgements by assigning the numeric values 1; 2=3; 1=3; 0 to the corresponding possibilities. This is the same as setting the distortions d(a; b) proportional to the number of bars which have been deleted during transmission. Our choice is

◦◦

26

A. Sgarro / Fuzzy Sets and Systems 132 (2002) 11 – 32

enough to specify a possibilistic channel , whose matrix is given below; zeroes have not been written to help readability. Underneath  we have written the matrix  which speci*es the proximities between the input graphemes; since  is symmetric, i.e., (a; a ) = (a ; a), we have written only the upper triangle; cf. the de*nition of  in Section 3. 

|

⊕ ⊗





+







|

1



|



|



|



2=3





 −

1

◦ −

1=3 2=3

1

2=3

1=3 2=3

1

2=3

|



|











− −









|

1=3

2=3

1=3

1=3



|

1

1=3

2=3

1=3



|

1

2=3

2=3



|

1

2=3

|



2=3





◦\

1

1

◦ −

1

Both the components of  and those of  are in their own way “resemblance indices”. However, those in  specify the structural resemblance between an input grapheme a and an output grapheme b; this resemblance is 1 when b equals a, is 2=3 when b can be obtained from a by deletion of a single bar, is 1=3 when b can be obtained from a by deleting two bars, and is 0 when b cannot be obtained from a in any of these ways. Instead the components of  specify how easy it is to confound input graphemes at the other end of the channel: (a; a ) = 1 means a = a , (a; a ) = 2=3 means that a and a are diBerent, but there is at least an output grapheme which can be obtained by deletion of a single bar in a, or in a , or in both, (a; a ) = 1=3 means that one has two delete at least two bars from one of the input graphemes, or from both, to reach a common output grapheme. Assuming that the channel  is stationary and non-interactive means that we are adopting a very strict criterion to evaluate the “structural resemblance” between

input sequence and output sequence; this criterion corresponds to peak distortion, as explained above. Let us consider coding. We use the proximity matrix  to construct the -confoundability graph G (), which was de*ned in Section 3. If 06¡1=3 the -confoundability graph is complete and the capacity of the channel is 0: this means that the reliability criterion (5.2) is so strict that no data transmission is feasible. For ¿2=3 the graph is edge-free and so the -capacity is log 5: this means that the reliability criterion (5.2) is so loose that the channel  behaves essentially as noise-free. Let us proceed to the more interesting case 1=36¡2=3; actually, one can take  = 1=3 (cf. Proposition 5.1). A maximal independent set I in G1=3 () is made up by the three “vertices” ⊕; ⊗ and , as soon checked. Using the notions explained in the appendix, and in particular the inequalities (A.1), one soon shows that the 3n sequences of In give a maximal independent set in G1=3 ([n] ). Fix codeword length n; as stated by Theorem 5:2; an optimal codebook is C = In and so the optimal code rate is log 3, which is also the value of the capacity C1=3 (). When one uses such a code, a decoding error occurs only when at least one of the n graphemes sent over the channel loses at least two bars, an event which has been judged to be pretty unplausible. We give the example of an application. Think of the keys in a digital keyboard, as the one of the author’s telephone, say, in which digits from 1 to 9 are arranged on a 3 × 3 grid, left to right, top row to bottom row, while digit 0 is positioned below digit 8. It may happen that, when a telephone number is digited, the wrong key is pressed (because of “channel noise”). We assume the following model of the “noisy channel”, in which possibilities are seen as numeric labels for vague linguistic judgements:



(i) it is quite plausible that the correct key is pressed (possibility 1); (ii) it is pretty plausible that one inadvertently presses a “neighbour” of the correct key, i.e., a key which is positioned on the same row or on the same column and is contiguous to the correct key (possibility 2=3);

A. Sgarro / Fuzzy Sets and Systems 132 (2002) 11 – 32

(iii) it is pretty unplausible that the key one presses is contiguous to the correct key, but is positioned on the same diagonal (possibility 1=3); (iv) everything else is quite unplausible (possibility 0). When the wrong key is pressed, we shall say that a cross-over of type (ii), of type (iii), or of type (iv) has taken place, according whether its possibility is 2=3, 1=3, or 0. Using these values 10 one can construct a possibilistic matrix  with the input and the output alphabet both equal to the set of the 10 keys. One has, for example: (a|1) = 2=3 for a ∈ {2; 4}, (a|1) = 1=3 for a = 5, (a|1) = 0 for a ∈ {3; 6; 7; 8; 9; 0}. As for the proximity  (a; b), it is equal to 2=3 whenever either keys a and b are neighbours as in (ii), or there is a third key c which is a common neighbour of both. One has, for example:  (1; a) = 2=3 for a ∈ {2; 3; 4; 5; 7}; instead,  (1; a) = 1=3 for a ∈ {6; 8; 9} and  (1; a) = 0 for a = 0. A codebook is a bunch of admissible telephone numbers of length n; since a phone number is wrong whenever there is a collision with another phone number in a single digit, it is natural to assume that the “noisy channel”  is non-interactive. This example had been suggested to us by J. KTorner; however, at least in principle, in the standard probabilistic setting one would have to specify three stochastic matrices such as to be 0; 1=3 and 2=3— equivalent with . In these matrices only the opposition zero=non-zero would count; their entries would have no empirical meaning, and no signi*cant relation with the stochastic matrix of the probabilities with which errors are actually committed by the hand of the operator. So, the adoption of a “hard” 10 Adopting a diBerent “numerization” for the transition possibilities (or, equivalently, for the distortions) does not make any real diBerence from the point of view of criterion (5.2), provided the order is preserved and the values 0 and 1 are kept *xed. Instead, arithmetic averages as in criterion (6.1) have no sort of insensitivity to order-preserving transformations; criterion (6.1) might prove to be appropriate in a situation where one interprets possibilities in some other way (recall that possibilities can be viewed as a special case of plausibilities, which in their turn can be viewed as a special case of upper probabilities; cf., e.g., [22]). In (iv) we might have chosen a “negligible” positive value, rather than 0: again, this would have made no serious diBerence, save adding a negligible initial interval where the channel capacity would have been zero.

27

probabilistic model is in this case pretty unnatural. Instead, in a “soft” possibilistic approach one speci*es just one possibilistic matrix , which contains precisely the information which is needed and nothing more. Unfortunately, the author’s telephone is not especially promising. Let us adopt criterion (5.2). If the allowed error possibility of the code is 2=3 (or more), the confoundability graph is edge-free and no error protection is required. If we choose the error possibility  = 1=3, we have C1=3 () = log &(G1=3 ()) = log 3; in other words the 1=3-capacity, which is an asymptotic 11 parameter, is reached already for n = 1. To see this use the inequalities (A.1) of Appendix A: the independence number of G1=3 () is 3, and a maximal independent set of keys, which are far enough from each other so as not to be confoundable, is {0; 1; 6}, as easily checked; however, one checks that 3 is also the chromatic number of the complementary graph. In practice, this means that an optimal codebook as in Theorem 5.1 may be constructed by juxtaposition of the input “letters” 0; 1; 6; the code is disappointing, since everything boils down to allowing only phone numbers which use keys 0; 1; 6. As for decoding, the output sequence y is decoded to the single codeword c for which [n] (y|c)¿1=3; so, error correction is certainly successful if there have been no cross-overs of type (iii) and (iv). If, for example, one digits number 2244 rather than 1111 a successful error correction takes place; actually, [4] (2244|c)¿1=3 only for c = 1111. If instead one is so clumsy as to digit the “pretty unplausible” number 2225, this is incorrectly decoded to 1116. Take instead the the more demanding threshold  = 0; the 0-capacity, as easily checked, goes down to log 2; the 0-error code remains as disappointing as the 1=3-error code, being obtained by allowing only phone numbers made up of “far-away” digits as are 0 and 1, say. The design of convenient keyboards 11 When the value of an asymptotic functional (channel capacity, say, or source entropy, or the rate-distortion function as in Appendix B) is reached already for n = 1, its computation is easy, but, unfortunately, this is so because the situation is so hopeless that one is obliged to use trivial code constructions. By the way, this is always the case when one tries to compress possibilistic sources without distortion, as in Section 4. The interesting situations correspond instead to cases when the computation of the asymptotic functional is diLcult, as for the pentagon, or even unfeasible, as for the heptagon (cf. Appendix A).

28

A. Sgarro / Fuzzy Sets and Systems 132 (2002) 11 – 32

such that their possibilistic capacity is not obtained already for n = 1 is a graph-theoretic problem which may be of relevant practical interest in those situations when digiting an incorrect number may cause serious inconveniences. More generally, exhibiting useful *nite-length code constructions would have a relation to the material of this paper, which is similar to the relation of coding theory (algebraic coding theory, say) to the asymptotic theory of coding (Shannon theory). Remark 7.1. Rather than peak distortion, in (7.4) one might use average distortion. This would give rise to a stationary but de*nitely interactive channel for which: n (y|x) = 1 − dn (x; y) =

1  (yi |xi ): n 16i6n

We leave open the problem of studying such a channel and ascertaining its meaning for real-world data transmission. Actually, one might even de*ne new distortions between sequences based on a diBerent way of averaging single-letter distortions in the general sense of aggregation operators (the very broad notion of aggregation operators and averaging operators is covered, e.g., in [12] or [16]). Now we pass to source coding and data compression. We shall pursue an interpretation of possibilistic SNI sources and possibilistic source coding which *ts in with a meaning of the word “possible” to be found in the Oxford Dictionary of the English Language: possible = tolerable to deal with, i.e., acceptable, because it possesses all the qualities which are required. 12 Assume that certain items are accepted only if they pass n quality controls; each control i is given a numeric mark i from 0 (totally unacceptable) to 1 (faultless); the marks which one can assign are chosen from a *nite subset of [0; 1] of numbers which just stand for linguistic judgements. The quality control as a whole is passed only when all the n controls have been passed. As an example, let us take the source alphabet B equal to the alphabet of the seven graphemes output by the possibilistic channel 12

An interpretation which may be worth pursuing is: degree of possibility = level of grammaticality. This may be interesting also in channel coding, in those situation when decoding errors are less serious when the encoded message has a low level of grammatical correctness.

 which has been considered above. The “items” will be sequences y of n graphemes, and the ith control will be made on the ith grapheme yi . The possibility vector  over the seven graphemes of B, in the order as they are listed, will be:  = (1; 1; 1; 2=3; 1; 2=3; 1) B = {⊕; ⊗; ;

over

◦ ; ; ◦\ ; ◦}:

A possibility smaller than 1 has been assigned to the two output graphemes which are not also input graphemes; in practice, vector  has been obtained by taking the maximum of the entries in the columns of the possibilistic matrix  which describes the channel. When (b) = 1 it is possible that the grapheme b has been received at the end of the channel exactly as it has been transmitted, when (b) = 2=3 the grapheme b which has been received is necessarily distorted with respect to the input grapheme, and that at least one bar has been erased during transmission. 13 Let us *x a value , 06¡1, and rule out all the items whose possibility is 6. Then the accepted items can be encoded by means of a possibilistic source code as in Section 4: each acceptable item is given a binary number whose length is nH (), or rather nH (), by rounding to the integer ceiling, i.e., to the smallest integer which is as least as large as n times the -entropy of . In our case, when ¿2=3 only the sequences which do not contain the graphemes  \ are given a codeword, and so H () = log 5; and  instead when 62=3 all the sequences have their own codeword, and so H () = log 7.

13 As a matter of fact, we have been using a formula proposed in the literature in order to compute marginal output possibilities (b), when the marginal input possibilities (a) and the conditional possibilities (b|a) are given, namely

(b) = max [(a) ∧ (b|a)] the maximum being taken over all letters a ∈ A. In our case we have set all the input possibilities (a) equal to 1. The possibilistic formula is inspired by the corresponding probabilistic one, just replacing sums and products by maxima and minima, as is usual when one passes from probabilities to possibilities; cf., e.g., [5] or [11].

A. Sgarro / Fuzzy Sets and Systems 132 (2002) 11 – 32

Acknowledgements We gladly acknowledge helpful discussions with F. Fabris on the relationship between possibilistic channels and distortion measures as used in probabilistic source coding.

Appendix A. Graph capacity We consider only simple graphs, i.e., graphs without multiple edges and without loops; we recall that a graph is assigned by giving its vertices and its edges; each edge connects two (distinct) vertices which are then adjacent. If G is a graph, its complementary graph GQ has the same set of vertices, but two vertices are adjacent in GQ if and only if they are not adjacent in G. By &(G) and '(G) we denote the independence number and the chromatic number of G, respectively. We recall that the independence number of a graph is the maximum size of a set of vertices none of which are adjacent (of an independent set, called also a stable set); the chromatic number of a graph is the minimum number of colours which can be assigned to its vertices in such a way that no two adjacent vertices have the same colour. From a graph G with k vertices one may wish to construct a “power graph” Gn whose k n “vertices” are the vertex sequences of length n. Many such powers are described in the literature; of these we need the following one, called sometimes the strong power: two vertices x = x1 x2 : : : xn and u = u1 u2 : : : un are adjacent in Gn if and only if for each component i either xi and ui are adjacent in G or xi = ui ; 16i6n. The reason for choosing this type of power becomes clear when one thinks of confoundability graphs G(W ) and of -confoundability graphs G () as de*ned in Section 3. Actually, one has: G(W n ) = [G(W )]n ;

G ([n] ) = [G ()]n :

The *rst equality is obvious; the second is implied by the *rst and by Lemma 3.3: just take any stochastic matrix W which is -equivalent to . If G is a simple graph, its graph capacity C(G), called also Shannon’s graph capacity, is de*ned as C(G) = lim n

1 log &(Gn ): n

29

The limit always exists, as it can be shown. It is rather easy to prove that log &(G) 6

1 Q log &(Gn ) 6 log '(G) n

(A.1)

Q the graph capacity is and so whenever &(G) = '(G) very simply C(G) = log &(G). Giving a single-letter characterization of graph capacity can be however a very tough problem, which is still unsolved in its generality [15]. We observe that the minimum value of the graph capacity is zero, and is reached   whenever the graph is complete, i.e., has all the K2 edges; the maximum value of the capacity of a graph with k vertices is log k, and is obtained when the graph is edgefree (has no edges at all). We also observe that “pure” combinatorialists prefer to de*ne graph capacity as the  limit of n &(Gn ), i.e., as 2C(G) . Example A.1. Let us take the case of a polygon Pk with k vertices. For k = 3, we have a triangle P3 ; then Q3 ) = 1 and the capacity C(P3 ) is zero. Let &(P3 ) = '(P Q4 ) = 2 and us go to the quadrangle P4 ; then &(P4 ) = '(P so C(P4 ) = 1. In the case of the pentagon, however, Q5 ) = 3. It was quite an achievement of &(P5 ) = 2¡'(P √ LovVasz to prove in 1979 that C(P5 ) = log 5, as long conjectured; the conjecture had resisted a proof for more than twenty years. The capacity of the heptagon P7 is still unknown. Appendix B. The possibilistic rate-distortion function This appendix generalizes source coding as dealt with in Section 4 and is rather more technical than the body of the paper. The reader is referred to Section 4 for a description of the problem of source coding. In the case of source coding with distortion, beside the primary source alphabet A one has a secondary alphabet B, called also the reproduction alphabet, which is used to reproduce primary sequences. A distortion matrix d is given which speci*es the distortions d(a; b) between each primary letter a ∈ A and each secondary letter b ∈ B; the numbers d(a; b) are non-negative and for each primary letter a there is at least one secondary letter b such that d(a; b) = 0, i.e., such as to perfectly reproduce a. We recall that distortion measures have already been

30

A. Sgarro / Fuzzy Sets and Systems 132 (2002) 11 – 32

used in Section 7; unlike in Section 7, here we do not require d(a; b)61. The distortion between letters a and b is extended to average distortion between sequences x ∈ A n and y ∈ B n as we did in (7.3), or to peak distortion, called also maximal distortion, as in (7.2). Unlike in the case without distortion, here the decoder f− maps the binary codeword f+ (x) to a secondary sequence y ∈ B n which should have an acceptably small distortion from the encoded primary sequence x. Let us denote by g the composition of encoder and decoder, g(x) = f− (f+ (x)); the set of secondary sequences C = g(A n ) ⊆ B n which are used to reproduce the primary sequences is called the codebook of the code f = (f+ ; f− ). In practice the secondary sequence y = g(x) is usually misinterpreted as if it were the codeword for the primary sequence x, and correspondingly the mapping g is called the encoder (this is slightly abusive, but the speci*cation of f+ and f− turns out to be irrelevant once g is chosen). The rate of the code is the number Rn =

log |C| : n

The numerator can be interpreted as the (non necessarily integer) length of the binary codewords output by the encoder stricto sensu f+ and fed to the decoder f− , and so the rate is the number of bits per primary letter. From now on we shall forget about f+ ; the term “encoder” will refer solely to the mapping g which outputs secondary sequences y ∈ B n . Let us begin by the average distortion dn , as is common in the probabilistic approach. One *xes a threshold +¿0, a tolerated error probability ¿0, and requires that the following reliability criterion is satis*ed: P n {x : dn (x; g(x)) ¿ +} 6 :

(B.1)

The encoder g should be constructed in such a way that the codebook C ⊆ B n be as small as possible, under the constraint that the reliability criterion which has been chosen is satis*ed; for *xed n one can equivalently minimize the code rate: log |C| so as to satisfy Minimize the code rate n constraint (B:1):

For *xed +¿0 and ¿0, one is interested in the asymptotic value of the optimal rates. For ¿0 one proves that the solution, i.e., the asymptotic value of optimal code rates, is given by the rate-distortion function R (P; +) =

min

XY : Ed(X; Y )6+

I (X ∧ Y );

¿0

(B.2)

in whose expression at the right  does not explicitly appear. Above I (X ∧ Y ) is the mutual information 14 of the random couple XY , X being a random primary letter ouput by the source according to the probability distribution P. The second random component Y of the random couple XY belongs to the secondary alphabet B, and so is a random secondary letter. The minimum is taken with respect to all random couples XY which are constrained to have an expected distortion Ed(X; Y ) which does not exceed the threshold +. The rate-distortion function does not look especially friendly; luckily the problem of its computation has been deeply investigated from a numeric viewpoint [4]. Observe however that, even if the computation of the rate-distortion function involves a minimization, there is no trace of n left and so its expression is single-letter, unlike in the case of graph capacity. Let us proceed to zero-error coding with distortion. The problem of *nding a single-letter expression for the asymptotic value of optimal rates is not at all trivial, but it has been solved; not surprisingly, this value turns out to depend only on the support of P, i.e., on the fact whether the probabilities P(a) of source letters a are zero or non-zero. More precisely, for  = 0 the asymptotic value is given by the zero-error ratedistortion function: R0 (P; +)=

max

min

X : P(a)=0⇒PX (a)=0 XY : Ed(X; Y )6+

I (X ∧Y ): (B.3)

Here the maximum is taken with respect to all random variables X whose support is (possibly strictly) included in the support of P, i.e., in the subset of letters a whose probability P(a) is strictly positive; PX is the probability distribution of the random variable X . 14 The mutual information can be expressed in terms of Shannon entropies as I (X ∧ Y ) = H (X ) + H (Y ) − H (XY ); it is seen as an index of dependence between the random variables X and Y , and assumes its lowest value, i.e., 0, if and only if X and Y are independent.

A. Sgarro / Fuzzy Sets and Systems 132 (2002) 11 – 32

31

The minimum in (B.3) is to be compared with R (P; +) as in (B.2). In practice, one considers all the ratedistortion functions over the support of P, and then selects the largest value which has been obtained; as for numeric techniques which are available, cf. [4]. If one chooses the peak distortion dn∗ rather than the average distortion dn , the reliability criterion (B.1) and the de*nition of the rate-distortion function should be modi*ed accordingly; in particular, the new reliability criterion is

or, in the case of peak distortion:

P n {x : d∗n (x; g(x)) ¿ +} 6 :

Denition B.1. The possibilistic rate-distortion function R (; +) for average distortion and the possibilistic rate-distortion function R∗ (; +) for peak distortion are the limit of the rates Rn of codes which are optimal for the criterion (B.5) or (B.6), respectively, as the length n goes to in*nity; 06 ¡1.

(B.4)

The asymptotic optimal rate for peak distortion R∗ (P; +) has in general a higher value 15 than R (P; +), since (B.4) is more demanding than (B.1). The expression of R∗ (P; +) turns out to be the same as in (B.2) and (B.3), only replacing the constraint Ed(X; Y )6+ which de*nes the minimization set by the more severe constraint d(X; Y )6+ (cf. [4]; by writing d(X; Y ) = 0 we mean that the event d(X; Y ) = 0 has probability 1, i.e., that the support of the random couple XY is made up only of couples (a; b) for which d(a; b) = 0). If the source is a SNI source described by giving the possibility vector  over primary letters, one can consider the same codes as before, but judge of their reliability by referring to the new reliability criteria: n {x : dn (x; g(x)) ¿ +} 6 

(B.5)

15

We recall that peak distortion can be taken back to coding with average distortion with a threshold equal to zero; this is true no matter whether the source is probabilistic or possibilistic. Actually, if one sets (a; b) = 0

iB

d(a; b) 6 +;

d∗n (x; y)6+

else (a; b) = d(a; b)

is clearly equivalent to the equality the inequality n (x; y) = 0. So, coding at distortion level + with peak distortion is the same as coding at distortion level zero with average distortion, after replacing the old distortion measure d by the new distortion measure . The case of average distortion with + = 0 and the general case of peak distortion with any +¿0 can be both couched into the inspiring mould of graph theory; then the rate-distortion function is rather called the hypergraph entropy, or the graph entropy in the special case when the two alphabets A and B coincide and when the distortion is Hamming distortion, as de*ned at the end of this appendix; cf. [4,20]. We recall that graph capacity and hypergraph entropy are the two basic functionals of the zeroerror theory; both of them originated in a coding theoretic context, but both have found unexpected and deep applications elsewhere; cf. [15].

n {x : d∗n (x; g(x)) ¿ +} 6 

(B.6)

to be compared with (B.1) and (B.4). The corresponding minimization problems are: Minimize the code rate

1 log |C| so as to satisfy n

constraint (B:5) or (B:6); respectively:

Lemma 3.1 gives soon the following lemma: Lemma B.1. If P and  are -equivalent, a code f = (f+ ; f− ) is optimal for criterion (B:1) at zero error if and only if it is optimal for criterion (B:5) at -error; it is optimal for criterion (B:4) at zero error if and only if it is optimal for criterion (B:6) at -error. The following theorem is obtained from Lemma B.1 after a comparison with the expressions of R0 (P; +) and R∗0 (P; +): Theorem B.1. The possibilistic rate-distortion functions R (; +) and R∗ (; +); 06¡1, are given by: R (; +) = R0 (P; +);

R∗ (; +) = R∗0 (P; +)

for whatever P such as to be -equivalent to ; more explicitly: R (; +) =

max

X :(a)6⇒PX (a)=0

min

XY : Ed(X; Y )6+

R∗ (; +) =

I (X ∧ Y );

max

X : (a)6⇒PX (a)=0

min

XY : d(X;Y )6+

I (X ∧ Y ):

32

A. Sgarro / Fuzzy Sets and Systems 132 (2002) 11 – 32

Observe that the possibilistic rate-distortion functions R (; +) and R∗ (; +) are both non-increasing step-functions of . Actually, if i ¡ i+1 are two consecutive entries in , as in Proposition 4.1, the relation of -equivalence is always the same for whatever  such as i 6¡ i+1 . Unlike R (; +), R∗ (; +) is also a step-function of +. Actually, if the distinct entries of the matrix d are arranged in the increasing order, and if di ¡di+1 are two consecutive entries, the constraint (B.6) is the same for whatever + such that di 6+¡di+1 . In some simple cases the minima and the maxima which appear in the expression of the various ratedistortion functions can be made explicit; the reader is referred once more to [4]; the results given there are soon adapted to the possibilistic case. We shall just mention one such special case: the two alphabets coincide, A = B, the distortion matrix d is Hamming distortion, i.e., d(a; b) is equal to 0 or to 1 according whether a = b or a = b, respectively; + = 0. As a matter of fact, one soon realizes that this is just a diBerent formulation of the problem of coding without distortion as in Section 4. A simple computation gives: R (; 0) = R∗ (; 0) = log |{a: (a) ¿ }| in accordance with the expression of the possibilistic entropy given in Theorem 4.1. A slight generalization of this case is obtained for arbitrary +¿0 when the inequality d(a; b)6+ is an equivalence relation which partitions the primary alphabet A into equivalence classes E. Then R∗ (; +) = log |{E: (E) ¿ }|: In practice, optimal codes are constructed by taking a letter aE for each class E whose possibility exceeds ; each primary letter in E is then reproduced by using precisely aE . This way the asymptotic optimal rate R∗ (; +) is achieved already for n = 1, as in the case of coding without distortion. This is bad news, since it means that optimal code constructions are bound to be trivial; cf. footnote 11. References [1] M. Borelli, A. Sgarro, A possibilistic distance for sequences of equal and unequal length, in: C. CWa lude, Gh. PWa un (Eds.), Finite VS In*nite, Discrete Mathematics and Theoretical Computer Science, Springer, London, 2000, pp. 27–38.

[2] B. Bouchon-Meunier, G. Coletti, C. Marsala, Possibilistic Conditional Events, IPMU 2000, Madrid, July 3–7 2000, Proceedings, pp. 1561–1566. [3] Th.M. Cover, J.A. Thomas, Elements of Information Theory, Wiley, New York, 1991. [4] I. CsiszVar, J. KTorner, Information Theory, Academic Press, New York, 1981. [5] G. De Cooman, Possibility Theory, Internat. J. General Systems 25 (4) (1997) 291–371. [6] D. Dubois, H.T. Nguyen, H. Prade, Possibility theory, probability and fuzzy sets: misunderstandings, bridges and gaps, in: D. Dubois, H. Prade (Eds.), Fundamentals of Fuzzy Sets, Kluwer Academic Publishers, Boston, 2000, pp. 343– 438. [7] D. Dubois, W. Ostasiewicz, H. Prade, Fuzzy Sets: History and Basic Notions, in: D. Dubois, H. Prade (Eds.), Fundamentals of Fuzzy Sets, Kluwer Academic Publishers, Boston, 2000, pp. 21–290. [8] D. Dubois, H. Prade, Properties of measures of information in evidence and possibility theories, Fuzzy Sets and Systems 24 (1987) 161–182. [9] D. Dubois, H. Prade, Fuzzy sets in approximate reasoning: inference with possibility distribution, Fuzzy Sets and Systems 40 (1991) 143–202. [10] F. Fabris, A. Sgarro, Possibilistic data transmission and fuzzy integral decoding, IPMU 2000, Madrid, July 3–7 2000, Proceedings, pp. 1153–1158. [11] E. Hisdal, Conditional possibilities, independence and non-interaction, Fuzzy Sets and Systems 1 (1978) 283–297. [12] G.J. Klir, T.A. Folger, Fuzzy Sets, Uncertainty and Information, Prentice-Hall, London, 1988. [13] G.J. Klir, M.J. Wierman, Uncertainty-Based Information: Elements of Generalized Information Theory, Physica Verlag=Springer Verlag, Heidelberg and New York, 1998. [14] G.J. Klir, Measures of uncertainty and information, in: D. Dubois, H. Prade (Eds.), Fundamentals of Fuzzy Sets, Kluwer Academic Publishers, Boston, 2000, pp. 439–457. [15] J. KTorner, A. Orlitsky, Zero-error information theory, Trans. Inform. Theory 44 (6) (1998) 2207–2229. [16] H.T. Nguyen, E.A. Walker, A First Course in Fuzzy Logic, 2nd Edition, Chapman & Hall, London, 2000. [17] S. Ovchinnikov, An Introduction to Fuzzy Relations, in: D. Dubois, H. Prade (Eds.), Fundamentals of Fuzzy Sets, Kluwer Academic Publishers, Boston, 2000, pp. 233–259. [18] C.E. Shannon, A mathematical theory of communication, Bell System Technical J. 27 (3&4) (1948) 379 – 423, 623– 656. [19] C.E. Shannon, The zero-error capacity of a noisy channel, IRE Trans. Inform. Theory IT-2 (1956) 8–19. [20] G. Simonyi, Graph entropy: a survey, in: W. Cook, L. LovVasz, P. Seymour (Eds.), Combinatorial Optimization, DIMACS Series in Discrete Maths and Computer Science, vol. 20, 1995, AMS, Providence, RI, pp. 399–441. [21] D. Solomon, Data Compression, Springer, New York, 1998. [22] P. Walley, Statistical Reasoning with Imprecise Probabilities, Chapman & Hall, London, 1991. [23] L. Zadeh, Fuzzy sets as a basis for a theory of possibility, Fuzzy Sets and Systems 1 (1978) 3–28.

Related Documents