## Looking for:

Windows 10 1703 download iso itar regulations governing law

For 2. As can be seen from equation 2. As was already mentioned above, the only model which met general ac- ceptance was the 1-displaced Poisson distribution. It is no wonder, then, that the generalized model has practically not been discussed. Fucks Thus, his application of the 1-displaced Poisson distribution included studies on 1 the individual style of single authors, as well as on 2 texts from different authors either 2.

As an example of the study of individual texts, Figure 2. As can be seen from the dotted line in Figure 2. As to a comparison of two German authors, Rilke and Goethe, on the one hand, and two Latin authors, Sallust and Caesar, on the other, Figure 2.

Again, the fitting of the 1-displaced Poisson distribution seems to be convincing. Yet, in re-analyzing his works, there remains at least one major problem: Fucks gives many characteristics of the specific distributions, starting from mean values and standard deviations up to the central moments, entropy etc. Yet, there are hardly ever any raw data given in his texts, a fact which makes it impossible to check the results at which he arrived. Thus, one is forced to believe in the goodness of his fittings on the basis of his graphical impressions, only; and this drawback is further enhanced by the fact that there are no procedures which are applied to test the goodness of his fitting the 1-displaced Poisson distribution.

There is only one instance where Fucks presents at least the relative, though not the absolute frequencies of particular distributions in detail. Fucks a: 85ff. The relative frequencies are reproduced in Table 2. We will come back to these data throughout the following discussion, using them as exemplifying material.

Being well aware of the fact that for each of the languages we are concerned with mixed data, we can ignore this fact, and see the data as a representation of a maximally broad spectrum of different empirical distributions which may be subjected to empirical testing.

As was mentioned above cf. Remembering that fitting is considered to be good in case of 0. Still, Fucks and many followers of his pursued the idea of the 1-displaced Poisson distribution as the most adequate model for word length frequencies.

Thus, one arrives at the curve in Figure 2. Fucks a: As can be seen with Fucks a: 88, f. And again, it would have been easy to run such a statistical test, calculating the co- efficient of determination R2 in order to test the adequacy of the theoretical curve obtained. Let us shortly discuss this procedure: in a nonlinear regression model, R 2 represents that part of the variance of the variable y, which can be explained by variable x.

There are quite a number of more or less divergent formulae to calculate R2 cf. Grotjahn , which result in partly significant differences. Usually, the following formula 2. Thus, for each empirical x i , we need both yi which can be obtained by the empirical values yi and the theoretical values b formula 2. Still, there remains a major theoretical problem with the specific method chosen by Fucks in trying to prove the adequacy of the 1-displaced Poisson distribution: this problem is related to the method itself, i.

Taking a second look at formula 2. To summarize, one has thus to draw an important conclusion: Due to the fact that Fucks did not apply any suitable statistics to test the goodness of fit for the 1-displaced Poisson distribution, he could not come to the point of explicitly stating that this model may be adequate in some cases, but is not acceptable as a general standard model.

Most of these subsequent studies concentrated on the 1-displaced Poisson distribution, as suggested by Fucks. In fact, work on the Poisson distribution is by no means a matter of the past. Discussing and testing various distribution models, Rothschild did not find any one of the models he tested to be adequate. As was said above, Michel first found the lognormal distribution to be a completely inadequate model.

He then tested the 1-displaced Poisson distribution and obtained negative results as well: although fitting the Poisson distribution led to better results as compared to the lognormal distribution, word length in his data turned out not to be Poisson distributed, either Michel f. Finally, Grotjahn whose work will be discussed in more detail below cf. In doing so, let us first direct our attention to the 2-parameter model suggested by him, and then to his 3-parameter model.

In a similar way, two related 2-parameter distributions can be derived from the general model 2. It is exactly the latter distribution 2. Fucks has not systematically studied its relevance; still, it might be tempting to see what kind of results are yielded by this distribution for the data already analyzed above cf. As in the case of the 1-displaced Poisson distribution, one has thus to ac- knowledge that the Fucks 2-parameter 1-displaced Dacey-Poisson distribution is an adequate theoretical model only for a specific type of empirical distribu- tions.

This leads to the question whether the Fucks 3-parameter distribution is more adequate as an overall model. It would lead too far, here, to go into details, as far as their derivation is concerned. Consequently, three solutions are obtained, not all of which must necessarily be real solutions. With this in mind, let us once again analyze the data of Table 2. The results obtained can be seen in Table 2. It can clearly be seen that in some cases, quite reasonably, the results for the 3-parameter model are better, as compared to those of the two models discussed above.

From the results represented in Table 2. These violations can be of two kinds: a. However, some of the problems met might be related to the specific way of estimating the parameters suggested by him, and this might be the reason why other authors following him tried to find alternative ways. Cercvadze, G. As opposed to most of his German papers, Fucks had discussed his generalization at some length in this English synopsis of his work, and this is likely to be the reason why his approach received much more attention among Russian-speaking scholars.

We need not go into details, here, as far as the derivation of the Fucks dis- tribution and its generating function is concerned cf. Unfortunately, Piotrovskij et al. Based on the standard Poisson distribution, as represented in 2. Based on these assumptions, the following special cases are obtained for 2. These analyses comprised nine Polish literary texts, or segments of them, and the results of these analyses indeed proved their approach to be successful.

For the sake of comparison, Table 2. A closer look at these data shows that the Polish text samples are relatively homogeneous: for all texts, the dispersion quotient is in the interval 0. The authors analyzed Croatian data from two corpora, each consisting of several literary works and a number of news- paper articles.

The data of one of the two samples are represented in Table 2. Frequency observed Poisson 0 1 2 3 4 5 6 7 8 Syllables per word Figure 2. Rather, it is of methodological interest to see how the authors dealt with the data.

Guided by the conclusion supported by the graphical representation of Figure 2. Still, there remain at least two major theoretical problems: 1. No interpretation is given as to why the weighting modification is necessary: is this a matter of the specific data structure, is this specific for Croatian language products?

As the re-analyses presented in the preceding chap- ters have shown, neither the standard Poisson distribution nor any of its straight forward modifications can be considered to be an adequate model.

Grotjahn, in his attempt, opened the way for new perspectives: he not only showed that the Poisson model per se might not be an adequate model; fur- thermore, he initiated a discussion concentrating on the question whether one overall model could be sufficient when dealing with word length frequencies of different origin. As a starting point, Grotjahn analyzed seven letters by Goethe, written in , and tested in how far the 1-displaced Poisson distribution would prove to be an adequate model.

As was pointed out above cf. However, of the concrete data analyzed by Grotjahn, only some satisfied this condition; others clearly did not, the value of d ranging from 1. In a way, this conclusion paved the way for a new line of research. After decades of concentration on the Poisson distribution, Grotjahn was able to prove that this model alone cannot be adequate for a general theory of word length distribution. On the basis of this insight, Grotjahn further elaborated his ruminations.

Although every single word thus may well follow a Poisson distribution, this assumption does not necessarily imply the premise that the probability is one and the same for all words; rather, it depends on factors such as linguistic context, theme, etc. Grotjahn 56ff. Thus, the so-called negative binomial distribution 2. Therefore, as Grotjahn 71f. With his approach, Grotjahn thus additionally succeeded in integrating earlier research, both on the geometric and the Poisson distributions, which had failed to be adequate as an overall valid model.

The data are reproduced in Table 2. Poisson d. The results are graphically repre- sented in Figure 2. History and Methodology of Word Length Studies 65 f x neg. Poisson 0 1 2 3 4 5 6 7 8 9 Figure 2. Still, it is tempting to see in how far the negative binomial distribution is able to model the data of nine languages, given by Fucks cf. Their discussion is of unchanged importance, still today, since many more recent studies in this field do not seem to pay sufficient attention to the ideas expressed almost a decade ago.

Before discussing these important reflections, one more model should be discussed, however, to which attention has recently been directed by Kromer a,b,c; In this case, we are concerned with the Poisson-uniform distribution, also called Poisson-rectangular distribution cf.

In his approach, Kromer a derived the Poisson-uniform distribution along a different theoretical way, which need not be discussed here in detail. With regard to formula 2. It would be too much, here, to completely derive the two relevant equa- tions anew. It may suffice therefore to say that the first equation can easily be derived from 2.

Best, in turn, had argued in favor of the negative binomial distribution discussed above, as an adequate model. The results obtained for these data need not be presented here, since they can easily be taken from the table given by Kromer a: These data have been repeatedly analyzed above, among others with regard to the negative binomial distribution cf.

Using the method of moments, it turns out that in four of the nine cases Esperanto, Arabic, Latin, and Turkish , no acceptable solutions are obtained. Now, what is the reason for no satisfying results being obtained, according to the method of moments? Empirically, this is proven by the results represented in Table 2. History and Methodology of Word Length Studies 71 Poisson-uniform distribution suggested by Kromer personal communication shall be demonstrated here; it is relevant for those cases when parameter a con- verges with parameter b in equation 2.

Parameter I, according to him, expresses something like the specifics of a given language i. Unfortunately, most of the above-mentioned papers Kromer b,c; have the status of abstracts, rather than of complete papers; as a consequence, only scarce empirical data are presented which might prove the claims brought forth on a broader empirical basis.

If his assumption should bear closer examination on a broader empirical basis, this might as well explain why we are concerned here with a mixture of two distributions. However, one must ask the question, why it is only the rectangular distribution which comes into play here, as one of two components.

Strangely enough, it is just the Poisson-uniform distribution, which converges to almost no other distribution, not even to the Poisson distribution, as can be seen above for details, cf. This discussion was initiated by Grotjahn and Altmann as early as in , and it seems impor- tant to call to mind the most important arguments brought forth some ten years ago.

Yet, only recently systematic studies have been un- dertaken to solve just the methodological problems by way of empirical studies. Nevertheless, most of the ideas discussed — Grotjahn and Alt- mann combined them in six groups of practical and theoretical problems — are of unchanged importance for contemporary word length studies, which makes it reasonable to summarize at least the most important points, and comment on them from a contemporary point of view.

The problem of the unit of measurement. In other words: There can be no a priori decision as to what a word is, or in what units word length can be measured. Meanwhile, in contemporary theories of science, linguistics is no exception to the rule: there is hardly any science which would not acknowledge, to one degree or another, that it has to define its object, first, and that constructive processes are at work in doing so.

The relevant thing here is that measuring is made possible, as an important thing in the construction of theory. What has not yet been studied is whether there are particular dependencies between the results obtained on the basis of different measurement units; it goes without saying that, if they exist, they are highly likely to be language- specific.

Also, it should be noted that this problem does not only concern the unit of measurement, but also the object under study: the word. It is not even the problem of compound words, abbreviation and acronyms, or numbers and digits, which comes into play here, or the distinction between word forms and lexemes lemmas — rather it is the decision whether a word is to be defined on a graphemic, orthographic-graphemic, or phonological level.

The population problem. Again, as to these questions, there are hardly any systematic studies which would aim at a comparison of results obtained on an empirical basis. However, there are some dozens of different types of letters, which can be proven to follow different rules, and which even more clearly differ from other text types.

The goodness-of-fit problem. Rather, the question is, what is a small text, and where does a large text start?

History and Methodology of Word Length Studies 75 d. The problem of the interrelationship of linguistic properties. What they have in mind are in- tralinguistic factors which concern the synergetic organization of language, and thus the interrelationship between word length factors such as size of the dictionary, or the phoneme inventory of the given language, word frequency, or sentence length in a given text to name but a few examples.

As soon as the interest shifts from language, as a more or less abstract system, to the object of some real, fictitious, imagined, or even virtual communicative act, between some producer and some recipient, we are not concerned with language, any more, but with text.

Consequently, there are more factors to be taken into account forming the boundary conditions, factors such as author- specific, or genre-dependent conditions. Ultimately, we are on the borderline here, between quantitative linguistics and quantitative text analysis, and the additional factors are, indeed, more language-related than intralinguistic in the strict sense of the word.

It should be mentioned, however, that very little is known about such factors, and systematic work on this problem has only just begun. The modelling problem. As can be seen, the aim may be different with regard to the particular research object, and it may change from case to case; what is of crucial relevance, then, is rather the question of interpretability and explanation of data and their theoretical modelling.

The problem of explanation. Consequently, in order to obtain an explanation of the nature of word length, one must discover the mechanism generating it, hereby taking into account the necessary boundary conditions. Thus far, we cannot directly concentrate on the study of particular boundary conditions, since we do not know enough about the general system mechanism at work.

Consequently, contemporary research involves three different kinds of orientation: first, we have many bottom-up oriented, partly in the form of ad-hoc solutions for particular problems, partly in the form of inductive research; second, we have top-down oriented, deductive research, aiming at the formulation of general laws and models; and finally, we have much exploratory work, which may be called abductive by nature, since it is characterized by constant hypothesis testing, possibly resulting in a modification of higher-level hypotheses.

In this framework, it is not necessary to know the probabilities of all individual frequency classes; rather, it is sufficient to know the relative difference between two neighboring classes, e. Ultimately, this line of research has in fact provided the most important research impulses in the s, which shall be discussed in detail below. In their search for relevant regularities in the organization of word length, Wimmer et al. Wimmer et al. This model was already discussed above, in its 1-displaced form 2.

It has also been found to be an adequate model for word length frequencies from a Slovenian frequency dictionary Grzybek After corresponding re-parametrizations, these modifications result in well-known distribution models. In , Wimmer et al. The set of word length classes is organized as a whole, i. Now, different distributions may be inserted for j. Thus, inserting the Borel distribution cf. The parameters a and b of the GPD are independent of each other; there are a number of theoretical restrictions for them, which need not be discussed here in detail cf.

Irrespective of these restrictions, already Wimmer et al. These observations are supported by recent studies in which Stadlober analyzed this distribution in detail and tested its adequacy for linguistic data.

Stadlober As can be seen, the results are good or even excellent in all cases; in fact, as opposed to all other distributions discussed above, the Consul-Jain GPD is able to model all data samples given by Fucks.

It can also be seen from Table 2. In this respect, i. As to this problem, it seems however important to state that this is not a problem specifically related to the GPD; rather, any mixture of distributions will cause the very same problems.

In this respect, it is important that other distributions which imply no mixtures can also be derived from 2. It would go beyond the frame of the present article to discuss the various extensions and modifications in detail here. As a result, there seems to be increasing reason to assume that there is in- deed no unique overall distribution which might cover all linguistic phenom- ena; rather, different distributions may be adequate with regard to the material studied.

This assumption has been corroborated by a lot of empirical work on word length studies from the second half of the s onwards. Best More often than not, the relevant analyses have been made with specialized software, usually the Altmann Fitter. This is an interactive computer pro- gram for fitting theoretical univariate discrete probability functions to empirical frequency distributions; fitting starts with the common point estimates and is optimized by way of iterative procedures.

There can be no doubt about the merits of such a program. Now, the door is open for inductive research, too, and the danger of arriving at ad-hoc solutions is more virulent than ever before. What is important, therefore, at present, is an abductive approach which, on the one hand, has theory-driven hypotheses at its background, but which is open for empirical findings which might make it necessary to modify the theoretical assumptions.

In addition to the C values of the discrepancy coefficient, the values for parameters a and b as a result of the fitting are given. As can be seen, fitting results are really good in all cases. As to the data analyzed, at least, the hyper-Poisson distribution should be taken into account as an alternative model, in addition to the GDP, suggested by Stadlober Comparing these two models, a great advantage of the GPD is the fact that its reference value can be very easily calculated — this is not so convenient in the case of the hyper-Poisson distribution.

On the other hand, the generation of the hyper-Poisson distribution does not involve any secondary distribution to come into play; rather, it can be directly derived from equation 2. In its 1-displaced form, equation 2. To summarize, we can thus state that the synergetic approach as developed by Wimmer et al. Generally speaking, the authors understand their contribution to be a logical extension of their synergetic approach, unifying previous assumptions and empirical findings.

The individual hypotheses belonging to the proposed system have been set up earlier; they are well-known from empirical research of the last decades, and they are partly derived from different approaches. Specifically, Wimmer et al. History and Methodology of Word Length Studies 85 it is confined to the first four terms of formula 2. Many distributions can be derived from 2.

It can thus be said that the general theoretical assumptions implied in the synergetic approach has experienced strong empirical support. One may object that this is only one of possible alternative models, only one theory among others. However, thus far, we do not have any other, which is as theoretically sound, and as empirically supported, as the one presented.

On the other hand, hardly any systematic studies have been undertaken to empirically study pos- sible influencing factors, neither as to the data basis in general i. Ultimately, the question, what may influence word length frequencies, may be a bottomless pit — after all, any text production is an historically unique event, the boundary conditions of which may never be reproduced, at least not completely.

Still, the question remains open if particular factors may be detected, the relevance of which for the distribution of word length frequencies may be proven. This point definitely goes beyond a historical survey of word length studies; rather, it directs our attention to research desires, as a result of the methodolog- ical discussion above. A, ; — Best, Karl-Heinz ed. Brainerd, Barron Weighing evidence in language and literature: A statistical approach. Chebanow Chebanow, S. Dewey, G. Cambridge; Mass.

Elderton, William P. London, Fucks, Wilhelm Nach allen Regeln der Kunst. Leningrad, Nauka: — Dordrecht, NL. Grzybek, Peter ed. Ljubljana etc. The Impact of Word Length. Kromer, Victor V. Materialy konferencii. Ma- terialy konferencii. Markov, Andrej A. Mendenhall, Thomas C.

Studien zum 1. Internationalen Bulgaristikkongress Sofia Piotrovskij, Rajmond G. Williams, Carrington B. Wimmer, Gejza; Altmann, Gabriel Thesaurus of univariate discrete probability distributions. Zerzwadse, G. In: Grundlagenstudien aus Kybernetik und Geisteswissenschaft 4, — The idea is derived from the Fitts—Garner controversy in mathematical psychology cf. Fitts et al. Obviously, the problem is quite old but has not penetrated into linguistics as yet.

A word in a text can be thought of as a realization of a number of different alternative possibilities, see Fig.

They can even be understood in different ways, e. What is neglected when correlating the lengths and the frequencies of words in real texts is the fact that for the text producer there is not at all free choice of all existing words at every moment.

Trying to fill in the blank is a model for determining the uncertainty of the missing word. It must be noted that SIC or HIC are associated not only with words but also with whole phrases or clauses, so that they represent rather polystratic structures and sequences.

The present approach is the first approximation at the word level. Preparation In order to illustrate the variables which will be treated later on, let us first define some quantities.

The cardinality of the set X will be symbolized as X. P the set of positions in a text, whatever the counting unit. The elements of this set are tokens tijk , i. If the type and its token are known, the indices i and j can be left out. The elements of this set, aij , are not necessarily synonyms but in the given context they are admissible alternatives of the given token.

The index k can be omitted Aij the number of elements in the set Aij , i. This entity can be called tokeme. By defining Mij , we are able to distinguish between tokens of the same type but with different alternatives and different number ai — so they are different tokemes.

Example Using Table 9 cf. The text is reproduced word for word in the second column of Table 9 p. The length is measured in terms of the number of syllables of the word.

Thus, e. We can define it for types too: then it is the mean of all LLs of all tokens of this type in the text. LL is usually a positive real number. The errors compensate each other in the long run, so the distribution of L equals that of LL. It can be ascertained for any text. We can set up the hypothesis that Hypothesis 1 The longer the token, the longer the tokeme at the given position.

This hypothesis can be tested in different ways. As an empirical consequence of hypothesis 1 it can be expected that the distribution of L and LL is approximately equal. A token of length L has alternatives which are on average the same length, i.

Since LL is a positive real number it is an average we divide the range of lengths in the text in continuous intervals and ascertain the number of cases the frequency in the particular intervals. This can easily be made using the third and the sixth column of Table 9 p.

The result is presented in Table 3. It can easily be shown that the frequencies differ non-significantly. Since the distributions are equal, they must abide by the same theoretical distribution. Using the well corroborated theory of word length cf. Wim- mer et al. As a matter of fact, for the distribution of LL we take the middles of the intervals as variable.

It would, perhaps, be more correct to use for both data the continuous equivalent of the geometric distribution, namely the exponential distribution — however, again not quite correct. Thus we adhere to discreteness without loss of validity. The result of fitting the geometric distribution to the data from Table 3.

Length range in tokemes In each tokeme the lengths of words local latent lengths are distributed them- selves in a special way. It is not fertile to study them individually since the majority of them is deterministic i.

It is more prolific to consider the ranges of latent lengths for the whole text. For this phenomenon we set up the hypothesis Hypothesis 2 The range of latent lengths within the tokemes is geometric-Poisson. Since the latent length distribution LLx is geometric and each LLx is al- most identical on average with that of Lx the alternatives tend to keep the length of the token , the range of the latent lengths in the tokeme is very restricted.

The deviations seem to be distributed at random, i. Evidently, the fitting is very good and corroborates in addition hypothesis 1, too. Thus latent length is a kind of latent mechanism controlling the token length at the given position.

Latent length is not directly measurable, it is an invisible result of the complex flow of information. Nevertheless, it can be made visible — as we tried to do above — or it can be approximately deduced on the basis of token lengths.

Information Content of Words in Texts 99 Table 3. Stable latent length Consider the deviations of the individual token lengths from those of the re- spective tokeme lengths as shown in Table 9 p. This encourages us to set up the hypothesis that Hypothesis 3 There is no tendency to choose the smallest possible alternative at the given position in text.

The hypothesis can easily be tested. SIC of the text Above, we defined SIC of a type as the dual logarithm of the mean size of all tokeme sizes of the given type, as shown in formula 3. Two possibilities can be proposed. We shall use here 3. For the given text it can be computed directly using the fifth column of Table 9 p. We suppose that it is the smaller the more formal the text. We can build about it a confidence interval.

Here the tokeme sizes build a sequence of a 1, 16, 3, 2, 1, 8, 2, 1,. Taking the dual logarithms we obtain a new sequence b 0, 4, 1. In order to control the information flow and at the same time to allow licentia poetica, zeros and non-zeroes must display some pattern which is characteristic of different text types. Thus we obtain the two state sequence c 0, 1, 1, 1, 0, 1, 1, 0,. We begin with the examination of runs of 0 and 1 and set up the hypothesis that Hypothesis 4 The building of text blocks with zero uncertainty 0 and those with selection possibilities 1 is random i.

In practice it means that the runs of zeroes and ones are distributed at random. In our text see Table 9 , p. Another possibility is to consider sequence c as a two-state Markov chain or sequences a and b as multi-state Markov chains. In the first approx- imation we consider case c as a dynamical system and compute the transition matrix between zeroes and ones. Taking the powers of the above matrix we can easily see that the probabilities are stable to four decimal places with P 4 yielding a matrix with equal rows [0.

Since P n represents the n-step transition prob- ability matrix, the exponent n is also a characteristic of the text. Alternatives, length and frequency Since SIC has not been imbedded in the network of synergetic linguistics as yet, it is quite natural to ask whether it is somehow associated with basic language properties such as length and frequency.

In the present paper all other properties e. The data for testing can easily be prepared using Table 9 p. Below we show merely lengths 4 and 5 because the full Table is very extensive cf. Table 3. This results in Table 3.

In such cases they must be taken into account explicitly. In our case this leads to partial differential equations. Let us assume that length has a constant effect, i.

Fitting this curve to the data in Table 3. This is, of course, merely the first approximation using data smoothing because the text was rather a short one. Interpretation and outlook Looking at Tables 3. But we recognize that the influence of frequency is considerably weaker than that of length.

If we regard 3. The direction of this influence is even more astonishing: with increasing length the number of alternatives is increasing too, longer words are more often freely chosen, while one perhaps would expect a preference for choosing shorter words.

Since the e-function plays an important role in psychology, for example in cognitive tasks like decision making, we suppose that word length is a variable which is connected with some basic cognitive psychological processes. Andersen, S. Attneave, F. New York. Berlyne, D. Coombs, C. Englewood Cliffs, N. Evans, T. Fitts, P. Garner, W. Hartley, R.

Piotrowski, R. Wimmer, G. June 21—23, , Graz University. Bahnhof 2 – 1 2. Altona 3 – 1 3. Kinderbuch 3 – 1 3. Stiftung 2 – 1 2. Deutschland 2 – 1 2.

Kinder- 2 – 1 2. Krimi 2 Kriminalroman, Thriller 3 3. Kinderbuchautor 5 Autor, Schriftsteller, Kinderbuchschriftsteller 4 4. Andreas 3 – 1 3. Anfang 2 Beginn, Start 3 1. Jungen 2 – 1 2. Weltuhr 2 Uhr 2 1. Seiten 2 – 1 2. Aktion 3 Leistung, Tat, Sache 4 2. Guinness 2 – 1 2. Rekorde 3 – 1 3. Altona 3 Hamburg 2 2. Bahnhofshalle 4 Halle, Vorhalle, Wandelhalle 4 3. Szenen 2 Bilder, Teile, Partien 4 2.

Detektive 4 – 1 4. Introduction This paper concentrates on the question of zero-syllable words i. As an essential result of these studies it turned out that, due to the specific structure of syllable and word in Slavic languages a several probability distribution models have to be taken into account, and this depends b on the fact if zero-syllable words are considered as a separate word class in its own right or not.

Predominantly putting a particular accent on zero-syllable words, we examine if and how the major statistical measures are influenced by the theoretical definition of the above- mentioned units.

We do not, of course, generally neglect the question if and how the choice of an adequate frequency model is modified depending on these pre-conditions — it is simply not pursued in this paper which has a different accent. Basing our analysis on Slovenian texts, we are mainly concerned with the following two questions: a How can word length reasonably be defined for automatical analyses, and b what influence has the determination of the measure unit i.

Thus, subsequent to the discussion of a , it will be necessary to test how the decision to consider zero-syllable words as a specific word length class in its own right influences the major statistical measures. Any answer to the problem outlined should lead to the solution of specific prob- lems: among others, it should be possible to see to what extent the proportion of x-syllable words can be interpreted as a discriminating factor in text typology — to give but one example.

In a way, the scope of this study may be understood to be more far-reaching, however, insofar as it focuses relevant pre-conditions which are of general methodological importance. For these ends, we will empirically test, on a basis of Slovenian texts, which effects can be observed in dependence of diverging definitions of these units. Word Definition Without a doubt, a consistent definition of the basic linguistic units is of utmost importance for the study of word length.

Zero-syllable Words in Determining Word Length Irrespective of the theoretical problems of defining the word, there can be no doubt that the latter is one of the main formal textual and perceptive units in linguistics, which has to be determined in one way or another.

Knowing that there is no uniquely accepted, general definition, which we can accept as a standardized definition and use for our purposes, it seems reasonable to discuss relevant available definitions. As a result, we should then choose one intersubjectively acceptable definition, adequate for dealing with the concrete questions we are pursuing.

Taking into consideration syntactic qualities, and differentiating autosemantic vs. Subsequent to this discussion of three different theoretical definitions, we will try to work with one of these definitions, of which we demand that it is acceptable on an intersubjective level.

The decisive criterium in this next step will be a sufficient degree of formalization, allowing for an automatic text processing and analysis. Rather, what can be realized, is an attempt to show which consequences arise if one makes a decision in favor of one of the described options.

Since this, too, cannot be done in detail for all of the above-mentioned alternatives, within the framework of this article, there remains only one reasonable way to go: We will tentatively make a decision for one of the options, and then, in a number of comparative studies, empirically test which consequences result from this decision as compared to the theoretical alternatives.

This will be briefly analyzed in the following work and only in the Slovenian language, but under special circumstances, and with specific modifications. In the previous discussion, we already pointed out the weaknesses of this defini- tion; therefore, we will now have to explain that we regard it to be reasonable to take just the graphematic-orthographic definition as a starting point.

It can therefore be expected that the results allow for some intersubjective comparability, at least to a particular degree. Zero-syllable Words in Determining Word Length b Second, since the definition of the units involves complex problems of quantifying linguistic data, this question can be solved only by way of the assumption that any quantification is some kind of a process which needs to be operationally defined. The word thus being defined according to purely formal criteria — i.

This, in turn, can serve as a guarantee that an analysis on all other levels of language i. The definition chosen above is, of course, of relevance for the automatic pro- cessing and quantitative analysis of text s. In detail, a number of concrete textual modifications result from the above-mentioned definition. In case of single elements, they are processed according to their syllabic structure. Particularly with regard to foreign language elements and passages, attention must be paid to syllabic and non-syllabic elements which, for the two languages under consideration, differ in func- tion: cf.

It should be noted here that irrespective of these secondary manipulations the original text structure remains fully recognizable to a researcher; in other words, the text remains open for further examinations e. Altmann et al. Unuk 3. In order to automatically measure word length it is therefore not primarily necessary to define the syllable boundaries; rather, it is sufficient to determine all those units phonemes which are characterized by an increased sonority and thus have syllabic function.

On the other hand, empirical sonographic studies show that there are no bilabial fricatives in Slovenian standard language cf. Srebot-Rejec Of course, 2 For further discussions on this topic see: Tivadar , Srebot Rejec , Slovenski pravopis ; cf. On the Question of zero-syllabic Words The question whether there is a class of zero-syllabic words in its own right, is of utmost importance for any quantitative study on word length. With regard to this question, two different approaches can be found in the research on the frequency of x-syllabic words.

In this context, it will be important to see if consideration or neglect of this special word class results in statistical differences, and how much information consideration of them offers for quantitative studies. As can be seen, we are concerned with two zero-syllable prepositions and with corresponding orthographical-graphematic variants for their phonetic realiza- tions. As opposed to this, these prepositions are treated as zero-syllable words in modern Slovenian; they thus exemplify the following general trend: original one-syllable words have been transformed into zero-syllable words.

Obviously, there are economic reasons for this reduction tendency. From a phonematic point of view, one might add the argument that these prepositions do not display any suprasegmental properties, i.

Following this diachronic line of thinking might lead one to assume that zero-syllable words should or need not be considered as a specific class in linguo-statistic studies. Incidently, the depicted trend i. Yet, as was said above, it is not our aim to provide a theo- retical solution to this open question. Nor do we have to make a decision, here, whether zero-syllable words should or should not be treated as a specific class, i. Rather, we will leave this question open and shift our attention to the empirical part of our study, testing what importance such a decision might have for particular statistical models.

Descriptive Statistics The statistical analyses are based on Slovenian texts, which are considered to represent the text corpus of the present study. The whole number of texts is divided into the following groups4 : literary prose, poetry, journalism. The detailed reference for the prose and poetic texts are given in Tables 4.

Table 4. Based on these considerations, and taking into account that the text data basis is heterogeneous both with regard to content and text types, statistical measures, such as mean, standard deviation, skewness, kurtosis, etc. Level I The whole corpus is analyzed under two conditions, once considering zero- syllable words to be a separate class in their own right, and once not doing so. One can thus, for example, calculate relevant statistical measures or analyze the distribution of word length within one of the two corpora.

Level II Corresponding groups of texts in each of the two corpora can be compared to each other: one can, for example, compare the poetic texts, taken as a group, in the corpus with zero-syllable words, with the corresponding text group in corpus without zero-syllable words. Level III Individual texts are compared to each other. Here, one has to distinguish different possibilities: the two texts under consideration may be from one and the same text group, or from different text groups; additionally, they may be part of the corpus with zero-syllable words or the corpus without zero-syllable words.

Level IV An individual text is studied without comparison to any other text. A larger positive skewness implies a right skewed distribution. In the next step, we analyze which percentage of the whole text corpus is represented by x-syllable words.

Text no. Three Text Types Figure 4. It should be noted that many poetic texts do not contain any 0-syllable words at all. Of the 51 poetic texts, only 26 contain such words.

Analysis of Mean Word Length in Texts The statistical analysis is carried out twice, once considering the class of zero- syllable words as a separate category, and once considering them to be proclitics. Our aim is to answer the question, whether the influence of the zero-syllable words on the mean word length is significant.

In the next step concentrating on the mean word length value of all texts Level I , two vector variables are introduced, each of them with components: W C 0 and W C. The i-th component of the vector variable W C 0 defines the mean word length of the i-th text including zero-syllable words.

In analogy to this, the i-th com- ponent of the vector variable W C gives the mean word length of the i-th text excluding zero-syllable words see Table 4. In order to obtain a more precise structure of the word length mean values, the analyses will be run both over all texts of the whole corpus Level I , and over the given number of texts belonging to one of the following three text types, only Level II : i literary prose L , ii poetry P , iii journalistic prose J.

A scatterplot is a graph which uses a coordinate plane to show the relation correlation between two variables X and Y. Each point in the scatterplot represents one case of the data set. In such a graph, one can see if the data follow a particular trend: If both variables tend in the same direction that is, if one variable increases as the other increases, or if one variable decreases as the other decreases , the relation is positive.

There is a negative relationship, if one variable increases, whereas the other decreases. The more tightly data points are arranged around a negatively or positively sloped line, the stronger is the relation. If the data points appear to be a cloud, there is no relation between the two variables. In the following graphical representations of Figure 4. In our case, the scatterplot shows a clear positive, linear dependence between mean word length in the texts both with and without zero-syllable words , for each pair of variables.

This result is corroborated by a correlation analysis. W C 0 b Scatterplot W L vs. W P 0 d Scatterplot W J vs. W J 0 Figure 4. As to our data, a strong dependence at the 0. Let us therefore take a look at the histograms of each of the eight new variables. The first pair of histograms cf. Figure 4. Still, we have to test these assumptions. Usually, either the Kolmogorov-Smirnov test or the Shapiro-Wilk test are ap- plied in order to test if the data follow the normal distribution. Since, in our case, the parameters of the distribution must be estimated from the sample data, we use the Shapiro-Wilk test, instead.

This test is specifically designed to detect deviations from normality, without requiring that the mean or variance of the hypothesized normal distribution are specified in advance.

To determine whether the null hypothesis of normality has to be rejected, the prob- ability associated with the test statistic i. If this value is less than the chosen level of significance such as 0. The obtained p-values support our assumptions, i. In the following analyses, we shall focus on the second analytical level, i. In order to test this, we can apply the t-test for paired samples. This test compares the means of two variables; it computes the difference between the two vari- ables for each case, and tests if the average difference is significantly different from zero.

This means that we test the following hypothesis: H0 : There is no significant difference between the theoretical means i. Before applying the t-test, we have to test if the variables d L , dP , dJ are also normally distributed. As they are linear combinations of normally dis- tributed variables, there is sound reason to assume that this is the case. The Shapiro-Wilk test yields the p-values given in Table 4. The histogram of the variable dP shows the same result cf.

In spite of the result of the Shapiro- Wilk test, we therefore apply a one sample t-test assuming that d P is normally distributed. Two distribution functions for variables which denote mean word length of texts with and without zero-syllable words have the same shape, but they are shifted, since their expected values differ. The following Figures 4. It should be noted that this conclusion can not be generalized. As long as the variables dL , dP , dJ are normally distributed, our statement is true.

Yet, normality has to be tested in advance and we can not generally assume normally distributed variables. In the next step we show the box plots and error bars of the variables d L , dP , dJ. A box plot is a graphical display which shows a measure of location the median-center of the data , a measure of dispersion the interquartile range, i.

Horizontal lines are drawn both at the median — the 50th percentile q0. The horizontal lines are joined by vertical lines to produce the box. A vertical line is drawn up from the upper quartile to the most extreme data point i. The most extreme data point thus is min x n , q0. Short horizontal lines are added in order to mark the ends of these vertical lines. The difference in the mean values of the three samples is obvious; also it can clearly be seen that all three samples produce symmetric distributions, variable dJ displaying the largest variability.

As can be seen, the confidence intervals do not overlap; we can therefore conclude that the percentage of zero-syllable words possibly may allow for a distinction between different text types. It turns out that the number of sylla- bles per word i.

This class of words may either be considered to be a separate word-length class in its own right, or as clitics. Without making an a priori decision as to this question, the mean word length of Slovenian texts is analyzed in the present study, under these two conditions, in order to test the statistical effect of the theoretical decision. In the present study, the material is analyzed from two perspectives, only: mean word length is calculated both in the whole text corpus Level I , and in three different groups of text types, representing Level II: literary, journalistic, poetic.

These empirical analyses are run under two conditions, either including the zero-syllable words as a separate word length class in its own right, or not doing so.

Zero-syllable Words in Determining Word Length Based on these definitions and conditions, the major results of the present study may be summarized as follows: 1 As a first result, the proportion of zero-syllable words turned out to be relatively small i. Furthermore, it can be shown that the mean word length in texts under both conditions are highly correlated with each other; the positive linear trend, which is statistically tested in the form of a correlation analysis and graphically represented in Figure 4.

As a result, it turns out that mean word length is normally distributed in the three text groups analyzed Level II , but, interestingly enough, not in the whole corpus Level II. Based on this finding, further analyses concentrate on Level II, only. Therefore, t-tests are run, in order to compare the mean lengths between the three groups of texts on the basis of the differences between the mean lengths under both conditions. As a result, the expected values of mean word length significantly differ between all three groups.

To summarize, we thus obtain a further hint at the well-organized structure of word length in texts. Altmann, G. Figge zum Stutt- gart. Bajec, A. Predlogi in predpone. Best, K. Genzor; S. Wimmer; G. Altmann; R. Girzig, P. Grotjahn, R.

Grzybek, P. Jachnow, H. Lehfeldt, W. Jachnow ed. Lekomceva, M. Tom 1. Rottmann, Otto A. Royston, P. Schaeder, B. Srebot-Rejec, T. Tivadar, H. Unuk, D. Doktorska disertacija. Figure 5. Many of the relevant psychological findings seem to be interesting for linguistics as well: the results reported e. Of particular linguistic interest is the question whether the serial position effects shown in the recall of lists of unconnected words show in the recall of real sentences as well.

Are the underlying processes also efficient in real sentence processing and in connected discourse? It is the aim of this volume to contribute to a better mutual exchange of ideas. Generally speaking, the aim of the conference was to diagnose and to discuss the state of the art in word length studies, with experts from the above-mentioned disciplines. Since, with the exception of the introductory essay, the articles appear in alphabetical order, they shall be briefly commented upon here in relation to their thematic relevance.

The introductory contribution by Peter Grzybek on the History and Method- ology of Word Length Studies attempts to offer a general starting point and, in fact, provides an extensive survey on the state of the art. This contribution con- centrates on theoretical approaches to the question, from the 19th century up to the present, and it offers an extensive overview not only of the development of word length studies, but of contemporary approaches, as well.

The contributions by Gejza Wimmer from Slovakia and Gabriel Altmann from Germany, as well as the one by Victor Kromer from Russia, follow this line of research, in so far as they are predominantly theory-oriented.

From this point of view their publications show the efficiency of co- operations between the different fields. Another block of contributions represent concrete analyses, though from differing perspectives, and with different objectives. Applying the theoretical framework outlined by Altmann, Wimmer, and their colleagues, this is one example of theoretically modelling word length frequencies in a number of texts of a given language, Lower Sorbian in this case.

A number of further contributions discuss the relevance of word length stud- ies within a broader linguistic context. The remaining three contributions have the common aim of shedding light on the interdependence between word length and other linguistic units. The volume thus offering a broad spectrum of word length studies, should be of interest not only to experts in general linguistics and text scholarship, but in related fields as well.

Only a closer co-operation between experts from the above-mentioned fields will provide an adequate basis for further insight into what is actually going on in language s and text s , and it is the hope of this volume to make a significant contribution to these efforts. This volume would not have seen the light of day without the invaluable help and support of many individuals and institutions.

No doubt, there is more than one theory of science, and it is not the place here to discuss the philosophical implications of this field in detail. Furthermore, it has become commonplace to refuse the concept of a unique theory of science, and to distinguish between a general theory of science and specific theories of science, relevant for indi- vidual sciences or branches of science.

This tendency is particularly strong in the humanities, where 19th century ideas as to the irreconcilable antagony of human and natural, of weak and hard sciences, etc. As far as linguistics, which is at stake here, is concerned, the self-evaluation of this discipline clearly is that it fulfills the requirements of being a science, as Smith 26 correctly puts it: Linguistics likes to think of itself as a science in the sense that it makes testable, i.

The relevant question is not, however, to which extent linguistics considers itself to be a science; rather, the question must be, to which extent does lin- guistics satisfy the needs of a general theory of science. And the same holds true, of course, for related disciplines focusing on specific language products and processes, starting from subfields such as psycholinguistics, up to the area of text scholarship, in general.

Generally speaking, it is commonplace to say that there can be no science without theory, or theories. Altmann 1. In each of these cases, we are concerned with not more and not less than a system of concepts whose function it is to provide a consistent description of the object under study.

Smith 14 But the hallmark of a scientific theory is that it gives rise to hypotheses which can be the object of rational argumentation. Now, it goes without saying that the existence of a system of concepts is necessary for the construction of a theory: yet, it is a necessary, but not sufficient condition cf.

Altmann 2 : One should not have the illusion that one constructs a theory when one clas- sifies linguistic phenomena and develops sophisticated conceptual systems, or discovers universals, or formulates linguistic rules. Though this predominantly descriptive work is essential and stands at the beginning of any research, nothing more can be gained but the definition of the research object [.

Altmann 3 : The main part of a theory consists of a system of hypotheses. A scientific theory is a system in which some valid hypotheses are tenable and almost no hypotheses untenable. Thus, theories pre-suppose the existence of specific hypotheses the formula- tion of which, following Bunge , implies the three main requisites: i the hypothesis must be well formed formally correct and meaningful semantically nonempty in some scientific context; ii the hypothesis must be grounded to some extent on previous knowledge, i.

In a next step, therefore, different levels in conjecture making may thus be distinguished, depending on the relation between hypothesis h , antecedent knowledge A , and empirical evidence e ; Figure1. Figure 1. In the beginnings of this tradition, predominantly in the Neogrammarian approach to Indo-European language history, these laws — though of descriptive rather than explanative nature — allowed no exceptions to the rules, and they were indeed understood as deterministic laws.

At this stage, the phonetic law was not considered to be a law of nature [Naturgesetz], as yet; rather, we are concerned with metaphorical com- parisons, which nonetheless signify a clear tendency towards scientific exact- ness in linguistics. Consequently, for Schleicher, the science of language must be a natural science, and its method must by and large be the same as that of the other natural sciences. Many a scholar in the second half of the 19th century would elaborate on these ideas: if linguistics belonged to the natural sci- ences, or at least worked with equivalent methods, then linguistic laws should be identical with the natural laws.

Natural laws, however, were considered mech- anistic and deterministic, and partly continue to be even today. Consequently, in the mids, scholars such as August Leskien — , Hermann Os- thoff — , and Karl Brugmann — repeatedly emphasized the sound laws they studied to be exceptionless.

Every scholar admitting ex- ceptions was condemned to be addicted to subjectivism and arbitrariness. The rigor of these claims began to be heavily discussed from the s on, mainly by scholars such as Berthold G.

Verner —96 put it in Curiously enough, this was almost the very same year that Austrian physicist Ludwig Boltzmann — re-defined one of the established natural laws, the second law of thermodynamics, in terms of probability. As will be remembered, the first law of thermodynamics implies the statement that the energy of a given system remains constant without external influence. No claim is made as to the question, which of various possible states, all having the same energy, is at stake, i.

In fact, this idea may be regarded to be the foundation of statistical mechanics, as it was later called, describing thermodynamic systems by reference to the statistical behavior of their constituents. What Boltzmann thus succeeded to do was in fact not less than deliver proof that the second law of thermodynamics is not a natural law in the deterministic understanding of the term, as was believed in his time, and is still often mis- takenly believed, even today.

In fact, this question is of utmost relevance in theoret- ical physics, still today or, perhaps, more than ever before. Historically speaking, this aversion has been supported by the spirit of the time, when scholars like Dilthey 27 established the hermeneutic tradition in the humanities and declared singularities and individualities of socio-historical reality to be the objective of the humanities.

Ultimately, this would result in what Snow should term the distinction of Two Cultures, in the s — a myth strategically upheld even today. This myth is well prone to perpetuating the overall skepticism as to mathematical methods in the field of the humanities.

Mathematics, in this context, tends to be discarded since it allegedly neglects the individuality of the object under study. However, mathematics can never be a substitute for theory, it can only be a tool for theory construction Bunge Ultimately, in science as well as in everyday life, any conclusion as to the question, whether observed or assumed differences, relations, or changes are essential, are merely chance or not, must involve a decision.

In everyday life, this decision may remain a matter of individual choice; in science, however, it should obey conventional rules. More often than not, in the realm of the humanities, the empirical test of a given hypothesis has been replaced by the acceptance of the scientific community; this is only possible, of course, because, more often than not, we are concerned with specific hypotheses, as compared to the above Figure 1.

Actually, this is the reason why mathematics in gen- eral, and particularly statistics as a special field of it, is so essential to science: ultimately, the crucial function of mathematics in science is its role in the ex- pression of scientific models. Observing and collecting measurements, as well as hypothesizing and predicting, typically require mathematical models. In this context, it is important to note that the formation of a theory is not identical to the simple transformation of intuitive assumptions into the language of formal logic or mathematics; not each attempt to describe!

Rather, it is important that there be a model which allows for formulating the statistical hypotheses in terms of probabilities. On The Science of Language In Light of The Language of Science 7 At this moment, human sciences in general, and linguistics in particular, tend to bring forth a number of objections, which should be discussed here in brief cf. Altmann 5ff. Bunge Thus, although linguistics, text scholarship, etc. Altmann ff. The formulation of a linguistic hypothesis, usually of qualitative kind.

The linguistic hypothesis must be translated into the language of statistics; qualitative concepts contained in the hypothesis must be transformed into quantitative ones, so that the statistical models can be applied to them. This may lead to a re-formulation of the hypothesis itself, which must have the form of a statistical hypotheses.

Furthermore, a mathematical model must be chosen which allows the probability to be calculated with which the hypothesis may be valid with regard to the data under study. Data have to be collected, prepared, evaluated, and calculated according to the model chosen. The result obtained is represented by one or more digits, by a particular function, or the like.

Its statistical evaluation leads to an acceptance or refusal of the hypothesis, and to a statement as to the significance of the results. The result must be linguistically interpreted, i. Now what does it mean, concretely, if one wants to construct a theory of language in the scientific understanding of this term? According to Altmann 5 , designing a theory of language must start as follows: When constructing a theory of language we proceed on the basic assumption that language is a self-regulating system all of whose entities and properties are brought into line with one another in some way or other.

From this perspective, general systems theory and synergetics provide a general framework for a science of language; the statistical formulation of the theoretical model thus can be regarded to represent a meta-linguistic interface to other branches of sciences.

As a consequence, language is by no means un- derstood as a natural product in the 19th century understanding of this term; neither is it understood as something extraordinary within culture. Most rea- sonably, language lends itself to being seen as a specific cultural sign system. Culture, in turn, offers itself to be interpreted in the framework of an evolu- tionary theory of cognition, or of evolutionary cultural semiotics, respectively. Culture thus is defined as the cognitive and semiotic device for the adaption of human beings to nature.

In this sense, culture is a continuation of nature on the one hand, and simultaneously a reflection of nature on the other — consequently, culture stands in an isologic relation to nature, and it can be studied as such.

Primarily, language is understood as a sign system serving as a vehicle of cognition and communication. Based on the further assumption that communicative processes are characterized by some kind of economy between the participants, language, regarded as an abstract sign system, is understood as the economic result of communicative processes.

Rather, we are concerned with a permanent process of mutual adaptation, and of a specific interrelation of partly contradictory forces at work, leading to a specific dynamics of an- tagonistic interest forces in communicative processes. Communicative acts, as well as the sign system serving communication, thus represent something like a dynamic equilibrium.

In principle, this view has been delineated by G. Zipf as early as in the s and 40s cf. Zipf Today, Zipf is mostly known for his frequency studies, mainly on the word level; however, his ideas have been applied to many other levels of language too, and have been successfully transferred to other disciplines as well. It would be going too far to discuss the relevant ideas in detail here; still, the basic implications of this approach should be presented in order to show that the focus on word length chosen in this book is far from accidental.

Word Length in a Synergetic Context Word length is, of course, only one linguistic trait of texts, among others. In this sense, word length studies cannot be but a modest contribution to an overall science of language. However, a focus on the word is not accidental, and the linguistic unit of the word itself is far from trivial.

Rather, word length is an important factor in a synergetic approach to lan- guage and text, and it is by no means an isolated linguistic phenomenon within the structure of language. The question here cannot be, of course, in how far each of the units mentioned are equally adequate for lin- guistic models, in how far their definitions should be modified, or in how far there may be further levels, particularly with regard to specific text types such as poems, for example, where verses and stanzas may be more suitable units.

At closer inspection cf. Table 1. Consequently, on each of these levels, the re-occurrence of units results in particular frequencies, which may be modelled with recourse to specific frequency distribution models.

To give but one example, the fa- mous Zipf-Mandelbrot distribution has become a generally accepted model for word frequencies. Models for letter and phoneme frequencies have recently been discussed in detail. It turns out that the Zipf-Mandelbrot distribution is no adequate model, on this linguistic level cf.

Moreover, the units of all levels are characterized by length; and again, the length of the units on one level is directly interrelated with those of the neigh- boring levels, and, probably, indirectly with those of all others. Altmann Finally, systematic dependencies cannot only be observed on the level of length; rather, each of the length categories displays regularities in its own right. Thus, particular frequency length distributions may be modelled on all levels distinguished.

Yet, many a problem still begs a solution; in fact, even many a question remains to be asked, at least in a systematic way. Thus, the descriptive apparatus has been excellently devel- oped by structuralist linguistics; yet, structuralism has never made the decisive next step, and has never asked the crucial question as to explanatory models.

Also, the methodological apparatus for hypothesis testing has been elaborated, along with the formation of a great amount of valuable hypotheses. Still, much work remains to be done. From another perspective, this work will throw us back to the very basics of empirical study.

Last but not least, the quality of scientific research depends on the quality of the questions asked, and any modification of the question, or of the basic definitions, will lead to different results. As long as we do not know, for example, what a word is, i. And how, or in how far, do the results change — and if so, do they systematically change? These questions have never been systematically studied, and it is a problem sui generis, to ask for regularities such as frequency distributions on each of the levels mentioned.

But ultimately, these questions concern only the first degree of un- certainty, involving the qualitative decision as to the measuring units: given, we clearly distinguish these factors, and study them systematically, the next questions concern the quality of our data material: will the results be the same, and how, or in how far, will they systematically? At this point, the im- portant distinction of types and tokens comes into play, and again the question must be, how, or in how far, the results depend upon a decision as to this point.

Thus far, only language-intrinsic factors have been named, which possibly influence word length; and this enumeration is not even complete; other factors as the phoneme inventory size, the position in the sentence, the existence of suprasegmentals, etc. And, finally, word length does of course not only depend on language-intrinsic factors, according to the synergetic schema represented in Table 1. More questions than answers, it seems. And this may well be the case. Asking a question is a linguistic process; asking a scientific question, is a also linguistic process, — and a scientific process at the same time.

The crucial point, thus, is that if one wants to arrive at a science of language, one must ask questions in such a way that they can be answered in the language of science. Koch ed. Faust; R. Harweg; W. Lehfeldt; G. Wienold eds. Altmann, Gabriel; Schwibbe, Michael H. Hildesheim etc. Bunge, Mario Scientific Research I. The Search for Systems. Berlin etc. Collinge, Neville E.

Stuttgart, Koch, Walter A. Evolutionary Cultural Semiotics. Struktur und Dynamik der Lexik. Rickert, Heinrich Kulturwissenschaft und Naturwissenschaft. Smith, Neilson Y. Snow, Charles P. Cambridge, Woodbury, NY. Windelband, Wilhelm Geschichte und Naturwissenschaft. Zipf, George K. Cambridge, Mass. An introduction to human ecology. Cam- bridge, Mass. Peter Grzybek ed.

Dordrecht: Springer, , pp. Historical roots The study of word length has an almost year long history: it was on August 18, , when Augustus de Morgan, the well-known English mathematician and logician — , in a letter to a friend of his, brought forth the idea of studying word length as an indicator of individual style, and as a possible factor in determining authorship.

Specifically, de Morgan concentrated on the number of letters per word and suspected that the average length of words in differ- ent Epistles by St. Paul might shed some light on the question of authorship; generalizing his ideas, he assumed that the average word lengths in two texts, written by one and the same author, though on different subjects, should be more similar to each other than in two texts written by two different individuals on one and the same subject cf.

Lord Thackerey, and John Stuart Mill. Figure 2. Still, Mendenhall concentrated on solely on word length, as he did in his follow-up study of , when he continued his earlier line of research, extend- ing it also to include selected passages from French, German, Italian, Latin, and Spanish texts.

In fact, what Mendenhall basically did, was what would nowadays rather be called a frequency analysis, or frequency distribution analysis. He personally was mainly attracted to the frequency distribution technique by its resemblance to spectroscopic analysis. Particularly as to the question of au- thorship, Williams emphasized that before discussing the possible significance of the Shakespeare—Bacon and the Shakespeare—Marlowe contro- versies, it is important to ask whether any differences, other than authorship, were involved in the calculations.

Grzybek et al. Thus, the least one would expect would be to count the number of sounds, or phonemes, per word; as a matter of fact, it would seem much more reasonable to measure word length in more immediate constituents of the word, such as syllables, or morphemes. Yet, even today, there are no reliable systematic studies on the influence of the measuring unit chosen, nor on possible interrelations between them and if they exist, they are likely to be extremely language- specific. More often than not, the reason for this procedure is based on the statistical assumption that, from a well-defined sample, one can, with an equally well-defined degree of probability, make reliable inferences about some totality, usually termed population.

Now, for some linguistic questions, samples of words may be homogeneous — for example, this seems to be the case with letter frequencies cf.

The very same, of course, has to be said about corpus analyses, since a corpus, from this point of view, is nothing but a quasi text. However, much of this criticism must then be directed towards contemporary research, too.

Particularly the last point mentioned above, leads to the next period in the history of word length studies. As can be seen, no attempt was made by Mendenhall to find a formal mathe- matical model, which might be able to describe or rather, theoretically model the frequency distribution. As a consequence, no objective comparison between empirical and theoretical distributions has been possible.

In this respect, the work of a number of researchers whose work has only recently and, in fact, only partially been appreciated adequately, is of utmost im- portance. These scholars have proposed particular frequency distribution mod- els, on the one hand, and they have developed methods to test the goodness of the results obtained. Initially, most scholars have implicitly or explicitly shared the assumption that there might be one overall model which is able to represent a general theory of word length; more recently, ideas have been devel- oped assuming that there might rather be some kind of general organizational principle, on the basis of which various specific models may be derived.

The present treatment concentrates on the rise and development of such models. It goes without saying that without empirical data, such a discussion would be as useless as the development of theoretical models. Consequently, the following presentation, in addition to discussing relevant theoretical models, will also try to present the results of empirical research.

Studies of merely empirical orientation, without any attempt to arrive at some generalization, will not be mentioned, however — this deliberate concentration on theory may be an important explanation as to why some quite important studies of empirical orientation will be absent from the following discussion.

The first models were discussed as early as in the late s. Research then concentrated on two models: the Poisson distribution, and the geometric dis- tribution, on the other. Later, from the mids onwards, in particular the Poisson distribution was submitted to a number of modifications and gener- alizations, and this shall be discussed in detail below.

The first model to be discussed at some length, here, is the geometric distribution which was sug- gested to be an adequate model by Elderton in Elderton — , who had published a book on Frequency-Curves and Correlation some decades before London , studied the frequency of word lengths in passages from English writers, among them Gray, Macaulay, Shakespeare, and others.

As opposed to Mendenhall, Elderton measured word length in the number of syllables, not letters, per word. His assumption was that the frequency distributions might follow the geometric distribution.

It seems reasonable to take a closer look at this suggestion, since, histori- cally speaking, this was the first attempt ever made to arrive at a mathematical description of a word length frequency distribution.

Where are zero-syllable words, i. Table 2. Gray Elderton Number of Frequency of syllables x-syllable words xi fi pi 1 0. Therefore, formula 2. The theoretical data, obtained by fitting the geometric distribution 2 to the empirical data from Table 2. Thus, with d. Therefore, the larger a sample, the more likely the deviations tend to be statistically significant. What is problematic about his approach is not so much that his attempt was only partly successful for some English texts; rather, it is the fact that the geometrical distribution is adequate to describe monotonously decreasing distributions only.

Analyzing randomly chosen lexical material from a Lithuanian dictionary, he found differences as to the distribution of root words and words with affixes. As an empirical test shows, the geometric distribution indeed turns out to be a good model.

In order to test his hypothesis, he gives, by way of an example, the relative frequencies of a list of dictionary words taken from a Lithuanian-French dic- tionary, represented in Table 2. The whole sample is thus arbitrarily divided into two portions, assuming that at a particular point of the data, there is a rupture in the material.

With regard to the data presented in Table 2. The approach as a whole thus implies that word length frequency would not be explained as an organic process, regulated by one overall mechanism, but as being organized by two different, overlapping mechanisms. In fact, this is a major theoretical problem: Given one accepts the suggested separation of different word types — i. Yet, this raises the question whether a unique, common model might not be able to model the Lithuanian data from Table 2.

In fact, as the re-analysis shows, there is such a model which may very well be fitted to the data; we are concerned, here, with the Conway-Maxwell-Poisson cf. What is more important, how- ever, is the fact that, in the case of the Conway-Maxwell-Poisson distribution, no separate treatment of two more or less arbitrarily divided parts of the whole sample is necessary, so that in this case, the generation of word length follows one common mechanism.

His linguistic interests, to our knowledge, mainly concen- trated on the process of language development. Since the support of 2. By way of an example, his approach will be demonstrated here, with reference to three texts. These data shall be additionally analyzed here because they are a good example for showing that word length frequencies do not necessarily imply a monotonously decreasing profile cf. The absolute frequencies fi , as presented by Cebanov , as well as the corresponding relative frequencies pi , are represented in Table 2.

Let us demonstrate this with reference to the data from Parzival in Table 2. Well 5 As compared to the calculations above, the theoretical frequencies slightly differ, due to rounding effects.

In Figure 2. As opposed to the approaches thus far discussed, these authors did not try to find a discrete distribution model; rather, they worked with continuous models, mainly the so-called lognormal model. Herdan was not the first to promote this idea with regard to language. Before him, Williams , had applied it to the study of sentence length fre- quencies, arguing in favor of the notion that the frequency with which sentences of a particular length occur, are lognormally distributed.

This assumption was brought forth, based on the observation that sentence length or word length frequencies do not seem to follow a normal distribution; hence, the idea of lognormality was promoted. Later, the idea of word length frequencies being lognormally distributed was only rarely picked up, such as for example by Rus- sian scholar Piotrovskij and colleagues Piotrovskij et al.

Generally speaking, the theoretical background of this assumption can be characterized as follows: the frequency distribution of linguistic units as of other units occurring in nature and culture often tends to display a right-sided asymmetry, i. One of the theoretical reasons for this can be seen in the fact that the variable in question cannot go beyond or remain below a particular limit; since it is thus characterized by a one-sided limitation in variation, the distribution cannot be adequately approximated by the normal distribution.

In other words: the left part of the distribution is stretched, and at the same time, the right part is compressed. Given the probability density function for the normal distribution as in 2. These two studies contain data on word length frequencies, the former 78, words of written English, the latter 76, words of spoken English. Thus, Herdan had the opportunity to do comparative analyses of word length frequencies measured in letters and phonemes.

In order to test his hypothesis as to the lognormality of the frequency distribution, Herdan confined himself to graphical techniques only. The most widely applied method in his time was the use of probability grids, with a logarithmically divided abscissa x-axis and the cumulative frequencies on the ordinate y- axis. If the resulting graph showed a more or less straight line, one regarded a lognormal distribution to be proven. As can be seen from Figure 2. The latter had analyzed several French samples, among them the three picked up by Herdan in Figure 2.

The corresponding graph is reproduced in Figure 2. In his book, he offered theoretical arguments for the lognormal distribution to be an adequate model Herdan However, Herdan did not do any comparative analyses as to the efficiency of the normal or the lognormal distribution, neither graphically nor statistically. Therefore, both procedures shall be presented here, by way of a re-analysis of the original data.

As far as graphical procedures are concerned, probability grids have been replaced by so-called P-P plots, today, which also show the cumulative pro- portions of a given variable and should result in a linear rise in case of normal distribution.

By way of an example, Figure 2. It can clearly be seen that there are quite some deviations for the lognor- mal distribution cf. What is even more important, however, is the fact that the deviations are clearly less expressed for the normal distribu- tion cf. Although this can, in fact, be shown for all three data samples mentioned above, we will concentrate on a statistical analysis of these observations.

Furthermore, differences between normal and lognormal are minimal; in case of Manon Lescaut, the lognormal distribution is even worse than the normal distribution. The same holds true, by the way, for the above-mentioned data presented by Piotrovskij et al. As a re-analysis of the data shows, this claim may not be upheld, however cf. However, as can be seen the deviation from the lognormal distribution is highly significant as well, and, strictly speaking, even greater compared to the normal distribution.

With regard to this negative finding, one may add the result of a further re-analysis, saying that in case of all three data samples discussed by Herdan, the binomial distribution can very well be fitted to the empirical data, with 0. Incidently, Michel arrived at the very same conclusion, in an exten- sive study on Old and New Bulgarian, as well as Old and new Greek material.

He tested the adequacy of the lognormal distribution for the word length fre- quencies of the above-mentioned material on two different premises, basing his calculation of word length both on the number of letters per word, and on the number of syllables per word.

Additionally, and this is even more important in the given context, one must state that there are also major theoretical problems which arise in the context of the log- normal distribution as a possible model for word length frequencies: a.

With this in mind, let us return to discrete models. The next historical step in the history of word length studies were the important theoretical and empirical analyses by Wilhelm Fucks, a German physician, whose theoretical models turned out to be of utmost importance in the s and s.

The Fucks Generalized Poisson Distribution 5. Cebanov in the late s. Interestingly enough, some years later the very same model — i. Piotrowski et al. Furthermore, Fucks, in a number of studies, developed many important ideas on the general functioning not only of language, but of other human sign systems, too. In its most general form, this weighting generalization results in the following formula 2. For 2. As can be seen from equation 2. As was already mentioned above, the only model which met general ac- ceptance was the 1-displaced Poisson distribution.

It is no wonder, then, that the generalized model has practically not been discussed. Fucks Thus, his application of the 1-displaced Poisson distribution included studies on 1 the individual style of single authors, as well as on 2 texts from different authors either 2.

As an example of the study of individual texts, Figure 2. As can be seen from the dotted line in Figure 2. As to a comparison of two German authors, Rilke and Goethe, on the one hand, and two Latin authors, Sallust and Caesar, on the other, Figure 2.

Again, the fitting of the 1-displaced Poisson distribution seems to be convincing. Yet, in re-analyzing his works, there remains at least one major problem: Fucks gives many characteristics of the specific distributions, starting from mean values and standard deviations up to the central moments, entropy etc. Yet, there are hardly ever any raw data given in his texts, a fact which makes it impossible to check the results at which he arrived. Thus, one is forced to believe in the goodness of his fittings on the basis of his graphical impressions, only; and this drawback is further enhanced by the fact that there are no procedures which are applied to test the goodness of his fitting the 1-displaced Poisson distribution.

There is only one instance where Fucks presents at least the relative, though not the absolute frequencies of particular distributions in detail.

Fucks a: 85ff. The relative frequencies are reproduced in Table 2. We will come back to these data throughout the following discussion, using them as exemplifying material. Being well aware of the fact that for each of the languages we are concerned with mixed data, we can ignore this fact, and see the data as a representation of a maximally broad spectrum of different empirical distributions which may be subjected to empirical testing.

As was mentioned above cf. Remembering that fitting is considered to be good in case of 0. Still, Fucks and many followers of his pursued the idea of the 1-displaced Poisson distribution as the most adequate model for word length frequencies. Thus, one arrives at the curve in Figure 2. Fucks a: As can be seen with Fucks a: 88, f. And again, it would have been easy to run such a statistical test, calculating the co- efficient of determination R2 in order to test the adequacy of the theoretical curve obtained.

Let us shortly discuss this procedure: in a nonlinear regression model, R 2 represents that part of the variance of the variable y, which can be explained by variable x. There are quite a number of more or less divergent formulae to calculate R2 cf.

Grotjahn , which result in partly significant differences. Usually, the following formula 2. Thus, for each empirical x i , we need both yi which can be obtained by the empirical values yi and the theoretical values b formula 2. Still, there remains a major theoretical problem with the specific method chosen by Fucks in trying to prove the adequacy of the 1-displaced Poisson distribution: this problem is related to the method itself, i.

Taking a second look at formula 2. To summarize, one has thus to draw an important conclusion: Due to the fact that Fucks did not apply any suitable statistics to test the goodness of fit for the 1-displaced Poisson distribution, he could not come to the point of explicitly stating that this model may be adequate in some cases, but is not acceptable as a general standard model.

Most of these subsequent studies concentrated on the 1-displaced Poisson distribution, as suggested by Fucks. In fact, work on the Poisson distribution is by no means a matter of the past. Discussing and testing various distribution models, Rothschild did not find any one of the models he tested to be adequate. As was said above, Michel first found the lognormal distribution to be a completely inadequate model. He then tested the 1-displaced Poisson distribution and obtained negative results as well: although fitting the Poisson distribution led to better results as compared to the lognormal distribution, word length in his data turned out not to be Poisson distributed, either Michel f.

Finally, Grotjahn whose work will be discussed in more detail below cf. In doing so, let us first direct our attention to the 2-parameter model suggested by him, and then to his 3-parameter model. In a similar way, two related 2-parameter distributions can be derived from the general model 2.

It is exactly the latter distribution 2. Fucks has not systematically studied its relevance; still, it might be tempting to see what kind of results are yielded by this distribution for the data already analyzed above cf. As in the case of the 1-displaced Poisson distribution, one has thus to ac- knowledge that the Fucks 2-parameter 1-displaced Dacey-Poisson distribution is an adequate theoretical model only for a specific type of empirical distribu- tions.

This leads to the question whether the Fucks 3-parameter distribution is more adequate as an overall model. It would lead too far, here, to go into details, as far as their derivation is concerned.

Consequently, three solutions are obtained, not all of which must necessarily be real solutions. With this in mind, let us once again analyze the data of Table 2. The results obtained can be seen in Table 2. It can clearly be seen that in some cases, quite reasonably, the results for the 3-parameter model are better, as compared to those of the two models discussed above.

From the results represented in Table 2. These violations can be of two kinds: a. However, some of the problems met might be related to the specific way of estimating the parameters suggested by him, and this might be the reason why other authors following him tried to find alternative ways. Cercvadze, G. As opposed to most of his German papers, Fucks had discussed his generalization at some length in this English synopsis of his work, and this is likely to be the reason why his approach received much more attention among Russian-speaking scholars.

We need not go into details, here, as far as the derivation of the Fucks dis- tribution and its generating function is concerned cf. Unfortunately, Piotrovskij et al. Based on the standard Poisson distribution, as represented in 2.

Based on these assumptions, the following special cases are obtained for 2. These analyses comprised nine Polish literary texts, or segments of them, and the results of these analyses indeed proved their approach to be successful. For the sake of comparison, Table 2. A closer look at these data shows that the Polish text samples are relatively homogeneous: for all texts, the dispersion quotient is in the interval 0. The authors analyzed Croatian data from two corpora, each consisting of several literary works and a number of news- paper articles.

The data of one of the two samples are represented in Table 2. Frequency observed Poisson 0 1 2 3 4 5 6 7 8 Syllables per word Figure 2. Rather, it is of methodological interest to see how the authors dealt with the data. Guided by the conclusion supported by the graphical representation of Figure 2. Still, there remain at least two major theoretical problems: 1. No interpretation is given as to why the weighting modification is necessary: is this a matter of the specific data structure, is this specific for Croatian language products?

As the re-analyses presented in the preceding chap- ters have shown, neither the standard Poisson distribution nor any of its straight forward modifications can be considered to be an adequate model. Grotjahn, in his attempt, opened the way for new perspectives: he not only showed that the Poisson model per se might not be an adequate model; fur- thermore, he initiated a discussion concentrating on the question whether one overall model could be sufficient when dealing with word length frequencies of different origin.

As a starting point, Grotjahn analyzed seven letters by Goethe, written in , and tested in how far the 1-displaced Poisson distribution would prove to be an adequate model. As was pointed out above cf. However, of the concrete data analyzed by Grotjahn, only some satisfied this condition; others clearly did not, the value of d ranging from 1.

In a way, this conclusion paved the way for a new line of research. After decades of concentration on the Poisson distribution, Grotjahn was able to prove that this model alone cannot be adequate for a general theory of word length distribution. On the basis of this insight, Grotjahn further elaborated his ruminations. Although every single word thus may well follow a Poisson distribution, this assumption does not necessarily imply the premise that the probability is one and the same for all words; rather, it depends on factors such as linguistic context, theme, etc.

Grotjahn 56ff. Thus, the so-called negative binomial distribution 2. Therefore, as Grotjahn 71f. With his approach, Grotjahn thus additionally succeeded in integrating earlier research, both on the geometric and the Poisson distributions, which had failed to be adequate as an overall valid model. The data are reproduced in Table 2. Poisson d. The results are graphically repre- sented in Figure 2.

History and Methodology of Word Length Studies 65 f x neg. Poisson 0 1 2 3 4 5 6 7 8 9 Figure 2. Still, it is tempting to see in how far the negative binomial distribution is able to model the data of nine languages, given by Fucks cf. Their discussion is of unchanged importance, still today, since many more recent studies in this field do not seem to pay sufficient attention to the ideas expressed almost a decade ago. Before discussing these important reflections, one more model should be discussed, however, to which attention has recently been directed by Kromer a,b,c; In this case, we are concerned with the Poisson-uniform distribution, also called Poisson-rectangular distribution cf.

In his approach, Kromer a derived the Poisson-uniform distribution along a different theoretical way, which need not be discussed here in detail. With regard to formula 2. It would be too much, here, to completely derive the two relevant equa- tions anew.

It may suffice therefore to say that the first equation can easily be derived from 2. Best, in turn, had argued in favor of the negative binomial distribution discussed above, as an adequate model.

The results obtained for these data need not be presented here, since they can easily be taken from the table given by Kromer a: These data have been repeatedly analyzed above, among others with regard to the negative binomial distribution cf. Using the method of moments, it turns out that in four of the nine cases Esperanto, Arabic, Latin, and Turkish , no acceptable solutions are obtained.

Now, what is the reason for no satisfying results being obtained, according to the method of moments? Empirically, this is proven by the results represented in Table 2. History and Methodology of Word Length Studies 71 Poisson-uniform distribution suggested by Kromer personal communication shall be demonstrated here; it is relevant for those cases when parameter a con- verges with parameter b in equation 2.

Parameter I, according to him, expresses something like the specifics of a given language i. Unfortunately, most of the above-mentioned papers Kromer b,c; have the status of abstracts, rather than of complete papers; as a consequence, only scarce empirical data are presented which might prove the claims brought forth on a broader empirical basis.

If his assumption should bear closer examination on a broader empirical basis, this might as well explain why we are concerned here with a mixture of two distributions. However, one must ask the question, why it is only the rectangular distribution which comes into play here, as one of two components. Strangely enough, it is just the Poisson-uniform distribution, which converges to almost no other distribution, not even to the Poisson distribution, as can be seen above for details, cf. This discussion was initiated by Grotjahn and Altmann as early as in , and it seems impor- tant to call to mind the most important arguments brought forth some ten years ago.

Yet, only recently systematic studies have been un- dertaken to solve just the methodological problems by way of empirical studies. Nevertheless, most of the ideas discussed — Grotjahn and Alt- mann combined them in six groups of practical and theoretical problems — are of unchanged importance for contemporary word length studies, which makes it reasonable to summarize at least the most important points, and comment on them from a contemporary point of view.

The problem of the unit of measurement. In other words: There can be no a priori decision as to what a word is, or in what units word length can be measured. Meanwhile, in contemporary theories of science, linguistics is no exception to the rule: there is hardly any science which would not acknowledge, to one degree or another, that it has to define its object, first, and that constructive processes are at work in doing so.

The relevant thing here is that measuring is made possible, as an important thing in the construction of theory. What has not yet been studied is whether there are particular dependencies between the results obtained on the basis of different measurement units; it goes without saying that, if they exist, they are highly likely to be language- specific. Also, it should be noted that this problem does not only concern the unit of measurement, but also the object under study: the word.

It is not even the problem of compound words, abbreviation and acronyms, or numbers and digits, which comes into play here, or the distinction between word forms and lexemes lemmas — rather it is the decision whether a word is to be defined on a graphemic, orthographic-graphemic, or phonological level.

The population problem. Again, as to these questions, there are hardly any systematic studies which would aim at a comparison of results obtained on an empirical basis.

However, there are some dozens of different types of letters, which can be proven to follow different rules, and which even more clearly differ from other text types. The goodness-of-fit problem. Rather, the question is, what is a small text, and where does a large text start?

History and Methodology of Word Length Studies 75 d. The problem of the interrelationship of linguistic properties. What they have in mind are in- tralinguistic factors which concern the synergetic organization of language, and thus the interrelationship between word length factors such as size of the dictionary, or the phoneme inventory of the given language, word frequency, or sentence length in a given text to name but a few examples. As soon as the interest shifts from language, as a more or less abstract system, to the object of some real, fictitious, imagined, or even virtual communicative act, between some producer and some recipient, we are not concerned with language, any more, but with text.

Consequently, there are more factors to be taken into account forming the boundary conditions, factors such as author- specific, or genre-dependent conditions. Ultimately, we are on the borderline here, between quantitative linguistics and quantitative text analysis, and the additional factors are, indeed, more language-related than intralinguistic in the strict sense of the word.

It should be mentioned, however, that very little is known about such factors, and systematic work on this problem has only just begun. The modelling problem. As can be seen, the aim may be different with regard to the particular research object, and it may change from case to case; what is of crucial relevance, then, is rather the question of interpretability and explanation of data and their theoretical modelling. The problem of explanation. Consequently, in order to obtain an explanation of the nature of word length, one must discover the mechanism generating it, hereby taking into account the necessary boundary conditions.

Thus far, we cannot directly concentrate on the study of particular boundary conditions, since we do not know enough about the general system mechanism at work. Consequently, contemporary research involves three different kinds of orientation: first, we have many bottom-up oriented, partly in the form of ad-hoc solutions for particular problems, partly in the form of inductive research; second, we have top-down oriented, deductive research, aiming at the formulation of general laws and models; and finally, we have much exploratory work, which may be called abductive by nature, since it is characterized by constant hypothesis testing, possibly resulting in a modification of higher-level hypotheses.

In this framework, it is not necessary to know the probabilities of all individual frequency classes; rather, it is sufficient to know the relative difference between two neighboring classes, e.

Ultimately, this line of research has in fact provided the most important research impulses in the s, which shall be discussed in detail below. In their search for relevant regularities in the organization of word length, Wimmer et al. Wimmer et al. This model was already discussed above, in its 1-displaced form 2. It has also been found to be an adequate model for word length frequencies from a Slovenian frequency dictionary Grzybek After corresponding re-parametrizations, these modifications result in well-known distribution models.

In , Wimmer et al. The set of word length classes is organized as a whole, i. Now, different distributions may be inserted for j. Thus, inserting the Borel distribution cf.

The parameters a and b of the GPD are independent of each other; there are a number of theoretical restrictions for them, which need not be discussed here in detail cf. Irrespective of these restrictions, already Wimmer et al. These observations are supported by recent studies in which Stadlober analyzed this distribution in detail and tested its adequacy for linguistic data.

Stadlober As can be seen, the results are good or even excellent in all cases; in fact, as opposed to all other distributions discussed above, the Consul-Jain GPD is able to model all data samples given by Fucks. It can also be seen from Table 2. In this respect, i. As to this problem, it seems however important to state that this is not a problem specifically related to the GPD; rather, any mixture of distributions will cause the very same problems.

In this respect, it is important that other distributions which imply no mixtures can also be derived from 2. It would go beyond the frame of the present article to discuss the various extensions and modifications in detail here. As a result, there seems to be increasing reason to assume that there is in- deed no unique overall distribution which might cover all linguistic phenom- ena; rather, different distributions may be adequate with regard to the material studied.

This assumption has been corroborated by a lot of empirical work on word length studies from the second half of the s onwards. Best More often than not, the relevant analyses have been made with specialized software, usually the Altmann Fitter.

This is an interactive computer pro- gram for fitting theoretical univariate discrete probability functions to empirical frequency distributions; fitting starts with the common point estimates and is optimized by way of iterative procedures. There can be no doubt about the merits of such a program. Now, the door is open for inductive research, too, and the danger of arriving at ad-hoc solutions is more virulent than ever before.

What is important, therefore, at present, is an abductive approach which, on the one hand, has theory-driven hypotheses at its background, but which is open for empirical findings which might make it necessary to modify the theoretical assumptions. In addition to the C values of the discrepancy coefficient, the values for parameters a and b as a result of the fitting are given.

As can be seen, fitting results are really good in all cases. As to the data analyzed, at least, the hyper-Poisson distribution should be taken into account as an alternative model, in addition to the GDP, suggested by Stadlober Comparing these two models, a great advantage of the GPD is the fact that its reference value can be very easily calculated — this is not so convenient in the case of the hyper-Poisson distribution.

On the other hand, the generation of the hyper-Poisson distribution does not involve any secondary distribution to come into play; rather, it can be directly derived from equation 2. In its 1-displaced form, equation 2.

To summarize, we can thus state that the synergetic approach as developed by Wimmer et al. Generally speaking, the authors understand their contribution to be a logical extension of their synergetic approach, unifying previous assumptions and empirical findings. The individual hypotheses belonging to the proposed system have been set up earlier; they are well-known from empirical research of the last decades, and they are partly derived from different approaches.

Specifically, Wimmer et al. History and Methodology of Word Length Studies 85 it is confined to the first four terms of formula 2. Many distributions can be derived from 2. It can thus be said that the general theoretical assumptions implied in the synergetic approach has experienced strong empirical support.

One may object that this is only one of possible alternative models, only one theory among others. However, thus far, we do not have any other, which is as theoretically sound, and as empirically supported, as the one presented. On the other hand, hardly any systematic studies have been undertaken to empirically study pos- sible influencing factors, neither as to the data basis in general i.

Ultimately, the question, what may influence word length frequencies, may be a bottomless pit — after all, any text production is an historically unique event, the boundary conditions of which may never be reproduced, at least not completely. Still, the question remains open if particular factors may be detected, the relevance of which for the distribution of word length frequencies may be proven. This point definitely goes beyond a historical survey of word length studies; rather, it directs our attention to research desires, as a result of the methodolog- ical discussion above.

A, ; — Best, Karl-Heinz ed. Brainerd, Barron Weighing evidence in language and literature: A statistical approach. Chebanow Chebanow, S. Dewey, G. Cambridge; Mass. Elderton, William P.

London, Fucks, Wilhelm Nach allen Regeln der Kunst. Leningrad, Nauka: — Dordrecht, NL. Grzybek, Peter ed. Ljubljana etc. The Impact of Word Length. Kromer, Victor V. Materialy konferencii. Ma- terialy konferencii. Markov, Andrej A. Mendenhall, Thomas C. Studien zum 1. Internationalen Bulgaristikkongress Sofia Piotrovskij, Rajmond G. Williams, Carrington B.

Wimmer, Gejza; Altmann, Gabriel Thesaurus of univariate discrete probability distributions. Zerzwadse, G. In: Grundlagenstudien aus Kybernetik und Geisteswissenschaft 4, — The idea is derived from the Fitts—Garner controversy in mathematical psychology cf.

Fitts et al. Obviously, the problem is quite old but has not penetrated into linguistics as yet. A word in a text can be thought of as a realization of a number of different alternative possibilities, see Fig. They can even be understood in different ways, e. What is neglected when correlating the lengths and the frequencies of words in real texts is the fact that for the text producer there is not at all free choice of all existing words at every moment.

Trying to fill in the blank is a model for determining the uncertainty of the missing word. It must be noted that SIC or HIC are associated not only with words but also with whole phrases or clauses, so that they represent rather polystratic structures and sequences. The present approach is the first approximation at the word level. Preparation In order to illustrate the variables which will be treated later on, let us first define some quantities. The cardinality of the set X will be symbolized as X.

P the set of positions in a text, whatever the counting unit. The elements of this set are tokens tijk , i. If the type and its token are known, the indices i and j can be left out. The elements of this set, aij , are not necessarily synonyms but in the given context they are admissible alternatives of the given token. The index k can be omitted Aij the number of elements in the set Aij , i. This entity can be called tokeme. By defining Mij , we are able to distinguish between tokens of the same type but with different alternatives and different number ai — so they are different tokemes.

Example Using Table 9 cf. The text is reproduced word for word in the second column of Table 9 p. The length is measured in terms of the number of syllables of the word. Thus, e. We can define it for types too: then it is the mean of all LLs of all tokens of this type in the text.

LL is usually a positive real number. The errors compensate each other in the long run, so the distribution of L equals that of LL. It can be ascertained for any text.

We can set up the hypothesis that Hypothesis 1 The longer the token, the longer the tokeme at the given position.

❿

##
Windows 10 1703 download iso itar regulations governing law

replace.me for-windowsversioneabed-0fdfca0-f50df5aca. Tavaana balatarin, Bateye kabhi na mp3 song download, Hk street, #Portugalskie Mmcm, Mohlogi music, On the move barthezz ulub, Top 10 memes rage. The FEDERAL REGISTER (ISSN –) is published daily,. Monday through Friday, except official holidays, by the Office of the Federal. replace.me wikipedia korean, Alphabets train free download, Constitutionality of federal laws, Quest completist wow addon download, Date day month year lyrics. Leah enterline southport, Windows store iso windows Ins 2, Online scientific tamil dictionary, Tlv encoding format, Laws of the federation of.

❿

###
Windows 10 1703 download iso itar regulations governing law

Seiten 2 – 1 2. What is important, therefore, at present, is an abductive approach which, on the one hand, has theory-driven hypotheses at its background, but which is open for empirical findings which might make it necessary to modify the theoretical assumptions. Kinderbuch 3 – 1 3. The situation has moved from very little to two million-word corpora in the past five years, and further prospects windowd open.❿

####
Windows 10 1703 download iso itar regulations governing law – Returning Users—Log in to MyNAP

Natural laws, however, источник considered mech- anistic and deterministic, and partly continue to be even today. Hartley, R. A, ; — Consequently, there are more factors to be taken into account forming the boundary conditions, factors such as author- specific, or genre-dependent conditions.

❿