Báo cáo khoa học: "Oxford Dictionary of English: Current Developments" pdf

tailieuhay_4189
tailieuhay_4189(15284 tài liệu)
(10 người theo dõi)
Lượt xem 0
0
Tải xuống
(Lịch sử tải xuống)
Số trang: 4 | Loại file: PDF
0

Gửi bình luận

Bình luận

Thông tin tài liệu

Ngày đăng: 24/03/2014, 03:20

Mô tả: Oxford Dictionary of English: Current DevelopmentsJames McCrackenOxford University Pressmccrackj@oup.co.ukAbstractThis research note describes the early stages of a projectto enhance a monolingual English dictionary databaseas a resource for computational applications. It consid-ers some of the issues involved in deriving formal lexi-cal data from a natural-language dictionary.1 IntroductionThe goal of the project is to enhance the databaseof the Oxford Dictionary of English (a forthcomingnew edition of the 1998 New Oxford Dictionary ofEnglish) so that it contains not only the originaldictionary content but also additional sets of dataformalizing, codifying, and supplementing thiscontent. This will allow the dictionary to be ex-ploited effectively as a resource for computationalapplications.The Oxford Dictionary of English (ODE) is ahigh-level dictionary intended for fluent Englishspeakers (especially native speakers) rather thanfor learners. Hence its coverage is very extensive,and definitional detail is very rich. By the sametoken, however, a certain level of knowledge isassumed on the part of the reader, so not every-thing is spelled out explicitly. For example, ODEfrequently omits morphology and variation whichis either regular or inferable from related words.Entry structure and defining style, while mostlyconforming broadly to a small set of basic patternsand formulae, may often be more concerned withdetail and accuracy than with simplicity of expla-nation. Such features make the ODE content rela-tively difficult to convert into comprehensive andformalized data. Nevertheless, the richness of theODE text, particularly in the frequent use of exam-ple sentences, provides a wealth of cues and clueswhich can help to control the generation of moreformal lexical data.A basic principle of this work is that the en-hanced data should always be predicated on theoriginal dictionary content, and not the other wayround. There has been no attempt to alter the origi-nal content in order to facilitate the generation offormal data. The enhanced data is intended primar-ily to constitute a formalism which closely reflects,summarizes, or extrapolates from the existing dic-tionary content.The following sections list some of the data typesthat are currently in progress:2 MorphologyA fundamental building block for formal lexicaldata is the creation of a complete morphologicalformalism (verb inflections, noun plurals, etc.)covering all lemmas (headwords, derivatives, andcompounds) and their variant forms, and encodingrelationships between them. This is being donelargely automatically, assuming regular patterns asa default but collecting and acting on anything inthe entry which may indicate exceptions (explicitgrammatical information, example sentences,pointers to other entries, etc.).The original intention was to generate a morpho-logical formalism which reflected whatever wasstated or implied by the original dictionary content.Hence pre-existing morphological lexicons werenot used except when an ambiguous case needed tobe resolved. As far as possible, issues relating tothe morphology of a word were to be handled bycollecting evidence internal to its dictionary entry.However, it became apparent that there weresome key areas where this approach would fallshort. For example, there are often no conclusiveindicators as to whether or not a noun may be plu-123ralized, or whether or not an adjective may take acomparative or superlative. In such cases, anyavailable clues are collected from the entry but arethen weighted by testing possible forms against acorpus.3 Idioms and other phrasesPhrases and phrasal verbs are generally lemma-tized in an 'idealized' form which may not repre-sent actual occurrences. Variation and alternativewording is embedded parenthetically in the lemma:(as) nice (or sweet) as pieObjects, pronouns, etc., which may form part ofthe phrase are indicated in the lemma by wordssuch as 'someone', 'something', 'one':twist (or wind or wrap) someone aroundone's little fingerIn order to be able to match such phrases to real-world occurrences, each dictionary lemma wasextended as a series of strings which enumerateeach possible variant and codify how pronouns,noun phrases, etc., may be interpolated. Each oc-currence of a verb in these strings is linked to themorphological data in the verb's own entry, to en-sure that inflected forms of a phrase (e.g. 'she hadhim wrapped around her little finger') can be iden-tified.4 Semantic classificationWe are seeking to classify all noun senses in thedictionary according to a semantic taxonomy,loosely inspired by the Princeton WordNet project.Initially, a relatively small number of senses wereclassified manually. Statistical data was then gen-erated by examining the definitions of these senses.This established a definitional 'profile' for eachclassification, which was then used to automati-cally classify further senses. Applied iteratively,this process succeeded in classifying all nounsenses in a relatively coarse-grained way, and isnow being used to further refine the granularity ofthe taxonomy and to resolve anomalies.Definitional profiling here involves two ele-ments:The first element is the identification of the 'keyterm' in the definition. This is the most significantnoun in the definition — not a rigorously definedconcept, but one which has proved pragmaticallyeffective. It is not always coterminous with thegenus term; for example, in a definition beginning'a morsel of food which ', the 'key term' is takento be food rather than morsel.The second element is a scoring of all the othermeaningful vocabulary in the definition (i.e. ignor-ing articles, conjunctions, etc.). A simple weight-ing scheme is used to give slightly moreimportance to words at the beginning of a defini-tion (e.g. a modifier of the key term) than to wordsat the end.These two elements are then assigned mutual in-formation scores in relation to each possible classi-fication, and the two MI scores are combined inorder to give an overall score. This overall score istaken to be a measure of how 'typical' a given defi-nition would be for each possible classification.This enables one very readily to rank and group allthe senses for a given classification, thus exposingmisclassifications or points where a classificationneeds to be broken down into subcategories.The semantic taxonomy currently has about1250 'nodes' (each representing a classificationcategory) on up to 10 levels. The dictionary con-tains 95,000 defmed noun senses in total, so thereare on average 76 senses per node. However, thisaverage disguises the fact that there are a smallnumber of nodes which classify significantly largersets of senses. Further subcategorization of largesets is desirable in principle, but is not considered apriority in all cases. For example, there are severalhundred senses classified simply as tree; the effortinvolved in subcategorizing these into various treespecies is unlikely to pay dividends in terms ofvalue for normal NLP applications. A pragmaticapproach is therefore to deprioritize work on ho-mogeneous sets (sets where the range of 'typicality'scores for each sense is relatively small), more orless irrespective of set size.Hence the goal is not achieve granularity on theorder of WordNet's 'synset' (a set in which allterms are synonymous, and hence are rarely morethan four or five in number) but rather a somewhatmore coarse-grained 'sirnilarset' in which everysense is similar enough to support general-purposeword-sense disambiguation, document retrieval,and other standard NLP tasks. At this level, auto-124matic analysis and grading of defmitions is provinghighly productive in establishing classificationschemes and in monitoring consistency, althoughextensive supervision and manual correction is stillrequired.It should be noted that a significant number ofnouns and noun senses in ODE do not have defini-tions and are therefore opaque to such processes.Firstly, some senses cross-refer to other defini-tions; secondly, derivatives are treated in ODE asundefined subentries. Classification of these willbe deferred until classification of all defmed sensesis complete. It should then be possible to classifymost of the remainder semi-automatically, bycombining an analysis of word formation with ananalysis of target or 'parent' senses.5 Domain indicatorsUsing a set of about 200 subject areas (biochemis-try, soccer, architecture, astronomy, etc.), all rele-vant senses and lemmas in ODE are beingpopulated with markers indicating the subject do-main o which they relate. It is anticipated that thiswill support the extraction of specialist lexicons,and will allow the ODE database to function as aresource for document classification and similarapplications.As with semantic classification above, a numberof domain indicators were assigned manually, andthese were then used iteratively to seed assignmentof further indicators to statistically similar defini-tions. Automatic assignment is a little morestraightforward and robust here, since most of thetime the occurrence of strongly-typed vocabularywill be a sufficient cue, and there is little reason toidentify a key term or otherwise parse the defini-tion.Similarly, assignment to undefined items (e.g.derivatives) is simpler, since for most two- orthree-sense entries a derivative can simply inheritany domain indicators of the senses of its 'parent'entry. For longer entries this process has to bechecked manually, since the derivative may notrelate to all the senses of the parent.Currently, about 72,000 of a total 206,000senses and lemmas have been assigned domainindicators. There is no clearly-defined cut-off pointfor iterations of the automatic assignment process;each iteration will continue to capture senseswhich are less and less strongly related to the do-main. Beyond a certain point, the relationship willbecome too tenuous to be of much use in most con-texts; but that point will differ for each subjectfield (and for each context). Hence a further objec-tive is to implement a 'points' system which notonly classifies a sense by domain but also scoresits relevance to that domain.6 Collocates for sensesWe are currently exploring methods to automati-cally determine key collocates for each sense ofmulti-sense entries, to assist in applications involv-ing word-sense disambiguation. Since collocateswere not given explicitly in the original dictionarycontent of ODE, the task involves examining allavailable elements of a sense for clues which maypoint to collocational patterns.The most fruitful areas in this respect are firstlydefinition patterns, and secondly example sen-tences.Definition patterns are best illustrated by verbs,where likely subjects and or objects are often indi-cated in parentheses:fly: (of a bird, bat, or insect) move through theair impound: (of a dam) hold back (water) The terms in parentheses can be collected as possi-ble collocates, and in some cases can be used asseeds for the generation of longer lists (by exploit-ing the semantic classifications described in sec-tion 3 above). Similar constructions are oftenfound in adjective definitions. For other parts ofspeech (e.g. nouns), and for definitions which hap-pen not to use the parenthetic style, inference oflikely collocates from definition text is a lessstraightforward process; however, by identifying aset of characteristic constructions it is possible todefine search patterns that will locate collocate-likeelements in a large number of definitions. The de-fining style in ODE is regular enough to supportthis approach with some success.Some notable 'blind spots' have emerged, oftenreflecting ODE's original editorial agenda; forexample, the defining style used for verbs oftenmakes it hard to determine automatically whether asense is transitive or intransitive.Example sentences can be useful sources sincethey were chosen principally for their typicality,125and are therefore very likely to contain one ormore high-scoring collocates for a given sense.The key problem is to identify automatically whichwords in the sentence represent collocates, as op-posed to those words which are merely incidental.Syntactic patterns can help here; if looking for col-locates for a noun, for example, it makes sense tocollect any modifiers of the word in question, andany words participating in prepositional construc-tions. Thus if a sense of the entry for breach hasthe example sentenceShe was guilty of a breach of trust.then some simple parsing and pattern-matching cancollect guilty and trust as possible collocates.However, it will be apparent from this that ex-amination of the content of a sense can do no morethan build up lists of candidate collocates — anumber of which will be genuinely high-scoringcollocates, but others of which may be more or lessarbitrary consequences of an editorial decision.The second step will therefore be to build into theprocess a means of testing each candidate against acorpus-based list of collocates, in order to elimi-nate the arbitrary items and to extend the list thatremains7 ConclusionIn order for a non-formalized, natural-languagedictionary like ODE to become properly accessibleto computational processing, the dictionary contentmust be positioned within a formalism which ex-plicitly enumerates and classifies all the informa-tion that the dictionary content itself merelyassumes, implies, or refers to. Such a system canthen serve as a means of entry to the original dic-tionary content, enabling a software application toquickly and reliably locate relevant material, andguiding interpretation.The process of automatically generating such aformalism by examining the original dictionarycontent requires a great deal of manual supervisionand ad hoc correction at all stages. Nevertheless,the process demonstrates the richness of a largenatural-language dictionary in providing cues andflagging exceptions. The stylistic regularity of adictionary like ODE supports the enumeration of afinite (albeit large) list of structures and patternswhich can be matched against a given entry or en-try element in order to classify it, mine it for perti-nent information, and note instances which may beanomalous.The formal lexical data is being built up along-side the original dictionary content in a single inte-grated database. This arrangement supports a broadrange of possible uses. Elements of the formal datacan be used on their own, ignoring the original dic-tionary content. More interestingly, the formal datacan be used in conjunction with the original dic-tionary content, enabling an application to exploitthe rich detail of natural-language lexicographywhile using the formalism to orient itself reliably.The formal data can then be regarded not so muchas a stripped-down counterpart to the main diction-ary content, but more as a bridge across which ap-plications can productively access that content.AcknowledgementsI would like to thank Adam Kilgarriff of ITRI, Brigh-ton, and Ken Litkowski of CL Research, who havebeen instrumental in both devising and implementingsignificant parts of the work outlined above.ReferencesChristiane Fellbaum and George Miller. 1998. Word-Net: an electronic lexical database. MIT Press,Cambridge, Mass.Judy Pearsall. 1998. The New Oxford Dictionary ofEnglish. Oxford University Press, Oxford, UK.126 . natural-language dictionary. 1 IntroductionThe goal of the project is to enhance the database of the Oxford Dictionary of English (a forthcomingnew edition of the. Oxford Dictionary of English: Current DevelopmentsJames McCrackenOxford University Pressmccrackj@oup.co.ukAbstractThis

— Xem thêm —

Xem thêm: Báo cáo khoa học: "Oxford Dictionary of English: Current Developments" pdf, Báo cáo khoa học: "Oxford Dictionary of English: Current Developments" pdf, Báo cáo khoa học: "Oxford Dictionary of English: Current Developments" pdf

Lên đầu trang

Mục lục

Tài liệu liên quan

Từ khóa liên quan

Đăng ký

Generate time = 0.332759141922 s. Memory usage = 13.95 MB