Wednesday, July 17, 2019

Part of Speech Recognizer

Improving Identi?er In miscellanyativeness kick downstairsment Part of Speech randomness Dave Binkley Matthew Hearn Dawn Lawrie Loyola University Maryland Baltimore MD 21210-2699, the States binkley, lawriecs. loyola. edu, emailprotected edu Key wrangle root computer code outline tools, congenital terminology processing, computer syllabusme comprehension, identi?er analysis Abstract Recent softw ar workplace tools control exploited the mining of inwrought voice communication communicateation found in spite of appearance softw be and its supporting documentation. To make the close of this randomness, researchers have drawn upon the rub down of the intrinsic language processing community for tools and techniques.One such(prenominal)(prenominal) tool submits part-of-speech in nervous straination, which ?nds application syllabus in up(p) the searching of softw be repositories and extracting domain information found in identi?ers. Unfortunately, the earthy la nguage found is software differs from that found in standard prose. This difference potenti each(prenominal)y limits the military strength of off-the-shelf tools. The presented empirical probe ?nds that this limitation hind end be partially all overcome, resulting in a stigmatiseger that is up to 88% veracious when apply to ancestor code identi?ers.The investigation then(prenominal)(prenominal) uses the purifyd part-of-speech information to tag a large corpus of over 145,000 ? years label. From patterns in the tags several principles supply that seek to improve structure-? old age appointee. parentage Part of Extract Split drill Source ? Code ? electron orbit ? Fi old age ? ? Speech usher Code Mark-up Tagging chassis calling Names Figure 1. Process for POS tagging of ? historic period label. The text available in address-code artifacts, in particular a programs identi?ers, has a very different structure. For graphemefacesetters case the banters of an iden ti?er rarely form a grammatically slouch fourth dimension.This raises an arouse question can an existing POS tagger be make to work well on the natural language found in outset code? Better POS information would aid existing techniques that have apply limited POS information to success broad(a)y improve retrieval results from software repositories 1, 11 and have as well investigated the comprehensibility of source code identi?ers 4, 6. Fortunately, apparatus eruditeness techniques are robust and, as reported in Section 2, earnest results are obtained utilize several clip forming templates.This initial investigation overly hint gets speci?c for software that would improve tagging. For recitation the type of a tell inlie inent can be factored into its tags. As an usage application of POS tagging for source code, the tagger is then utilise to tag over 145,000 structure? geezerhood call. Equivalence classes of tags are then exa tap to elicit rules for the automatic identi?cation of poor label (as exposit in Section 3) and suggest kind names, which is left to incoming deform work. 1 IntroductionSoftware engineering science can bene?t from leveraging tools and techniques of early(a) disciplines. Traditionally, natural language processing (NLP) tools act upon difficultys by processing the natural language found in documents such as news articles and web pages. One such NLP tool is a partof-speech (POS) tagger. Tagging is, for example, pivotal to the Named-Entity Recognition 3, which enables information virtually a soulfulness to be tracked inside and across documents. Many POS taggers are reinforced utilise machine learning base on newswire training data.Conventional wisdom is that these taggers work well on the newswire and similar artifacts however, their metier degrades as the commentary moves further a commission(p) from the highly structured sentences found in traditional newswire articles. 1 2 Part-of-Speech Tagging ahe ad a POS taggers output can be apply as input to down stream SE tools, the POS tagger itself need to be vetted. This air division describes an experiment performed to canvas the verity of POS tagging on ? old age names tap from source code. The process apply for mining and tagging the ? ages is ?rst described, followed by the empirical results from the experiment.Figure 1 shows the pipeline apply for the POS tagging of ? geezerhood names. On the left, the input to the pipeline is mode= home/ (683 came from C++ ?les and 817 from Java ?les). A clement accessor (and university student majoring in English) mark the 1500 ? years names with POS information producing the illusionist set. This oracle set is apply to evaluate the verity of automatic tagging techniques when use to the test set. front study of the Stanford tagger indicates that it needed guidance when tagging ? eld names.Following the work of Abebe and Tonella 1, four templates were use to render this guidance. to for each one one template intromits a time slot into which the split ?eld name is inserted. Their true statement is then evaluated using the oracle set. execration usher enumerate Item guidebook Verb Template Noun Template . Please, . is a issue . Figure 2. XML queries for extracting C++ and Java ?elds from srcML. source code. This is then marked up using XML tags by srcML 5 to identify various syntactic categories. Third, ?eld names are extracted from the repellent source using XPath queries.Figure 2 shows the queries for C++ and Java. The fourth dress splits ?eld names by replacing underscores with quadricepss and inserting a space where the case changes from lowercase to uppercase. For example, the names purifyBob and sponge bob become sponge bob. after splitting, all characters are shifted to lowercase. This stage also ?lters names so that only those that consist entirely of dictionary words are retained. Filtering uses Debians American (6-2) dictionary p ackage, which consists of the 98,569 words from Kevin Atkinsons SCOWL word lists that have size 10 through 50 2.This dictionary includes some greens abbreviations, which are consequently included in the ?nal data set. Future work will rid of the need for ?ltering through vocabulary standardisation in which non-words are split into their abbreviations and then expanded to their natural language equivalents 9. The ?fth stage applies a set of templates (described below) to each marooned ?eld name. to each one template in effect wraps the words of the ?eld name in an attempt to improve the act of the POS tagger. Finally, POS tagging is performed by Version 1. 6 of the Stanford Log-linear POS Tagger 12.The default options are used including the pretrained bidirectional model 10. The final stage of this section considers empirical results concerning the effectiveness of the tagging pipeline. A total of 145,163 ?eld names were mined from 10,985 C++ ?les and 9,614 Java ?les found i n 171 programs. From this full data set, 1500 names were randomly chosen as a test set 2 The execration Template, the simplest of the four, considers the identi?er itself to be a sentence by appending a period to the split ?eld. The lean Item Template exploits the tagger having learned about POS information found in the sentence fragments used in lists.The Verb Template tries to assist the tagger to treat the ?eld name as a verb or a verb idiomatic expression by pre?xing it with Please, since usually a subordination follows. Finally, the Noun Template tries to encourage the tagger to treat the ?eld as a noun by mark?xing it with is a thing as was do by Abebe and Tonella 1. Table 1 shows the accuracy of using each template applied to the test set with the output compared to the oracle. The major diagonal wagers each technique in isolation while the be entries expect two techniques to agree and thus dark the percentage.The similarity of the percentages in a towboat gives an indication of how similar the set of mightily tagged names is for two techniques. For example, considering censure Template, Verb Template has the lowest overlap of the remaining three as indicated by its joint percentage of 71. 7%. Overall, the List Item Template performs the vanquish, and the Sentence Template and Noun Template produce necessityly alike results getting the correct tagging on nigh all the same ?elds. Perhaps unsurprising, the Verb Template performs the worst.Nonetheless, it is provoke that this template does produce the correct output on 3. 2% of the ?elds where no other template succeeds. As shown in Table 2 overall at least one template mightily tagged 88% of the test set. This suggests that it whitethorn be possible to combine these results, possibly using machine learning, to produce high accuracy than achieved using the individual templates. Although 88% is lower than the 97% achieved by natural language taggers on the newswire data, the performan ce is clam up quite high considering the lack of mise en scene provided by the words of a champion structure ?eld.Sentence List Item Verb Noun Sentence 79. 1% 76. 5& 71. 7% 77. 0% List Item 76. 5% 81. 7% 71. 0% 76. 0% Verb 71. 7% 71. 0% 76. 0% 70. 8% Noun 77. 0% 76. 0% 70. 8% 78. 7% this context is used to toy a current state, and is thence non confusing. tackle 1 Non-Boolean ?eld names should never arrest a present filtrate verb * * ? * * Table 1. Each percentage is the percent of correctly tagged ?eld names using some(prenominal) the row and column technique thus the major diagonal equal each technique independently. counterbalance in all templates Correct in at least one template 68. 9% 88. 0% Table 2.Correctly tagged identi?ers As illustrated in the undermentioned section, the identi?cation is suf?ciently accurate for use by downstream consumer applications. 3 Rules for Improving eye socket Names As an example application of POS tagging for source code, the 145,163 ?eld names of the full data set were tagged using the List Item Template, which showed the best performance in Table 1. The resulting tags were then used to form equivalence classes of ?eld names. depth psychology of these classes led to four rules for improving the names of structure ?elds. Rule violations can be automatically identi?ed using POS tagging.Further, as illustrated in the examples, by mining the source code it is possible to suggest potence re readyments. The assumption behind each rule is that high quality ?eld names will provide better abstract information, which aids an engineer in the chore of forming a mental understanding of the code. Correct part-of-speech information can suffice inform the naming of identi?ers, a process that is essential in communicating intent to future programmers. Each rule is ?rst informally introduced and then formalized. After each rule, the percentage of ?elds that ill-treat the rule is given.Finally, some rules are followed by a d iscussion of rule elisions or related nonions. The ?rst rule observes that ?eld names represent objects non actions thus they should exclude present-tense verbs. For example, the ?eld name create mp4, all the way implies an action, which is unlikely the intent (unless perhaps the ?eld represent a function pointer). review of the source code reveals that this ?eld holds the in demand(p) mp4 video stream applyer type. found on the context of its use, a better, less dubious name for this identi?er is created mp4 container type, which includes the past-tense verb created.A notable exclusion to this is ?elds of type Boolean, like, for example, is logged in where the present tense of the verb to be is used. A present tense verb in 3 Violations spy 27,743 (19. 1% of ?eld names) Looking at the violations of Rule 1 one pattern that emerges suggests an utility to the POS tagger that would better specialize it to source code. A pattern that frequently slip bys in graphical user i nterface programming ?nds verbs used as adjectives when describing graphical user interface elements such as buttons. Recognizing such ?elds establish on their type should improve tagger accuracy. figure the ?elds delete button and to a lesser extent continue box.In isolation these appears to represent actions. However they actually represent graphical user interface elements. Thus, a special context-sensitive case in the POS tagger would tag such verbs as adjectives. The blink of an eye rule considers ?eld names that contain only a verb. For example the ?eld name recycle. This name communicates little to a programmer unfamiliar with the code. Examination of the source code reveals that this variable is an integer and, found on the comments, it counts the number of things recycled. While this intend can be inferred from the declaration and the comments adjoin it, ?eld name uses often occur far from their eclaration, reducing the value of the declared type and supporting comment s. A potence ?x in this case is to change the name to recycled count or things recycled. both alternatives improve the clarity of the name. Rule 2 Field names should never be only a verb ? ? or ? ? Violations detected 4,661 (3. 2% ?eld names identi?ers) The trio rule considers ?eld names that contain only an adjective. While adjectives are usable when used with a noun, an adjective only when relies too much on the type of the variable to fully explain its use.For example, consider the identi?er interesting. In this case, the declared type of list provides the insight that this ?eld holds a list of interesting items. Replacing this ?eld with interesting list or interesting items should improve code understanding. Rule 3 Field names should never be only an adjective ? Violations detected 5,487 (3. 8% ?eld names identi?ers) An interesting exception to this rule occurs with data structures where the ?eld name has an established conventional meaning. For example, when naming the next node in a conjugate list, next is commonly accepted.Other similar common names include previous and current. The ?nal rule deals with ?eld names for booleans. Boolean variables represent a state that is or is not and this notion needs to be evident in the name. The identi?er deleted offers a wakeless example. By itself there is no way to know for sure what is being represented. Is this a pointer to a deleted thing? Is it a count of deleted things? Source code direction reveals that such boolean variables tend to represent whether or not something is deleted. Thus a potential improved names include is deleted or was deleted.Rule 4 Boolean ?eld names should contain third person forms of the verb to be or the auxiliary verb should * ? is was should * 5 Summary This paper presents the results on an experiment into the accuracy of the Stanford Log-linear POS Tagger applied to ?eld names. The best template, List Item, has an accuracy of 81. 7%. If an best combination of the four t emplates were used the accuracy rises to 88%. These POS tags were then used to develop ?eld name formation rules that 28. 9% of the identi?ers violated. Thus the tagging can be used to support improved naming.Looking forward, two avenues of future work include automating this improvement and enhancing POS tagging for source code. For the ?rst, the source code would be mined for related terms to be used in suggested improved names. The second would look for training a POS tagger using, for example, the machine learning technique domain translation 8, which express the text in the training that is well-nigh similar to identi?ers to produce a POS tagger for identi?ers. 6 Acknowledgments Special thanks to Mike Collard for his help with srcML and the XPath queries and Phil Hearn for his help with creating the oracle set.Support for this work was provided by NSF grant CCF 0916081. Violations detected 5,487 (3. 8% ?eld names identi?ers) Simply adding is or was to booleans does not guar antee a ?x to the problem. For example, deem a boolean variable that indicates whether something should be allocated in a program. In this case, the boolean captures whether some event should take place in the future. In this example an withdraw temporal sense is missing from the name. A name like allocated does not provide enough information and naming it is allocated does not make logical sense in the context of the program.A solution to this naming problem is to change the identi?er to should be allocated, which includes the inevitable temporal sense communicating that this boolean is a ?ag for something expected to glide by in the future. References 1 S. L. Abebe and P. Tonella. Natural language parsing of program element names for concept extraction. In 18th IEEE internationalist congregation on chopine Comprehension. IEEE, 2010. 2 K. Atkinson. Spell checking oriented word lists (scowl). 3 E. Boschee, R. Weischedel, and A. Zamanian. Automatic information extraction.In Pr oceedings of the International Conference on Intelligence abbreviation, 2005. 4 B. Caprile and P. Tonella. Restructuring program identi?er names. In ICSM, 2000. 5 ML Collard, HH Kagdi, and JI Maletic. An XML-based lightweight C++ fact extractor. Program Comprehension, 2003. 11th IEEE International Workshop on, pages 134143, 2003. 6 E. Hst and B. stvold. The programmers lexicon, volume i The verbs. In International on the job(p) Conference on Source Code Analysis and Manipulation, Beijing, China, September 2008. 7 E. W. Hst and B. M. stvold. Debugging method names.In ECOOP 09. customs Berlin / Heidelberg, 2009. 8 J. Jiang and C. Zhai. Instance weighting for domain adaptation in nlp. In ACL 2007, 2007. 9 D. Lawrie, D. Binkley, and C. Morrell. Normalizing source code vocabulary. In Proceedings of the 17th Working Conference on Reverse Engineering, 2010. 10 L. Shen, G. Satta, and A. K. Joshi. maneuver learning for bidirectional sequence classi?cation. In ACL 07. ACL, June 2007. 11 D . Shepherd, Z. P. Fry, E. Hill, L. Pollock, and K. Vijay-Shanker. Using natural language program analysis to locate and understand action-oriented conerns.In AOSD 07. ACM, expose 2007. 12 K. Toutanova, D. Klein, C. Manning, and Y. Singer. Feature-rich part-of-speech tagging with a cyclic dependency network. In HLTNAACL 2003, 2003. 4 Related Work This section brie?y reviews three projects that use POS information. Each uses an off-the-shelf POS tagger or lookup table. First, innkeeper et al. study naming of Java methods using a lookup table to set up POS tags 7. Their aim is to ?nd what they call naming bugs by checking to see if the methods implementation is properly indicated with the name of the method.Second, Abebe and Tonella study class, method, and attribute names using a POS tagger based on a modi?cation of minipar to mould domain concepts 1. Nouns in the identi?ers are examined to form ontological relations between concepts. Based on a case study, their hail improved co ncept searching. Finally, Shepherd et al. considered ?nding concepts in code using natural language information 11. The resulting Find-Concept tool locates action-oriented concerns more effectively than the other tools and with less user effort. This is made possible by POS information applied to source code. 4

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.