Practical Corpus Linguistics

An Introduction to Corpus-Based Language Analysis

Häftad, Engelska, 2016

Av Martin Weisser

749 kr

Beställningsvara. Skickas inom 7-10 vardagar

Fri frakt för medlemmar vid köp för minst 249 kr.

Finns i fler format (1)

Inbunden

1 509 kr

Skickas inom 7-10 vardagar

This is the first book of its kind to provide a practical and student-friendly guide to corpus linguistics that explains the nature of electronic data and how it can be collected and analyzed. Designed to equip readers with the technical skills necessary to analyze and interpret language data, both written and (orthographically) transcribedIntroduces a number of easy-to-use, yet powerful, free analysis resources consisting of standalone programs and web interfaces for use with Windows, Mac OS X, and LinuxEach section includes practical exercises, a list of sources and further reading, and illustrated step-by-step introductions to analysis toolsRequires only a basic knowledge of computer concepts in order to develop the specific linguistic analysis skills required for understanding/analyzing corpus data

Produktinformation

Utgivningsdatum2016-02-05
Mått185 x 241 x 18 mm
Vikt590 g
FormatHäftad
SpråkEngelska
Antal sidor312
FörlagJohn Wiley and Sons Ltd
ISBN9781118831885

Tillhör följande kategorier

Språkvetenskap och lingvistik inom Sprak och ordbocker

Martin Weisser is a Professor in the National Key Research Center for Linguistics and Applied Linguistics at Guangdong University of Foreign Studies, China . He is the author of Essential Programming for Linguistics (2009), and has published numerous articles and book chapters, including contributions to The Encyclopedia of Applied Linguistics (Wiley, 2012) and Corpus Pragmatics: A Handbook (2014).

List of Figures xiii List of Tables xvAcknowledgements xvii1 Introduction 11.1 Linguistic Data Analysis 31.1.1 What’s data? 31.1.2 Forms of data 31.1.3 Collecting and analysing data 71.2 Outline of the Book 81.3 Conventions Used in this Book 101.4 A Note for Teachers 111.5 Online Resources 112 What’s Out There? 132.1 What’s a Corpus? 132.2 Corpus Formats 132.3 Synchronic vs. Diachronic Corpora 152.3.1 ‘Early’ synchronic corpora 152.3.2 Mixed corpora 182.3.3 Examples of diachronic corpora 202.4 General vs. Specific Corpora 212.4.1 Examples of specific corpora 222.5 Static Versus Dynamic Corpora 252.6 Other Sources for Corpora 26Solutions to/Comments on the Exercises 26Note 28Sources and Further Reading 283 Understanding Corpus Design 293.1 Food for Thought – General Issues in Corpus Design 293.1.1 Sampling 303.1.2 Size 313.1.3 Balance and representativeness 323.1.4 Legal issues 323.2 What’s in a Text? – Understanding Document Structure 333.2.1 Headers, ‘footers’ and meta-data 343.2.2 The structure of the (text) body 363.2.3 What’s (in) an electronic text? – understanding file formats and their properties 373.3 Understanding Encoding: Character Sets, File Size, etc. 383.3.1 ASCII and legacy encodings 383.3.2 Unicode 393.3.3 File sizes 40Solutions to/Comments on the Exercises 41Sources and Further Reading 424 Finding and Preparing Your Data 434.1 Finding Suitable Materials for Analysis 444.1.1 Retrieving data from text archives 444.1.2 Obtaining materials from Project Gutenberg 444.1.3 Obtaining materials from the Oxford Text Archive 454.2 Collecting Written Materials Yourself (‘Web as Corpus’) 464.2.1 A brief note on plain-text editors 464.2.2 Browser text export 484.2.3 Browser HTML export 494.2.4 Getting web data using ICEweb 504.2.5 Downloading other types of files 524.3 Collecting Spoken Data 534.4 Preparing Written Data for Analysis 564.4.1 ‘Cleaning up’ your data 564.4.2 Extracting text from proprietary document formats 584.4.3 Removing unnecessary header and ‘footer’ information 584.4.4 Documenting what you’ve collected 594.4.5 Preparing your data for distribution or archiving 60Solutions to/Comments on the Exercises 62Sources and Further Reading 665 Concordancing 675.1 What’s Concordancing? 675.2 Concordancing with AntConc 695.2.1 Sorting results 745.2.2 Saving, pruning and reusing your results 75Solutions to/Comments on the Exercises 78Sources and Further Reading 816 Regular Expressions 826.1 Character Classes 846.2 Negative Character Classes 866.3 Quantification 866.4 Anchoring, Grouping and Alternation 876.4.1 Anchoring 876.4.2 Grouping and alternation 886.4.3 Quoting and using special characters 906.4.4 Constraining the context further 916.5 Further Exercises 92Solutions to/Comments on the Exercises 93Sources and Further Reading 1007 Understanding Part-of-Speech Tagging and Its Uses 1017.1 A Brief Introduction to (Morpho-Syntactic) Tagsets 1037.2 Tagging Your Own Data 109Solutions to/Comments on the Exercises 113Sources and Further Reading 1208 Using Online Interfaces to Query Mega Corpora 1218.1 Searching the BNC with BNCweb 1228.1.1 What is BNCweb? 1228.1.2 Basic standard queries 1238.1.3 Navigating through and exploring search results 1248.1.4 More advanced standard query options 1268.1.5 Wildcards 1268.1.6 Word and phrase alternation 1288.1.7 Restricting searches through PoS tags 1298.1.8 Headword and lemma queries 1318.2 Exploring COCA through the BYU Web-Interface 1328.2.1 The basic syntax 1338.2.2 Comparing corpora in the BYU interface 135Solutions to/Comments on the Exercises 137Sources and Further Reading 1459 Basic Frequency Analysis – or What Can (Single) Words Tell Us About Texts? 1469.1 Understanding Basic Units in Texts 1469.1.1 What’s a word? 1479.1.2 Types and tokens 1499.2 Word (Frequency) Lists in AntConc 1519.2.1 Stop words – good or bad? 1569.2.2 Defining and using stop words in AntConc 1589.3 Word Lists in BNCweb 1609.3.1 Standard options 1609.3.2 Investigating subcorpora 1629.3.3 Keyword lists 1699.4 Keyword Lists in AntConc and BNCweb 1699.4.1 Keyword lists in AntConc 1699.4.2 Keyword lists in BNCweb 1729.5 Comparing and Reporting Frequency Counts 1759.6 Investigating Genre-Specific Distributions in COCA 178Solutions to/Comments on the Exercises 179Sources and Further Reading 19210 Exploring Words in Context 19310.1 Understanding Extended Units of Text 19410.2 Text Segmentation 19510.3 N-Grams, Word Clusters and Lexical Bundles 19610.4 Exploring (Relatively) Fixed Sequences in BNCweb 19810.5 Simple, Sequential Collocations and Colligations 19810.5.1 ‘Simple’ collocations 19810.5.2 Colligations 20010.5.3 Contextually constrained and proximity searches 20110.6 Exploring Colligations in COCA 20210.7 N-grams and Clusters in AntConc 20510.8 Investigating Collocations Based on Statistical Measures in AntConc, BNCweb and COCA 20710.8.1 Calculating collocations 20710.8.2 Computing collocations in AntConc 20910.8.3 Computing collocations in BNCweb 21010.8.4 Computing collocations in COCA 211Solutions to/Comments on the Exercises 212Sources and Further Reading 22611 Understanding Markup and Annotation 22711.1 From SGML to XML – A Brief Timeline 22911.2 XML for Linguistics 23011.2.1 Why bother? 23011.2.2 What does markup/annotation look like? 23011.2.3 The ‘history’ and development of (linguistic) markup 23211.2.4 XML and style sheets 23411.3 ‘Simple XML’ for Linguistic Annotation 23611.4 Colour Coding and Visualisation 24011.5 More Complex Forms of Annotation 246Solutions to/Comments on the Exercises 248Sources and Further Reading 25312 Conclusion and Further Perspectives 254Appendix A: The CLAWS C5 Tagset 259Appendix B: The Annotated Dialogue File 261Appendix C: The CSS Style Sheet 269Glossary 271References 277Index 283

"This textbook makes Practical Corpus Linguistics accessible to everyone. The focus on methodological and technical aspects and the instructive dimension of the book – nothing is considered obvious or already known – make it very useful to any corpus linguist aiming at a better understanding of his/her data...Through the various exercises, it is very easy to test one’s comprehension and the reader gradually gains confidence. The educational, sometimes entertaining tone as well as the glossary also contribute to gradually enhance the reader’s learning capacities in a field in which many feel insecure...It should accompany scholars at the beginning of any research to raise awareness about technical issues that are too often overlooked..." - Robert A. Cote for The LINGUIST List, December 2016