Mapping ancient literary classics by numbers

2022-06-11 0 By

Big data and its corresponding technologies have had a significant impact on social knowledge systems and ways of thinking.A deep and efficient analysis of ancient literary classics based on this technique can bring literary research into a broader perspective, improve the accuracy, stability and verifiability of research conclusions, and promote the development of new research concepts, methods and paradigms.Since the information revolution, great achievements have been made in the data accumulation and knowledge base construction of ancient books.The vast sea of ancient books can form arbitrary text sets of different sizes, with data characteristics of different dimensions.Based on the statistics of words, sentences, text and other aspects, we can obtain cognition different from paper reading.Google and Harvard University used a database jointly developed by Google to calculate the frequency of words and phrases in nearly 5.2 million books published between 1800 and 2000. The frequency and change trend of any word or phrase in ancient books over the past hundreds of years can be clearly presented.This word frequency statistic is widely used in research such as exploring the rise and fall of famous objects, the heat change of topics, and the influence of people or groups.Similarly, in the era of big data, new technologies and research ideas provide possibilities for making up for the shortcomings of ancient paper books in structural arrangement, aggregation and arrangement of a large number of data, three-dimensional connection and presentation of relationships, etc.We use the traditional Chinese treasure dian database contains over ten thousand kinds of more than 2.2 billion words of past dynasties ancient data, in which most core of classic nearly radix stemonae, screened from words, words than (TTR_H), word frequency statistics, such as different angles to age with the style of literature sources before you aren’t supposed to associated with the method of comparison,It has obtained a series of important propositions and discoveries involving Chinese history, stylistics, archaeology of knowledge, study of ignorance, transformation of modern literature and Bai and other fields and cross-disciplines, which is an effective example of “digital mapping classics, technology updating humanities”.In the longitudinal survey of the data from pre-Qin to Qing Dynasty, we can first pay attention to the increasing trend of the total number of words and the number of words used in a single classic.Obviously, the former is directly related to the change and development of the material form of the literature, while the latter is also affected by the development factors such as the dissonance of medieval Chinese, and is also related to the growth of the total number of books and the popularity of social knowledge from the Han Dynasty to the Middle Ages.The reference books for knowledge and the primary school textbooks for literacy ranked first, such as Erya (3360 words), Notes on Water Classics (4490 words) and Guwen View (3863 words).Since the Han Dynasty, scholars gradually paid attention to the accumulation of academic and social viewpoints and the summing up of the essence of life, so their works were often rich and knowledgeable.The “Historical Records” and “The tao of heaven and man, through the changes of ancient and modern”;Huainan Zi has 4730 characters and 3900 characters respectively, which are very prominent in the ancient and middle ancient literatures involved in the statistics, and are comparable with the novels of Ming and Qing Dynasties (the four Great Classical novels and Strange Stories from A Liaozhai Group have between 3931 and 4936 characters).Data alone cannot be an “intelligent” transformation; more important than data is the way it is interpreted.In addition to correlating statistical analysis to classical topics, data segmentation and clustering are also crucial to the foundation.A classic case study is that the text of a Dream of Red Mansions is divided into forty chapters. The significant difference in the word quantity of the last part just proves the doubt about the author.However, to directly measure the quality or difficulty of a work by the amount of words is to fall into a mechanical statistical analysis.For example, the number of characters used in several novels ranked first in statistics is increased by their large volume, wide content and elegant and vulgar style.Similarly, restricted by the total amount of commonly used Chinese characters, the increase of literature length will lead to the decrease of the character ratio.Therefore, the TTR_H model commonly used in computational linguistics was introduced in the statistics to correct the character ratio. The final results showed that the books with the highest character ratio were all the ones in primary school: Thousand Character Wen (1), Hundred Family Names (0.986), Three Character Jing (0.894), and Song Temperament Enlightenment (0.857).It can be seen that the authors consciously increase the number of characters in the limited space and difficulty of content, so that students can learn as many Chinese characters as possible in a relatively intensive way.What were the criteria for the compiler of the reading materials? Were they high-frequency words in the classical literature of the time, common words in daily life, or some other criteria?How is this selection done?These are all topics worthy of further exploration.Considering the different properties and interpretive functions of the imaginary and real characters, they are usually calculated separately in the statistics of the frequency of detecting classical propositions with character features.Function words are often used as characteristic data in the study of Chinese history and other fields, and are also symbolic parameters for style comparison of works. The proportion of function words itself constitutes the style identification of different authors.Among the “Five Classics”, referring to the “discrimination of writing style” of later generations, “Shi” is a rhyming text, so compared with the other books, the content words in the high-frequency words have a larger proportion, the ancient poetical view of “the more the content words are good, the more the function words are weak” or originated from this.In general, in poetry, ci and music, content words are more likely to be used as high frequency words than prose style.Shang Shu, the earliest of the five Classics, also preserves the traces of the evolution of ancient Chinese.The function word “wei” is the most frequently used in Shangshu, which not only has something to do with the nature of many books, but also reflects the difference between early Chinese and later times.From the same perspective, we can see another great revolution in Chinese history.Dialogue is an important element in novels, and verbs expressing speech naturally enjoy high frequency status, which is embodied as “Yue” in Romance of The Three Kingdoms and Liaozhai, and “Dao” in Journey to the West and Water Margin, which is an important symbol of the weakening of classical Chinese color in the latter group of works.The real transformation of vernacular Chinese occurred in a Dream of Red Mansions. “Di” replaced “Zhi”, which has the same grammatical function, as the second most frequently used word on the list for the first time.The first high-frequency word in a Dream of Red Mansions is another function word with the characteristics of the vernacular, “ha”, which is also the first high-frequency word in water Margin.Contrary to function words, content words are the mapping of literature content and theme, and behind them are important propositions reflecting the historical evolution of ideas.Similarly, taking the Five Classics as an example, the first frequent content words in Shi, Shu, Li, Yi and The Spring and Autumn Annals are “I”, “Wang”, “ren”, “xiang” and “Zi” respectively.”The Book of Songs” has the strongest theme lyric color, just like the “Preface to MAO’s Poems” said that “the affairs of a country are the foundation of one person”.Shang Shu is a document record of the imperial codes, morals, teachings, letters, vows and orders of three generations in ancient times, with the core of recording the words and deeds of “Kings”.Confucius restrained people with “self-abnegation and return to courtesy”. “Courtesy” is the externalization of one’s inner quality. Therefore, to talk about “courtesy” not based on people will lose its foundation.As the object of interpreting Zhouyi, “image” is self-evident.”The ancient paoxi king of the world also, Yang is the view of the sky, overlooking is the view of the law in the ground, watching the birds and animals of the text and the ground appropriate, close to all body, far from all things, so began to make” Yi “eight diagrams, in order to hang xian.This passage in Shuo Wen Jie Zi Preface shows that “image” is not only the key of Zhouyi, but also the embodiment of the concept of Chinese character formation and Chinese cultural thinking.The first frequently used word in Zuo Zhuan of the Spring and Autumn Period is “zi”, which contains the dual meaning of the second person singular and the title of princes and sovereigns.The latter is the core of the narrative of the Spring and Autumn Annals. Confucius wrote the Spring and Autumn Annals to record the extraordinary times when “rites and music were launched from the vassals” with small words. As a chronicle of history, the order and moral choices of the vassals constituted its potential latitude.Qian Zhongshu’s “Tan Yi Lu” begins with “Poetry divided into Tang and Song”, which has a wide influence.Yan Yu in the Song Dynasty said in his poems that “people in this dynasty are rational, while people in the Tang Dynasty are interested in being prosperous”.Tang and Song poems differ in physique and are relatively mysterious.Through quantitative analysis, we can make a detailed grasp of its linguistic characteristics.Through the statistics of the word frequency of more than 57,000 poems in the Whole Tang Dynasty and 254,000 poems in the Whole Song Dynasty, the top ten high-frequency words are: I don’t know, where, thousands of miles, thousands of miles, not to see, not to be, white clouds, today, spring breeze, not to be (The whole Tang Dynasty);I do not know, spring breeze, life, not, thousands of miles, thousands of miles, the world, not see, ten years, where (” the whole Song Poem “.The word frequency order below will be noted in parentheses, so it will not be explained one by one).Extending statistics to the top 100, many propositions about the tang and Song poetry style can be developed in the folds of words.As an example of Yan Yu’s judgment, among the top 100 poems, the scenery words in tang poems are more prominent than those in other aspects, such as “baiyun” (7th) and “mingyue” (11th). Although they are only fragments of vocabulary, tang people’s atmosphere is vividly visible.As a reference, these two images are reduced to the 19th and 23rd place respectively in the statistics of word frequency in the Whole Song Poems.Yan Yu’s view of “upholding the principles of the present dynasty” can also be proved by statistical data: the ranking of philosophical “principles of life” in Song poems, such as “Life in life” (3rd) and “human world” (8th), has risen compared with that in Tang Dynasty (30th and 13th respectively).Another point is worth pondering, sung people are advocating self-controlled self-supervision, for in the heart of the neo-confucianism, poetry, there is no shortage of the tang dynasty rarely wrote “fame” (36), “wealth” (78), and common in the tang dynasty “melancholy” (15), and “love” (22) as opposed to a “language” of “words” in the song poetry fell out of the top one hundred word list.Tang poetry emphasizes space while Song poetry emphasizes time.敻 The vast consciousness of the universe and the boundless space can be seen from the first five word frequencies (I do not know, where, thousands of miles, thousands of miles, not see) in the whole Tang Poetry.Japanese Sinologist Kojiro Yoshikawa once put forward that Tang poetry is gazing at the burning of precious moments of life and only the culmination of the object.Song poetry, on the other hand, is temporal and sees life as a long and continuous process.To measure this point of view in terms of the statistics of words, the time word in tang poetry ranks highest is “today” (8), on which time and emotion are concentrated, while in Song poetry, “ten years” (9) tops, followed by “today” (12) and “one hundred years” (20).Yoshikawa promoted the idea of “burning and persistence” to the contrast aspect of image selection. Sunset is the scene of burning, and rain is the scene of persistence, so there is the classic judgment of “Tang people write sunset and Song people write rain”.The word frequency statistics also verify this point, “sunset” (55), “sunset” (59), “sunset” (69), which are in the top of the word frequency in the whole Tang Poems, are all below 90 in the Song poems.Different from the modern era of information explosion, the boundary of classical texts in the handed down literature is relatively clear, but its volume is still difficult for researchers focusing on a certain topic or field to master.The text analysis of ancient literature classics based on big data technology not only focuses on the classical literature, but also takes the massive basic literature as the foundation. It hopes to use efficient and comprehensive data mining to carry out accurate and effective text analysis in a short time.Conclusions in traditional classical studies are usually obtained through observation, thinking and comprehension in the process of individual limited reading, which is often subjective and even transcendental.The convergence of big data and the application of computational analysis methods can make the conclusions in both unexpected and “automatic emergence”.The use of big data to reconnect things that were previously divided and isolated has changed our understanding path and grasp scale of literature, text and knowledge.Starting from a small aspect of a large data tool, word/word frequency statistics, we have gained a preliminary experience of exploring literatures in the fields of classics, linguistics, literature and so on in a new way.Compared with the integrated literature knowledge base reconstructed with different technical means, different structural methods and different granularity, the above work may be just a small attempt.It is believed that with the accumulation, superposition and mapping of statistical data, the study of ancient books and traditional culture will surely show more vitality and vitality.(Author: Liu Shi, chief expert of “Analysis and Research on Ancient Literary Classics Based on Big Data Technology”, a major project of National Social Science Fund, professor of Tsinghua University, Yin Xiaolin, full-time researcher of Chinese Poetry Research Center of Capital Normal University)