[Paper Summary] The Future of Computational Linguistics: On Beyond Alchemy
Current Progress in Computational Linguistics: Exploring Beyond the Alchemical Paradigm
— Kenneth Church & Mark Liberman, 2021
1950s empirical approach: Information theory and AI as applied statistical methods
1970s rational framework: Formalized language studies and logic
1990s empirical approach: Stochastic grammatical structures (probability & preference)
2010s: empirical approach: Deep neural networks (computer vision & natural language processing)
The Past
- CL is an interdisciplinary topic that has been closer to Linguistics at times, but is currently closer to CS (Enginnering), and especially ML.
- To understand the future of our field, we need to understand its past, which we will describe as tribes of researchers migrating through a changing conceptual and socio-economic landscape: In the history of AI, sometimes logic and probability co-exist in harmony. The conceptual split implicit in Boole’s conjunction “Theories of Logic and Probabilities” (in his seminal 1854 book in mathematical logic titled “An Investigation of the Laws of Thought on Which are Founded the Mathematical Theories of Logic and Probability”) foreshadowed much of what was to come. A century later, Claude Shannon made two crucial contributions, one on each side of the divide. On the logic side, his 1937 Master’s thesis, which made digital computers possible by translating expressions of Boolean logic into electrical-mechanical terms, was titled “A symbolic Analysis of Relay and Switching Circuits”. On the probability side, his 1948 monograph “A mathematical theory of communication” introduced info theory based on the assumption that msgs are sequences of symbols generated by a stochastic process.
- Logic played a larger role when rationalism was in fashion, and probability played a larger role when empiricism was in fashion, and both logic and probability faded into the background as deep nets gave procedural life to an accociationist (rather than statistical) flavor of empiricism.
- The pendulum paper predicted that the next generation would soom rebel against us. Instead of a return to Rationalism, though, the rebellion took an unexpected turn with the revival of *Connectionism.
- Actually, the third concept, Connectionism , arose in Ronald Fisher’s 1936 paper “The use of multiple measurements in taxonomic problems”, which proposed a method for dividing multivariate measurements into categories based on thresholding a linear combination of the measurement vectors. (Linear Discriminant Analysis). 至此正片开始:Generalizations of this idea — networks of matrix multiplications interconnected via point non-linearities — have risen to prominence under a variety of names: “perceptron”, “neural nets”, “parallel distributed processing”, “connectionism”, and most recently “deep learning”. Fisher’s contributions have been extremely influential, and continue to live on to this day, though the 1936 paper was published in a journal on Eugenics, a topic that has fallen out of favor.
- It was an extension of Fisher’s discriminant analysis that succeeded in the neural nets’ arena, starting with Frank Rosenblatt’s 1958 paper: “The perceptron: a probabilistic model for information storage and organization in the brain”. The phrase “probabilistic model” in Rosenblatt’s title seems to mean only that most of the values used in learning, computation, and storage are gradient rather than categorical. Decisions are made by thresholding, on an all-or-nothing basis. The important novelty in this work was a simple iterative method for learning the weights and thresholds needed to make the perceptron compute the right answer.
Fads come and fads go. There are booms and there are busts.
- In hindsight, the metaphoric expressions of rebellions and pendulums are regrettable.
- Sometimes ideas are conflicting, especially when powerful individuals like Chomsky emerge; however, there is usually ample room for multiple positions to coexist.
- It’s tempting to highlight differences; yet more advantageous to focus on synergies.
- The future may encompass a blend of logic, probability, and connectionism; though likely varying in proportion.
Connectionism, Associationism and Empiricism
- Associationism: Mental constructs become interlinked based on an individual’s past experiences. The extent to which an individual has encountered and interacted with Xs and Ys in their environment determines the likelihood that thoughts about X and thoughts about Y will co-occur in their future cognitive processes.
- The adoption of end-to-end systems has become increasingly prevalent in recent years. Early proponent, Mercer, championed this approach within the framework of [Church & Mercer Special Issue on CL with large corpora], leading to the well-known quote: "Every time a linguist is assigned to work on our speech recognition system, performance improves."
The Alchemy
- A common drive to investigate within the container is explanation. Their functionality is exceptionally effective most of the time, though it can be unusually poor occasionally.
- They exhibit strong priors regarding the Alchemy debate. It’s unlikely there are remaining persuadible voters on this issue, yet they remain enamored with hearing their side’s score (pointless) scores.
当实践似乎领先于理论时,《alchemy》论点不期而遇。
- Rahimi在2017年NIPS年度奖演讲中坦率地指出,“机器学习已成巫术”。这一辩论引发了对不断扩张的无理技巧领域(即所谓的模型 zoo)日益增长空间的担忧。
- LeCun站在辩论另一端的观点:“仅仅因为现有的理论工具尚未跟上实践进度而对整个行业进行批判——这可能是危险的。LeCun认为‘理论往往跟随发明’[Yann LeCun 2019年的《深度学习中的知识论》演讲]。
- 在《人工智能的大战》(https://www.youtube.com/watch?v=93Xv8vJ2acI)中,Rich Caruana举了一个经典例子:“哮喘似乎有助于肺炎,依据数据模式可以看出”。这里的核心问题在于缺失关键变量:哮喘实际上作为某种我们真正想了解的变量的代理,它代表的是患者开始就医的时间——至少在哮喘领域,他们正在加快获得就医信息,但目前尚无法衡量患者从开始就医到就诊的时间间隔。
Ok now, which one do you want? The one that is explainable, or the one that actually works?
The debate is often seen as a contest between engineering and science
- LeCun cares deeply about making machines that work, whereas Tenenbaum would like to understand common sense and learning, especially in young children (and animals).
- LeCun is, in fact, very sympathetic to the psychologist Josh Tenenbaum’s position in his talk [Yann LeCun 2017 How does the brain learn so much so quickly?]: You know the punchline is … our AI systems need to learn models of the world … it is very similar to what Josh was saying. LeCun is well aware of various limitations of currently popular methods. LeCun has a nice way of making such points extremely crisply: “the revolution will not be supervised ”.
- Commonsense is the ability to fill in the blanks : Infer the state of the world from partial information / Infer the future from the past and present / Infer past events from the present state.
- Predicting any part of the past, present or future percepts from whatever information is availabel. That’s what predictive learning is. But really, that’s what people mean by unsupervised learning.
- The number of samples required to train a large learning machine (for any task) depends on the amount of information that we ask it to predict. Predicting human-provided labels is not enough, predicting a value function is not enough! The brain has about 10^14 synapses and we only live for about 10^9 seconds. So we have a lot more parameters than data. This motivates the idea that we must do a lot of unsupervised learning since the perceptual input (including proprioception) is the only place we can get 10^5 dimensions of constraint per second. [Geoffrey Hinton 2014 AMA on Reddit]- LeCun stands by the perspective of “Technology drives & motivates Science (and vice versa), Sciences are born from the study of technological artifacts,” which is also used by him as a counterargument in the Great AI Debate. Let’s be inspired by nature, but not too much. For airplanes, we developed aerodynamics and compressible fluid dynamics. We figured that feathers and wing flapping weren’t crucial. Question: What is the equivalent of aerodynamics for understanding intelligence?
- Dyna: an Integrated Architecture for Learning, Planning and Reacting: The main idea of Dyna is the old, commonsense idea that planning is “trying things in your head,” using an internal model of the world. This suggests the existence of a more primitive process for trying things not in your head, but through direct interaction with the world. Reinforcement learning is the name we use for this more primitive, direct kind of trying, and Dyna is the extension of reinforcement learning to include a learned world model.
DARPA’s AI Next Campaign
Document No., set out in 1962 by DARPA, aimed to develop advanced computational systems characterized by
Despite significant advancements, the field still requires substantial efforts as outlined in DARPA’s 1962 mission statement, particularly the final section concerning ‘respond robustly to unforeseen situations’.
DARPA’s AI Next Conference
- Time is ripe for change: It was well understood that we need no more cat detectors
- The first slide of Tenenbaum’s keynote said: “Where is the gap?” — Intelligence is not just about pattern recognition. It is about modeling the world.
- There are three waves: 1) Expert Systems: “if this, do that” 2) ML: Siri, face recognition, etc. 3) AI Next: TBD.
- The first wave relied on programming by hand, with no capacity for learning.
- In the current second wave, benchmarks often underestimate real world difficulties. Unrealistic idealization of “train and test coming from the same population” lead to Godfrey’s Gap, where performance on benchmarks is unrealistically rosy.
- The emphsis on the third wave, which is about adding what you might call “ contextual reasoning ”, or even commonsense. In situations an AI system has never encountered or been explicitly trained for, can it learn to pick up clues from the environment and come to accurate conclusions? So far, evaluations have been focusing on typical (or avg) case performance, but to achieve acceptable robustness, we probably need to do more to handle novelty and adversarial attacks. The machines DARPA envisions will function more as colleagues than as tools.
Paths Forward
The current scenario has changed significantly. Conferences have grown considerably, possibly beyond reasonable limits. The majority of tasks involve incremental efforts, which may be excessive. This could be perceived as a harsh approach to 'maintain appearances while doing minimal yet essential tasks.' The response to 'what can individuals achieve?' is influenced by personal interests, skills, and opportunities — individuals with distinct positions within an organization encounter varied aspects of its challenges and responsibilities.
This study outlines potential pathways toward achieving its objectives. It is important to note that this proposal is specifically tailored to address the unique challenges faced by individuals from different professional backgrounds, including Funding Agencies, Managers in Industrial Research, senior researchers, and younger researchers. Below, I provide specific recommendations tailored to younger researchers.
- 一上来就一语中的,又一句跟我的“inner dichotomy”呼应的: So younger researchers need to keep an eye open for promising new developments, while demonstrating their ability to work within the currently standard framework.
- The publish or perish rat race is more intense than ever before. The old cliche, publish or perish needs to be updated to “publish every month or perish”. This rat race is not good for anyone, especially younger researchers who are caught in the middle of the maze.
- Given this realities, it is probably safer to set relatively modest goals. It is probably wise, when starting out, to spend more time on smaller conference-size projects that are more likely to work out, and avoid taking chances on great questions such as a) composionality, b) causality, and c) long distance dependencies. Questions that have been open for too long are unlikely to be solved quickly.
- Learn old stuff, for having the freedom to choose and reject, and others’ respect when you do so: Traditional methods (that have stood up to the test of time) are more likely to solve your problem than whatever was posted on the archives last night. As mentioned above, deep nets can be viewed both postively and negatively. On the positive side, end-to-end systems can be viewed as a way to finess around potentially distracting details that have proved challenging for traditional models. But on the other hand, end-to-end systems can be viewed as a lame excuse for ignorance (about traditional topics). 所以只有学好了traditional method才有话语权呀才不会被认为是lame excuse.
IAS Event: Deep Learning Alchemy or Science (a few notes)
-
Zachary Lipton: Troubling Trends in ML Scholarship
- We run lots of experiments, but few aimed at generating knowledge of the natural world. Most empirical work dedicated to leaderboard chasing, exhibits aspects of the pre-paradigmatic.
- Too much ‘jounalism’ today. Key metrics are clicks & impressions. Demand-driven news production. Have lost the ability to focus on the pith.
- Good side: a) A vibrant economy around technologies reduced to practice. b) Financial resources for salaries, data and compute. c) More of us can now pursue full-time research careers past PhDs. c) Strong tooling and abundant tutorials widen accessibility to ML.
- Down side: a) Overburdened reviewers. b) Emergence of new incentives unaligned with scholarship. c) Whole areas of research and discourse that are flatly wrong. d) Emergence of bad academic patterns sans (without) consequences.
- Abuse of Language with suggestive terms, overloaded technical language and suitcase (就是假大空的那种: fairness, interpretability, consciousness) words.
- We want it all! We want financial resources & objectives aligned w/ intellectual work. We want to grow labs past bandwidth and to provide mentorship. We want fame and notoriety but not for the press to distort incentives. We want strong peer review but not to become conservative like economics or statistics. 经济学人和统计学人稍等一下再来打我 We want better peer reviewers, but culling (selective slaughter, killing weaker animals in a group) either the community or the reviewer pool seems problematic.
-
Yann LeCun: The Epistemology of Deep Learning
- But does anyone use generalization bounds to perform model selection? No! Everyone uses cross-validation, everyone does hyperparameter selection through validation. Generalization bounds wildly over-estimate the generalization error. What about theories from Bayesian/frequentist statistics? Do they explain why a 100-million parameter neural net generalizes with 1 million training samples? * Street light effect: Theory can limit our creative thinking. Theory is our lampost, but the keys to AI might be elsewhere.
- What is an SVM, really? A two-leyer net! Layer1: kernels; layer2: linear. The first layer is “trained” with the simplest unsupervised method ever devised: using the training samples as templates for the kernel function. Kernel methods are glorified template matching.
