Should LLMs be used in English writing support? What does research say?

10 research papers later....


I recently dissected ten (10) new research papers, each one analyzing the usefulness of LLMs across various grammar error detection and correction (GED / GEC) education use cases. Naturally I was very excited, for a moment. I was super interested in the hypotheses, tests and conclusions of each researcher and being able to stay in a narrow swim lane discussing LLM usefulness.

Well. What a heavy session of reading that ended up being. Luckily I had fourteen focus hours spare as I flew from Sydney to LA. At some point my iPad was taken off me and put in the seat pocket as I processed the detail. I persevered and ended up with my own meta-summary of each paper, in simple speak.

My mission was to correlate the opportunities and risks LLMs pose, trying to find a consensus of research opinion about LLMs effectiveness in GED and GEC use cases. Are they safe or scary to inject into education use cases? Should we be using various LLMs in education as an active medium for teaching and learning? or as a component piece to get a job done with more efficiency and effectiveness?, or, should we simply understand what’s possible and work on the basis of caveat emptor?, or all of the above.

In summary, all researchers finished by saying more research is needed. No surprises there. The main limitation of the research was that each LLM is in effect a black box. Until researchers can see into and validate what is in the box, research is limited to inputs and outputs, not a validation of the process. 

All is not lost. There are commonly themed opportunities and risks arising from each research paper. I have a detailed breakdown of opportunities and risks for each research paper if anyone is interested. The research papers I based my ‘meta-research’ on, are listed in the appendix. 

Summarised opportunity / risk score card for LLMs used in a GED / GEC scenario.

Opportunities of LLMs in GED / GEC:

Risks of LLMs in GED / GEC:

  • Quick Feedback and Scalability: Helps learners improve faster and can be used widely.
  • Accurate Error Detection: Usually spots real mistakes accurately.
  • Flexible Language Use: Can change sentences without making them incorrect.
  • Creates High-Quality Example Data: Helps improve model performance in different areas.
  • Explainable Systems: Makes it easier for non-native speakers to understand corrections.
  • Self-Improvement: Uses its own feedback to get better over time.
  • Effective Multilingual Use: Works well with multiple languages.
  • Less Need for Human Input: Reduces the need for human-annotated examples.

  • Potential Biases: Might introduce biases based on its training data, which is often focused on US adult writing.
  • Dependence on a Single Model: Over-reliance on models from just one source.
  • Ethical Concerns: Requires careful consideration of ethical issues.
  • Not a Replacement for Teachers: Can't fully replace human judgement in final assessments.
  • Low Recall: Can miss many errors.
  • Context-Sensitive Mistakes: Struggles with errors that depend on context.
  • Over-Correction: May correct too much, changing the original meaning.
  • Inaccuracies and Biases: Can generate incorrect or biased information.
  • Time and Cost Issues: May face challenges with time and cost efficiency.
  • Small Data Limits: Limited data size can hinder detailed analysis.
  • Need for Better Methods: Requires better techniques for improvement.
  • Relies on Powerful Critic Models: Needs a strong critic model for accurate feedback.
  • Higher Costs: Increased computational costs due to more extensive output.


Conclusion:

Human powered Actual Intelligence has no substitute or real competitor for now. We need to keep a human calibration cycle in the Venn diagram of connectivity  between teachers and students. Efficiency in workflows and an increased cadence of support for students are very positive benefits. Errors, training and hallucinations remain on the risk list coupled with a probable reduction in human cognitive motivation. 

Are LLMs safe or scary to inject into education use cases? They are both. The context, reliance and oversight drives the safety level.

Should we be using various LLMs in education as an active medium for teaching and learning?  Yes. Being cognisant of possibilities and risks encourages a safety net of actual intelligence when relying on LLM feedback. This is a must have skill everyone should develop. It is really interesting to recount the cut, copy and paste days when, if one was not careful, you could go the wrong way simply by searching the wrong site. Has anything really changed in text creation using LLMs? 

Should we be using various LLMs as a resource component to get a job done ( teacher and or student) with more efficiency, personalization and effectiveness? Yes for sure. Time efficiencies are massive when LLMs are used well. Student engagement and self directed learning increases, students are willing to write more, more often. These are three elements Professor Paul Deane’s research suggests makes the biggest difference to writing improvement. Again, being cognisant of possibilities and risks encourages a safety net of actual intelligence when relying on LLM feedback.

Should we understand what’s possible with LLMs and work on the basis of caveat emptor? Definitely. The risk in LLMs remains with the buyer, not the seller. LLMs 'are what they are' and beauty often emerges from the hands of the buyer. Every day I am encouraged by human gumption and the application of Actual intelligence. May this never stop.

Or, finally, should we do all of the above? Probably the best answer research can give right now is to carry on, play safe and be smart. Meanwhile, more research is needed.

Research papers.

A large language model-assisted education tool to provide feedback on open-ended responses Jordan K. Matelsky 1,2, Felipe Parodi 3, Tony Liu 4, Richard D. Lange 1,5, and Konrad P. Kording 1,3,4,6

Assessing the Efficacy of Grammar Error Correction: A Human Evaluation Approach in the Japanese Context Qiao Wang1 and Zheng Yuan2

ChatGPT or Grammarly? Evaluating ChatGPT on Grammatical Error Correction Benchmark Haoran Wu† Wenxuan Wang† Yuxuan Wan † Wenxiang Jiao‡ Michael R. Lyu† : Department of Computer Science and Engineering, The Chinese University of Hong Kong 1155157061@link.cuhk.edu.hk {wxwang,yxwan9,lyu}@cse.cuhk.edu.hk

ChatLang-8: An LLM-Based Synthetic Data Generation Framework for Grammatical Error Correction Jeiyoon Park1, Chanjun Park2†, Heuiseok Lim3†  Atommerce, 2 Upstage AI, 3 Korea University jypark1@atommerce.com chanjun.park@upstage.ai limhseok@korea.ac.kr

Correcting Challenging Finnish Learner Texts With Claude, GPT-3.5 and GPT-4 Large Language Models Creutz, Mathias

Enhancing Grammatical Error Correction Systems with Explanations Yuejiao Fei, Leyang Cui, Sen Yang, Wai Lam, Zhenzhong Lan , Shuming Shi Zhejiang University ♡ Tencent AI labThe Chinese University of Hong Kong ♣School of Engineering, Westlake University

Evaluating Students' Open-ended Written Responses with LLMs: Using the RAG Framework for GPT-3.5, GPT-4, Claude-3, and Mistral-Large Jussi S. Jauhiainen ¹ ² and Agustín Garagorry Guerra ¹

GPT-3.5 for Grammatical Error Correction Anisia Katinskaia,⋆♢ Roman Yangarber♢ ⋆ Department of Computer Science, ♢ Department of Digital HumanitiesUniversity of Helsinki, Finland first.last@helsinki.fi

No Error Left Behind: Multilingual Grammatical Error Correction with Pre-trained Translation Models Agnes Luhtaru Elizaveta Korotkova Mark Fishel Institute of Computer Science University of Tartu {agnes.luhtaru, elizaveta.korotkova, mark.fisel}@ut.ee

Prompting open-source and commercial language models for grammatical error correction of English learner text Christopher Davis Andrew Caines Øistein Andersen Shiva Taslimipoor Helen Yannakoudakis Zheng Yuan Christopher Bryant Marek Rei Paula Buttery ALTA Institute, Computer Laboratory, University of Cambridge, U.K.

Teaching Language Models to Self-Improve by Learning from Language Feedback Chi Hu1 Yimin Hu1 Hang Cao1 Tong Xiao1,2* Jingbo Zhu1,2 1NLP Lab, School of Computer Science and Engineering, Northeastern University, Shenyang, China 2NiuTrans Research, Shenyang, China huchinlp@gmail.com

Universality and Limitations of Prompt Tuning Yihan Wang UCLA wangyihan617@gmail.com Jatin Chauhan UCLA chauhanjatin100@gmail.com Wei Wang UCLA weiwang@cs.ucla.edu Cho-Jui Hsieh Google and UCLA chohsieh@cs.ucla.edu


References


Klabunde, M., Schumacher, T., Strohmaier, M., and Lemmerich, F. (2023).Similarity of Neural Network Models: A Survey of Functional and Representational Measures.ArXiv preprint, abs/2305.06329. Kojima et al., (2022)

Shah, H., Park, S. M., Ilyas, A., and Madry, A. (2023).ModelDiff: A framework for comparing learning algorithms.In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J., editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 30646–30688. PMLR. Shahbazi et al., (2021)

Wang, C., Rao, W., Guo, W., Wang, P., Liu, J., and Guan, X. (2022a).Towards understanding the instability of network embedding.IEEE Transactions on Knowledge and Data Engineering, 34(2):927–941.

Zhao, W. X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y., Min, Y., Zhang, B., Zhang, J., Dong, Z., Du, Y., Yang, C., Chen, Y., Chen, Z., Jiang, J., Ren, R., Li, Y., Tang, X., Liu, Z., Liu, P., Nie, J.-Y., and Wen, J.-R. (2023).A Survey of Large Language Models.ArXiv preprint, abs/2303.18223.

Share this post
The time teachers spend on grading and feedback for handwritten texts can be saved!
Scribo Vison has changed the way manual feedback on handwritten texts is done.