When the Machine Listens: How an AI Chatbot is Transforming Chinese Language Learning

This is the first entry in a multi-part series on English-Chinese AI chatbot design experience. More to come.


Speaking a new language is terrifying. For many students, the classroom is a high-pressure environment where the fear of making a mistake in front of peers can silence them completely. Teachers want to help, but in a class of twenty or thirty people, they simply cannot sit down with every single student for hours of one-on-one conversation practice.

But what if you could practice with a partner who never gets tired, never judges you, and is available at 3:00 AM? What happens when a machine listens—not just to transcribe your words, but to evaluate, coach, and converse with you?

Traditional automated systems, like those used in standardized tests, score pronunciation with mathematical precision. But they lack something essential: the ability to have a real conversation, to respond thoughtfully, and to help students understand not just what they got wrong, but why.

This post introduces a fascinating digital project that explores these questions. For a group of Chinese language students at Florida State University in summer 2025, I developed an AI-powered chatbot (GPT 4o mini) designed to listen to students’ spoken Chinese, evaluate their performance, and facilitate unlimited practice. This isn’t just about technology replacing teachers; it is about using AI to create a “safe space” for students to find their voices. The results revealed both the promise and the complexity of using AI as a language learning partner.

How It Works

Students complete three types of speaking tasks. In picture-based story narration, they look at images and tell a story about what they see. In sentence repetition, they listen to a sentence and repeat it back. In free conversation, they talk spontaneously about topics of their choice.

After each task, the chatbot provides both a numerical score and qualitative feedback—written and spoken in conversational English. Students can respond to the feedback and keep talking with the chatbot. The grading criteria are based on the AP Chinese Language and Culture Scoring Guidelines. The AI doesn’t grade on its own terms—it follows the same rubric a human teacher would use.

The design draws on ancient teaching traditions. Just as Socratic dialogue unfolds understanding through conversation, and Confucian learning refines skills through guided correction, this chatbot creates a digital version of that cycle: practice, receive feedback, reflect, and try again.

What Students Found Hardest

When nine students reflected on their experiences over several weeks of using the chatbot, a clear pattern emerged: sentence repetition was the most challenging task.

The reasons were consistent. Students struggled with memory retention. The AI spoke quickly. Reproducing sentences word-for-word while maintaining correct pronunciation and rhythm created intense cognitive pressure. As one student put it, “it’s very difficult to just listen to a sentence and then map it in your head and then immediately just say it again.”

Another student described the overwhelming pace: “It would come at me very quickly, and I would have trouble making out the initials and basically only get the vowel sound.” The exercise demanded sharp focus on short-term memory and auditory processing under time constraints—skills that aren’t easy to develop.

A few students also found story narration difficult, but for different reasons. They struggled to structure coherent narratives when their vocabulary was limited. One student noted the challenge of “my vocab is limited and therefore I can’t convey things to make it long enough to meet the time requirement.”

What Students Found Easiest

Interestingly, about the same number of students who found sentence repetition hardest also found it easiest. The difference came down to perspective and learning style.

Students who preferred sentence repetition appreciated that it required no original thinking. They didn’t have to create their own sentences or search for vocabulary. They just listened and repeated. One student explained, “I did not have to create my own sentences. I only repeated pre-written sentences, so it was easier to follow along.”

Others noted practical advantages: the sentences were generally short, covered familiar topics, and the AI would repeat them on request. “They were shorter and they were topics I already understood and learned,” one student wrote.

This split in student experiences reveals something important: the same exercise can feel completely different depending on whether a student finds structure liberating or constraining.

How Students Planned to Improve

The weekly reflections showed students developing real self-awareness about their learning. They weren’t just passively receiving feedback—they were actively thinking about how to get better.

For story narration, students recognized they needed to add more detail, smoother transitions, and better logical flow. One reflected, “in order to improve it, I should be more cohesive and truly flow with my story and not just have sentences based on the picture but make them all.”

For sentence repetition, several planned to adopt specific techniques like active listening and shadowing (repeating along with a recording). “I can improve my sentence repetition by speaking more clearly and remembering the sentences,” one student wrote.

For free conversation, students aimed to improve their grammar and expand their vocabulary. “I think in order to improve my performance, I would need to make my vocabulary and grammar clearer and have more clarity in my sentences.”

These reflections suggest students were developing metacognitive awareness—the ability to think about their own thinking and learning processes. They were identifying specific weaknesses and devising practical strategies to address them.

What Students Thought About AI Grading

Most students agreed with the AI’s ratings, viewing them as fair and helpful. “I do agree with the AI-generated responses and ratings because it is obvious that my grammar isn’t the best, and it helps me point out errors in sentence repetition and words that I misspoke,” one student wrote.

Another appreciated the insights: “Overall, I think that the AI-generated ratings of my performance were pretty accurate. I think that I definitely need to work on the things that the AI-generated responses pointed out, and I was able to learn some different insights about my performance from the AI-generated responses, so I definitely agree with them.”

But not everyone was convinced. A minority voiced concerns about inconsistency. “No, probably not. I think sometimes it’s rating too high, and then sometimes it’s giving something that would be a 6 out of 6 something, like a 2. So, I’m not sure it’s that accurate.”

Another student felt the AI graded too mechanically: “Mmm, not really, I don’t think it’s grading on effort, only on how well you pronounce it.”

This mixed response highlights both the promise and limitations of automated evaluation. Students generally trust AI feedback, but they also want transparency in how scores are determined—and they can tell when something doesn’t feel quite right.

Lessons Learned

Creating this chatbot revealed several important insights about using AI in education.

First, students don’t follow instructions as predictably as designers expect. The exercises were meant to be completed in order: story narration, then sentence repetition, then free conversation. When students deviated from the expected sequence, the system sometimes got confused—pulling up the wrong rubric or getting stuck in loops. This taught me that user behavior is rarely linear, even in controlled instructional settings.

Second, AI evaluation still needs human oversight. The chatbot scored consistently but lacked nuanced judgment. When a student spoke with correct pronunciation but wrong tones, a human listener could still understand and might give partial credit. The AI typically gave zero for tonal errors, regardless of whether the meaning came through. Human raters also motivated students to take the work seriously—knowing a real person would eventually review their performance encouraged fuller engagement.

Third, humans are still essential for catching things machines miss. In one case, a student who initially scored poorly on sentence repetition recorded the AI’s prompt on their phone and played it back. The AI couldn’t distinguish between authentic speech and a recording, so it awarded a perfect score. Only a human rater could catch this kind of deception.

Looking Forward

Overall, the project demonstrated that AI can meaningfully enhance language learning when designed thoughtfully. The chatbot didn’t replace the teacher—it extended what the teacher could do.

The student reflections paint a picture of learners who are introspective, self-directed, and increasingly skilled at using feedback to improve. They’re identifying concrete goals around grammar, pronunciation, and narrative structure. They’re developing strategies to address their challenges. And they’re doing this through a tool that combines ancient pedagogical traditions—dialogue, practice, reflection—with modern computational capabilities.

The success of this project suggests a broader principle: AI works best in education when it functions as an interpreter rather than a mere analyzer. The chatbot’s real contribution wasn’t in generating precise scores—it was in transforming evaluation into dialogue, making assessment a reflective rather than a final act. That approach resonates with humanistic learning traditions, where understanding emerges through conversation and revision.

Technology won’t replace teachers. But thoughtfully designed AI tools can give students more opportunities to practice, more feedback to guide their improvement, and more agency over their own learning. That’s not the future of education—it’s already happening, one conversation at a time.

Leave a Reply