Friday, January 27, 2017

Active English Vocabulary

From the point I decided to learn as many languages as possible I've been thinking about making the process of learning more efficient. When you learn one language, it's much easier to learn another one from the same language class. Language schools, however, do not offer distinguished classes for "beginners - advanced learners" and "beginners - beginner learners". 

Recently emerged interesting challenges such as learning a language in 30 days on the Internet and as a person with a computer science degree I started to think how can I develop software for such challenge that would help me learning languages more efficiently.


What is active and passive vocabulary


Vocabulary is a set of words a person knows. According to Wikipedia, acquiring an extensive vocabulary is one of the biggest challenge in learning a foreign language. For experienced language learners, however, this don't have to be such challenge. 

What I always found disturbing at language courses was that learning process started right from a point zero, as we were learning our first language; from learning simple words like naming animals, colors, family members, to more and more difficult vocabulary. After years of visiting such classes you weren't able to understand what you wanted to understand. Life around you, newspapers, TV shows, movies. The only appropriate media content for you were kids movies and slow BBC podcasts.

Jumping over this line was painful and included talking to real native speakers in real time. This made me to conclusion that I can learn a language much faster by choosing a different approach. It was ages ago.

Later on I started to think about vocabulary learning. A usual way of learning a new language starts with learning grammar rules and along the way learn new vocabulary. I would like to test a little bit reversed approach which I find more natural, learning vocabulary first and learning grammar and syntax along the way. Children learn their native language syntax empirically and build their vocabulary according to their interests (which is focused on naming things at home, animals and toys first). As we grow old, our interests are changing and we want to talk about emotions, politics, biology or tech. Why don't we also try to learn grammar empirically and build new vocabulary with the mostly used words? The question is which words do people use the most?

Active vocabulary of average adult native English speaker consists of about 20.000 words. However, it's been said that a vocabulary of just 3000 words provides coverage for around 95% of common texts such as news items, blogs, etc.

What if we started learning with top-bottom approach, rather than bottom-up approach? What if we learn the most used grammar rules and just 3000 words first and then build your vocabulary and grammar knowledge?

Efficient way of learning languages based on statistical methods


I came to this idea far earlier than I started working on it, then I forgot about it. I started to study data processing instead with a big passion for going further than just to SQL and relational databases. It was a few years after massive online courses started to spread around the Internet. I picked a course about Data Science and deeply plunged into learning about awesome examples how data processing can help us in daily life, in completely common activities.

That was the time I came back to the idea of learning languages more efficiently using my programming skills ;)

I wrote a program for scraping data from Twitter real time stream in 2015. Processing just about 15.000 English Tweets gave me a list of more than 15.000 unique words I could play with. So many "15"s there, right?

My processing methods were and, unfortunately, still are pretty simple and straightforward. They were still more focused on fast and efficient data processing than on the language itself. According to the frequency of individual words I built a list of the most frequently used words among English speakers on Twitter.

My results weren't that far from official Oxford dictionaries statistics. At least first top 10 mostly used words were exactly the same, just in different order. Among those words were words like the, I, are and so on. Not so meaningful, though, the word love was on 23rd place.

Learning only 3000 words for daily communication


Recently I returned back to work on this project again and saw my naive Python codes. I've never been excellent in Python, neither in linguistics. I've been just a motivated enthusiast with an idea to build something of interest. I decided to build another, more useful version of my dictionary. (The first one consisted of a HTML page with words and links to a Slovak-English dictionary with q= option in URL for each word.)


Active English Vocabulary

The application Active English Vocabulary is now again in the development process. The second version will include following features:
  • Classifying words into groups (part of speech word groups)
  • Building a desktop application supporting translations of English words to another languages
  • Cleaning the data