Category Archives: Analytics

Analytics: Optimal starting ease for core vocab in Anki

I’ve long wondered what the optimum starting ease settings are for learning vocabulary in anki.  Starting ease is the primary setting the affects accuracy, workload, and ultimately how much I can learn in a given time.  There’s supermemo’s theory page, but it’s not specific to japanese vocabulary or even language learning.  I want to know my personal settings for the deck I’m studying so I decided analyze my anki learning data to find out.

The first scatter chart shows the relationship between a card’s ease and my accuracy answering the card.  The blue data points are from when I first started studying core vocabulary and was using a lot of filtered decks.  I’ve since realized that filtered decks aren’t as efficient as simply using anki’s algorithm and sensible settings.  I’m also guessing that there’s a learning effect making it easier to learn japanese vocabulary once I’m a few thousand words into learning.  Either way, it seems that some combination of those factors is allowing me to be more accurate lately(red) as opposed to when I started(blue).



The second chart shows what happens when I simulate my workload for various values along the combined best fit curve.  The blue line(left axis) is simply the combined line from the chart above.  The red line(right axis) is the simulated workload and the yellow line(right axis) is a smoothed version of the red line.  As you can see, on the left side of the chart, if I try for high accuracy, my workload is twice what it could be if I accepted a lower accuracy.  At an ease of around 210, my accuracy should be around 61%, but my workload is about half what it is with ease 130 allowing me to study twice as many cards in the same amount of time.


The problem with the chart above is that the yellow line doesn’t accurately show how much of the vocabulary I actually “know” for any ease/accuracy setting.  In other words, if I am getting 60% accuracy vs 80% accuracy, I “know” 20% less vocabulary, but it’s counted the same in the chart above. So the following chart is the same, only the yellow workload line is adjusted to account for accuracy, so that every point on the line represents the same number of known cards.



Judging by this last chart, my most efficient starting ease for my core vocabulary deck is around 175 which should put my accuracy around 67%. Lately, I’ve had my ease set to rather easy settings because it makes the learning process a lot more fun when I feel like I’m winning. However, I realized that the slope of that yellow line is so steep that a small sacrifice in accuracy should result in a large decrease in workload, allowing me to add more cards. So, I’ve decided to slowly raise my ease settings until I find a good comprise between accuracy, efficiency and enjoyment.

Analytics: The difficulty of finding leeches

When I first started thinking about leeches, I assumed cards that were easy or hard in the learning phase would stay easy or hard in the review phase.  I’ve noticed those problematic cards that I had a hard time getting out of learning and are still giving me trouble months later.  If I could just find the cards which were giving me trouble early on, I could just suspend them and learn just the easy cards.  Presumably I only remembered the cards that continued giving me trouble.  Because unfortunately, the next set of charts shows I was not a very good judge of what was actually going on.

The following charts show the relationship between how many reps it took to get each card to an interval of 7 days and how many reps to get the same card from interval 7 to interval 90.  I was fully expecting to see a nice relationship where difficult cards would stay difficult.  In a scatter chart, you would see a tight grouping of dots sloping from the lower left to the upper right.  Instead, what I got was the following set of charts where just as many easy cards became difficult as difficult cards became easy.  This exercise is making me think that it will be difficult to find leeches with any accuracy.

coresentence_ivl7v90 jfbp_ivl7v90corevocab_ivl7v90rtk_ivl7v90tk_ivl7v90

Anki Analytics: Card difficulty

In a follow up on my post on leeches, I graphed the amount of reps it took to get each card to an interval of 90 days.  Again, the chart for kanji is the odd man out with it’s plot being more linear than the others, suggesting that different types of memories behave differently.  But even with kanji, we see that the most difficult cards take many multiples the number of repetitions that the median card takes.

By my calculations, the easiest 80% of my core vocabulary cards takes roughly the same number of reps as the hardest 20%.  In other words, I could learn 4 easy cards in the same time it takes to learn 1 of the harder cards.  It sure would be nice to identify those difficult cards early somehow.

low high mean median
Tae Kim 5 20 8.386740331 8
RTK 3 98 37.78409091 34.5
Core sentences 2 59 9.30834753 7
Core vocab 2 194 30.17112299 19
JFBP 5 164 19.59797297 7

rtk_ivl90 tk_ivl90 jfbp_ivl90 corevocab_ivl90 coresentence_ivl90

Analytics: Leeches

This is a new series where I combine a few things that I am currently learning into a topic I have no business pretending to know anything about.  In addition to teaching myself Japanese, I am also attempting to teach myself programming and also data analysis.  Although it’s going very slowly, I am hoping to figure out a few things that will hopefully make the ankiing a little more efficient.

My first target is those damn leeches.  Leeches are what anki calls those cards that you keep forgetting over and over.  According to the supermemo site, around 50% of your time can be spent learning 2.5% of the material.  That 2.5% that is taking half of your time are leeches.  Depending on your goals, wouldn’t it be nice to be able to identify that 2.5% of material and spend that 50% of your time learning twice as much?  Personally, I would rather learn 97.5% of core twice as fast before spending the time to learn that last 2.5%.

Unfortunately we don’t know what those 2.5% hard vocab words are, and even worse, anki doesn’t give us nearly the tools to find them.  All that anki gives us is a setting that once you fail a card more than a set number of times (default is 7), anki will suspend that card.  The thinking being that you are more likely to learn a new card in less additional time than keep trying (and failing) to learn the one you’ve failed so many times already.  But I’ve always wondered what setting has you learning the most amount of material in the least amount of time?

This is the question I set out to answer.  I wrote a small program that counts the number of reps to either learn a card or become a leech.  I considered a card to be “learned” once it’s interval surpassed 4 months.  I did this for all cards, and averaging the reps to learn a card and the reps to become a leech for every card I’ve studied.  The result is the average number of reps it would take to learn a card assuming a given leech threshold in anki.
image (1)

The above graph shows the results for the 4 decks I’ve been studying.  The first thing to notice is that “core sentence”s and my” Japanese for busy people” decks are much easier than my “core vocabulary” and “kanji” decks.  The other thing to notice is that for all decks except for kanji, setting the leech threshold to the lowest setting results in learning the most number of cards in the fewest reps.  Kanji appears to be most efficient setting the leech threshold to 8, but any number higher than 4 appears to be just fine.  The final thing to notice is that all of the vocabulary and sentence decks appear to have a similar curve, and a very smooth one.  I take this to suggest that for all vocabulary decks I study, setting leeches to the lowest setting will result in learning the most amount of vocab words in the least amount of reps.  However this isn’t the only consideration.
image (3)

The second graph shows the ratio of learned cards to suspended leeches for each deck and each leech threshold.  As you can see with the “hard” vocab and kanji decks, at lower thresholds anki is suspending more cards than I would be learning learning.  In fact, setting the leech threshold to 1 for core vocab and kanji would result in learning only 18% of the vocab deck and 6% of the kanji deck.  This is hardly desirable, but finding a good balance between efficiency and completeness might make sense for some people.  For instance, setting the threshold to 9 for kanji and 6 for core vocab gets me in the 50-60% coverage range.  That still seems less than optimal to me, but something that I have to think about as there is no clear cut answer unfortunately.

That’s it for now.  Please put you thoughts, criticism, praise and especially suggestions in the comments as I’m happy to make this better with your help.