I’ve written several parts of blogposts on this multiple times. This blogpost started its life as a comment on this post on Dear Author and then became way too long so I moved it here.
Hugh Howey and a mystery coder released a report on author earnings a little more than a week ago. It has been hailed as everything from total genius to utter crap. I’ve thought about writing about it since it was released. Lots of people have written about this, and I will sum up what they say: The study has convinced almost everyone who already believed what Howey said in the report, and convinced almost nobody who did not already believe it.
This is an ambitious project that is likely taking a lot of work on the part of Howey and his mystery coder. They’ve aggregated a bunch of information that people have discussed only anecdotally up until now. That’s pretty cool. That being said, it’s pretty obvious to me that they desperately need someone with some kind of background in science and statistics and data collection, because right now they’re spending a ton of time sifting through data without any sense of how to properly quantify things.
A note about my background: I spent three years doing computational modeling of physical events. I’ve forgotten a bunch of stuff and don’t feel competent (at this point) to actually quantify things myself without spending months relearning everything I’ve forgotten, but I remember enough to tell when things are running off the rails.
A note about my bias: I have been self-publishing exclusively since late 2011, although I’m technically a hybrid author. Given the current market, I find it unlikely that a publisher will offer me something for digital rights for future books that would be even remotely close to the value of the rights they would want. I’m open to any sort of publication but I am deeply skeptical that anyone other than me will ever maximize the value of my rights. At this point, my bias in publishing would (you think) skew me to take Howey’s point of view on things.
With those two things out of the way, I’d like to make some points that I don’t believe have been made at this juncture, although I have not trawled the entire internet to see what everyone is saying.
1) Some people are saying that the study bases author earnings for a year off of sales for a day. I believe it’s incorrect to say that the authors are using a days’ worth of sales. As best I can tell from the methodology, they used a single snapshot of sales rank–so that would be ranks for one hour on Amazon, not one day.
2) The study is not actually providing raw data, and it is a mistake for it to use those words to describe the Excel spreadsheet that is included at the end. The raw data used by the study is book rankings, as that is the only data that is used as input. (Technically, there must also be raw data used to determine unit sales–that’s what formed the KDP Calculator in the first place–but that is not disclosed at all.) The excel spreadsheet provided at the end of the report provides us with the information produced by his algorithm–in other words, the Excel spreadsheet tells us what the study’s model suggests for author earnings. This is not raw data. This data has been chopped up and boiled.
[Edited @2:37PM CST: Okay, my point number 2 was based on the Excel file I downloaded. I downloaded pretty much the instant I saw the report, and that was at a point when the server was wonky. Apparently, I didn’t get a full download, so I didn’t have the full data, but assumed I was seeing everything there was. That’s my bad. I was wrong on this point–they do include the full data in the Excel file. I’ll update this and some of the points below on that basis.]
3) I don’t think Howey or his mystery programmer have ever done any kind of scientific modeling, or they would probably refer not to their results as “author earnings”–which they do not in fact poll–but as “the Howey model of author earnings.” They are not giving us actual data about author earnings. They are giving us author earnings as calculated by their model, which takes as input data from Amazon which is loosely correlated with author earnings.
Note that saying that something is a model is not an insult, nor is it intended to be one. Models are extremely useful–they allow you to collect and aggregate data you might not be able to get from reality, or to try to ignore extraneous circumstances. For instance, when you use high school physics to calculate the speed of a falling bowling ball from a tower, you’re modeling reality: your model neglects electromagnetic effects, quantum effects, relativistic effects. It probably assumes a smooth bowling ball. It also probably assumes that the effects of gravity are equal throughout the fall even though the distance from the earth changes. Data would be an actual measurement of the speed of an actual falling bowling ball. A model is when you plug the measured weight of a bowling ball into an equation and come up with a number.
If your model works well under the circumstances you use, the measurement and the model will provide very, very similar answers.
(Aside: I spent three years working with a guy who believed that almost everything interesting in the world could be modeled with an Ising model. IÂ actually wrote a final class project in law school using a modified Ising model to test assumptions about free speech. I am not a person who is opposed to models.)
This brings me to…
4) One of the reasons you need to differentiate between reality and a model is that every model has limits. High school physics equations are awesome at calculating the speed of a falling bowling ball from a tower, or a frictionless ice skater going in a straight line. They are total crap at calculating the behavior of quarks. If you tell me what your model is, and do not tell me what the limits of the model are, you’re not telling me when I can take your results seriously.
It’s a model. You can’t take it seriously all the time. If you think you can take your model seriously all the time, you are engaging in the practice of religion, not science.
So, for instance, if your model is that author earnings can be calculated by looking at sales on Amazon US, and you neglect print sales entirely, you should mention that your model does not account for print earnings. And boom. There you are. You have a model. You explain when it works best and when it fails, and you try to use it only in situations where it works well, and only to the extent that the model gives good results. That’s decent science.
If someone else has to do that for you, you’re not engaging in science; you’re engaging in opinion fluff. At that point, you’re not going to convince anyone who is on the fence; you’re only providing fodder for the people who agree with you.
5) Note that you are not slamming yourself if you disclose the limitations of your model. In fact, that’s how you come up with legitimate models that you can use to make reasonable approximations about reality in the first place. The biggest failure, in my mind, of the author earnings report was its failure to try to figure out how and when it was wrong.
6) The study hasn’t disclosed some important elements of the model, and that makes it impossible for other people to make educated guesses as to how good it is. Specifically, how it is that the study comes up with a single number for unit sales when the KDP Calculator (which he cites for changing sales rank into sales data) gives a range, and in some cases, a very broad range? For instance, the KDP calculator says that books ranked 201-1000 all sell between 100-300 copies a day. How many sales does this study attribute to a book ranked at 300?
We aren’t given a range of unit sales or author earnings in the report. We’re given a single number, and I can’t tell how that number is derived. Do the authors use the midpoint of the range given? Are they linearizing over the entire range, so that a book at sales rank 201 will be credited with 300 sales, and a book at sales rank 1000 will be credited with 100 sales? Is there some kind of a non-linear best fit formula? I don’t know. I’m guessing that they are linearizing sales between data points, but I don’t really know that that’s true because I’m not given enough information to reproduce their results.
[Edited @2:42 PM: Okay, now that I have the full spreadsheet, I can see what they’re doing. They are linearizing around the data ranges of the KDP calculator.]
(Sidenote: I’ve found the KDP calculator to be a little out of date; Theresa Ragan’s range giver is usually more accurate. But assuming that I’m going to take this as a given, we know the instances in which they fail (the rank calculators assume steady sales at that rank over multiple days; when you’re rising quickly in rank or falling quickly in rank, the tend to underestimate sales and overestimate sales respectively)–and this is usually because the range given is broad enough to swamp effects on the sales rank that are due to things other than that days’ sales.)
7) If you’re going to use your model to make suggestions about reality, you need to spend some time demonstrating that your model gives results that are comparable to reality. A model that is not tested against reality is just a thought experiment.
I’m sure Howey could get hundreds of people to donate actual data on author earnings to compare against his model on the self-publishing side, and probably can find dozens willing to sacrifice old royalty statements from traditional publishers to check the other end of things. Historical Amazon ranks are available to authors through Author Central, too, so there could absolutely be attempts to determine what sort of error his model introduces.
But the report never makes one comparison between modeled author earnings and actual author earnings. That, in my mind, makes it almost impossible for me to take it seriously. If Howey had taken his model and compared 5% of the data points in it to actual earnings, he could have gotten some kind of an idea how good the model was. That didn’t happen.
8) [Edited 3:01 PM: Note that I’m adding a point 8.5 below this now that I have the full spreadsheet.] I tried to do this in a limited form. I took a calculation for a single book of mine–Unraveled. As a side note, this book is MOST LIKELY to fit his data calculations, because it’s been out for a long time; it’s self-published and has been out long enough that there are essentially zero print sales; I haven’t been trying to promote it much, mostly because I don’t control the first two books in the series, and so there are very few big jumps; and I had a new release about two months ago, so the book in January is at about the “average” rank in its release-to-release lifecycle.
This book, at this time, is probably one of the more likely books to fit the Howey model of earnings.
On January 30, 2014, the rank was 9,990. According to the KDP calculator, I’m selling 1-10 books a day–specifically, it gives that response for sales ranks from 8,001-40,000. If we linearize around those points, I should be selling about 9.44 copies a day on Amazon US, meaning that I’m making $26.36 every day according to Howey’s calculations, or $9623 every year.
As it is, I know the actual amount I made on Unraveled last year on Amazon US from February 1, 2013 through January 31, 2014, and it’s $13,831. In other words, for a relatively stable book with very few jumps, in income–a book that’s MOST LIKELY to be correct under this analysis–the model is off by 44%.
Of course, this is drastically dependent on the day in which the observation is made. If I’d used January 25 instead of January 30, my sales rank would have been 5,895. The KDP calculator tells me that the range 3500 through 8000 gives us 10-30 sales per day, and so assuming linear behavior, we get 19.3 sales per day, which is $53.90 per day, or $19,675 per year–or off by about 30% in the other direction.
(Yet another side note: I don’t think that error on the order of 30 to 40% for a book most likely to fit the model means that the model is inherently crap. Like I said, I spent years using Ising models as models of glassy behavior, so I know what it’s like to work with a model where if you’re within an order of magnitude, it’s all good. But this does tell me that the model is unlikely to be capable of making fine-grained comparisons in author earnings, where “fine-grained” means the model is unlikely to be able to distinguish between $10,000 and $30,000.)
And that’s where I think we have some problems. No attempt was made to figure out how closely the Howey model hews to reality, but the authors discuss the results as if they were reality instead of a model. They give author earnings results to two decimal places without knowing if their model is giving them any significant digits. As best I can tell, the Howey model of author earnings is at best quick, back-of-the-envelope calculation, which is at best useful for an order of magnitude estimate.
Measurements of actual events must be the ultimate arbiter of reality, and this is a place where we have the capacity to make measurements. The failure to compare the model’s results to actual measurements before making pronouncements is a huge problem.
[8.5) Added around 3:24 PM.
Now that I’ve played around with the full spreadsheet, I’m even more confused. The full spreadsheet gives enough information for a relatively dedicated searcher to find her own books in the mess, which I thought I did.
I believe that the following books are mine: #356 on the spreadsheet is The Countess Conspiracy, #564 is The Duchess War, #591 on the spreadsheet is The Heiress Effect, and #1059 is A Kiss for Midwinter. These lines fit the rankings for the day polled, have the right number of reviews on the relevant days, and the right review averages. No other books of mine, either indie or traditionally published, appear to be included in the spreadsheet. This gives me 4 books, with a total of $1153.98 in author sales per day. Hypothetically speaking.
There is no author on the spreadsheet with that daily sales total.
I checked to see if perhaps I had misattributed one of those books to me, but there is no author on the spreadsheet whose earnings are a combination of any of those three numbers. I thought maybe I’d missed one or two of my books on the spreadsheet, but there is no author whose earnings are higher than $1153.98 per day who could rationally be me, unless the author data includes more books than are included in the title data. So who knows how this goes?
Assuming that I’m right about identifying my titles above–and if I am, their author data is royally screwed up–this model would say that I earned $117,238 on The Duchess War, $11,384 on A Kiss for Midwinter, $115,198 on The Heiress Effect, and $177,383 on The Countess Conspiracy.
I can’t evaluate that estimate for The Heiress Effect and The Countess Conspiracy, since they haven’t been out for a year. It’s too high by 15% for The Duchess War (on Amazon US; if we include global earnings on the title, though, it’s too low by 42%). And at this point, I’m too tired to add up earnings on A Kiss for Midwinter. In any event, I’m coming up with the same gut sense: there’s a pretty massive margin of error on these numbers, and my innate guess at it is “big.” I don’t have the data to get better than that.]
9) This doesn’t mean I think the methodology is useless. But I think that until the study authors start really talking about sources of error, and finding ways to quantify those errors, the study isn’t going to be much use to anyone except people who already believe it. If you can’t tell me how wrong you are, you haven’t taken the effort to figure out how right you are.
Determining how wrong you are is what makes your work believable.
10) I do think there are useful things in that report–aggregate measurements are probably more likely to have individual sources of error cancelling out, as long as we recognize that they’re likely crude. So I would feel safe pointing to that report and saying unit sales of indie books on the bestseller lists on Amazon are on the same order of magnitude as trade published books on the bestseller lists on Amazon. I would feel more comfortable about that if those numbers were drawn on a daily basis for a lengthy period of time and aggregated, as compared to looking at a single slice of time, and if individual model results had numbers that were tested and compared with reality on a regular basis. I’m willing to send Howey and his team my royalty statements for months in order to help them come up with that estimate, and my guess is he could find many multiple authors who would be equally willing.
11) In short, I think someone using similar methodologies could, if careful, make reasonable, cautious statements, that would have some value. But they’d have to be damned clear about how they’re calculating unit sales, would have to aggregate sufficient data so that they had an error estimate in their calculation, would have to poll for lengthy periods of time, and would have to test their results against actual data to see how the model (and this is a model of earnings, not data about earnings) corresponds with reality.
Thanks for making the reality check. You’ve put your finger on the thing that bothered me most, which is the apparent lack of critical thinking (on the part of the study authors and proponents) about the limitations of the data and the model. As always, thanks for being so open about your own data and experience.
Thanks for the thoughtful analysis, Courtney. I haven’t given this topic much time or attention as I’ve been busy launching a new series, but it seemed to me the hysteria surrounding this report was slightly premature. Everyone is looking for ways to validate their decisions, which seems counterproductive to me. There’s a lot of time and effort being put into analyzing the market by authors who really have control over one thing and one thing only: their books. That’s where all my focus is these days, and life inside my little bubble is pretty good! 🙂
Thank you. Thank you. Thank you.
I can’t decide whether to laugh at some of the hysteria, on both side, or roll my eyes. A little of both, I think.
Thanks for the post.
I’m almost entirely traditionally published at the moment (I have self-published short pieces that were only available in out of print anthologies), but I hope to change that in future. So I watch everything with interest.
But…my income according to this study would have been, I think, zero last year. And it was definitively not zero. A UK crime author said he would have earned zero, and actually, earned 72K pounds. I didn’t earn as much as that, but at least to my household, I earned a significant amount.
I have no objections to any of the data, etc., but – given that there are a number of us who wouldn’t have earned much, or anything, in this scan, but did in fact make a living, I don’t think it’s a useful indicator of “Which path to chooseâ€. There’s just not enough information there. So people who are self-publishing will continue to self-publish, but people who are evaluating … won’t have much reason to change their evaluations, given that they are invisible and off the scale but actually making money.
I would actually be perfectly happy to send off royalty statements for traditionally published books if that gained a better overall picture.
Hey Courtney, sorry to be, erm, thick about this all, but when you say the the excel file doesn’t provide the raw data, do you mean b/c of the translation of book rank to book sales and the calculations about earnings? I thought the raw data (book ranks) was included, and I guess I don’t understand how that part of the data was cooked. (Or if I should care/if it’s a bad thing, or if it’s just something that should be noted.)
I’m (really) trying to wade through all this, looking for understanding, but stats and data analysis are so not my thing. Sorry. Mainly, I’m most curious about the percentage of trade published books vs self-published books in the top X rankings on Amazon for the big genres, as well as a comparison for how much of the market is print vs. digital for those genres (mainly romance). My (limited) understanding is that more snapshots will give a more reliable answer to the first question, but probably not the second.
Am I way off base?
Christina,
I downloaded the excel file as soon as the report came out, and apparently I downloaded the file while the server was wonky, so it appears I didn’t get the full report. I just went back and checked and I made those statements above on the basis of the original file I downloaded. I’ll amend.
Thank you Courtney – I always appreciate your break downs and the way you logically lay out your arguments.
And quite off topic, but this:
“If you can’t tell me how wrong you are, you haven’t taken the effort to figure out how right you are.”
This sounds like something a Courtney Milan hero or heroine would say and I think I can see how you’re able to pull it off in your romance novels when I follow your lines of reasoning IRL. 🙂
Thanks for the thoughtful commentary.
I read the report, saw the note at the top that it was based on data from one day only, and immediately viewed the findings as a very general overview. I assumed that they would repeat the exercise at regular intervals and refine the model. It would be a shame for Howey and his mystery programmer to give up at this point, because long-term modelling of the data is valuable.
It would be a great pity if an author took the information and assumed they would get a monthly paycheck of X if only they could game their Amazon ranking to Y.
Sarah, count on it, Hugh will be back with more data. He’s pretty good about accepting constructive criticism.
I’ve been reading both sides because I know nothing about statistics, and thank you, Courtney, for an evenhanded analysis of the data without any offensive snark. It is much appreciated.
Thank you for making a reasonable comment about this. Not much of that happening. I found this data more useful than you did, but that is because I have been building my own model of book sales. One thing that I can say is that using one day as a cross-section appears to work really well for comparing aggregate performance. I can check their estimates against my (unpublished) model tracking best seller performance over time. The fact that titles further down the list follow similar patterns is enlightening. Given the distribution of sales per title that was something I was unwilling to assume.
I have suggested to the @authorearnings folks some specific things they could do to improve the quality of their data for longitudinal purposes. I think they need to use specific anonymized identifiers to track authors and titles. I offered to help, if they need it.
Amazon customers have very different behavior than customers of offine bookstores. I found the separation of the data, even though it was a matter of convenience, to be helpful rather than a weakness. Again, that is because I have been building a model of reader/customer behavior already.
All in all, Hugh’s claims go beyond what his model supports and are still probably mostly true.
The publishing industry’s grasp of science can be likened to a kid pushing a lawn mower around an overgrown yard, back and forth, exactly even rows one after another… Just image how the landscape might improve if somebody showed the kid how to turn that thing on.
You’re cracking me up, Grace Burrowes! Thanks, Courtney, for the analysis. I’m no statistician but appreciate that you have expertise and are willing to share.
I think some folks that take a side in this ebook or traditional publisher conversation forget that we’re capitalists and there’s a reason that there are no more shoe cobblers. Massive upheaval of an industry creates new winners and plenty of losers, and often a change in the status quo. The folks that have called the shots for a long time see/imagine influence and dollars slipping away so they demonize the new business models. Nobody should be surprised. Conversely, new business models don’t necessarily have the capital to survive, the necessary skills for success, or the commitment to the long haul which is inevitably required. Things will shake out in a decade or two or until a new technology or event comes along and changes things again. In the mean time, I’ve burned my Excel spread sheet of Doom chronicling my agent and publisher rejections and will continue to sell books directly to readers!
There’s absolutely no doubt that Howey and his supporters have jumped to massive and unsupportable conclusions from their data. That said, the one thing that I think IS clear in the data is that the standard 25% of net digital royalty rate being paid by traditional publishers is not in the long-term best interest of the vast majority of authors they sign. Yes, a publisher might pay you a large advance in lieu of a greater royalty, but I think it has to be a LOT of money before it compensates for the fact that you might not get your rights back for 35 years. And when you take into consideration that net can change (i.e., what if Amazon takes over the world and decides it will only pay publishers 40% of list?), there’s a lot of risk in traditional publishing, too. It’s just DIFFERENT risk.
I did a little worksheet to pinpoint earn-out points based on format, royalty rate, list price, and publisher’s net. I found the results pretty helpful, because they gave me a sense of what I’d need to get in advance to accept 25% of net. It’s way more than any publisher in their right mind would pay me :).
If anyone wants to check it out, it’s on my blog here: http://bit.ly/1oEwlZp. And if you want to play around with it yourself, you can email me and ask for a copy.
It’s nice to see a reasoned reply within all the drek flying about on this subject, so thank you Courtney.
I agree completely with Jackie Barbosa above in that the real takeaway from this snippet of data is the unconscionably huge portion of earnings that the trad pubs get from each ebook they sell. This should be the real eye opener for anyone considering signing a Big 5 contract or querying to obtain one.
The other nugget I find fascinating that has gotten lost among the fun, but speculative graphs that the report produced at the end is the percentage of ebooks vs. paper books that Amazon is selling at this given point in time. We knew from Amazon’s own reports that ebooks were outselling paper for them now, but I don’t think many people understood the disparity to be THAT large. I know I didn’t. It seems to paint a bleaker picture for the future of print than others would like us to believe.
I couldn’t agree more, Marie. It’s all about the book—always has been, always will be.
Thanks Courtney, for adding additional perspective to this. For me there seems to be two issues. The raw data extracted, which actually covers a nice swatch of authors as the rankings go from 1 to 752,000. This grabs the mid-list nicely and we can see do a lot with that data alone. Then there is the calculations that take that data and attempt to determine daily sales and later yearly incomes. Yeah, there are a lot of problems with this…not the least of which is that a one shot pull doesn’t show the “range of rankings” for a book on a given day, or that sales at rank x are different in March then they are in December, or that a book won’t stay at the same rank for a year, which is where the train comes off the rails.
I do appreciate that Hugh has provided the raw data, and I’m sure others are going to tease some interesting information out of it. If we look at just the raw data (which I think for the most part will be accurate (even though you couldn’t find all your books) it proves what many have been saying for a long time…that self-published authors are running toe-to-toe with traditionally published authors in terms of sales on the largest bookseller in the world. With higher royalties, even if these authors ONLY get income from Amazon they are making a really nice living wage. There are thousands of authors, without household names, that are making five and six figure incomes. It proves once and for all that self-publishing can be a viable option for those with the entrepreneur spirit, and even those seeking traditional publishing should rejoice, since it shows publishers that they are not in competition only with each other, but with the concept of “going it alone.” In such an environment, publishers will have to adjust their contracts and “industry standards” to attract and retain authors.