Tuesday, November 12, 2013

Learning Statistics

I've been getting some feedback on my previous post in which people complain how much they hated their Statistics class, and how they didn't feel like they learned much useful.

Perhaps my fresh feeling about statistics is due to my never having taken a formal statistics class.  I pretty much had to teach myself everything I knew.  I thought I would take this opportunity to drop a few hints as to what sources I used to learn statistics.  I'd like to highlight some excellent sources.

My first raw exposure was as a post-doc with Andy Gould at Ohio State University.  Andy is probably the smartest person I've ever worked with.  I'm probably going to mess up his potted biography a bit, since its from memory, but as I recall him saying, Andy had dropped out of college in the '60s to work on an assembly line, I think at Ford, hoping to radicalize the proletariat masses.  But in his spare time, he dabbled in astrophysics, leading him to write a letter to Stephen Hawking pointing out a mistake in a paper on Black Holes.  Hawking suggested he go to graduate school.

In order to go to graduate school, Andy had to finish his undergraduate physics degree.  While going through the usual physics labs we've all taken, he independently invented modern Bayesian statistics, developing his own notation and terminology in the process.  Andy's work as I remember it is centered on Cramér–Rao_bounds and Fisher information, though he had his own names and notation.  He summarized some of his techniques in this ArXiv preprint.  I remember this paper on the optimal design of microlensing experiments as an example of his techniques in application.

I gradually became complacent in my understanding of estimation theory, until I had a big shock when I switched to Medical Physics.  There, my bible was the tome Foundations of Image Science by Barrett and Myers.  I think that this book is of great value for all scientists, not just those working in image theory.  After all, an image is just a collection of data with a particular organization.  These authors brought together a great deal of the general science of data analysis.  The first six chapters are basically an advanced undergraduate degree in Mathematics, chapters 11 and 12 study the statistics of general detectors and photon-noise-limited detectors, while chapter 13 is a study of general statistical analysis.

In particular, these authors divided statistical tasks into Estimation tasks and Classification tasks.  In a nutshell, an estimation task tries to put a number, or multiple numbers on a set of data, while Classification tries to interpret data as leading to a finite number of options.  As a physical scientist, my background was purely in estimation tasks, and the goal of a statistical understanding of an estimation task is to place a properly-sized error bar on a graph, while the appropriate tool is the Cràmer-Rao bound I had learned from Andy.

But, in reality, we are often performing classification tasks, and the true underlying purpose of an estimation task is to distinguish between two distinct underlying theories.  Astronomers love to perform classification tasks with their Type I and Type II supernovae, their Population I and Population II stars, their Elliptical, Spiral, and Irregular galaxies, etc.  The workhorse of the classification task is the ROC curve.

While working for Human Network Labs, I ran into a big cultural difference in statistical analysis between scientists and engineers.  As scientists, we go through distinct phases: we gather data, then we interpret data.  The interpretation is done retrospectively on a complete data set.  But real life is not like that.  We are constantly recieving new information, and have to constantly revise our assessments based on what we've learned.  Roboticists especially have to deal with data in this manner.  Here, there is an emphasis on systems which can be easily updated with the introduction of new data without having to start again from scratch.  I think you can learn a lot about Bayesian statistics by understanding the Particle Filter.

My latest career change into internet advertising forced a new shift into Machine Learning techniques.  As I see it, in physics, there is a theory, or multiple theories, and the goal is to distinguish between multiple theories (classification) or refine the parameters of a theory (estimation).  But when dealing with humans and other living things, there is no theory; humans are complex and we cannot deduce how people will act, even statistically, from first principles like we can for stars and crystals.  We can only describe how people behave based on many different noisy data sources.  "Learning" here can be reduced to drawing a "smooth" model through noisy data in a very high dimensional space.  The best source I have found here is The Elements of Statistical Learning by Hastie, Tibshirani, and Friedman, best in part because it is freely available as a .pdf download.

Thursday, November 7, 2013

Why we need a Common Core

The ethically-informed, statistically-literate policy maker for our modern multicultural technological society.
Thinking about the ostesible reasons why people should study humanities instead of science.
I saw this quote from the Statistician to the Stars:

Science and math give us terrific toasters, efficient ways of annoying strangers with our electronic toys, and are darn good fields at extracting money from Leviathan. But none of them say word one about what is the best in life, which is the ideal way to live, what life is about, why life even exists, why anything exists, what is good and what evil, what is right and what wrong.

But in my work as a medical physicist, I find that I often have to make tradeoffs that the Humanities people should be making: how can I weigh extra pain and discomfort to a patient vs the cost of some improvement, or the improved accuracy of some diagnosis?  How many unnecessary mastectomys should we perform to save one woman's life from breast cancer? How should I trade off risk to the general public from radiation exposure vs. benefits to the patient of some new procedure vs the cost (in dollars) of shielding?  How do I weigh having more false alarms and potentially scaring patients vs the potential for improved patient care?

These seem like the kind of questions that a humanities person should be helping me with!  But I barely even get help from the physicians who work on the project, let alone our mythical on-call bioethicist or even my poet, philosopher, or artist friends.  People just don't understand the details of the trade-offs -- the math is too hard for them, because they decided they aren't "math people".  So the million small decisions and tradeoffs and some of the large ones end up being made by the engineering team.

To me, this argues for the kind of multi-disciplenary education that I had, but with tweaks.  Humanities people, studying what is right and wrong presumably so that they can best set policy, must must must have a deep knowledge of statistics.  At my college, they were required to study calculus, but calculus is just a tool to solve some problems, and other schools are happy if they can be taught again how to subtract fractions (only to immediatly forget it again).  Statistics is applied epistomology -- it teaches us how to distinguish between what is (likely to be) true and what is (likely to be) false.  Of course, you can't really do statistics without calculus.  If people want to guide society by helping us with these life-or-death questions, they need to have the tools to understand what the engineers who are building society are doing.

Meanwhile, the science geeks should be taught humanities, but with ethics as the focus.  Not the day-to-day ethics of should I accept this gift from a lobbyist, but the overarching ethics of what is good in life, and what should we do with our limited time on this dizzy planet, and what is good for the millions of people who will use the technology we produce or maintain.  Multicultural studies are important because we will have to make decisions that affect people different from us, and we have to understand them, or at a minimum, understand that they may be different from us.  We need to study Plato and Aristotle and the Buddha and Mohommed (PBUH) because without them we can't understand Bentham and Quine, and the latest modern theories of how we can best serve our patients, customers, clients, co-workers, friends and family.

Modern capitalism has an answer to this for the engineers, one they are adept at calculating: do what maximizes the long-term risk-adjusted inflation-adjusted tax-adjusted time-adjusted expected profit, measured in dollars, and to hell with anything else.  In practice, I get more feedback from the marketers and investors about the technical decisions I make than from the physicians and patients.

But at present, until the poets buckle down and learn their statistics, they are stepping away from their responsibilities in a modern technological society.

Friday, August 30, 2013

Equation tester

I've just added the mathjax equation displayer to my blog. This will allow me to put some real science up here with the LaTeX I love. I followed the steps outlined here .

Pythagorus' theorem is $a^2 + b^2 = c^2$. The definition of a limit is $$\lim_{x \rightarrow x_0} f = f_0 \equiv \forall \epsilon: \exists \delta: |f - f_0| < \delta \implies |x - x_0| < \epsilon$$