Perhaps my fresh feeling about statistics is due to my never having taken a formal statistics class. I pretty much had to teach myself everything I knew. I thought I would take this opportunity to drop a few hints as to what sources I used to learn statistics. I'd like to highlight some excellent sources.
My first raw exposure was as a post-doc with Andy Gould at Ohio State University. Andy is probably the smartest person I've ever worked with. I'm probably going to mess up his potted biography a bit, since its from memory, but as I recall him saying, Andy had dropped out of college in the '60s to work on an assembly line, I think at Ford, hoping to radicalize the proletariat masses. But in his spare time, he dabbled in astrophysics, leading him to write a letter to Stephen Hawking pointing out a mistake in a paper on Black Holes. Hawking suggested he go to graduate school.
In order to go to graduate school, Andy had to finish his undergraduate physics degree. While going through the usual physics labs we've all taken, he independently invented modern Bayesian statistics, developing his own notation and terminology in the process. Andy's work as I remember it is centered on Cramér–Rao_bounds and Fisher information, though he had his own names and notation. He summarized some of his techniques in this ArXiv preprint. I remember this paper on the optimal design of microlensing experiments as an example of his techniques in application.
I gradually became complacent in my understanding of estimation theory, until I had a big shock when I switched to Medical Physics. There, my bible was the tome Foundations of Image Science by Barrett and Myers. I think that this book is of great value for all scientists, not just those working in image theory. After all, an image is just a collection of data with a particular organization. These authors brought together a great deal of the general science of data analysis. The first six chapters are basically an advanced undergraduate degree in Mathematics, chapters 11 and 12 study the statistics of general detectors and photon-noise-limited detectors, while chapter 13 is a study of general statistical analysis.
In particular, these authors divided statistical tasks into Estimation tasks and Classification tasks. In a nutshell, an estimation task tries to put a number, or multiple numbers on a set of data, while Classification tries to interpret data as leading to a finite number of options. As a physical scientist, my background was purely in estimation tasks, and the goal of a statistical understanding of an estimation task is to place a properly-sized error bar on a graph, while the appropriate tool is the Cràmer-Rao bound I had learned from Andy.
But, in reality, we are often performing classification tasks, and the true underlying purpose of an estimation task is to distinguish between two distinct underlying theories. Astronomers love to perform classification tasks with their Type I and Type II supernovae, their Population I and Population II stars, their Elliptical, Spiral, and Irregular galaxies, etc. The workhorse of the classification task is the ROC curve.
While working for Human Network Labs, I ran into a big cultural difference in statistical analysis between scientists and engineers. As scientists, we go through distinct phases: we gather data, then we interpret data. The interpretation is done retrospectively on a complete data set. But real life is not like that. We are constantly recieving new information, and have to constantly revise our assessments based on what we've learned. Roboticists especially have to deal with data in this manner. Here, there is an emphasis on systems which can be easily updated with the introduction of new data without having to start again from scratch. I think you can learn a lot about Bayesian statistics by understanding the Particle Filter.
My latest career change into internet advertising forced a new shift into Machine Learning techniques. As I see it, in physics, there is a theory, or multiple theories, and the goal is to distinguish between multiple theories (classification) or refine the parameters of a theory (estimation). But when dealing with humans and other living things, there is no theory; humans are complex and we cannot deduce how people will act, even statistically, from first principles like we can for stars and crystals. We can only describe how people behave based on many different noisy data sources. "Learning" here can be reduced to drawing a "smooth" model through noisy data in a very high dimensional space. The best source I have found here is The Elements of Statistical Learning by Hastie, Tibshirani, and Friedman, best in part because it is freely available as a .pdf download.