PHIL'S PUZZLE: Finding a Function


Recommended Posts

Subject: Phil's Puzzle for Monday

We have a lot of people on this board with some math or science or technical education. A question just occurred to me. I'm wondering if anyone knows whether in math there is a -general- method to fit a function (create an actual equation)to data points. It would be extremely useful in the real world in hundreds of areas.

Math and scientific applications as you study them in textbooks are regularly about starting with an equation or function and then applying it, using it to see what data is derived from it. (If I know the rate of fall of an object, I can tell you at what time it will hit the earth; if I know the force of gravitational attraction ...) But what about GOING THE OPPOSITE WAY? Often what you know is the data and you would love to find a function that most closely quantifies or 'formulaizes' it.

Let me give an example to make it clear the sort of thing I'm looking for: We have census numbers for the population of major cities in the U.S. (and perhaps even around the world) -- or even population of countries -- for each decade stretching back a century or more. What is the best equation that fits that data?

You could use a graphing calculator and experiment: draw a number of polynomials (ax*n + bx*n-1....+z)and see which comes closest to the data points. Once you have an equation, the first derivative would tell you at what rate the population is or has been growing. And the second derivative tells you whether the growth is slowing down or speeding up. And by how much.

Very useful information. But the problem is if you can induce or derive an equation or function in the first place. (And polynomials are not the only kinds of functions or necessarily the best fit. For example, bacteria and disease grow or spread exponentially due largely to basic rules of reproductive biology.)

Edited by Philip Coates
Link to comment
Share on other sites

Another area where this would be useful is practical applications of economics, e.g., forecasting.

There is all kinds of data about hiring numbers for major corporations emerging from recession, unemployment statistics, how much gold is being bought in India and China, housing starts, etc. Can you add them up to show whether (and when) the economy may pull out of recession or are we in for a decade of stagnation? Will housing recover so I can sell or buy a house? What has the supply and demand been like for the commodity I sell or for certain kinds of foods that are part of my staples? What is the trend line for stock prices in a recovery -- you can make money or avoid losing it if you have a mathematical handle on that. And on and on, endlessly.

Edited by Philip Coates
Link to comment
Share on other sites

Subject: Phil's Puzzle for Monday

We have a lot of people on this board with some math or science or technical education. A question just occurred to me. I'm wondering if anyone knows whether in math there is a -general- method to fit a function (create an actual equation)to data points. It would be extremely useful in the real world in hundreds of areas.

Math and scientific applications as you study them in textbooks are regularly about starting with an equation or function and then applying it, using it to see what data is derived from it. (If I know the rate of fall of an object, I can tell you at what time it will hit the earth; if I know the force of gravitational attraction ...) But what about GOING THE OPPOSITE WAY? Often what you know is the data and you would love to find a function that most closely quantifies or 'formulaizes' it.

Let me give an example to make it clear the sort of thing I'm looking for: We have census numbers for the population of major cities in the U.S. (and perhaps even around the world) -- or even population of countries -- for each decade stretching back a century or more. What is the best equation that fits that data?

You could use a graphing calculator and experiment: draw a number of polynomials (ax*n + bx*n-1....+z)and see which comes closest to the data points. Once you have an equation, the first derivative would tell you at what rate the population is or has been growing. And the second derivative tells you whether the growth is slowing down or speeding up. And by how much.

Very useful information. But the problem is if you can induce or derive an equation or function in the first place. (And polynomials are not the only kinds of functions or necessarily the best fit. For example, bacteria and disease grow or spread exponentially due largely to basic rules of reproductive biology.)

What criterion of fit are you using? A fit which minimizes the sum of the squares of errors. A fit in which the maximum different between the curve and any of the sample points is less than a given bound? Chebyscheff polynomials to fit?

Given n+1 point not all in a line one can find a n-degree polynomial which passes through all the points, but such a polynomial would probably not tell you much about the underlying stuff from which the sample points came.

Please have a look at:

http://www.amazon.com/Fitting-Equations-Data-Computer-Multifactor/dp/0471376841

which shows the first few pages of a book on curve fitting. There are loads of ways of fitting curves to data.

Ba'al Chatzaf

Edited by BaalChatzaf
Link to comment
Share on other sites

Given n+1 data points, a polynomial of degree n or less can be found to fit the data exactly. See here, here, and here.

Curve fitting can be done in MS Excel using the Solver add-in. The target curve need not be a polynomial.

A genetic algorithm can be used to find a function with the closest fit to any data, but you must supply the function's form. Despite the name, it is not restricted to evolution. For example, see Evolver.

Edited by Merlin Jetton
Link to comment
Share on other sites

Subject: Phil's Puzzle for Monday

We have a lot of people on this board with some math or science or technical education. A question just occurred to me. I'm wondering if anyone knows whether in math there is a -general- method to fit a function (create an actual equation)to data points. It would be extremely useful in the real world in hundreds of areas.

Math and scientific applications as you study them in textbooks are regularly about starting with an equation or function and then applying it, using it to see what data is derived from it. (If I know the rate of fall of an object, I can tell you at what time it will hit the earth; if I know the force of gravitational attraction ...) But what about GOING THE OPPOSITE WAY? Often what you know is the data and you would love to find a function that most closely quantifies or 'formulaizes' it.

Let me give an example to make it clear the sort of thing I'm looking for: We have census numbers for the population of major cities in the U.S. (and perhaps even around the world) -- or even population of countries -- for each decade stretching back a century or more. What is the best equation that fits that data?

You could use a graphing calculator and experiment: draw a number of polynomials (ax*n + bx*n-1....+z)and see which comes closest to the data points. Once you have an equation, the first derivative would tell you at what rate the population is or has been growing. And the second derivative tells you whether the growth is slowing down or speeding up. And by how much.

Very useful information. But the problem is if you can induce or derive an equation or function in the first place. (And polynomials are not the only kinds of functions or necessarily the best fit. For example, bacteria and disease grow or spread exponentially due largely to basic rules of reproductive biology.)

Phil, there are in fact ways of doing exactly what you suggest. Much finite element or finite difference modeling assumes ignorance about system behavior, but augments mathematics with real time data gathering. The better economics does this with Monte Carlo simulations. Much of current economics is corrupted by assuming normal or even Bayesian adjusted statistical distributions. The kind of math you want for this is conditional probability and statistics of random variables. Often there are not closed form solutions, but heuristics and rules that allow a problem to be described given what is currently known.

There is no way to get around the problem of inducing a representative equation. Often more promising mathematical possibilities fall out of what we know physically about a system, but a general rule or methodology depends on the subject and conditions specific to a given system.

Often breakthroughs happen with unwieldy mathematics that is the best for the time period. Maxwell's original equations were expressed in unwieldy quaternion form, not the cleaned up vector calculus notation familiar to us from Heaviside.

Jim

Link to comment
Share on other sites

Given a finite number of data points there are an uncountable infinity of ways of fitting continuous real valued functions to the data points. In a sense, curve fitting with no further constraints than the data points themselves is not a well posed problem.

Ba'al Chatzaf

Link to comment
Share on other sites

Phil,

As the guys before me said, there are a lot of ways to do this. There's some pretty neat numerical techniques related to the idea.

Least-Squares Regression (http://en.wikipedia.org/wiki/Least_squares) is my 'favorite.' If you know the general form of the equation, then you can find values for your unknown constants which most effectively fit the data. The biggest issue with Least-Squares Regression is getting a good form of the equation. But that can be done by guessing, using the situation's governing physical equations (mass/momentum/energy conservation, equations of state, etc.), or dimensional analysis. Or by using the form your boss gives you.

For a population or an economy, getting the right form of equation would be tricky. Especially in economics, where a small change in the inputs would likely give a large change in the outputs. You could try a polynomial, but polynomial fits (except for cubic splines) can be funky if the data points aren't aligned nicely. And a polynomial fit is often completely out of the question if extrapolation is your goal.

Mike

Link to comment
Share on other sites

Thanks, Baal, Jim, Merlin, and Mike!

Allright, you guys are cool! ...A wealth of useful leads and answers for me to chew through!!

...This is why I wish I had majored in applied math, practical stuff that the engineers were taking at my school, instead of "pure" or theoretical math for those who were going to become math professors. Lots more stuff you can do with 'analysis' beyond calculus as well, as opposed to group theory, abstract algebras, topology. (Many applications, though, for some of the courses I took, though, like linear algebra. But we never covered them for chrissake!)

And I didn't have sense enough to choose probability or statistics among my electives either. Too practical and 'easy' for an arrogant smug little egghead Platonist shithead in training like Young Phil.

You just did proofs and theory in pure math as it was taught: Platonic crapola to a large extent, in which pride was taken in not sullying ourselves with real world applications...although you certainly learned rigor and long-chain logical deductive thinking.

(One of the reasons I tuned my back on math, bailed out on it as my career choice before going on to the Ph.D. I didn't see beyond the treeline and lost respect for it.)

Edited by Philip Coates
Link to comment
Share on other sites

> And a polynomial fit is often completely out of the question if extrapolation is your goal. [Mike H.]

Extrapolation is -very definitely- the goal in the practical, real world examples I gave --- such as those from economics where you are trying to decide whether to buy gold, or sell stocks, or determine when to sell your house or when the job market might bounce back. Or in the case of population demographics where one wants to know if the existing freeways or other resources or infrastructure will be adequate for population five or twenty years away.

(( Obviously if you fully know the causality involved (reproductive biology in the bacteria and disease example), you don't need to 'invent' a new equation. But in the complex cases such as the economic ones I mentioned, either the full magnitude of the causal factors is staggering or its quantification variable. Or, as in the case of population growth or shrinkage, there is a large element of volition involved which can shift quickly. ))

Edited by Philip Coates
Link to comment
Share on other sites

It just occurs to me also that you are going to have to deal with 'boundedness' on the causality.

Link to comment
Share on other sites

Subject: Phil's Puzzle for Monday

We have a lot of people on this board with some math or science or technical education. A question just occurred to me. I'm wondering if anyone knows whether in math there is a -general- method to fit a function (create an actual equation)to data points. It would be extremely useful in the real world in hundreds of areas.

Math and scientific applications as you study them in textbooks are regularly about starting with an equation or function and then applying it, using it to see what data is derived from it. (If I know the rate of fall of an object, I can tell you at what time it will hit the earth; if I know the force of gravitational attraction ...) But what about GOING THE OPPOSITE WAY? Often what you know is the data and you would love to find a function that most closely quantifies or 'formulaizes' it.

Let me give an example to make it clear the sort of thing I'm looking for: We have census numbers for the population of major cities in the U.S. (and perhaps even around the world) -- or even population of countries -- for each decade stretching back a century or more. What is the best equation that fits that data?

You could use a graphing calculator and experiment: draw a number of polynomials (ax*n + bx*n-1....+z)and see which comes closest to the data points. Once you have an equation, the first derivative would tell you at what rate the population is or has been growing. And the second derivative tells you whether the growth is slowing down or speeding up. And by how much.

Very useful information. But the problem is if you can induce or derive an equation or function in the first place. (And polynomials are not the only kinds of functions or necessarily the best fit. For example, bacteria and disease grow or spread exponentially due largely to basic rules of reproductive biology.)

Phil -

THere is an abundance of methodology of this sort. The most commonly known one is called regression, in which the criterion of fit is the sum of squared "mistakes" (difference between the value given by your function/equation and the value in the data = the mistake, called the "residual"). The method of least squares minimizes the sum of the squared residuals by choice of the unknown parameters in the equations. See almost any applied low-level statistics textbook (say, sophomore level).

I'm not certain where you are going with this. I have a PhD in the field of statistics and can steer you to reading if you want to learn about the large amount of methodology which has been developed over many decades to handle such problems. There are a variety of different criteria (pros and cons of these can be described), and the associated algorithms, etc... The properties of the lease squares estimators (ordinary regression) are those discussed in most elementary statistics textbooks.

Under the acronym "GLM" you can find a broad variety of other methodologies, of which the above is a special case.

Let me know what is of interest...

Bill P

Link to comment
Share on other sites

Thanks, Bill.

...I think we cross-posted. My posts 8-10 narrow it down a bit. It sounds like you are eminently qualified in this area (insofar as it's purely an issue in the field of statistics.)

Edited by Philip Coates
Link to comment
Share on other sites

Both linear regression and logistic regression assume a specific functional form and span only two possible functional forms. (Logistic regression is especially suitable for fitting probabilities and is heavily used on epidemiological data.) The divided difference method assumes a polynomial form. If you want to fit any other other functional form, then genetic algorithms offer the greatest flexibility.

Edited by Merlin Jetton
Link to comment
Share on other sites

> And a polynomial fit is often completely out of the question if extrapolation is your goal. [Mike H.]

Extrapolation is -very definitely- the goal in the practical, real world examples I gave --- such as those from economics where you are trying to decide whether to buy gold, or sell stocks, or determine when to sell your house or when the job market might bounce back. Or in the case of population demographics where one wants to know if the existing freeways or other resources or infrastructure will be adequate for population five or twenty years away.

(( Obviously if you fully know the causality involved (reproductive biology in the bacteria and disease example), you don't need to 'invent' a new equation. But in the complex cases such as the economic ones I mentioned, either the full magnitude of the causal factors is staggering or its quantification variable. Or, as in the case of population growth or shrinkage, there is a large element of volition involved which can shift quickly. ))

Phil,

Extrapolation is really tough with complex systems. Usually there are distinct points where models break down (continuity goes out the window). For example, the Ideal Gas Equation of State is great for situations where the gas is nearly ideal. But once we reach certain limits (i.e. higher pressure or temperature) a different model is needed. So we go to something more complicated like the Van der Waals Equation of State or the good ol' Soave-Redlich-Kwong Equation of State.

We can see that models of economies or populations will act in the same way. So our function either has to be able account for discontinuities which could arise in the future which we're trying to extrapolate. This means that we either have to use a universal equation (which is usually tough to create) for everything (but such an equation may not be computationally feasible for a large number of points), or we have to create a piecewise function which allows for the discontinuities... but if we're going to account for the discontinuities in such a complex and ill-conditioned system, we better know damn well where they are.

Also, if we're going to use a computer, we have to worry about iteration error. Which means, if we base our extrapolation of point i+1 on point i, then the extrapolation of point i+2 has the error created in i to i+1 as well as the error created in i+1 to i+2. In chaotic systems (i.e. populations, economies, climates) this propagation of error is awful. That's why weathermen are only "accurate" for a few days.

Mike

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now