References

Next: Summary of Notation Up: Contents Previous: 11 Case Studies

References

Agre, 1988: Agre, P. E. (1988). The Dynamic Structure of Everyday Life. PhD thesis, Massachusetts Institute of Technology, Cambridge, MA. AI-TR 1085, MIT Artificial Intelligence Laboratory.
Agre and Chapman, 1990: Agre, P. E. and Chapman, D. (1990). What are plans for? Robotics and Autonomous Systems, 6:17--34.
Albus, 1971: Albus, J. S. (1971). A theory of cerebellar function. Mathematical Biosciences, 10:25--61.
Albus, 1981: Albus, J. S. (1981). Brain, Behavior, and Robotics. Byte Books.
Anderson, 1986: Anderson, C. W. (1986). Learning and Problem Solving with Multilayer Connectionist Systems. PhD thesis, University of Massachusetts, Amherst, MA.
Anderson, 1987: Anderson, C. W. (1987). Strategy learning with multilayer connectionist representations. Technical Report TR87-509.3, GTE Laboratories, Incorporated, Waltham, MA. (This is a corrected version of the report published in Proceedings of the Fourth International Workshop on Machine Learning,103--114, 1987, San Mateo, CA: Morgan Kaufmann.).
Anderson et al., 1977: Anderson, J. A., Silversten, J. W., Ritz, S. A., and Jones, R. S. (1977). Distinctive features, categorical perception, and probability learning: Some applications of a neural model. Psychological Review, 84:413--451.
Andreae, 1963: Andreae, J. H. (1963). STELLA: A scheme for a learning machine. In Proceedings of the 2nd IFAC Congress, Basle, pages 497--502, London. Butterworths.
Andreae, 1969a: Andreae, J. H. (1969a). A learning machine with monologue. International Journal of Man-Machine Studies, 1:1--20.
Andreae, 1969b: Andreae, J. H. (1969b). Learning machines---a unified view. In Meetham, A. R. and Hudson, R. A., editors, Encyclopedia of Information, Linguistics, and Control, pages 261--270. Pergamon, Oxford.
Andreae, 1977: Andreae, J. H. (1977). Thinking with the Teachable Machine. Academic Press, London.
Baird, 1995: Baird, L. C. (1995). Residual algorithms: Reinforcement learning with function approximation. In Prieditis, A. and Russell, S., editors, Proceedings of the Twelfth International Conference on Machine Learning, pages 30--37, San Francisco, CA. Morgan Kaufmann.
Bao et al., 1994: Bao, G., Cassandras, C. G., Djaferis, T. E., Gandhi, A. D., and Looze, D. P. (1994). Elevator dispatchers for down peak traffic. Technical report, ECE Department, University of Massachusetts.
Barnard, 1993: Barnard, E. (1993). Temporal-difference methods and Markov models. IEEE Transactions on Systems, Man, and Cybernetics, 23:357--365.
Barto, 1985: Barto, A. G. (1985). Learning by statistical cooperation of self-interested neuron-like computing elements. Human Neurobiology, 4:229--256.
Barto, 1986: Barto, A. G. (1986). Game-theoretic cooperativity in networks of self-interested units. In Denker, J. S., editor, Neural Networks for Computing, pages 41--46. American Institute of Physics, New York.
Barto, 1990: Barto, A. G. (1990). Connectionist learning for control: An overview. In Miller, T., Sutton, R. S., and Werbos, P. J., editors, Neural Networks for Control, pages 5--58. MIT Press, Cambridge, MA.
Barto, 1991: Barto, A. G. (1991). Some learning tasks from a control perspective. In Nadel, L. and Stein, D. L., editors, 1990 Lectures in Complex Systems, pages 195--223. Addison-Wesley Publishing Company, The Advanced Book Program, Redwood City, CA.
Barto, 1992: Barto, A. G. (1992). Reinforcement learning and adaptive critic methods. In White, D. A. and Sofge, D. A., editors, Handbook of Intelligent Control: Neural, Fuzzy, and Adaptive Approaches, pages 469--491. Van Nostrand Reinhold, New York.
Barto, 1995a: Barto, A. G. (1995a). Adaptive critics and the basal ganglia. In Houk, J. C., Davis, J. L., and Beiser, D. G., editors, Models of Information Processing in the Basal Ganglia, pages 215--232. MIT Press, Cambridge, MA.
Barto, 1995b: Barto, A. G. (1995b). Reinforcement learning. In Arbib, M. A., editor, Handbook of Brain Theory and Neural Networks, pages 804--809. The MIT Press, Cambridge, MA.
Barto and Anandan, 1985: Barto, A. G. and Anandan, P. (1985). Pattern recognizing stochastic learning automata. IEEE Transactions on Systems, Man, and Cybernetics, 15:360--375.
Barto and Anderson, 1985: Barto, A. G. and Anderson, C. W. (1985). Structural learning in connectionist systems. In Program of the Seventh Annual Conference of the Cognitive Science Society, pages 43--54, Irvine, CA.
Barto et al., 1982: Barto, A. G., Anderson, C. W., and Sutton, R. S. (1982). Synthesis of nonlinear control surfaces by a layered associative search network. Biological Cybernetics, 43:175--185.
Barto et al., 1991: Barto, A. G., Bradtke, S. J., and Singh, S. P. (1991). Real-time learning and control using asynchronous dynamic programming. Technical Report 91-57, Department of Computer and Information Science, University of Massachusetts, Amherst, MA.
Barto et al., 1995: Barto, A. G., Bradtke, S. J., and Singh, S. P. (1995). Learning to act using real-time dynamic programming. Artificial Intelligence, 72:81--138.
Barto and Duff, 1994: Barto, A. G. and Duff, M. (1994). Monte carlo matrix inversion and reinforcement learning. In Cohen, J. D., Tesauro, G., and Alspector, J., editors, Advances in Neural Information Processing Systems: Proceedings of the 1993 Conference, pages 687--694, San Francisco, CA. Morgan Kaufmann.
Barto and Jordan, 1987: Barto, A. G. and Jordan, M. I. (1987). Gradient following without back-propagation in layered networks. In Caudill, M. and Butler, C., editors, Proceedings of the IEEE First Annual Conference on Neural Networks, pages II629--II636, San Diego, CA.
Barto and Sutton, 1981a: Barto, A. G. and Sutton, R. S. (1981a). Goal seeking components for adaptive intelligence: An initial assessment. Technical Report AFWAL-TR-81-1070, Air Force Wright Aeronautical Laboratories/Avionics Laboratory, Wright-Patterson AFB, OH.
Barto and Sutton, 1981b: Barto, A. G. and Sutton, R. S. (1981b). Landmark learning: An illustration of associative search. Biological Cybernetics, 42:1--8.
Barto and Sutton, 1982: Barto, A. G. and Sutton, R. S. (1982). Simulation of anticipatory responses in classical conditioning by a neuron-like adaptive element. Behavioural Brain Research, 4:221--235.
Barto et al., 1983: Barto, A. G., Sutton, R. S., and Anderson, C. W. (1983). Neuronlike elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, 13:835--846. Reprinted in J. A. Anderson and E. Rosenfeld, Neurocomputing: Foundations of Research, MIT Press, Cambridge, MA, 1988.
Barto et al., 1981: Barto, A. G., Sutton, R. S., and Brouwer, P. S. (1981). Associative search network: A reinforcement learning associative memory. IEEE Transactions on Systems, Man, and Cybernetics, 40:201--211.
Bellman and Dreyfus, 1959: Bellman, R. and Dreyfus, S. E. (1959). Functional approximations and dynamic programming. Math Tables and Other Aides to Computation, 13:247--251.
Bellman et al., 1973: Bellman, R., Kalaba, R., and Kotkin, B. (1973). Polynomial approximation---A new computational technique in dynamic programming: Allocation processes. Mathematical Computation, 17:155--161.
Bellman, 1956: Bellman, R. E. (1956). A problem in the sequential design of experiments. Sankhya, 16:221--229.
Bellman, 1957a: Bellman, R. E. (1957a). Dynamic Programming. Princeton University Press, Princeton, NJ.
Bellman, 1957b: Bellman, R. E. (1957b). A Markov decision process. Journal of Mathematical Mech., 6:679--684.
Berry and Fristedt, 1985: Berry, D. A. and Fristedt, B. (1985). Bandit Problems. Chapman and Hall, London.
Bertsekas, 1982: Bertsekas, D. P. (1982). Distributed dynamic programming. IEEE Transactions on Automatic Control, 27:610--616.
Bertsekas, 1983: Bertsekas, D. P. (1983). Distributed asynchronous computation of fixed points. Mathematical Programming, 27:107--120.
Bertsekas, 1987: Bertsekas, D. P. (1987). Dynamic Programming: Deterministic and Stochastic Models. Prentice-Hall, Englewood Cliffs, NJ.
Bertsekas, 1995: Bertsekas, D. P. (1995). Dynamic Programming and Optimal Control. Athena, Belmont, MA.
Bertsekas and Tsitsiklis, 1989: Bertsekas, D. P. and Tsitsiklis, J. N. (1989). Parallel and Distributed Computation: Numerical Methods. Prentice-Hall, Englewood Cliffs, NJ.
Bertsekas and Tsitsiklis, 1996: Bertsekas, D. P. and Tsitsiklis, J. N. (1996). Neural Dynamic Programming. Athena Scientific, Belmont, MA.
Biermann et al., 1982: Biermann, A. W., Fairfield, J. R. C., and Beres, T. R. (1982). Signature table systems and learning. IEEE Transactions on Systems, Man, and Cybernetics, SMC-12:635--648.
Bishop, 1995: Bishop, C. M. (1995). Neural Networks for Pattern Recognition. Clarendon, Oxford.
Booker, 1982: Booker, L. B. (1982). Intelligent Behavior as an Adaptation to the Task Environment. PhD thesis, University of Michigan, Ann Arbor, MI.
Boone, 1997: Boone, G. (1997). Minimum-time control of the acrobot. In 1997 International Conference on Robotics and Automation, Albuquerque, NM.
Boutilier et al., 1995: Boutilier, C., Dearden, R., and Goldszmidt, M. (1995). Exploiting structure in policy construction. In Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence.
Boyan and Moore, 1995: Boyan, J. A. and Moore, A. W. (1995). Generalization in reinforcement learning: Safely approximating the value functions. In G. Tesauro, D. Touretzky, T. L., editor, Advances in Neural Information Processing Systems: Proceedings of the 1994 Conference, pages 369--376, San Mateo, CA. Morgan Kaufmann.
Boyan et al., 1995: Boyan, J. A., Moore, A. W., and Sutton, R. S., editors (1995). Proceedings of the Workshop on Value Function Approximation. Machine Learning Conference 1995, Pittsburgh, PA. School of Computer Science, Carnegie Mellon University. Technical Report CMU-CS-95-206.
Bradtke, 1993: Bradtke, S. J. (1993). Reinforcement learning applied to linear quadratic regulation. In S. J. Hanson, J. D. Cowan, C. L. G., editor, Advances in Neural Information Processing Systems: Proceedings of the 1992 Conference, pages 295--302, San Mateo, CA. Morgan Kaufmann.
Bradtke, 1994: Bradtke, S. J. (1994). Incremental Dynamic Programming for On-Line Adaptive Optimal Control. PhD thesis, University of Massachusetts, Amherst. Appeared as CMPSCI Technical Report 94-62.
Bradtke and Barto, 1996: Bradtke, S. J. and Barto, A. G. (1996). Linear least--squares algorithms for temporal difference learning. Machine Learning, 22:33--57.
Bradtke and Duff, 1995: Bradtke, S. J. and Duff, M. O. (1995). Reinforcement learning methods for continuous-time Markov decision problems. In G. Tesauro, D. Touretzky, T. L., editor, Advances in Neural Information Processing Systems: Proceedings of the 1994 Conference, pages 393--400, San Mateo, CA. Morgan Kaufmann.
Bridle, 1990: Bridle, J. S. (1990). Training stochastic model recognition algorithms as networks can lead to maximum mutual information estimates of parameters. In Touretzky, D. S., editor, Advances in Neural Information Processing Systems 2, pages 211--217, San Mateo, CA. Morgan Kaufmann.
Broomhead and Lowe, 1988: Broomhead, D. S. and Lowe, D. (1988). Multivariable functional interpolation and adaptive networks. Complex Systems, 2:321--355.
Bryson, 1996: Bryson, Jr., A. E. (1996). Optimal control---1950 to 1985. IEEE Control Systems, 13(3):26--33.
Bush and Mosteller, 1955: Bush, R. R. and Mosteller, F. (1955). Stochastic Models for Learning. Wiley, New York.
Byrne et al., 1990: Byrne, J. H., Gingrich, K. J., and Baxter, D. A. (1990). Computational capabilities of single neurons: Relationship to simple forms of associative and nonassociative learning in aplysia. In Hawkins, R. D. and Bower, G. H., editors, Computational Models of Learning, pages 31--63. Academic Press, New York.
Campbell, 1959: Campbell, D. T. (1959). Blind variation and selective survival as a general strategy in knowledge-processes. In Yovits, M. C. and Cameron, S., editors, Self-Organizing Systems, pages 205--231. Pergamon.
Carlström and Nordström, 1997: Carlström, J. and Nordström, E. (1997). Control of self-similar atm call traffic by reinforcement learning. In Proceedings of the International Workshop on Applications of Neural Networks to Telecommunications 3 (IWANNT*97), Hillsdale NJ. Lawrence Erlbaum.
Chapman and Kaelbling, 1991: Chapman, D. and Kaelbling, L. P. (1991). Input generalization in delayed reinforcement learning: An algorithm and performance comparisons. In Proceedings of the 1991 International Joint Conference on Artificial Intelligence.
Chow and Tsitsiklis, 1991: Chow, C.-S. and Tsitsiklis, J. N. (1991). An optimal one-way multigrid algorithm for discrete-time stochastic control. IEEE Transactions on Automatic Control, 36:898--914.
Chrisman, 1992: Chrisman, L. (1992). Reinforcement learning with perceptual aliasing: The perceptual distinctions approach. In Proceedings of the Tenth National Conference on Artificial Intelligence, pages 183--188, Menlo Park, CA. AAAI Press/MIT Press.
Christensen and Korf, 1986: Christensen, J. and Korf, R. E. (1986). A unified theory of heuristic evaluation functions and its application to learning. In Proceedings of the Fifth National Conference on Artificial Intelligence AAAI-86, pages 148--152, San Mateo, CA. Morgan Kaufmann.
Cichosz, 1995: Cichosz, P. (1995). Truncating temporal differences: On the efficient implementation of TD(lambda) for reinforcement learning. Journal of Artificial Intelligence Research, 2:287--318.
Clark and Farley, 1955: Clark, W. A. and Farley, B. G. (1955). Generalization of pattern recognition in a self-organizing system. In Proceedings of the 1955 Western Joint Computer Conference, pages 86--91.
Clouse, 1997: Clouse, J. (1997). On Integrating Apprentice Learning and Reinforcement Learning TITLE2. PhD thesis, University of Massachusetts, Amherst. Appeared as CMPSCI Technical Report 96-026.
Clouse and Utgoff, 1992: Clouse, J. and Utgoff, P. (1992). A teaching method for reinforcement learning systems. In Proceedings of the Ninth International Machine Learning Conference, pages 92--101.
Colombetti and Dorigo, 1994: Colombetti, M. and Dorigo, M. (1994). Training agent to perform sequential behavior. Adaptive Behavior, 2(3):247--275.
Connell, 1989: Connell, J. (1989). A colony architecture for an artificial creature. Technical Report Technical Report AI-TR-1151, MIT Artificial Intelligence Laboratory, Cambridge, MA.
Craik, 1943: Craik, K. J. W. (1943). The Nature of Explanation. Cambridge University Press, Cambridge.
Crites, 1996: Crites, R. H. (1996). Large-Scale Dynamic Optimization Using Teams of Reinforcement Learning Agents. PhD thesis, University of Massachusetts, Amherst, MA.
Crites and Barto, 1996: Crites, R. H. and Barto, A. G. (1996). Improving elevator performance using reinforcement learning. In D. S. Touretzky, M. C. Mozer, M. E. H., editor, Advances in Neural Information Processing Systems: Proceedings of the 1995 Conference, pages 1017--1023, Cambridge, MA. MIT Press.
Curtiss, 1954: Curtiss, J. H. (1954). A theoretical comparison of the efficiencies of two classical methods and a monte carlo method for computing one component of the solution of a set of linear algebraic equations. In Meyer, H. A., editor, Symposium on Monte Carlo Methods, pages 191--233. Wiley, New York.
Cziko, 1995: Cziko, G. (1995). Without Miracles. Universal Selection Theory and the Second Darvinian Revolution. The MIT Press.
Daniel, 1976: Daniel, J. W. (1976). Splines and efficiency in dynamic programming. Journal of Mathematical Analysis and Applications, 54:402--407.
Dayan, 1991: Dayan, P. (1991). Reinforcement comparison. In Touretzky, D. S., Elman, J. L., Sejnowski, T. J., and Hinton, G. E., editors, Connectionist Models: Proceedings of the 1990 Summer School, pages 45--51. Morgan Kaufmann, San Mateo, CA.
Dayan, 1992: Dayan, P. (1992). The convergence of TD() for general . Machine Learning, 8:341--362.
Dayan and Hinton, 1993: Dayan, P. and Hinton, G. E. (1993). Feudal reinforcement learning. In Hanson, S. J., Cohen, J. D., and Giles, C. L., editors, Advances in Neural Information Processing Systems: Proceedings of the 1992 Conference, pages 271--278, San Mateo, CA. Morgan Kaufmann.
Dayan and Sejnowski, 1994: Dayan, P. and Sejnowski, T. (1994). TD() converges with probability 1. Machine Learning, 14:295--301.
Dean and Lin, 1995: Dean, T. and Lin, S.-H. (1995). Decomposition techniques for planning in stochastic domains. In Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence.
DeJong and Spong, 1994: DeJong, G. and Spong, M. W. (1994). Swinging up the acrobot: An example of intelligent control. In Proceedings of the American Control Conference, pages 2158--2162.
Denardo, 1967: Denardo, E. V. (1967). Contraction mappings in the theory underlying dynamic programming. SIAM Review, 9:165--177.
Dennett, 1978: Dennett, D. C. (1978). Brainstorms, chapter Why the Law-of-Effect Will Not Go Away, pages 71--89. Bradford/MIT Press, Cambridge, MA.
Dietterich and Flann, 1995: Dietterich, T. G. and Flann, N. S. (1995). Explanation-based learning and reinforcement learning: A unified view. In Prieditis, A. and Russell, S., editors, Proceedings of the Twelfth International Conference on Machine Learning, pages 176--184, San Francisco, CA. Morgan Kaufmann.
Doya, 1996: Doya, K. (1996). Temporal difference learing in continuous time and space. In Touretzky, D. S., Mozer, M. C., and Hasselmo, M. E., editors, Advances in Neural Information Processing Systems: Proceedings of the 1995 Conference, pages 1073--1079, Cambridge, MA. MIT Press.
Doyle and Snell, 1984: Doyle, P. G. and Snell, J. L. (1984). Random Walks and Electric Networks. The Mathematical Association of America. Carus Mathematical Monograph 22.
Dreyfus and Law, 1977: Dreyfus, S. E. and Law, A. M. (1977). The Art and Theory of Dynamic Programming. Academic Press, New York.
Duda and Hart, 1973: Duda, R. O. and Hart, P. E. (1973). Pattern Classification and Scene Analysis. Wiley, New York.
Duff, 1995: Duff, M. O. (1995). Q-learning for bandit problems. In Prieditis, A. and Russell, S., editors, Proceedings of the Twelfth International Conference on Machine Learning, pages 209--217, San Francisco, CA. Morgan Kaufmann.
Estes, 1950: Estes, W. K. (1950). Toward a statistical theory of learning. Psychololgical Review, 57:94--107.
Farley and Clark, 1954: Farley, B. G. and Clark, W. A. (1954). Simulation of self-organizing systems by digital computer. IRE Transactions on Information Theory, 4:76--84.
Feldbaum, 1960: Feldbaum, A. A. (1960). Optimal Control Theory. Academic Press, New York.
Friston et al., 1994: Friston, K. J., Tononi, G., Reeke, G. N., Sporns, O., and Edelman, G. M. (1994). Value-dependent selection in the brain: Simulation in a synthetic neural model. Neuroscience, 59:229--243.
Fu, 1970: Fu, K. S. (1970). Learning control systems---Review and outlook. IEEE Transactions on Automatic Control, pages 210--221.
Galanter and Gerstenhaber, 1956: Galanter, E. and Gerstenhaber, M. (1956). On thought: The extrinsic theory. Psychological Review, 63:218--227.
Gällmo and Asplund, 1995: Gällmo, O. and Asplund, H. (1995). Reinforcement learning by construction of hypothetical targets. In Alspector, J., Goodman, R., and Brown, T. X., editors, Proceedings of the International Workshop on Applications of Neural Networks to Telecommunications 2 (IWANNT-2), pages 300--307. Stockholm, Sweden.
Gardner, 1981: Gardner (1981). Samuel's checkers player. In Barr, A. and Feigenbaum, E. A., editors, The Handbook of Artificial Intelligence, I, pages 84--108. William Kaufmann, Los Altos, CA.
Gardner, 1973: Gardner, M. (1973). Mathematical games. Scientific American, 228:108.
Gelperin et al., 1985: Gelperin, A., Hopfield, J. J., and Tank, D. W. (1985). The logic of limax learning. In Selverston, A., editor, Model Neural Networks and Behavior. Plenum Press, New York.
Gittins and Jones, 1974: Gittins, J. C. and Jones, D. M. (1974). A dynamic allocation index for the sequential design of experiments. Progress in Statistics, pages 241--266.
Goldberg, 1989: Goldberg, D. E. (1989). Genetic Algorithms in Search, Optimization, and Machine Learning. Addison-Wesley, Reading, MA.
Goldstein, 1957: Goldstein, H. (1957). Classical Mechanics. Addison-Wesley, Rreadin, MA.
Goodwin and Sin, 1984: Goodwin, G. C. and Sin, K. S. (1984). Adaptive Filtering Prediction and Control. Prentice-Hall, Englewood Cliffs, N.J.
Gordon, 1995: Gordon, G. J. (1995). Stable function approximation in dynamic programming. In Prieditis, A. and Russell, S., editors, Proceedings of the Twelfth International Conference on Machine Learning, pages 261--268, San Francisco, CA. Morgan Kaufmann. An expanded version was published as Technical Report CMU-CS-95-103, Carnegie Mellon University, Pittsburgh, PA, 1995.
Gordon, 1996: Gordon, G. J. (1996). Stable fitted reinforcement learning. In D. S. Touretzky, M. C. Mozer, M. E. H., editor, Advances in Neural Information Processing Systems: Proceedings of the 1995 Conference, pages 1052--1058, Cambridge, MA. MIT Press.
Griffith, 1966: Griffith, A. K. (1966). A new machine learning technique applied to the game of checkers. Technical Report Project MAC Artificial Intelligence Memo 94, Massachusetts Institute of Technology.
Griffith, 1974: Griffith, A. K. (1974). A comparison and evaluation of three machine learning procedures as applied to the game of checkers. Artificial Intelligence, 5:137--148.
Gullapalli, 1990: Gullapalli, V. (1990). A stochastic reinforcement algorithm for learning real-valued functions. Neural Networks, 3:671--692.
Gurvits et al., 1994: Gurvits, L., Lin, L.-J., and Hanson, S. J. (1994). Incremental learning of evaluation functions for absorbing Markov chains: New methods and theorems. Preprint.
Hampson, 1983: Hampson, S. E. (1983). A Neural Model of Adaptive Behavior. PhD thesis, University of California, Irvine, CA.
Hampson, 1989: Hampson, S. E. (1989). Connectionist Problem Solving: Computational Aspects of Biological Learning. Birkhauser, Boston.
Hawkins and Kandel, 1984: Hawkins, R. D. and Kandel, E. R. (1984). Is there a cell-biological alphabet for simple forms of learning? Psychological Review, 91:375--391.
Hersh and Griego, 1969: Hersh, R. and Griego, R. J. (1969). Brownian motion and potential theory. Scientific American, pages 66--74.
Hilgard and Bower, 1975: Hilgard, E. R. and Bower, G. H. (1975). Theories of Learning. Prentice-Hall, Englewood Cliffs, NJ.
Hinton, 1984: Hinton, G. E. (1984). Distributed representations. Technical Report CMU-CS-84-157, Department of Computer Science, Carnegie-Mellon University, Pittsburgh, PA.
Hochreiter and Schmidhuber, 1997: Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural Computation.
Holland, 1975: Holland, J. H. (1975). Adaptation in Natural and Artificial Systems. University of Michigan Press, Ann Arbor.
Holland, 1976: Holland, J. H. (1976). Adaptation. In Rosen, R. and Snell, F. M., editors, Progress in Theoretical Biology, volume 4, pages 263--293. Academic Press, NY.
Holland, 1986: Holland, J. H. (1986). Escaping brittleness: The possibility of general-purpose learning algorithms applied to rule-based systems. In Michalski, R. S., Carbonell, J. G., and Mitchell, T. M., editors, Machine Learning: An Artificial Intelligence Approach, Volume II, pages 593--623. Morgan Kaufmann, San Mateo, CA.
Houk et al., 1995: Houk, J. C., Adams, J. L., and Barto, A. G. (1995). A model of how the basal ganglia generates and uses neural signals that predict reinforcement. In Houk, J. C., Davis, J. L., and Beiser, D. G., editors, Models of Information Processing in the Basal Ganglia, pages 249--270. MIT Press, Cambridge, MA.
Howard, 1960: Howard, R. (1960). Dynamic Programming and Markov Processes. MIT Press, Cambridge, MA.
Hull, 1943: Hull, C. L. (1943). Principles of Behavior. D. Appleton-Century, NY.
Hull, 1952: Hull, C. L. (1952). A Behavior System. Wiley, NY.
Jaakkola et al., 1994: Jaakkola, T., Jordan, M. I., and Singh, S. P. (1994). On the convergence of stochastic iterative dynamic programming algorithms. Neural Computation, 6.
Jaakkola et al., 1995: Jaakkola, T., Singh, S. P., and Jordan, M. I. (1995). Reinforcement learning algorithm for partially observable Markov decision problems. In G. Tesauro, D. Touretzky, T. L., editor, Advances in Neural Information Processing Systems: Proceedings of the 1994 Conference, pages 345--352, San Mateo, CA. Morgan Kaufmann.
Kaelbling, 1996: Kaelbling (1996). A special issue of machine learning on reinforcement learning. 22.
Kaelbling, 1993a: Kaelbling, L. (1993a). Hierarchical learning in stochastic domains: Preliminary results. In Proceedings of the Tenth International Conference on Machine Learning, pages 167--173. Morgan Kaufmann.
Kaelbling, 1993b: Kaelbling, L. P. (1993b). Learning in Embedded Systems. MIT Press, Cambridge MA.
Kaelbling et al., 1996: Kaelbling, L. P., Littman, M. L., and Moore, A. W. (1996). Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 4.
Kakutani, 1945: Kakutani, S. (1945). Markov processes and the dirichlet problem. Proc. Jap. Acad., 21:227--233.
Kalos and Whitlock, 1986: Kalos, M. H. and Whitlock, P. A. (1986). Monte Carlo Methods. Wiley, NY.
Kanerva, 1988: Kanerva, P. (1988). Sparse Distributed Memory. MIT Press, Cambridge, MA.
Kanerva, 1993: Kanerva, P. (1993). Sparse distributed memory and related models. In Hassoun, M. H., editor, Associative Neural Memories: Theory and Implementation, pages 50--76. Oxford University Press, NY.
Kashyap et al., 1970: Kashyap, R. L., Blaydon, C. C., and Fu, K. S. (1970). Stochastic approximation. In Mendel, J. M. and Fu, K. S., editors, Adaptive, Learning, and Pattern Recognition Systems: Theory and Applications. Academic Press, New York.
Keerthi and Ravindran, 1997: Keerthi, S. S. and Ravindran, B. (1997). Reinforcement learning. In Fiesler, E. and Beale, R., editors, Handbook of Neural Computation. Oxford University Press, USA.
Kimble, 1961: Kimble, G. A. (1961). Hilgard and Marquis' Contitioning and Learning. Appleton-Century-Crofts, Inc., New York.
Kimble, 1967: Kimble, G. A. (1967). Foundations of Conditioning and Learning. Appleton-Century-Crofts.
Kirkpatrick et al., 1983: Kirkpatrick, S., Gelatt, C. D., and Vecchi, M. P. (1983). Optimization by simulated annealing. Science, 220:671--680.
Klopf, 1972: Klopf, A. H. (1972). Brain function and adaptive systems---A heterostatic theory. Technical Report AFCRL-72-0164, Air Force Cambridge Research Laboratories, Bedford, MA. A summary appears in Proceedings of the International Conference on Systems, Man, and Cybernetics, 1974, IEEE Systems, Man, and Cybernetics Society, Dallas, TX.
Klopf, 1975: Klopf, A. H. (1975). A comparison of natural and artificial intelligence. SIGART Newsletter, 53:11--13.
Klopf, 1982: Klopf, A. H. (1982). The Hedonistic Neuron: A Theory of Memory, Learning, and Intelligence. Hemisphere, Washington, D.C.
Klopf, 1988: Klopf, A. H. (1988). A neuronal model of classical conditioning. Psychobiology, 16:85--125.
Kohonen, 1977: Kohonen, T. (1977). Associative Memory: A System Theoretic Approach. Springer-Varlag, Berlin.
Korf, 1988: Korf, R. E. (1988). Optimal path finding algorithms. In Kanal, L. N. and Kumar, V., editors, Search in Artificial Intelligence, pages 223--267. Springer Verlag, Berlin.
Kraft and Campagna, 1990: Kraft, L. G. and Campagna, D. P. (1990). A summary comparison of CMAC neural network and traditional adaptive control systems. In Miller, T., Sutton, R. S., and Werbos, P. J., editors, Neural Networks for Control, pages 143--169. MIT Press, Cambridge, MA.
Kraft et al., 1992: Kraft, L. G., Miller, W. T., and Dietz, D. (1992). Development and application of CMAC neural network-based control. In White, D. A. and Sofge, D. A., editors, Handbook of Intelligent Control: Neural, Fuzzy, and Adaptive Approaches, pages 215--232. Van Nostrand Reinhold, New York.
Kuman and Varaiya, 1986: Kuman, P. R. and Varaiya, P. (1986). Stochastic Systems: Estimation, Identification, and Adaptive Control. Prentice-Hall, Englewood Cliffs, NJ.
Kumar, 1985: Kumar, P. R. (1985). A survey of some results in stochastic adaptive control. SIAM Journal of Control and Optimization, 23:329--380.
Kumar and Kanal, 1988: Kumar, V. and Kanal, L. N. (1988). The CDP: A unifying formulation for heuristic search, dynamic programming, and branch-and-bound. In Kanal, L. N. and Kumar, V., editors, Search in Artificial Intelligence, pages 1--37. Springer-Verlag.
Kushner and Dupuis, 1992: Kushner, H. J. and Dupuis, P. (1992). Numerical Methods for Stochastic Control Problems in Continuous Time. Springer-Verlag, New York.
Lai, 1987: Lai, T. L. (1987). Adaptive treatment allocation and the multi-armed bandit problem. The Annals of Statistics, 15(3):1091--1114.
Lang et al., 1990: Lang, K. J., Waibel, A. H., and Hinton, G. E. (1990). A time-delay neural network architecture for isolated word recognition. Neural Networks, 3:33--43.
Lin and Kim, 1991: Lin, C.-S. and Kim, H. (1991). Cmac-based adaptive critic self-learning control. IEEE Transactions on Neural Networks, 2:530--533.
Lin, 1992: Lin, L.-J. (1992). Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine Learning, 8:293--321.
Lin and Mitchell, 1992: Lin, L.-J. and Mitchell, T. (1992). Reinforcement learning with hidden states. In Proceedings of the Second International Conference on Simulation of Adaptive Behavior: From Animals to Animats, pages 271--280. MIT Press.
Littman, 1994: Littman, M. L. (1994). Markov games as a framework for multi-agent reinforcement learning. In Proceedings of the Eleventh International Conference on Machine Learning, pages 157--163, San Francisco, CA. Morgan Kaufmann.
Littman et al., 1995a: Littman, M. L., Cassandra, A. R., and Kaelbling, L. P. (1995a). Learning policies for partially observable environments: Scaling up. In Prieditis, A. and Russell, S., editors, Proceedings of the Twelfth International Conference on Machine Learning, pages 362--370, San Francisco, CA. Morgan Kaufmann.
Littman et al., 1995b: Littman, M. L., Dean, T. L., and Kaelbling, L. P. (1995b). On the complexity of solving Markov decision processes. In Proceedings of the Eleventh International Conference on Uncertainty in Artificial Intelligence.
Ljung and Söderstrom, 1983: Ljung, L. and Söderstrom, T. (1983). Theory and Practice of Recursive Identification. MIT Press, Cambridge, MA.
Lovejoy, 1991: Lovejoy, W. S. (1991). A survey of algorithmic methods for partially observed Markov decision processes. Annals of Operations Research, 28:47--66.
Luce, 1959: Luce, D. (1959). Individual Choice Behavior. Wiley, NY.
M. Zweben, 1994: M. Zweben, B. Daun, M. D. (1994). Scheduling and rescheduling with iterative repair. In Zweben, M. and Fox, M. S., editors, Intelligent Scheduling, pages 241--255. Morgan Kaufmann, San Francisco, CA.
Maclin and Shavlik, 1994: Maclin, R. and Shavlik, J. W. (1994). Incorporating advice into agents that learn from reinforcements. In Proceedings of the Twelfth National Conference on Artificial Intelligence (AAAI-94).
Mahadevan, 1996: Mahadevan, S. (1996). Average reward reinforcement learning: Foundations, algorithms, and empirical results. Machine Learning, 22:159--196.
Markey, 1994: Markey, K. L. (1994). Efficient learning of multiple degree-of-freedom control problems with quasi-independent q-agents. In Mozer, M. C., Smolensky, P., Touretzky, D. S., Elman, J. L., and Weigend, A. S., editors, Proceedings of the 1009 Connectionist Models Summer School, Hillsdale, NJ. Erlbaum.
Mazur, 1994: Mazur, J. E. (1994). Learning and Behavior, Third Edition. Prentice-Hall, Englewood Cliffs, NJ.
McCallum, 1992: McCallum, A. K. (1992). Reinforcement learning with perceptual aliasing: The perceptual distinctions approach. In Proceedings of the Tenth National Conference on Artificial Intelligence, pages 183--188, Menlo Park, CA. AAAI Press/MIT Press.
McCallum, 1993: McCallum, A. K. (1993). Overcoming incomplete perception with utile distinction memory. In Proceedings of the Tenth International Conference on Machine Learning, pages 190--196. Morgan Kaufmann.
McCallum, 1995: McCallum, A. K. (1995). Reinforcement Learning with Selective Perception and Hidden State. PhD thesis, University of Rochester, Rochester.
Mendel, 1966: Mendel, J. M. (1966). Applications of artificial intelligence techniques to a spacecraft control problem. Technical Report NASA CR-755, National Aeronautics and Space Administration.
Mendel and McLaren, 1970: Mendel, J. M. and McLaren, R. W. (1970). Reinforcement learning control and pattern recognition systems. In Mendel, J. M. and Fu, K. S., editors, Adaptive, Learning and Pattern Recognition Systems: Theory and Applications, pages 287--318. Academic Press, New York.
Michie, 1961: Michie, D. (1961). Trial and error. In Barnett, S. A. and McLaren, A., editors, Science Survey, Part 2, pages 129--145, Harmondsworth. Penguin.
Michie, 1963: Michie, D. (1963). Experiments on the mechanisation of game learning. 1. characterization of the model and its parameters. Computer Journal, 1:232--263.
Michie, 1974: Michie, D. (1974). On Machine Intelligence. Edinburgh University Press.
Michie and Chambers, 1968: Michie, D. and Chambers, R. A. (1968). BOXES: An experiment in adaptive control. In Dale, E. and Michie, D., editors, Machine Intelligence 2, pages 137--152. Oliver and Boyd.
Miller and Williams, 1992: Miller, S. and Williams, R. J. (1992). Learning to control a bioreactor using a neural net dyna-q system. In Proceedings of the Seventh Yale Workshop on Adaptive and Learning Systems, pages 167--172, Center for Systems Science, Dunham Laboratory, Yale University.
Miller et al., 1994: Miller, W. T., Scalera, S. M., and Kim, A. (1994). Neural network control of dynamic balance for a biped walking robot. In Proceedings of the Eighth Yale Workshop on Adaptive and Learning Systems, pages 156--161, Dunham Laboratory, Yale University. Center for Systems Science.
Minsky, 1954: Minsky, M. L. (1954). Theory of Neural-Analog Reinforcement Systems and its Application to the Brain-Model Problem. PhD thesis, Princeton University.
Minsky, 1961: Minsky, M. L. (1961). Steps toward artificial intelligence. Proceedings of the Institute of Radio Engineers, 49:8--30. Reprinted in E. A. Feigenbaum and J. Feldman, editors, Computers and Thought. McGraw-Hill, New York, 406--450, 1963.
Minsky, 1967: Minsky, M. L. (1967). Computation: Finite and Infinite Machines. Prentice Hall, Englewood Cliffs, NJ.
Montague et al., 1996: Montague, P. R., Dayan, P., and Sejnowski, T. J. (1996). A framework for mesencephalic dopamine systems based on predictive hebbian learning. Journal of Neuroscience, 16:1936--1947.
Moore, 1990: Moore, A. W. (1990). Efficient Memory-Based Learning for Robot Control. PhD thesis, University of Cambridge, Cambridge, UK.
Moore, 1994: Moore, A. W. (1994). The parti-game algorithm for variable resolution reinforcement learning in multidimensional spaces. In Cohen, J. D., Tesauro, G., and Alspector, J., editors, Advances in Neural Information Processing Systems: Proceedings of the 1993 Conference, pages 711--718, San Francisco, CA. Morgan Kaufmann.
Moore and Atkeson, 1993: Moore, A. W. and Atkeson, C. G. (1993). Prioritized sweeping: Reinforcement learning with less data and less real time. Machine Learning, 13:103--130.
Moore et al., 1986: Moore, J. W., Desmond, J. E., Berthier, N. E., Blazis, E. J., Sutton, R. S., and Barto, A. G. (1986). Simulation of the classically conditioned nictitating membrane response by a neuron-like adaptive element: I. Response topography, neuronal firing, and interstimulus intervals. Behavioural Brain Research, 21:143--154.
Narendra and Thathachar, 1989: Narendra, K. and Thathachar, M. A. L. (1989). Learning Automata: An Introduction. Prentice Hall, Englewood Cliffs, NJ.
Narendra and Thathachar, 1974: Narendra, K. S. and Thathachar, M. A. L. (1974). Learning automata---A survey. IEEE Transactions on Systems, Man, and Cybernetics, 4:323--334.
Nie and Haykin, 1996: Nie, J. and Haykin, S. (1996). A dynamic channel assignment policy through q-learning. CRL Report 334, Hamilton, Ontario, Canada L8S 4K1.
Page, 1977: Page, C. V. (1977). Heuristics for signature table analysis as a pattern recognition technique. IEEE Transactions on Systems, Man, and Cybernetics, SMC-7:77--86.
Parr and Russell, 1995: Parr, R. and Russell, S. (1995). Approximating optimal policies for partially observable stochastic domains. In Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence.
Pavlov, 1927: Pavlov, P. I. (1927). Conditioned Reflexes. Oxford, London.
Pearl, 1984: Pearl, J. (1984). Heuristics: Intelligent Search Strategies for Computer Problem Solving. Addison-Wesley.
Peng, 1993: Peng, J. (1993). Efficient Dynamic Programming-Based Learning for Control. PhD thesis, Northeastern University, Boston, MA.
Peng and Williams, 1993: Peng, J. and Williams, R. J. (1993). Efficient learning and planning within the Dyna framework. Adaptive Behavior, 1(4).
Peng and Williams, 1994: Peng, J. and Williams, R. J. (1994). Incremental multi-step q-learning. In Cohen, W. W. and Hirsh, H., editors, Proceedings of the Eleventh International Conference on Machine Learning, pages 226--232.
Peng and Williams, 1996: Peng, J. and Williams, R. J. (1996). Incremental multi-step q-learning. Machine Learning, 22(1/2/3).
Phansalkar and Thathachar, 1995: Phansalkar, V. V. and Thathachar, M. A. L. (1995). Local and global optimization algorithms for generalized learning automata. Neural Computation, 7:950--973.
Poggio and Girosi, 1989: Poggio, T. and Girosi, F. (1989). A theory of networks for approximation and learning. A.I. Memo 1140, Artificial Intelligence Laboratory, Massachusetts Institute of Technology.
Poggio and Girosi, 1990: Poggio, T. and Girosi, F. (1990). Regularization algorithms for learning that are equivalent to multilayer networks. Science, 247:978--982.
Powell, 1987: Powell, M. J. D. (1987). Radial basis functions for multivariate interpolation: A review. In Mason, J. C. and Cox, M. G., editors, Algorithms for Approximation. Clarendon Press, Oxford.
Puterman, 1994: Puterman, M. L. (1994). Markov Decision Problems. Wiley, NY.
Puterman and Shin, 1978: Puterman, M. L. and Shin, M. C. (1978). Modified policy iteration algorithms for discounted Markov decision problems. Management Science, 24:1127--1137.
Reetz, 1977: Reetz, D. (1977). Approximate solutions of a discounted Markovian decision process. Bonner Mathematische Schriften, vol 98: Dynamische Optimierung, pages 77--92.
Ring, 1994: Ring, M. B. (1994). Continual Learning in Reinforcement Environments. PhD thesis, University of Texas at Austin, Austin, Texas 78712.
Rivest and Schapire, 1987: Rivest, R. L. and Schapire, R. E. (1987). Diversity-based inference of finite automata. In Proceedings of the Twenty-Eighth Annual Symposium on Foundations of Computer Science, pages 78--87.
Robbins, 1952: Robbins, H. (1952). Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society, 58:527--535.
Robertie, 1992: Robertie, B. (1992). Carbon versus silicon: Matching wits with TD-gammon. Inside Backgammon, 2(2):14--22.
Rosenblatt, 1961: Rosenblatt, F. (1961). Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms. Spartan Books, 6411 Chillum Place N.W., Washington, D.C.
Ross, 1983: Ross, S. (1983). Introduction to Stochastic Dynamic Programming. Academic Press, New York.
Rubinstein, 1981: Rubinstein, R. Y. (1981). Simulation and the Monte Carlo Method. Wiley, NY.
Rumelhart et al., 1986: Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986). Learning internal representations by error propagation. In Rumelhart, D. E. and McClelland, J. L., editors, Parallel Distributed Processing: Explorations in the Microstructure of Cognition, vol.1: Foundations. Bradford Books/MIT Press, Cambridge, MA.
Rummery, 1995: Rummery, G. A. (1995). Problem Solving with Reinforcement Learning. PhD thesis, Cambridge University.
Rummery and Niranjan, 1994: Rummery, G. A. and Niranjan, M. (1994). On-line q-learning using connectionist systems. Technical Report CUED/F-INFENG/TR 166, Cambridge University Engineering Department.
Russell and Norvig, 1995: Russell, S. and Norvig, P. (1995). Artificial Intelligence: A Modern Approach. Prentice Hall, Englewood Cliffs, NJ.
Rust, 1996: Rust, J. (1996). Numerical dynamic programming in economics. In Amman, H., Kendrick, D., and Rust, J., editors, Handbook of Computational Economics, pages 614--722. Elsevier, Amsterdam.
S. J. Bradtke, 1994: S. J. Bradtke, B. E. Ydstie, A. G. B. (1994). Adaptive linear quadratic control using policy iteration. In Proceedings of the American Control Conference.
Samuel, 1959: Samuel, A. L. (1959). Some studies in machine learning using the game of checkers. IBM Journal on Research and Development, pages 210--229. Reprinted in E. A. Feigenbaum and J. Feldman, editors, Computers and Thought, McGraw-Hill, New York, 1963.
Samuel, 1967: Samuel, A. L. (1967). Some studies in machine learning using the game of checkers. II---Recent progress. IBM Journal on Research and Development, pages 601--617.
Schultz and Melsa, 1967: Schultz, D. G. and Melsa, J. L. (1967). State Functions and Linear Control Systems. McGraw-Hill, New York.
Schultz et al., 1997: Schultz, W., Dayan, P., and Montague, P. R. (1997). A neural substrate of prediction and reward. Science, 275:1593--1598.
Schwartz, 1993: Schwartz, A. (1993). A reinforcement learning method for maximizing undiscounted rewards. In Proceedings of the Tenth International Conference on Machine Learning, pages 298--305. Morgan Kaufmann.
Schweitzer and Seidmann, 1985: Schweitzer, P. J. and Seidmann, A. (1985). Generalized polynomial approximations in Markovian decision processes. Journal of Mathematical Analysis and Applications, 110:568--582.
Selfridge et al., 1985: Selfridge, O. J., Sutton, R. S., and Barto, A. G. (1985). Training and tracking in robotics. In Joshi, A., editor, Proceedings of the Ninth International Joint Conference of Artificial Intelligence, pages 670--672, San Mateo, CA. Morgan Kaufmann.
Shannon, 1950a: Shannon, C. E. (1950a). A chess-playing machine. Scientific American, 182:48--51.
Shannon, 1950b: Shannon, C. E. (1950b). Programming a computer for playing chess. Philosophical Magazine, 41:256--275.
Shewchuk and Dean, 1990: Shewchuk, J. and Dean, T. (1990). Towards learning time-varying functions with high input dimensionality. In Proceedings of the Fifth IEEE International Symposium on Intelligent Control, pages 383--388. IEEE.
Singh, 1992a: Singh, S. P. (1992a). Reinforcement learning with a hierarchy of abstract models. In Proceedings of the Tenth National Conference on Artificial Intelligence, pages 202--207, Menlo Park, CA. AAAI Press/MIT Press.
Singh, 1992b: Singh, S. P. (1992b). Scaling reinforcement learning algorithms by learning variable temporal resolution models. In Proceedings of the Ninth International Machine Learning Conference, pages 406--415, San Mateo, CA. Morgan Kaufmann.
Singh, 1993: Singh, S. P. (1993). Learning to Solve Markovian Decision Processes. PhD thesis, University of Massachusetts, Amherst. Appeared as CMPSCI Technical Report 93-77.
Singh and Bertsekas, 1997: Singh, S. P. and Bertsekas, D. (1997). Reinforcement learning for dynamic channel allocation in cellular telephone systems. In Advances in Neural Information Processing Systems: Proceedings of the 1996 Conference, Cambridge, MA. MIT Press.
Singh et al., 1994: Singh, S. P., Jaakkola, T., and Jordan, M. I. (1994). Learning without state-estimation in partially observable Markovian decision problems. In Cohen, W. W. and Hirsch, H., editors, Proceedings of the Eleventh International Conference on Machine Learning, pages 284--292, San Francisco, CA. Morgan Kaufmann.
Singh et al., 1995: Singh, S. P., Jaakkola, T., and Jordan, M. I. (1995). Reinforcement learing with soft state aggregation. In G. Tesauro, D. Touretzky, T. L., editor, Advances in Neural Information Processing Systems: Proceedings of the 1994 Conference, pages 359--368, Cambridge, MA. MIT Press.
Singh and Sutton, 1996: Singh, S. P. and Sutton, R. S. (1996). Reinforcement learning with replacing eligibility traces. Machine Learning, 22:123--158.
Sivarajan et al., 1990: Sivarajan, K. N., McEliece, R. J., and Ketchum, J. W. (1990). Dynamic channel assignment in cellular radio. In Proceedings of the 40th Vehicular Technology Conference, pages 631--637.
Skinner, 1938: Skinner, B. F. (1938). The Behavior of Organisms. Appleton-Century, NY.
Sofge and White, 1992: Sofge, D. A. and White, D. A. (1992). Applied learning: Optimal control for manufacturing. In White, D. A. and Sofge, D. A., editors, Handbook of Intelligent Control: Neural, Fuzzy, and Adaptive Approaches, pages 259--281. Van Nostrand Reinhold, New York.
Spong, 1994: Spong, M. W. (1994). Swing up control of the acrobot. In Proceedings of the 1994 IEEE Conference on Robotics and Automation, San Diego, CA.
Staddon, 1983: Staddon, J. E. R. (1983). Adaptive Behavior and Learning. Cambridge University Press, Cambridge.
Sutton, 1978a: Sutton, R. S. (1978a). Learning theory support for a single channel theory of the brain.
Sutton, 1978b: Sutton, R. S. (1978b). Single channel theory: A neuronal theory of learning. Brain Theory Newsletter, 4:72--75.
Sutton, 1978c: Sutton, R. S. (1978c). A unified theory of expectation in classical and instrumental conditioning.
Sutton, 1984: Sutton, R. S. (1984). Temporal Credit Assignment in Reinforcement Learning. PhD thesis, University of Massachusetts, Amherst, MA.
Sutton, 1988: Sutton, R. S. (1988). Learning to predict by the method of temporal differences. Machine Learning, 3:9--44.
Sutton, 1990: Sutton, R. S. (1990). Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Proceedings of the Seventh International Conference on Machine Learning, pages 216--224, San Mateo, CA. Morgan Kaufmann.
Sutton, 1991a: Sutton, R. S. (1991a). Dyna, an integrated architecture for learning, planning, and reacting. SIGART Bulletin, 2:160--163. Also appeared in Working Notes of the 1991 AAAI Spring Symposium, pages 151--155.
Sutton, 1991b: Sutton, R. S. (1991b). Planning by incremental dynamic programming. In Birnbaum, L. A. and Collins, G. C., editors, Proceedings of the Eighth International Workshop on Machine Learning, pages 353--357, San Mateo, CA. Morgan Kaufmann.
Sutton, 1992: Sutton, R. S., editor (1992). A Special Issue of Machine Learning on Reinforcement Learning, volume 8. Machine Learning. Also published as Reinforcement Learnng, Kluwer Academic Press, Boston, MA 1992.
Sutton, 1995: Sutton, R. S. (1995). TD models: Modeling the world at a mixture of time scales. In Prieditis, A. and Russell, S., editors, Proceedings of the Twelfth International Conference on Machine Learning, pages 531--539, San Francisco, CA. Morgan Kaufmann.
Sutton, 1996: Sutton, R. S. (1996). Generalization in reinforcement learning: Successful examples using sparse coarse coding. In Touretzky, D. S., Mozer, M. C., and Hasselmo, M. E., editors, Advances in Neural Information Processing Systems: Proceedings of the 1995 Conference, pages 1038--1044, Cambridge, MA. MIT Press.
Sutton and Barto, 1981a: Sutton, R. S. and Barto, A. G. (1981a). An adaptive network that constructs and uses an internal model of its world. Cognition and Brain Theory, 3:217--246.
Sutton and Barto, 1981b: Sutton, R. S. and Barto, A. G. (1981b). Toward a modern theory of adaptive networks: Expectation and prediction. Psychological Review, 88:135--170.
Sutton and Barto, 1987: Sutton, R. S. and Barto, A. G. (1987). A temporal-difference model of classical conditioning. In Proceedings of the Ninth Annual Conference of the Cognitive Science Society, Hillsdale, NJ. Erlbaum.
Sutton and Barto, 1990: Sutton, R. S. and Barto, A. G. (1990). Time-derivative models of pavlovian reinforcement. In Gabriel, M. and Moore, J., editors, Learning and Computational Neuroscience: Foundations of Adaptive Networks, pages 497--537. MIT Press, Cambridge, MA.
Sutton and Pinette, 1985: Sutton, R. S. and Pinette, B. (1985). The learning of world models by connectionist networks. In Proceedings of the Seventh Annual Conference of the Cognitive Science Society, Irvine, CA.
Sutton and Singh, 1994: Sutton, R. S. and Singh, S. (1994). On bias and step size in temporal-difference learning. In Proceedings of the Eighth Yale Workshop on Adaptive and Learning Systems, pages 91--96, New Haven, CT. Yale University.
Tadepally and Ok, 1994: Tadepally, P. and Ok, D. (1994). H-learning: A reinforcement learning method to optimize undiscounted average reward. Technical Report 94-30-01, Oregon State University.
Tan, 1991: Tan, M. (1991). Learning a cost-sensitive internal representation for reinforcement learning. In Birnbaum, L. A. and Collins, G. C., editors, Proceedings of the Eighth International Workshop on Machine Learning, pages 358--362, San Mateo, CA. Morgan Kaufmann.
Tan, 1993: Tan, M. (1993). Multi-agent reinforcement learning: Independent vs. cooperative agents. In Proceedings of the Tenth International Conference on Machine Learning, pages 330--337. Morgan Kaufmann.
Tesauro, 1986: Tesauro, G. J. (1986). Simple neural models of classical conditioning. Biological Cybernetics, 55:187--200.
Tesauro, 1992: Tesauro, G. J. (1992). Practical issues in temporal difference learning. Machine Learning, 8:257--277.
Tesauro, 1994: Tesauro, G. J. (1994). TD--gammon, a self-teaching backgammon program, achieves master-level play. Neural Computation, 6(2):215--219.
Tesauro, 1995: Tesauro, G. J. (1995). Temporal difference learning and TD-Gammon. Communications of the ACM, 38:58--68.
Tesauro and Galperin, 1997: Tesauro, G. J. and Galperin, G. R. (1997). On-line policy improvement using monte-carlo search. In Advances in Neural Information Processing Systems: Proceedings of the 1996 Conference, Cambridge, MA. MIT Press.
Tham, 1994: Tham, C. K. (1994). Modular On-Line Function Approximation for Scaling up Reinforcement Learning. PhD thesis, Cambridge University.
Thathachar and Sastry, 1986: Thathachar, M. A. L. and Sastry, P. S. (1986). Estimator algorithms for learning automata. In Proceedings of the Platinum Jubilee Conference on Systems and Signal Processing, Bengalore, India.
Thathachar and Sastry, 1995: Thathachar, M. A. L. and Sastry, P. S. (1995). A new approach to the design of reinforcement schemes for learning automata. IEEE Transactions on Systems, Man, and Cybernetics, 15:168--175.
Thompson, 1933: Thompson, W. R. (1933). On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25:285--294.
Thompson, 1934: Thompson, W. R. (1934). On the theory of apportionment. American Journal of Mathematics, 57:450--457.
Thorndike, 1911: Thorndike, E. L. (1911). Animal Intelligence. Hafner, Darien, Conn.
Thorp, 1966: Thorp, E. O. (1966). Beat the Dealer: A Winning Strategy for the Game of Twenty-One. Random House, New York.
Tolman, 1932: Tolman, E. C. (1932). Purposive Behavior in Animals and Men. Century, New York.
Tsetlin, 1973: Tsetlin, M. L. (1973). Automaton Theory and Modeling of Biological Systems. Academic Press, New York.
Tsitsiklis, 1994: Tsitsiklis, J. N. (1994). Asynchronous stochastic approximation and q-learning. Machine Learning, 16:185--202.
Tsitsiklis and Van Roy, 1996: Tsitsiklis, J. N. and Van Roy, B. (1996). Feature-based methods for large scale dynamic programming. Machine Learning, 22:59--94.
Tsitsiklis and Van Roy, 1997: Tsitsiklis, J. N. and Van Roy, B. (1997). An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control.
Ungar, 1990: Ungar, L. H. (1990). A bioreactor benchmark for adaptive network-based process control. In Miller, W. T., Sutton, R. S., and Werbos, P. J., editors, Neural Networks for Control, pages 387--402. MIT Press, Cambridge, MA.
Varga, 1962: Varga, R. S. (1962). Matrix Iterative Analysis. Prentice-Hall, Englewood Cliffs, NJ.
Waltz and Fu, 1965: Waltz, M. D. and Fu, K. S. (1965). A heuristic approach to reinforcment learning control systems. IEEE Transactions on Automatic Control, 10:390--398.
Watkins, 1989: Watkins, C. J. C. H. (1989). Learning from Delayed Rewards. PhD thesis, Cambridge University, Cambridge, England.
Watkins and Dayan, 1992: Watkins, C. J. C. H. and Dayan, P. (1992). Q-learning. Machine Learning, 8:279--292.
Werbos, 1992: Werbos, P. (1992). Approximate dynamic programming for real-time control and neural modeling. In White, D. A. and Sofge, D. A., editors, Handbook of Intelligent Control: Neural, Fuzzy, and Adaptive Approaches, pages 493--525. Van Nostrand Reinhold, New York.
Werbos, 1977: Werbos, P. J. (1977). Advanced forecasting methods for global crisis warning and models of intelligence. General Systems Yearbook, 22:25--38.
Werbos, 1982: Werbos, P. J. (1982). Applications of advances in nonlinear sensitivity analysis. In Drenick, R. F. and Kosin, F., editors, System Modeling an Optimization. Springer-Verlag. Proceedings of the Tenth IFIP Conference, New York, 1981.
Werbos, 1987: Werbos, P. J. (1987). Building and understanding adaptive systems: A statistical/numerical approach to factory automation and brain research. IEEE Transactions on Systems, Man, and Cybernetics, pages 7--20.
Werbos, 1988: Werbos, P. J. (1988). Generalization of back propagation with applications to a recurrent gas market model. Neural Networks, 1:339--356.
Werbos, 1989: Werbos, P. J. (1989). Neural networks for control and system identification. In Proceedings of the 28th Conference on Decision and Control, pages 260--265, Tampa, Florida.
Werbos, 1990: Werbos, P. J. (1990). Consistency of HDP applied to simple reinforcement learning problem. Neural Networks, 3:179--189.
White, 1969: White, D. J. (1969). Dynamic Programming. Holden-Day, San Francisco.
White, 1985: White, D. J. (1985). Real applications of Markov decision processes. Interfaces, 15:73--83.
White, 1988: White, D. J. (1988). Further real applications of Markov decision processes. Interfaces, 18:55--61.
White, 1993: White, D. J. (1993). A survey of applications of Markov decision processes. Journal of the Operational Research Society, 44:1073--1096.
Whitehead and Ballard, 1991: Whitehead, S. D. and Ballard, D. H. (1991). Learning to perceive and act by trial and error. Machine Learning, 7(1):45--83.
Whitt, 1978: Whitt, W. (1978). Approximations of dynamic programs I. Mathematics of Operations Research, 3:231--243.
Whittle, 1982: Whittle, P. (1982). Optimization over Time, volume 1. Wiley, NY.
Whittle, 1983: Whittle, P. (1983). Optimization over Time, volume 2. Wiley, NY.
Widrow et al., 1973: Widrow, B., Gupta, N. K., and Maitra, S. (1973). Punish/reward: Learning with a critic in adaptive threshold systems. IEEE Transactions on Systems, Man, and Cybernetics, 5:455--465.
Widrow and Hoff, 1960: Widrow, B. and Hoff, M. E. (1960). Adaptive switching circuits. In 1960 WESCON Convention Record Part IV, pages 96--104. Reprinted in J. A. Anderson and E. Rosenfeld, Neurocomputing: Foundations of Research, MIT Press, Cambridge, MA, 1988.
Widrow and Smith, 1964: Widrow, B. and Smith, F. W. (1964). Pattern-recognizing control systems. In Computer and Information Sciences (COINS) Proceedings, Washington, D.C. Spartan.
Widrow and Stearns, 1985: Widrow, B. and Stearns, S. D. (1985). Adaptive Signal Processing. Prentice-Hall, Inc., Englewood Cliffs, N.J.
Williams, 1986: Williams, R. J. (1986). Reinforcement learning in connectionist networks: A mathematical analysis. Technical Report ICS 8605, Institute for Cognitive Science, University of California at San Diego, La Jolla, CA.
Williams, 1987: Williams, R. J. (1987). Reinforcement-learning connectionist systems. Technical Report NU-CCS-87-3, College of Computer Science, Northeastern University, Boston, MA.
Williams, 1988: Williams, R. J. (1988). On the use of backpropagation in associative reinforcement learning. In Proceedings of the IEEE International Conference on Neural Networks, pages 263--270, San Diego, CA.
Williams, 1992: Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8:229--256.
Williams and Baird, 1990: Williams, R. J. and Baird, L. C. (1990). A mathematical analysis of actor-critic architectures for learning optimal controls through incremental dynamic programming. In Proceedings of the Sixth Yale Workshop on Adaptive and Learning Systems, pages 96--101, New Haven, CT.
Wilson, 1994: Wilson, S. W. (1994). ZCS: A zeroth order classifier system. Evolutionary Compuation, 2:1--18.
Witten, 1976: Witten, I. H. (1976). The apparent conflict between estimation and control---A survey of the two-armed problem. Journal of the Franklin Institute, 301:161--189.
Witten, 1977: Witten, I. H. (1977). An adaptive optimal controller for discrete-time Markov environments. Information and Control, 34:286--295.
Witten and Corbin, 1973: Witten, I. H. and Corbin, M. J. (1973). Human operators and automatic adaptive controllers: A comparative study on a particular control task. International Journal of Man-Machine Studies, 5:75--104.
Yee et al., 1990: Yee, R. C., Saxena, S., Utgoff, P. E., and Barto, A. G. (1990). Explaining temporal differences to create useful concepts for evaluating states. In Proceedings of the Eighth National Conference on Artificial Intelligence, pages 882--888, Cambridge, MA.
Young, 1984: Young, P. (1984). Recursive Estimation and Time-Series Analysis. Springer-Verlag.
Zhang and Yum, 1989: Zhang, M. and Yum, T. P. (1989). Comparisons of channel-assignment strategies in cellular mobile telephone systems. IEEE Transactions on Vehicular Technology, 38.
Zhang, 1996: Zhang, W. (1996). Reinforcement Learning for Job-shop Scheduling. PhD thesis, Oregon State University. Tech Report CS-96-30-1.
Zhang and Dietterich, 1995: Zhang, W. and Dietterich, T. G. (1995). A reinforcement learning approach to job-shop scheduling. In Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence, pages 1114--1120.
Zhang and Dietterich, 1996: Zhang, W. and Dietterich, T. G. (1996). High-performance job-shop scheduling with a time--delay TD network. In D. S. Touretzky, M. C. Mozer, M. E. H., editor, Advances in Neural Information Processing Systems: Proceedings of the 1995 Conference, pages 1024--1030, Cambridge, MA. MIT Press.

Richard Sutton
Fri May 30 21:03:54 EDT 1997