Thursday, July 26, 2012

REST


Something to be careful about when designing a RESTful API is the conflation of GET and POST, as if they were the same thing. It's easy to make this mistake with Django's function-based views andCherryPy's default dispatcher, although both frameworks now provide a way around this problem (class-based views and MethodDispatcher, respectively).
HTTP-verbs are very important in REST, and unless you're very careful about this, you'll end up falling into a REST anti-pattern.
Some frameworks that get it right are web.pyFlask and Bottle. When combined with the mimerenderlibrary (full disclosure: I wrote it), they allow you to write nice RESTful webservices:
import webimport jsonfrom mimerender import mimerender

render_xml = lambda message: '<message>%s</message>'%message
render_json = lambda **args: json.dumps(args)
render_html = lambda message: '<html><body>%s</body></html>'%message
render_txt = lambda message: message

urls = (
    '/(.*)', 'greet'
)
app = web.application(urls, globals())
class greet:
    @mimerender(
        default = 'html',
        html = render_html,
        xml  = render_xml,
        json = render_json,
        txt  = render_txt
    )
    def GET(self, name):
        if not name: 
            name = 'world'
        return {'message': 'Hello, ' + name + '!'}
if __name__ == "__main__":
    app.run()
The service's logic is implemented only once, and the correct representation selection (Accept header) + dispatch to the proper render function (or template) is done in a tidy, transparent way.
$ curl localhost:8080/x<html><body>Hello, x!</body></html>

$ curl -H "Accept: application/html" localhost:8080/x<html><body>Hello, x!</body></html>

$ curl -H "Accept: application/xml" localhost:8080/x<message>Hello, x!</message>

$ curl -H "Accept: application/json" localhost:8080/x{'message':'Hello, x!'}

$ curl -H "Accept: text/plain" localhost:8080/xHello, x!
Update (April 2012): added information about Django's class-based views, CherryPy's MethodDispatcher and Flask and Bottle frameworks. Neither existed back when the question was asked

REST anti-patterns


When people start trying out REST, they usually start looking around for examples – and not only find a lot of examples that claim to be “RESTful”, or are labeled as a “REST API”, but also dig up a lot of discussions about why a specific service that claims to do REST actually fails to do so.

The usual standard disclaimer applies: REST, the Web, and HTTP are not the same thing; REST could be implemented with many different technologies, and HTTP is just one concrete architecture that happens to follow the REST architectural style. So I should actually be careful to distinguish “REST” from “RESTful HTTP”. I’m not, so let’s just assume the two are the same for the remainder of this article.
Why does this happen? HTTP is nothing new, but it has been applied in a wide variety of ways. Some of them were in line with the ideas the Web’s designers had in mind, but many were not. Applying REST principles to your HTTP applications, whether you build them for human consumption, for use by another program, or both, means that you do the exact opposite: You try to use the Web “correctly”, or if you object to the idea that one is “right” and one is “wrong”: in a RESTful way. For many, this is indeed a very new approach.
As with any new approach, it helps to be aware of some common patterns. In the firsttwo articles of this series, I’ve tried to outline some basic ones – such as the concept of collection resources, the mapping of calculation results to resources in their own right, or the use of syndication to model events. A future article will expand on these and other patterns. For this one, though, I want to focus on anti-patterns – typical examples of attempted RESTful HTTP usage that create problems and show that someone has attempted, but failed, to adopt REST ideas.
Let’s start with a quick list of anti-patterns I’ve managed to come up with:
  1. Tunneling everything through GET
  2. Tunneling everything through POST
  3. Ignoring caching
  4. Ignoring response codes
  5. Misusing cookies
  6. Forgetting hypermedia
  7. Ignoring MIME types
  8. Breaking self-descriptiveness
Let’s go through each of them in detail.

Tunneling everything through GET

To many people, REST simply means using HTTP to expose some application functionality. The fundamental and most important operation (strictly speaking, “verb” or “method” would be a better term) is an HTTP GET. A GET should retrieve a representation of a resource identified by a URI, but many, if not all existing HTTP libraries and server programming APIs make it extremely easy to view the URI not as a resource identifier, but as a convenient means to encode parameters. This leads to URIs like the following:
http://example.com/some-api?method=deleteCustomer&id=1234
The characters that make up a URI do not, in fact, tell you anything about the “RESTfulness” of a given system, but in this particular case, we can guess the GET will not be “safe”: The caller will likely be held responsible for the outcome (the deletion of a customer), although the spec says that GET is the wrong method to use for such cases.
The only thing in favor of this approach is that it’s very easy to program, and trivial to test from a browser – after all, you just need to paste a URI into your address bar, tweak some “parameters”, and off you go. The main problems with this anti-patterns are:
  1. Resources are not identified by URIs; rather, URIs are used to encode operations and their parameters
  2. The HTTP method does not necessarily match the semantics
  3. Such links are usually not intended to be bookmarked
  4. There is a risk that “crawlers” (e.g. from search engines such as Google) cause unintended side effects
Note that APIs that follow this anti-pattern might actually end up being accidentally restful. Here is an example:
http://example.com/some-api?method=findCustomer&id=1234
Is this a URI that identifies an operation and its parameters, or does it identify a resource? You could argue both cases: This might be a perfectly valid, bookmarkable URI; doing a GET on it might be “safe”; it might respond with different formats according to the Accept header, and support sophisticated caching. In many cases, this will be unintentional. Often, APIs start this way, exposing a “read” interface, but when developers start adding “write” functionality, you find out that the illusion breaks (it’s unlikely an update to a customer would occur via a PUT to this URI – the developer would probably create a new one).

Tunneling everything through POST

This anti-pattern is very similar to the first one, only that this time, the POST HTTP method is used. POST carries an entity body, not just a URI. A typical scenario uses a single URI to POST to, and varying messages to express differing intents. This is actually what SOAP 1.1 web services do when HTTP is used as a “transport protocol”: It’s actually the SOAP message, possibly including some WS-Addressing SOAP headers, that determines what happens.
One could argue that tunneling everything through POST shares all of the problems of the GET variant, it’s just a little harder to use and cannot explore caching (not even accidentally), nor support bookmarking. It actually doesn’t end up violating any REST principles so much – it simply ignores them.

Ignoring caching

Even if you use the verbs as they are intended to be used, you can still easily ruin caching opportunities. The easiest way to do so is by simply including a header such as this one in your HTTP response:
Cache-control: no-cache
Doing so will simply prevent caches from caching anything. Of course this may be what you intend to do, but more often than not it’s just a default setting that’s specified in your web framework. However, supporting efficient caching and re-validation is one of the key benefits of using RESTful HTTP. Sam Ruby suggests that a key question to ask when assessing somethings RESTfulness is “do you support ETags”? (ETags are a mechanism introduced in HTTP 1.1 to allow a client to validate whether a cached representation is still valid, by means of a cryptographic checksum). The easiest way to generate correct headers is to delegate this task to a piece of infrastructure that “knows” how to do this correctly – for example, by generating a file in a directory served by a Web server such as Apache HTTPD.
Of course there’s a client side to this, too: when you implement a programmatic client for a RESTful service, you should actually exploit the caching capabilities that are available, and not unnecessarily retrieve a representation again. For example, the server might have sent the information that the representation is to be considered “fresh” for 600 seconds after a first retrieval (e.g. because a back-end system is polled only every 30 minutes). There is absolutely no point in repeatedly requesting the same information in a shorter period. Similarly to the server side of things, going with a proxy cache such as Squid on the client side might be a better option than building this logic yourself.
Caching in HTTP is powerful and complex; for a very good guide, turn to Mark Nottingham’s Cache Tutorial.

Ignoring status codes

Unknown to many Web developers, HTTP has a very rich set of application-level status codes for dealing with different scenarios. Most of us are familiar with 200 (“OK”), 404 (“Not found”), and 500 (“Internal server error”). But there are many more, and using them correctly means that clients and servers can communicate on a semantically richer level.
For example, a 201 (“Created”) response code signals that a new resource has been created, the URI of which can be found in a Location header in the response. A 409 (“Conflict”) informs the client that there is a conflict, e.g. when a PUT is used with data based on an older version of a resource. A 412 (“Precondition Failed”) says that the server couldn’t meet the client’s expectations.
Another aspect of using status codes correctly affects the client: The status codes in different classes (e.g. all in the 2xx range, all in the 5xx range) are supposed to be treated according to a common overall approach – e.g. a client should treat all 2xx codes as success indicators, even if it hasn’t been coded to handle the specific code that has been returned.
Many applications that claim to be RESTful return only 200 or 500, or even 200 only (with a failure text contained in the response body – again, see SOAP). If you want, you can call this “tunneling errors through status code 200”, but whatever you consider to be the right term: if you don’t exploit the rich application semantics of HTTP’s status codes, you’re missing an opportunity for increased re-use, better interoperability, and looser coupling.

Misusing cookies

Using cookies to propagate a key to some server-side session state is another REST anti-pattern.
Cookies are a sure sign that something is not RESTful. Right? No; not necessarily. One of the key ideas of REST is statelessness – not in the sense that a server can not store any data: it’s fine if there is resource state, or client state. It’s session state that is disallowed due to scalability, reliability and coupling reasons. The most typical use of cookies is to store a key that links to some server-side data structure that is kept in memory. This means that the cookie, which the browser passes along with each request, is used to establish conversational, or session, state.
If a cookie is used to store some information, such as an authentication token, that the server can validate without reliance on session state, cookies are perfectly RESTful – with one caveat: They shouldn’t be used to encode information that can be transferred by other, more standardized means (e.g. in the URI, some standard header or – in rare cases – in the message body). For example, it’s preferable to use HTTP authentication from a RESTful HTTP point of view.

Forgetting hypermedia

The first REST idea that’s hard to accept is the standard set of methods. REST theory doesn’t specify which methods make up the standard set, it just says there should be a limited set that is applicable to all resources. HTTP fixes them at GET, PUT, POST and DELETE (primarily, at least), and casting all of your application semantics into just these four verbs takes some getting used to. But once you’ve done that, people start using a subset of what actually makes up REST – a sort of Web-based CRUD (Create, Read, Update, Delete) architecture. Applications that expose this anti-pattern are not really “unRESTful” (if there even is such a thing), they just fail to exploit another of REST’s core concepts: hypermedia as the engine of application state.
Hypermedia, the concept of linking things together, is what makes the Web a web – a connected set of resources, where applications move from one state to the next by following links. That might sound a little esoteric, but in fact there are some valid reasons for following this principle.
The first indicator of the “Forgetting hypermedia” anti-pattern is the absence of links in representations. There is often a recipe for constructing URIs on the client side, but the client never follows links because the server simply doesn’t send any. A slightly better variant uses a mixture of URI construction and link following, where links typically represent relations in the underlying data model. But ideally, a client should have to know a single URI only; everything else – individual URIs, as well as recipes for constructing them e.g. in case of queries – should be communicated via hypermedia, as links within resource representations. A good example is the Atom Publishing Protocol with its notion of service documents, which offer named elements for each collection within the domain that it describes. Finally, the possible state transitions the application can go through should be communicated dynamically, and the client should be able to follow them with as little before-hand knowledge of them as possible. A good example of this is HTML, which contains enough information for the browser to offer a fully dynamic interface to the user.
I considered adding “human readable URIs” as another anti-pattern. I did not, because I like readable and “hackable” URIs as much as anybody. But when someone starts with REST, they often waste endless hours in discussions about the “correct” URI design, but totally forget the hypermedia aspect. So my advice would be to limit the time you spend on finding the perfect URI design (after all, their just strings), and invest some of that energy into finding good places to provide links within your representations.

Ignoring MIME types

HTTP’s notion of content negotiation allows a client to retrieve different representations of resources based on its needs. For example, a resource might have a representation in different formats such as XML, JSON, or YAML, for consumption by consumers implemented in Java, JavaScript, and Ruby respectively. Or there might be a “machine-readable” format such as XML in addition to a PDF or JPEG version for humans. Or it might support both the v1.1 and the v1.2 versions of some custom representation format. In any case, while there may be good reasons for having one representation format only, it’s often an indication of another missed opportunity.
It’s probably obvious that the more unforeseen clients are able to (re-)use a service, the better. For this reason, it’s much better to rely on existing, pre-defined, widely-known formats than to invent proprietary ones – an argument that leads to the last anti-pattern addressed in this article.

Breaking self-descriptiveness

This anti-pattern is so common that it’s visible in almost every REST application, even in those created by those who call themselves “RESTafarians” – myself included: breaking the constraint of self-descriptiveness (which is an ideal that has less to do with AI science fiction than one might think at first glance). Ideally, a message – an HTTP request or HTTP response, including headers and the body – should contain enough information for any generic client, server or intermediary to be able to process it. For example, when your browser retrieves some protected resource’s PDF representation, you can see how all of the existing agreements in terms of standards kick in: some HTTP authentication exchange takes place, there might be some caching and/or revalidation, the content-type header sent by the server (“application/pdf”) triggers the startup of the PDF viewer registered on your system, and finally you can read the PDF on your screen. Any other user in the world could use his or her own infrastructure to perform the same request. If the server developer adds another content type, any of the server’s clients (or service’s consumers) just need to make sure they have the appropriate viewer installed.
Every time you invent your own headers, formats, or protocols you break the self-descriptiveness constraint to a certain degree. If you want to take an extreme position, anything not being standardized by an official standards body breaks this constraint, and can be considered a case of this anti-pattern. In practice, you strive for following standards as much as possible, and accept that some convention might only apply in a smaller domain (e.g. your service and the clients specifically developed against it).

Summary

Ever since the “Gang of Four” published their book, which kick-started the patterns movement, many people misunderstood it and tried to apply as many patterns as possible – a notion that has been ridiculed for equally as long. Patterns should be applied if, and only if, they match the context. Similarly, one could religiously try to avoid all of the anti-patterns in any given domain. In many cases, there are good reasons for violating any rule, or in REST terminology: relax any particular constraint. It’s fine to do so – but it’s useful to be aware of the fact, and then make a more informed decision.
Hopefully, this article helps you to avoid some of the most common pitfalls when starting your first REST projects.
Many thanks to Javier Botana and Burkhard Neppert for feedback on a draft of this article.
Stefan Tilkov is the lead editor of InfoQ’s SOA community and co-founder, principal consultant and lead RESTafarian of Germany/Switzerland-basedinnoQ.

Wednesday, July 25, 2012

How to use HTTP cookies in Python


How to use HTTP cookies in Python

by Jay Conrod
posted on 2009-02-08
If you've ever done any web programming, or even used a web browser, you've almost certainly heard of HTTP cookies. Cookies are basically key-value pairs that a web server can set for a client's web browser. This lets server programs track what pages a user has visited or what actions the user has performed. Cookies are very useful for implementing things like log-in sessions or shopping carts.
A perfect example of real-life cookies is the parking garage. When you enter a parking garage, a machine gives you a ticket, which has the current time on printed it. You leave the ticket in your car while you go watch a movie. When you leave the garage, you put the ticket in another machine, and it charges you based on the time on the ticket and the current time. In this case, the ticket is like a cookie, since it stores a small amount of information that you keep until later. Your car is like your web browser because it keeps the ticket for you while you do other things. The machines that generate and accept tickets are like scripts on a web server.
A log-in session for a website works almost the same way. You start at a log-in page and type your username and password. The server validates your credentials and gives you a cookie. The cookie contains something that identifies your session, usually just a unique number the server generated for you. The server remembers this session identifier as long as it is valid, and maps it to your username in its database. Each time you load a new page on that site, your browser sends the cookie to the server, which uses it to customize the page for you. When you log out, the server invalidates the session in the database and tells your browser to delete the cookie.
A shopping cart is another good application of cookies. When you first visit a site, your browser has no cookie, so the server creates a new shopping cart in its database and sends you a cookie with a key that identifies your cart. Every time you add an item to the cart, your browser sends the cookie so the server knows which cart the item should be placed in.
In addition to holding a simple key-value pair, several other properties can be associated with a cookie. Cookies can have a specific domain (the server's domain by default), a specific path (the browser only sends the cookie when loading pages under that path), and an expiration date. If no expiration date is set, the browser will delete the cookie when it shuts down. There is also a "version" attribute which should be set to 1, indicating how the cookie should be interpreted. This is required by the specification, but browsers don't seem to care if you don't set it.
Let's get down to business. How do we read and write cookies using Python CGI scripts? When a CGI script is executed, the first few lines of its output are interpreted as HTTP headers, and the rest of the output is the actual content the browser displays. A very simple "Hello, world" script looks like this:
#!/usr/bin/env python

print """Content-Type: text/plain

Hello, world!"""
The "Content-Type" line is actually an HTTP header. This is the only required header, but you can add more headers to control how the browser interprets your page. In particular, you can use a "Set-Cookie" header like this:
Content-Type: text/html
Set-Cookie: session=12345
You can replace session and 12345 with any key and value you like. You can set other cookie attributes like this:
Set-Cookie: session=12345; expires=Sat, 7-Feb-2010 03:10:00; path=/; domain=.jayconrod.com; version=1
The browser will store the cookie until it expires. Every time it loads a new page with the appropriate domain and path, it will submit cookies using a Cookie HTTP header like this:
Cookie: session=12345
CGI scripts cannot access client HTTP headers directly, but you can access all cookies sent from the client using the HTTP_COOKIE environment variable. This environment variables is formatted as a series of key=value pairs delimited by semicolons. This script will print the value of the variable:
#!/usr/bin/env python

import os

print "Content-type: text/plain\n"

if "HTTP_COOKIE" in os.environ:
    print os.environ["HTTP_COOKIE"]
else:
    print "HTTP_COOKIE not set!"
Reading and writing cookies manually can be tedious, since you have to deal with formatting and parsing text. You can't assume that the client will send you valid cookies, so you also have to be prepared for errors. Python's Cookiemodule will do the hard work for you. It provides the SimpleCookie class, which you can initialize using the actual HTTP_COOKIE string sent to your script. You can also initialize it with your own attributes and generate an appropriate HTTP header using the output() method. Here's an example for setting a cookie:
#!/usr/bin/env python

import Cookie
import datetime
import random

expiration = datetime.datetime.now() + datetime.timedelta(days=30)
cookie = Cookie.SimpleCookie()
cookie["session"] = random.randint(1000000000)
cookie["session"]["domain"] = ".jayconrod.com"
cookie["session"]["path"] = "/"
cookie["session"]["expires"] = \
  expiration.strftime("%a, %d-%b-%Y %H:%M:%S PST")

print "Content-type: text/plain"
print cookie.output()
print
print "Cookie set with: " + cookie.output()
You can set more than cookie at a time using this module (it refers to each one as a "morsel").
This next example shows how to read the same cookie:
#!/usr/bin/env python

import Cookie
import os

print "Content-type: text/plain\n"

try:
    cookie = Cookie.SimpleCookie(os.environ["HTTP_COOKIE"])
    print "session = " + cookie["session"].value
except (Cookie.CookieError, KeyError):
    print "session cookie not set!"
Always keep security in mind when programming with cookies. Since cookies are submitted by the client they can easily be forged, so don't write scripts where security depends on cookies being accurate. For instance, if you're writing a log-in script, don't store just the username (or password!) in the cookie, since an attacker could substitute anyone's username in the cookie and gain access to that person's information.
Usually the best thing to do is to store a large, random number as a session ID. This ID can be used as a key in a database to associate any relevant information with that session. If the numbers come from a large range, an attacker would have a difficult time guessing a valid ID. However, make sure that the random number generator you use isn't seeded with the current time. If the attacker knows the random generator algorithm, she can seed her own random number generator with the current time to produce session IDs that might be valid for someone else. The Python random number generator isseeded by default using os.urandom(), which is fine on Linux, Windows, OS X, and other platforms that support it.
Finally, be aware that if cookies are sent in plaintext, they can be stolen and copied by an eavesdropper. If you have a website where the log-in form is encrypted but other pages are not, an attacker can hijack a user's session when the user loads a non-secure page. The only way to protect against this is to encrypt every page for which the browser will send a cookie, which may have a hefty performance penalty. If you do end up doing this, add the secure flag to cookies you set. This instructs the browser not to send the cookie when requesting unencrypted content, which might happen if an encrypted page has unencrypted parts, such as images.

Thursday, July 19, 2012

程序员技术练级攻略


月光博客6月12日发表了《写给新手程序员的一封信》,翻译自《An open letter to those who want to start programming》,我的朋友(他在本站的id是Mailper)告诉我,他希望在酷壳上看到一篇更具操作性的文章。因为他也是喜欢编程和技术的家伙,于是,我让他把他的一些学习Python和Web编程的一些点滴总结一下。于是他给我发来了一些他的心得和经历,我在把他的心得做了不多的增改,并根据我的经历增加了“进阶”一节。这是一篇由新手和我这个老家伙根据我们的经历完成的文章
我的这个朋友把这篇文章取名叫Build Your Programming Technical Skills,我实在不知道用中文怎么翻译,但我在写的过程中,我觉得这很像一个打网游做任务升级的一个过程,所以取名叫“技术练级攻略”,题目有点大,呵呵,这个标题纯粹是为了好玩这里仅仅是在分享Mailper和我个人的学习经历。(注:省去了我作为一个初学者曾经学习过的一些技术(今天明显过时了),如:Delphi/Power builder,也省去了我学过的一些我觉得没意思的技术Lotus Notes/ActiveX/COM/ADO/ATL/.NET ……)

前言

你是否觉得自己从学校毕业的时候只做过小玩具一样的程序?走入职场后哪怕没有什么经验也可以把以下这些课外练习走一遍(朋友的抱怨:学校课程总是从理论出发,作业项目都看不出有什么实际作用,不如从工作中的需求出发)
建议:
  • 不要乱买书,不要乱追新技术新名词,基础的东西经过很长时间积累而且还会在未来至少10年通用。
  • 回顾一下历史,看看历史上时间线上技术的发展,你才能明白明天会是什么样。
  • 一定要动手,例子不管多么简单,建议至少自己手敲一遍看看是否理解了里头的细枝末节。
  • 一定要学会思考,思考为什么要这样,而不是那样。还要举一反三地思考。
:你也许会很奇怪为什么下面的东西很偏Unix/Linux,这是因为我觉得Windows下的编程可能会在未来很没有前途,原因如下:
  • 现在的用户界面几乎被两个东西主宰了,1)Web,2)移动设备iOS或Android。Windows的图形界面不吃香了。
  • 越来越多的企业在用成本低性能高的Linux和各种开源技术来构架其系统,Windows的成本太高了。
  • 微软的东西变得太快了,很不持久,他们完全是在玩弄程序员。详情参见《Windows编程革命史
所以,我个人认为以后的趋势是前端是Web+移动,后端是Linux+开源。开发这边基本上没Windows什么事。

启蒙入门

1、 学习一门脚本语言,例如Python/Ruby
可以让你摆脱对底层语言的恐惧感,脚本语言可以让你很快开发出能用得上的小程序。实践项目:
  • 处理文本文件,或者csv (关键词 python csv, python open, python sys) 读一个本地文件,逐行处理(例如 word count,或者处理log)
  • 遍历本地文件系统 (sys, os, path),例如写一个程序统计一个目录下所有文件大小并按各种条件排序并保存结果
  • 跟数据库打交道 (python sqlite),写一个小脚本统计数据库里条目数量
  • 学会用各种print之类简单粗暴的方式进行调试
  • 学会用Google (phrase, domain, use reader to follow tech blogs)
为什么要学脚本语言,因为他们实在是太方便了,很多时候我们需要写点小工具或是脚本来帮我们解决问题,你就会发现正规的编程语言太难用了。
2、 用熟一种程序员的编辑器(不是IDE) 和一些基本工具
  • Vim / Emacs / Notepad++,学会如何配置代码补全,外观,外部命令等。
  • Source Insight (或 ctag)
使用这些东西不是为了Cool,而是这些编辑器在查看、修改代码/配置文章/日志会更快更有效率。
3、 熟悉Unix/Linux Shell和常见的命令行
  • 如果你用windows,至少学会用虚拟机里的linux, vmware player是免费的,装个Ubuntu吧
  • 一定要少用少用图形界面。
  • 学会使用man来查看帮助
  • 文件系统结构和基本操作 ls/chmod/chown/rm/find/ln/cat/mount/mkdir/tar/gzip …
  • 学会使用一些文本操作命令 sed/awk/grep/tail/less/more …
  • 学会使用一些管理命令 ps/top/lsof/netstat/kill/tcpdump/iptables/dd…
  • 了解/etc目录下的各种配置文章,学会查看/var/log下的系统日志,以及/proc下的系统运行信息
  • 了解正则表达式,使用正则表达式来查找文件。
对于程序员来说Unix/Linux比Windows简单多了。(参看我四年前CSDN的博文《其实Unix很简单》)学会使用Unix/Linux你会发现图形界面在某些时候实在是太难用了,相当地相当地降低工作效率。
4、 学习Web基础(HTML/CSS/JS) + 服务器端技术 (LAMP)
未来必然是Web的世界,学习WEB基础的最佳网站是W3School
  • 学习HTML基本语法
  • 学习CSS如何选中HTML元素并应用一些基本样式(关键词:box model)
  • 学会用  Firefox + Firebug 或 chrome 查看你觉得很炫的网页结构,并动态修改。
  • 学习使用Javascript操纵HTML元件。理解DOM和动态网页(http://oreilly.com/catalog/9780596527402) 网上有免费的章节,足够用了。或参看 DOM 。
  • 学会用  Firefox + Firebug 或 chrome 调试Javascript代码(设置断点,查看变量,性能,控制台等)
  • 在一台机器上配置Apache 或 Nginx
  • 学习PHP,让后台PHP和前台HTML进行数据交互,对服务器相应浏览器请求形成初步认识。实现一个表单提交和反显的功能。
  • 把PHP连接本地或者远程数据库 MySQL(MySQL 和 SQL现学现用够了)
  • 跟完一个名校的网络编程课程(例如:http://www.stanford.edu/~ouster/cgi-bin/cs142-fall10/index.php ) 不要觉得需要多于一学期时间,大学生是全职一学期选3-5门课,你业余时间一定可以跟上
  • 学习一个javascript库(例如jQuery 或 ExtJS)+  Ajax (异步读入一个服务器端图片或者数据库内容)+JSON数据格式。
  • HTTP: The Definitive Guide 读完前4章你就明白你每天上网用浏览器的时候发生的事情了(proxy, gateway, browsers)
  • 做个小网站(例如:一个小的留言板,支持用户登录,Cookie/Session,增、删、改、查,上传图片附件,分页显示)
  • 买个域名,租个空间,做个自己的网站。

进阶加深

1、 C语言和操作系统调用
  • 重新学C语言,理解指针和内存模型,用C语言实现一下各种经典的算法和数据结构。推荐《计算机程序设计艺术》、《算法导论》和《编程珠玑》。
  • 学习(麻省理工免费课程)计算机科学和编程导论
  • 学习(麻省理工免费课程)C语言内存管理
  • 学习Unix/Linux系统调用(Unix高级环境编程),,了解系统层面的东西。
    • 用这些系统知识操作一下文件系统,用户(实现一个可以拷贝目录树的小程序)
    • 用fork/wait/waitpid写一个多进程的程序,用pthread写一个多线程带同步或互斥的程序。多进程多进程购票的程序。
    • 用signal/kill/raise/alarm/pause/sigprocmask实现一个多进程间的信号量通信的程序。
    • 学会使用gcc和gdb来编程和调试程序(参看我的《用gdb调试程序》)
    • 学会使用makefile来编译程序。(参看我的《跟我一起写makefile》)
    • IPC和Socket的东西可以放到高级中来实践。
  • 学习Windows SDK编程(Windows 程序设计 MFC程序设计
    • 写一个窗口,了解WinMain/WinProcedure,以及Windows的消息机制。
    • 写一些程序来操作Windows SDK中的资源文件或是各种图形控件,以及作图的编程。
    • 学习如何使用MSDN查看相关的SDK函数,各种WM_消息以及一些例程。
    • 这本书中有很多例程,在实践中请不要照抄,试着自己写一个自己的例程。
    • 不用太多于精通这些东西,因为GUI正在被Web取代,主要是了解一下Windows 图形界面的编程。@virushuo 说:“ 我觉得GUI确实不那么热门了,但充分理解GUI工作原理是很重要的。包括移动设备开发,如果没有基础知识仍然很吃力。或者说移动设备开发必须理解GUI工作,或者在win那边学,或者在mac/iOS上学”。
2、学习Java
  • Java 的学习主要是看经典的Core Java 《Java 核心技术编程》和《Java编程思想》(有两卷,我仅链了第一卷,足够了,因为Java的图形界面了解就可以了)
  • 学习JDK,学会查阅Java API Doc http://download.oracle.com/javase/6/docs/api/
  • 了解一下Java这种虚拟机语言和C和Python语言在编译和执行上的差别。从C、Java、Python思考一下“跨平台”这种技术。
  • 学会使用IDE Eclipse,使用Eclipse 编译,调试和开发Java程序。
  • 建一个Tomcat的网站,尝试一下JSP/Servlet/JDBC/MySQL的Web开发。把前面所说的那个PHP的小项目试着用JSP和Servlet实现一下。
3、Web的安全与架构
  • 学习HTML5,网上有很多很多教程,以前酷壳也介绍过很多,我在这里就不罗列了。
  • 学习Web开发的安全问题(参考新浪微博被攻击的这个事,以及Ruby的这篇文章
  • 学习HTTP Server的rewrite机制,Nginx的反向代理机制,fast-cgi(如:PHP-FPM
  • 学习Web的静态页面缓存技术。
  • 学习Web的异步工作流处理,数据Cache,数据分区,负载均衡,水平扩展的构架。
  • 实践任务:
    • 使用HTML5的canvas 制作一些Web动画。
    • 尝试在前面开发过的那个Web应用中进行SQL注入,JS注入,以及XSS攻击。
    • 把前面开发过的那个Web应用改成构造在Nginx + PHP-FPM + 静态页面缓存的网站
4、学习关系型数据库
  • 你可以安装MSSQLServer或MySQL来学习数据库。
  • 学习教科书里数据库设计的那几个范式,1NF,2NF,3NF,……
  • 学习数据库的存过,触发器,视图,建索引,游标等。
  • 学习SQL语句,明白表连接的各种概念(参看《SQL  Join的图示》)
  • 学习如何优化数据库查询(参看《MySQL的优化》)
  • 实践任务:设计一个论坛的数据库,至少满足3NF,使用SQL语句查询本周,本月的最新文章,评论最多的文章,最活跃用户。
5、一些开发工具
  • 学会使用SVN或Git来管理程序版本。
  • 学会使用JUnit来对Java进行单元测试。
  • 学习C语言和Java语言的coding standard 或 coding guideline。(我N年前写过一篇关C语言非常简单的文章——《编程修养》,这样的东西你可以上网查一下,一大堆)。
  • 推荐阅读《代码大全》《重构》《代码整洁之道

高级深入

1、C++ / Java 和面向对象
我个人以为学好C++,Java也就是举手之劳。但是C++的学习曲线相当的陡。不过,我觉得C++是最需要学好的语言了。参看两篇趣文“C++学习信心图” 和“21天学好C++
  • 学习(麻省理工免费课程)C++面向对象编程
  • 读我的 “如何学好C++”中所推荐的那些书至少两遍以上(如果你对C++的理解能够深入到像我所写的《C++虚函数表解析》或是《C++对象内存存局)()》,或是《C/C++返回内部静态成员的陷阱》那就非常不错了)
  • 然后反思为什么C++要干成这样,Java则不是?你一定要学会对比C++和Java的不同。比如,Java中的初始化,垃圾回收,接口,异常,虚函数,等等。
  • 实践任务:
    • 用C++实现一个BigInt,支持128位的整形的加减乘除的操作。
    • 用C++封装一个数据结构的容量,比如hash table。
    • 用C++封装并实现一个智能指针(一定要使用模板)。
  • 设计模式》必需一读,两遍以上,思考一下,这23个模式的应用场景。主要是两点:1)钟爱组合而不是继承,2)钟爱接口而不是实现。(也推荐《深入浅出设计模式》)
  • 实践任务:
    • 使用工厂模式实现一个内存池。
    • 使用策略模式制做一个类其可以把文本文件进行左对齐,右对齐和中对齐。
    • 使用命令模式实现一个命令行计算器,并支持undo和redo。
    • 使用修饰模式实现一个酒店的房间价格订价策略——旺季,服务,VIP、旅行团、等影响价格的因素。
  • 学习STL的用法和其设计概念  - 容器,算法,迭代器,函数子。如果可能,请读一下其源码。
  • 实践任务:尝试使用面向对象、STL,设计模式、和WindowsSDK图形编程的各种技能
    • 做一个贪吃蛇或是俄罗斯方块的游戏。支持不同的级别和难度。
    • 做一个文件浏览器,可以浏览目录下的文件,并可以对不同的文件有不同的操作,文本文件可以打开编辑,执行文件则执行之,mp3或avi文件可以播放,图片文件可以展示图片。
  • 学习C++的一些类库的设计,如: MFC(看看候捷老师的《深入浅出MFC》) ,Boost, ACE,  CPPUnit,STL (STL可能会太难了,但是如果你能了解其中的设计模式和设计那就太好了,如果你能深入到我写的《STL string类的写时拷贝技术》那就非常不错了,ACE需要很强在的系统知识,参见后面的“加强对系统的了解”)
  • Java是真正的面向对象的语言,Java的设计模式多得不能再多,也是用来学习面向对象的设计模式的最佳语言了(参看Java中的设计模式)。
  • 推荐阅读《Effective Java》 and 《Java解惑
  • 学习Java的框架,Java的框架也是多,如Spring, Hibernate,Struts 等等,主要是学习Java的设计,如IoC等。
  • Java的技术也是烂多,重点学习J2EE架构以及JMS, RMI, 等消息传递和远程调用的技术。
  • 学习使用Java做Web Service (官方教程在这里
  • 实践任务: 尝试在Spring或Hibernate框架下构建一个有网络的Web Service的远程调用程序,并可以在两个Service中通过JMS传递消息。
C++和Java都不是能在短时间内能学好的,C++玩是的深,Java玩的是广,我建议两者选一个。我个人的学习经历是:
  • 深究C++(我深究C/C++了十来年了)
  • 学习Java的各种设计模式。
2、加强系统了解
重要阅读下面的几本书:
  • Unix编程艺术》了解Unix系统领域中的设计和开发哲学、思想文化体系、原则与经验。你一定会有一种醍醐灌顶的感觉。
  • Unix网络编程卷1,套接字》这是一本看完你就明白网络编程的书。重要注意TCP、UDP,以及多路复用的系统调用select/poll/epoll的差别。
  • TCP/IP详解 卷1:协议》- 这是一本看完后你就可以当网络黑客的书。了解以太网的的运作原理,了解TCP/IP的协议,运作原理以及如何TCP的调优。
  • 实践任务:
    • 理解什么是阻塞(同步IO),非阻塞(异步IO),多路复用(select, poll, epoll)的IO技术。
    • 写一个网络聊天程序,有聊天服务器和多个聊天客户端(服务端用UDP对部分或所有的的聊天客户端进Multicast或Broadcast)。
    • 写一个简易的HTTP服务器。
  • Unix网络编程卷2,进程间通信》信号量,管道,共享内存,消息等各种IPC…… 这些技术好像有点老掉牙了,不过还是值得了解。
  • 实践任务:
    • 主要实践各种IPC进程序通信的方法。
    • 尝试写一个管道程序,父子进程通过管道交换数据。
    • 尝试写一个共享内存的程序,两个进程通过共享内存交换一个C的结构体数组。
  • 学习《Windows核心编程》一书。把CreateProcess,Windows线程、线程调度、线程同步(Event,  信号量,互斥量)、异步I/O,内存管理,DLL,这几大块搞精通。
  • 实践任务:使用CreateProcess启动一个记事本或IE,并监控该程序的运行。把前面写过的那个简易的HTTP服务用线程池实现一下。写一个DLL的钩子程序监控指定窗口的关闭事件,或是记录某个窗口的按键。
  • 有了多线程、多进程通信,TCP/IP,套接字,C++和设计模式的基本,你可以研究一下ACE了。使用ACE重写上述的聊天程序和HTTP服务器(带线程池)
  • 实践任务:通过以上的所有知识,尝试
    • 写一个服务端给客户端传大文件,要求把100M的带宽用到80%以上。(注意,磁盘I/O和网络I/O可能会很有问题,想一想怎么解决,另外,请注意网络传输最大单元MTU)
    • 了解BT下载的工作原理,用多进程的方式模拟BT下载的原理。
3、系统架构
  • 负载均衡。HASH式的,纯动态式的。(可以到Google学术里搜一些关于负载均衡的文章读读)
  • 多层分布式系统 – 客户端服务结点层、计算结点层、数据cache层,数据层。J2EE是经典的多层结构。
  • CDN系统 – 就近访问,内容边缘化。
  • P2P式系统,研究一下BT和电驴的算法。比如:DHT算法
  • 服务器备份,双机备份系统(Live-Standby和Live-Live系统),两台机器如何通过心跳监测对方?集群主结点备份。
  • 虚拟化技术,使用这个技术,可以把操作系统当应用程序一下切换或重新配置和部署。
  • 学习Thrift,二进制的高性能的通讯中间件,支持数据(对象)序列化和多种类型的RPC服务。
  • 学习Hadoop。Hadoop框架中最核心的设计就是:MapReduce和HDFS。MapReduce的思想是由Google的一篇论文所提及而被广为流传的,简单的一句话解释MapReduce就是“任务的分解与结果的汇总”。HDFS是Hadoop分布式文件系统(Hadoop Distributed File System)的缩写,为分布式计算存储提供了底层支持。
  • 了解NoSQL数据库(有人说可能是一个过渡炒作的技术),不过因为超大规模以及高并发的纯动态型网站日渐成为主流,而SNS类网站在数据存取过程中有着实时性等刚性需求,这使得目前NoSQL数据库慢慢成了人们所关注的焦点,并大有成为取代关系型数据库而成为未来主流数据存储模式的趋势。当前NoSQL数据库很多,大部分都是开源的,其中比较知名的有:MemcacheDB、Redis、Tokyo Cabinet(升级版为Kyoto Cabinet)、Flare、MongoDB、CouchDB、Cassandra、Voldemort等。
写了那么多,回顾一下,觉得自己相当的有成就感。希望大家不要吓着,我自己这十来年也在不断地学习,今天我也在学习中,人生本来就是一个不断学习和练级的过程。不过,一定有漏的,也有不对的,还希望大家补充和更正。(我会根据大家的反馈随时更新此文)欢迎大家通过我的微博(@左耳朵耗子)和twitter(@haoel)和我交流。
—– 更新  2011/07/19 —–
1)有朋友奇怪为什么我在这篇文章开头说了web+移动,却没有在后面提到iOS/Android的前端开发。因为我心里有一种感觉,移动设备上的UI最终也会被Javascript取代。大家可以用iPhone或Android看看google+,你就会明白了。
2)有朋友说我这里的东西太多了,不能为了学习而学习,我非常同意。我在文章的前面也说了要思考。另外,千万不要以为我说的这些东西是一些新的技术,这份攻略里95%以上的全是基础。而且都是久经考验的基础技术。即是可以让你一通百通的技术,也是可以让你找到一份不错工作的技术。
3)有朋友说学这些东西学完都40了,还不如想想怎么去挣钱。我想告诉大家,一是我今年还没有40岁,二是学无止境啊,三是我不觉得挣钱有多难,难的是怎么让你值那么多钱?无论是打工还是创业,是什么东西让你自己的价值,让你公司的价值更值钱?别的地方我不敢说,对于互联网或IT公司来说,技术实力绝对是其中之一。
4)有朋友说技术都是工具,不应该如此痴迷这句话没有错,有时候我们需要更多的是抬起头来看看技术以外的事情,或者是说我们在作技术的时候不去思考为什么会有这个技术,为什么不是别的,问题不在于技术,问题在于我们死读书,读死书,成了技术的书呆子。
5) 对于NoSQL,最近比较火,但我对其有点保守,所以,我只是说了解就可以。对于Hadoop,我觉得其在分布式系统上有巨大的潜力,所以需要学习。 对于关系型数据库,的确是很重要的东西,这点是我的疏忽,在原文里补充。
(全文完,转载时请注明作者和出处)