Projects

1 PROJECT WIKI
2 PROJECT DEADLINE
3 PROJECT GOAL
4 PROJECT TYPES

4.1 Experimental projects
4.2 Theoretical projects

5 USEFUL DATA SETS/TOOLS
6 REMINDER

PROJECT WIKI

All projects should be entered here *before* you start the actual work.

PROJECT DEADLINE

The project reports have to be sent to me by email no later than Sunday, April 3 (= as long as it's Sunday, April 3 somewhere on this planet). This gives a total of 4 weeks.

PROJECT GOAL

All of the projects aim to be (mini-)research related projects. They are not text book exercises with wrong/right answers.

PROJECT TYPES

There are two types of projects. One implementation-based, using real data, and one literature-based, with a pure theoretical nature

Experimental projects

Punchline: Handle real data sets, implement real algorithms, evaluate the performance of your algorithm.

Project description: The following is meant to be an *example* proposal. You're free to work on this but you're also encouraged to think of a different project that's more aligned with your research interests. Any reasonable proposal of similar complexity will be accepted. Ask first if you're unsure.

Sample description: Try to predict the "advertisability" of a query, i.e. whether sponsored search ads should be shown for this query. You can use the AOL query log (see data sources) as a real query log. To get a set of labeled data you can use your preferred search engine (again, see data sources). The final evaluation should have at least a partial human contribution as the goal is not to recreate Google's/Yahoo's/Bing's algorithm.

Output: A written project report. The report must include (i) the problem you addressed, (ii) how you obtained the data you used, (iii) which algorithms you used to solve the problem, and (iv) how exactly the algorithms were evaluated. It does not need to include any related work or any bibliography, though if you use existing methods you have to acknowledge that. There is no minimal or maximal length.

Evaluation: Projects will be evaluated on (i) the clarity of the project presentation/write-up, (ii) the scientific thoroughness/rigor of the evaluation, and (iii) the originality of the work. You can trade (ii) for (iii), and vice versa. E.g. you can simply use an existing algorithm from the literature (= low originality) but then have a very thorough and convincing evaluation (= high thoroughness, given the available resources of course). Or you can think of a new approach, in which case the evaluation does not need to be as thorough. In no case will you be evaluated on the absolute performance of your methods. It is irrelevant if your methods perform worse than the baseline as long as you demonstrate that you've learned something and, hopefully, you've had fun playing with the data and your algorithms.

Group size: These projects will typically be in done in pairs. If you're very experienced with the required tools you might be able to do them alone (not encouraged though). If there is a good reason (ask first!) and, e.g. you want to compile a very large data set and compare lots of different algorithms, larger groups (up to 4 students) might be possible. In either case, sharing of the obtained data sets for inter-group evaluation of the algorithms is strongly encouraged (assuming more than one group works on the same project).

Theoretical projects

Punchline: read a small set of related theoretical papers, give a "synthesis", have a stab at open problems.

Project description: Find 3-4 related (= addressing similar problems and/or citing each other) theoretical papers in the domain of online advertising, e.g. discussing various auction mechanism. Eligible papers must have non-trivial theorems and contain a substantial part of proofs and must *not* be co-authored by Ingmar. The goal is to "connect" these papers and to look for (easy) extensions/open problems.

Output: A written report without minimum/maximum page limits. The report should (i) summarize the papers investigated (not merely copying the abstracts), (ii) give an insight into the difficulties of the "main" proof, (iii) put the papers into context, i.e. explain how they relate and why one theorem is/is not applicable in the other setting, and, finally, (iv) look at easy extensions. Concerning the last point, general extensions are things such as more expressive mechanisms (e.g. interdependence between things), changes over time or things being repeated or other issues mentioned by the authors. These things will be hard/impossible to solve but usually there are solvable special cases, such as considering a constant X (where X can be anything depending on the problem), assuming very simple models, such as factorizability, allowing only a single Y of a certain type and so on.

Evaluation: The report should show that you've read and understood all the papers in detail, including the main aspects of the proofs. E.g. where exactly does on proof break when transferred from the setting of one paper to the other? The extensions considered should also not be completely trivial (such as 1 bidder, 1 item and anything else constant - such that a setting is a special case of almost any problem) and demonstrate that you've given the problem some serious thought.

Group size: These projects will typically be done alone. However, pairs of people are also possible if you think that there would be useful synergy (ask first!) and if there will a total of 6-8 papers covered.

USEFUL DATA SETS/TOOLS

1. AOL query log About 20M web queries from about 650k users over three months. Includes both queries and clicked URLs, but no ad clicks.

2. Ads for search terms Sadly, there is no open data set of ads for search terms. However, using a typical "scraping" approach, which might break the terms of service and should only be used for instructional purposes if at all, one can get a reasonable data set within, say, a day. If you want to follow this approach keep the following in mind: - Fake a user-agent: all tools such as "wget" have options for this.

Throttle the rate and use random waiting intervals: avoid issuing more than 5-7 queries per minute and regularly check the progress.
Use different IPs: If you have access to open proxy servers, great. Note that the ad served will often depend on the geo-location of your IP address.
Email Ingmar if you want Perl source code (essentially just regular expressions) that extracts the ads from a search result page. He can also provide you with a (trivial) wget command to get such a search result page.

3. Yahoo! Webscope data Real data for advertisers' bids on keywords but the keywords (and advertiser ids) are anonymized. Still potentially useful to study the stability of the ads (c.f. truthfulness) and the strategic behavior of competitors.

4. The Open Directory Project data Millions of URLs and snippets classified by topic.

5. Yahoo! Web Search (http://search.yahoo.com/ http://www.ysearchblog.com/category/site-explorer/ http://help.yahoo.com/l/us/yahoo/search/basics/basics-08.html) You can use "linkdomain:" to get inlinks for a URL (to find related pages). Note: this service will be discontinued soon.

6. Wikipedia Complete wikipedia - or use only the hierarchy to find related terms/topics.

7. Wordnet Another standard tool. Gives you e.g. synonyms and other related terms.

REMINDER

All projects should be entered here *before* you start the actual work.