Showing posts with label search engines. Show all posts
Showing posts with label search engines. Show all posts

Thursday, March 1, 2012

Search Challenge 002

A very popular search challenge is the Kermit Challenge.

I'd rate this a novice challenge and a good one to introduce elementary level students to search strategy, search engines, keywords, snippets and urls.

When it was first created, we posted a time to beat of 10 minutes. The only thing that would ever take this long is an inability to describe in words what is in the picture. If someone failed to use the character's name, that could slow down the search.

 Today I lowered the time to beat to 5 minutes. It takes less than a minute if you know what you are looking for.

Search Strategy 
Start by asking, "what am I looking for?" The directions call for finding a URL of a matching picture of Kermit, a URL where Kermit can be heard talking. If students don't know what a URL is, this is a good opportunity to point to one. No need to define it, just call it the address where a page on the Internet lives. Show a URL.

Keywords 
Also part of the search strategy is, "what words do I already know that I could use to find the matching picture?" The most important is given in the directions: Kermit. This is a proper noun and as such, has a very specific meaning. We want to use words that have specific meanings--if we can--when looking for information on the Internet. Other words need to come from the picture. "What do you see in the picture?" Describe it. "What is Kermit wearing?" "At what kind of an event would you wear clothes like that?"

Search Engines 
Search engines use Keywords to find matching information. The engine used here is Yahoo. Students should know that there is more than one search engine (Google). You can put any combination of words in a search engine, but it's best to use just a few. The order of the words doesn't really matter. Like most searches today, this one does not require any Boolean operators, but I'd leave that topic for older grades.

Snippets 
Search engines return matches to your keywords on a page as snippets, shortened sections of text that include the URL of the page where matching words were found, maybe the date the page was last updated, some text from the page so you can see how the words are used, a link to the page and some other information that can be topics for older grades (cached, similar). Snippets are REALLY important in finding information that matches the keywords. The search engine just finds the words, you have to determine if the way the words are used makes sense. The top result may not be the best one. Snippets may also (often) contain better words than the ones you started with. Maybe the words commencement or graduation show up. That's where people wear caps and gowns. Those words could be put in a new query such as KERMIT GRADUATION.

URL 
A little more about URLS could be introduced, such as the parts of a URL and what they tell us. In this case, the answer has the name of the organization that owns the information and the names of several folders where that information is stored: first, a news folder. Inside the news folder is another folder called 'commence' and in that folder is another one labeled '1996.' Finally in that folder is the page that matches the challenge. This page is an .htm page which stands for the kind of file it is, a pretty common information file on the Internet.

Planting the seeds that information can be organized (structured) in folders is a good computational mindset to introduce. A discussion about how to organize information (one big pile, separate piles without names, all laid out in a row, etc.) might help students think about the fastest ways to find something and what works best on computers.

 Try the Challenge. Don't miss the opportunity for learning. What other lessons can you squeeze out of this experience?

2018 Update: The Kermit Challenge became quite a bit more difficult with Long Island University absorbed Southampton College and took down the latter's Website. The answer page has been updated accordingly.

Thursday, October 22, 2009

For Elementary Students the Challenge is...


The younger a student is, the more challenging searching becomes.

On average, middle school students have more difficulty than high school students (the subject of an earlier post). Compared to middle school students, searchers in the elementary grades find themselves with fewer advantages.

There are a number of reasons (developmental, social, economic, etc.) for this and by no means is this post intended to be exhaustive. In fact, I'm going to focus on only one cause. Early childhood specialists, elementary teachers and librarians will have much more to contribute on this subject--and I encourage you to do so!

An obvious obstacle children face is vocabulary.

This might not be as big a problem if there were universal words used to describe the things we seek. Elsewhere I've talked about the "1 in 5 rule:" that, on average, there are four other words that may be used to describe an object or action. Unless children stick with fairly simple terms (butterfly, acorn, planet) there is a good chance they will not use the correct term (the term that matches the information they want). Moreover, 1-word searches are among the least effective ways to search--two words prove to be much better--and matching both words (if they're not the right ones) can produce unexpected results.

Because their vocabularies are limited, they may be unable to think of alternate words. That being said, it's amazing how few older students approach searching as a task of finding better keywords. Apparently, knowing more words is not the only obstacle to becoming an effective searcher.

But not having a good command of words, their meanings and the relationships among them poses a serious limitation. It is on this point I welcome insights from practitioners: what do you see as the limits?

There are some accommodations that can be made. One is to limit the search. Having children search in a closed, vetted environment is a fairly popular solution. Using Nettrekker.com or creating a custom search engine provides children with an authentic search experience while fishing in a pond that's appropriately stocked (not like a free range search engine where you can hook on to some real sharks).

Another alternative is to use a subject directory where the keywords are already supplied and the choices are vetted. This approach, while less efficient than a search engine, can be used to teach relationships among words. And again, notwithstanding your school's filters, the pond is largely protected (to check this yourself, use the search box on a directory site to search for an objectionable term).

That's all for now. I want to hear your thoughts on what makes searching challenging for children and what you may have found that helps.

Monday, September 28, 2009

Revisiting Stop Words and Clutter Words

The final item on the Query Checklist that I'm revisiting is #7: Did I use any stop words or clutter words?

Briefly, stop words are terms ignored by the search engine: common parts of speech that don't add significant content such as pronouns, prepositions and conjunctions. Google lists some of its exceptions to the "every word counts" rule. Here's a more complete list of overlooked words.

One way to tell if a word is being overlooked is to examine the query results. Consider the query here are all the stop words (not using quotes). In Google, all the words will appear in bold if the exact phrase is found (you don't need quotes to return the exact phrase). If only certain words from the query were used to find a matching result, those words will be shown in bold. In my query example, the second snippet contains the word 'the' but it does not appear in bold. Yahoo is similar.

One way to guarantee ALL the words are used is to link them with the AND operator (returning results containing all the words but not necessarily in the order you used them) or putting quotes around the phrase (returning results containing the exact phrase you submitted).

Stop words are so common they add little to the uniqueness of a query, which helps drive you to more well-matched, meaningful results. Students might be tempted to use stop words with a natural language query (e.g., I want a list of all the stop words), thinking this means something to the search engine. The query, list stop words gets to the point.

In a similar vein, clutter words are less common than stop words but don't add value to the query. In fact, they may detract from it, forcing the search engine to look for words you think are important but do not occur with the information you are seeking. Clutter words include unnecessary redundancies (like earthquake AND damage--in which case damage is redundant: it's hard to write about earthquakes without referring to damage or destruction or a bunch of other words you might not have used). Verbs, adjectives and adverbs are often clutter terms as well. A good rule of thumb to keep in mind is "if you can't clearly see it, don't use the word." Stick to objects--nouns and numbers.

All in all, the Query Checklist has held up well over the past few years. Once the list is internalized it can help you cut down on search time and produce more relevant results.

Next time: It's probably time for another Search Challenge!

Wednesday, September 23, 2009

Revisiting words with multiple meanings


Of the items on the Query Checklist, one that could be dropped is #6: avoiding words that have multiple meanings.

If your query includes an adequate number of keywords--and not more than necessary--a word with multiple meanings does little to prevent you from finding what you seek. From my perspective, today's search engines (the ones that continue to develop) are less sensitive to multiple meanings and more sensitive to contextual clues provided by the other keywords. This is why a search for roman spears no longer returns top results about Brittney Spears. A few years back this type of search challenge was pretty easy to construct: find a word with a very popular or common usage and use it in a search for a less common object or idea.

Nowadays, the pairing of words in a meaningful context excludes other uses of the terms. As long as the accompanying term is sufficiently unique, using a word with multiple meanings is not a problem. The challenge is to find the word that uniquely modifies the more ambiguous term.

If some of the 21cif Search Challenges seem easier than they once were, credit search engines for producing more focused results.

It's still a good idea to be mindful of words with multiple meanings and pair them with unique terms. If you are looking for information on a disc jockey whose name is "Bill Gates" you definitely need some unique terms to ferret out someone other than the Bill Gates of Microsoft fame. This one still may be challenging.

Finally, does it need to be stated that one-word searches are confounded by words that have more than one meaning?

Next time: revisiting stop words and clutter words

Tuesday, September 15, 2009

Revisiting Word Order


When does the order of keywords matter?

The ninth item of the query checklist was always last because keyword order mattered the least. This remains largely the case.

Take a query I used today while doing some IMSA program planning: business ethics simulation. There are five other ways to order the terms. But does it make any difference?

Analyzing the top ten results in Google, Bing and Yahoo, here's how many different results were obtained when the order was switched (a total of 60 different results per engine is theoretically possible):
14 - Google
15 - Bing
15- Yahoo
A few other insights are worth mentioning:

Google returned the identical top result no matter the keyword order. The second and third slots were filled consistently by the same two pages with minimal alternation. In all, six returns were common across all possible keyword combinations. Queries that returned the most diverse results were: business ethics simulation, ethics simulation business and ethics business simulation. I'm not sure what to make of this observation, but I thought I'd mention it nonetheless. Any ideas?

Compared to Google, Bing was more varied in its ranking of results. No page was consistently the top result, although five pages appeared in the top ten on all trials. While Bing produced one more unique page than Google, several pages were from the same site. Of greater interest, Bing and Google returned a number of pages not replicated by the other (see below).

Yahoo, like Google, consistently returned the identical top page no matter what the query order. The second return was also identical across all queries, although this page was related to the first, so not entirely a unique return. Again, five of the same results were found with every query. Yahoo did not return Google's top return at all, but both Google and Bing included Yahoo's top result.

All three search engines combined produced a total of 31 unique returns. If I had stopped after entering the first query--business ethics simulation--the three search engines would have yielded 21 different pages. Fifteen additional queries netted only 10 additional, unique pages. Probably not worth the effort.

Pages unique to each search engine:
7 - Google
4 - Bing
9 - Yahoo
What to make of this? The biggest lesson, it seems to me, is that searching different databases is more worthwhile than playing with word order. Without looking past the first page of each, I netted twice as many highly ranked results than if I had only used Google. (Now whether the results are all that relevant is a matter of investigation). By contrast, I netted only 4-5 new pages by sticking with one search engine and varying the keyword order.

Based on the number of unique results, if you're not using Yahoo, you might consider adding it to your list of go-to search engines.

Some differences are obtained by changing the word order, but maybe not enough (in this case) to warrant going through all the permutations. In general, stick with the natural language order of the words. It seems natural to say business ethics simulation. The other forms seem a bit awkward or forced. Since search engines look for words in relationship to one another, and this is the order most people might use when writing about business ethics simulations, it's good enough. I'm sure there are cases you can think of when a particular order works better. If there are, post your reply.

There's one case when order is highly important: when operators are used. The operator modifies the keywords around it, so if placed in the wrong order, the results may be wildly unpredictable. For example: business OR ethics OR simulation (a student favorite when they stumble upon the OR operator).

Next time: revisiting the optimal number of keywords.

Saturday, September 12, 2009

Query Checklist Revisited


A number of years ago, we published the Query Checklist, a guide for turning questions into queries. I seldom think about the checklist anymore--I guess I've internalized the list, so checking off the items as I search really isn't necessary.

I thought it would be helpful to revisit the checklist to see if it's still relevant. As search engines have evolved, maybe something has changed.

The original list was the combined search wisdom of the 21cif team back when there were 7 of us and IMSA was the publisher. Now there are just two of us and 21cif is privately owned. That's what happens when federal funding runs out. At least the program survived, thanks to IMSA's decision to release it to its authors.

Here's the list, if you're not already familiar with it:

1. How many key concepts (important ideas) are found in the question?
2. How many key concepts will I search for?
3. What keywords are probably effective “as is?”
4. For which concepts are more effective keywords probably needed?
5. Are there hyponyms or professional language for any of the intermediate words?
6. Are there words that have multiple meanings?
7. Did I use any stop words or clutter words?
8. Did I spell the words correctly?
9. Did I put the most important words first?


There's too much here to cram into one blog posting, so I'll spread it out over a series.

Let me start with number 8, which seems pretty obvious. The importance of this question depends on the search engine being used. For instance, Google has a built-in spell checker so you might think spelling no longer matters much. When spelling does matter is when the misspelled word turns out to be a bona fide yet different word.

Example 1: If you are looking for information on bear tracks but mistakenly type bare tracks, Google thinks there's nothing wrong with your spelling and returns information on an Australian nudist colony. An honest mistake, but not one you'd likely want to make in front of a class of students.

When words have more than one spelling, or a different word happens to match the misspelling, then spelling counts.

Example 2: When a search engine lacks the capacity to spell check, then literal matching is all you've got. If you search for Mississipi using the Farmers Almanac, you won't get any results.

Challenge: Take the top ten search engines you use and see which ones check spelling. Better yet, have your students do this for practice. Don't have 10 search engines that you use? Time to branch out.

Monday, August 17, 2009

More information, a smaller fraction


I'm still thinking about my last post.

The paradoxical thing about information and searching is that the more of it there is, the less of it we will see. The results we retrieve will be a smaller and smaller sample of what's actually available. And I don't see how this trend can be reversed.

Today when I search Google for the phrase "information fluency" it reports that about 39,800 results are found. But if I had the time and wanted to look at all those returns, I could only get to 981 of them, or 2.5%.

When I ask workshop participants if they've ever gone to the end of the list retrieved, I've never encountered anyone who has. Most searchers stop after the first page; the number who look at two pages is much smaller. For the 8 people in 100 who go beyond the third page, they have access to 0.1% of the information theoretically available. For the majority who never look beyond page one, that number falls to 0.025%.

Keep in mind these are the stats for "information fluency." If the search was for "civil war" (90,200,000 results), single-pager searchers would be able to reach 0.00001% (one one-hundred-thousandth) of the information theoretically available. That's a very small percentage and it's going to get smaller.

One of the changes in Google I noticed over the last year was greater variation in the results on the top pages. It used to be that 21cif.com (previously 21cif.imsa.edu) took 8 of the top 10 spots. Today, 21cif shares the top ten places with four other organizations. By eliminating similar results, Google is able to show more choices for the query. That's better in terms of available information--there's no reason why 21cif.com should have a monopoly on the content--but the results remain limited to a handful of major players in the field of information fluency.

There are other ways to serve up a broader array of results, such as clustering. Polymeta is one example. But nothing digests more than a fraction of the information out there. That places the responsibility for gaining a broader perspective on the searcher. And that takes work: multiple queries to get at information buried too deep to be retrieved by a single query.

If we want to stay on top of information, given today's tools it's not going to be easy.

Saturday, August 15, 2009

Is Searching Getting Easier?


There's little doubt that search engines are indispensable when it comes to searching for information online. But is searching easier today than it was a couple years ago?

I'm inclined to say yes.

Search engines are constantly being tweaked (it seems) to improve their performance, or at least their user-friendliness. How well they do at improving the task of finding information is largely a matter of personal opinion, and it depends on what you are searching for as well.

The ultimate problem with searching is that there is so much information to choose from. Search engines will always struggle to keep up with the production of new information. If IBM's prediction is right, and the amount of information produced by 2010 doubles every 11 hours, the engines we know today will never catch up. I imagine engine developers are working on how to stay on top of that kind of exponential growth curve, but it still comes down to this: what you retrieve is only a tiny fraction of the information that's available.

In a world awash with information, we increasingly need help reducing the torrent to a trickle. Largely content to browse the top ten results, we have become information reductionists. It seems harsh to put it that way, but most of the time we opt for easy. Scanning a few results is easy and, much of the time, serves us adequately.

Search engines are designed to serve up manageable results. Innovations like Google Squared and Bing make it relatively easy (though the quality of the information still has to be checked). Smarter engines will (if they don't already) track the things you query and the results you browse better to determine the type of information that interests you. That's intended to makes things easier, but I wonder if that will actually make it harder to find information that challenges you.

Something to think about.