{\rtf1\mac\deff2 {\fonttbl{\f0\fswiss Chicago;}{\f2\froman New York;}{\f3\fswiss Geneva;}{\f4\fmodern Monaco;}{\f13\fnil Zapf Dingbats;}{\f14\fnil Bookman;}{\f15\fnil N Helvetica Narrow;}{\f16\fnil Palatino;}{\f18\fnil Zapf Chancery;}{\f20\froman Times;} {\f21\fswiss Helvetica;}{\f22\fmodern Courier;}{\f23\ftech Symbol;}{\f33\fnil Avant Garde;}{\f34\fnil New Century Schlbk;}{\f55\fnil Code 3 of 9;}{\f1904\fnil AppleIcon;}{\f2029\fnil Nadianne;}{\f2052\fnil Zeal;}{\f12899\fnil AppleGaramond LtIt;} {\f12900\fnil AppleGaramond BkIt;}{\f12901\fnil AppleGaramond BdIt;}{\f12902\fnil AppleGaramond Lt;}{\f12903\fnil AppleGaramond Bk;}{\f12904\fnil AppleGaramond Bd;}}{\colortbl\red0\green0\blue0;\red0\green0\blue255;\red0\green255\blu e255; \red0\green255\blue0;\red255\green0\blue255;\red255\green0\blue0;\red255\ green255\blue0;\red255\green255\blue255;}{\stylesheet{\s243\qj\fi180\sa24 0\keep\tqc\tx4320\tqr\tx8640 \f20 \sbasedon0\snext243 footer;}{\s253\qj\fi180\li360\sa240\keep \b\f20 \sbasedon0\snext0 heading 3;}{\s254\qj\sb120\sa240\keep \b\f20 \sbasedon0\snext0 heading 2;}{\s255\qj\sb240\sa240\keep \f20\fs36 \sbasedon0\snext0 heading 1;}{\qj\fi180\sa240\keep \f20 \sbasedon222\snext0 Normal;}{\s2\li360\sa240\keep \f20 \sbasedon0\snext0 BulletList;}{\s3\sa240\keep \f22 \sbasedon0\snext0 code;}{\s4\qc\fi180\sa240\keep \f20\fs48 \sbasedon0\snext4 Title;}{\s5\qc\fi180\sa240\keep \f20\fs28 \sbasedon0\snext5 Byline;}{\s6\qc\fi180\sa240\keep \f20 \sbasedon0\snext0 Caption;}{ \s7\qj\sb240\sa240\keep \f22\fs36 \sbasedon255\snext7 CodeRef;}}\margl720\margr720\margt1080\margb720\deftab360\widowctrl\ftnbj \sectd \sbknone\linemod0\linex0\cols1\endnhere {\footer \pard\plain \s243\qr\fi180\sa240\keep\tqc\tx4320\tqr\tx8640 \f20 More Soup? \endash \chpgn \par }\pard\plain \s4\qc\fi180\sa240\keep \f20\fs48 More Soup?\par \pard\plain \qc\fi180\sa240\keep \f20 {\fs36 DRAFT 5\par }\pard\plain \s5\qr\fi180\sa240\keep \f20\fs28 Michael S. Engber\line Apple Computer - PIE Developer Technical Support\line Copyright \'a9 1993 - Michael S. Engber\par \pard\plain \fi180\sa240\keep \f20 This article was (will be) published in the January 1994 issue of PIE Developers magazine. For information about PIE Developers, contact Creative Digital Systems at CDS.SEM@APPLELINK.APPLE.COM or 415.621.4252.\par \pard\plain \s255\qj\sb240\sa240\keep \f20\fs36 Introduction\par \pard\plain \qj\fi180\sa240\keep \f20 Last month's issue had an article, "Soup's On," which covered the basic things you should know about creating and using soups. This article focuses optimizing soup usage. I assume that the reader is already familiar with the chapter on using soups in the N ewton Programmer's Guide and has some experience writing Newton applications.\par \pard\plain \s255\qj\sb240\sa240\keep \f20\fs36 The Nature of Soups\par \pard\plain \qj\fi180\sa240\keep \f20 Soups on the Newton store flattened versions of the entry-frames you work with in NewtonScript. When you retrieve an entry from a soup, you get back a frame which is a cache of the actual data in the soup. Calling EntryChange causes the cache to be written out to the soup. Calling EntryUndoChanges causes the cache to be re freshed from the soup.\par Retrieving an entry from a soup allocates a frame in the NewtonScript heap. This takes time and heap space. The current soup implementation optimizes this operation by not creating a complete frame for the entry until you access one of its slots. Once you access a slot, it brings the whole entry into memory. An obvious area for future optimization would be to bring slots into memory on an as-needed basis.\par \pard \qj\fi180\sa240\keep The above information may not be new to you. If you didn't know it already, after reading about soups in the Newton Programmer's Guide and a few minutes of reflection, you probably would have guessed it. What you may not have thought about, however, are so me of the implications of the implementation of soups.\par \pard\plain \s255\qj\sb240\sa240\keep \f20\fs36 Organizing Your Data Across Multiple Entries\par \pard\plain \qj\fi180\sa240\keep \f20 One implication of the way soup data is retrieved is that as you loop through your soup looking at all the entries, there is overhead (in both space and time) for allocating entry-frames in the NewtonScript heap. The larger the entries, the larger the over head. If you've created a soup whose entries each contain a large amount of data, you might have noticed this already.\par \pard \qj\fi180\sa240\keep One way you can speed things up is to store your data in multiple soups. The built-in Calendar application uses five different soups to store its data. This is probably more soups than most applications need. I think two soups should suffice for most purposes. Call one soup the query-soup and the other the data-soup. The basic strategy is to keep the entries in the query -soup small. They hold just enough information for the queries you commonly make, plus a reference (EntryUniqueId) to an entry in the data-soup which contains the bulk of the information.\par \pard \qj\fi180\sa240\keep This allows you to quickly access the query-soup and pay the overhead for retrieving large amount of information only when you actually retrieve it from the data-soup.\par \pard \qj\fi180\sa240\keep Actually, there is no reason you have to use two soups to take advantage of this technique. You can keep both query-entries and data-entries in the same soup. If only the query-entries have the slots used for indexing, then your queries will ignore the dat a-entries.\par \pard \qj\fi180\sa240\keep The exact details of what slots go where will depend on your particular application. It will probably take some experimentation on your part to c ome up with an optimal organization. Another consideration is whether or not you're planning to allow your users to edit their soup data using Newton Connection. If you are, splitting your data across multiple entries may make editing the data more diffic ult, perhaps impractical.\par Bear in mind, this type of optimization is useful only when you need to access a large number of soup entries to find relatively few entries that you are actually interested in. Whenever possible, it's better to handle this proble m by using indexes. Often, proper use of indexes can allow you to construct queries that only examine the entries of interest.\par \pard\plain \s255\qj\sb240\sa240\keep \f20\fs36 Use Your Indexes\par \pard\plain \qj\fi180\sa240\keep \f20 Another point to keep in mind when using soups is: take advantage of your indexes whenever possible. This seems obvious enough, but I've seen lots of programmers carefully design their soups with indexes and then fail to take advantage of them when writing queries.\par \pard\plain \s254\qj\sb120\sa240\keep \b\f20 StartKey vs ValidTest\par \pard\plain \qj\fi180\sa240\keep \f20 A common error is to be seduced by the flexibility of using a validTest. The te mptation is to write a query that relies solely on a validTest to decide which entries to return. For example, a validTest like the one below works \endash it returns the desired entries. However, it totally negates the benefit of having an index on slot x.\par \pard\plain \s3\sa240\keep \f22 validTest: func(e) e.x >= "C" AND e.x <= "D"\par \pard\plain \qj\fi180\sa240\keep \f20 If you create a cursor using this valid test, and naively use a "{\f22 while cursor:Entry()} " loop to traverse your soup, you'll end up visiting every single element in you soup. If instead you had used a startKey and an endTest, as show below, you would only have visited the entries of interest. You can't hope to do any better than that.\par \pard\plain \s3\sa240\keep \f22 startKey: "C",\line endTest: func(e) e.x < "C" OR e.x > "D"\par \pard\plain \qj\fi180\sa240\keep \f20 Of course, not all queries are going to be as simple as these, but you should use a startKey and an endTest whenever possible. This makes the difference between an O(N) search and an O(logN) search. You may recall from your introductory data structures cou rse how big a difference this can be \endash one million versus six (log10{\fs20\up6 6} = 6).\par \pard\plain \s254\qj\sb120\sa240\keep \b\f20 GotoKey\par \pard\plain \qj\fi180\sa240\keep \f20 A related point is using GotoKey instead of Move. Move lets you move the entries one at a time. The one at a time part is actually hidden from you since Move accepts an arbitrary integer offset. Using GotoKey moves the cursor directly to the desired entry in one shot \endash a single O(logN) search.\par \pard \qj\fi180\sa240\keep Consider the following two code fragments for advancing a string-index based cursor to the last entry in a soup:\par \pard\plain \s3\sa240\keep \f22 cursor:Move(0x1FFFFFFF);\line while cursor:Next() do; //make sure we're off the end\line cursor:Prev(); //back up one - to the last entry\par cursor:GotoKey("ZZZZZZ"); //pick a "large" key\line while cursor:Next() do; //make sure we're off the end\line cursor:Prev(); //back up one - to the last entry\par \pard\plain \qj\fi180\sa240\keep \f20 These code fragments may not look very different, but for a large soup, the second one will be significantly faster.\par \pard\plain \s255\qj\sb240\sa240\keep \f20\fs36 Avoiding Explicit Search Loops\par \pard\plain \qj\fi180\sa240\keep \f20 You can often avoid writing an explicit search loop by using startKeys, endTests, and validTests. This reduces the size of your code and is more efficient since l ooping is handled by the soup implementation. There are two examples of this in the section titled "Uniquely Identifying Entries." The following code is from one of these examples.\par \pard\plain \s3\sa240\keep \f22 Query(soup,\{type: 'index,\line indexPath: '_uniqueId,\line startKey: entryId,\line endTest: func(e) EntryUniqueId(e) <> entryId,\line validTest: func(e) EntryStore(e):GetSignature() = storeSig,\line \}):Entry();\par \pard\plain \qj\fi180\sa240\keep \f20 This code evaluates to either {\f22 nil} or to the entry whose id is {\f22 entryId} and whose store signature is {\f22 storeSig}. No explicit search loop is required \endash only a single Entry message.\par \pard \qj\fi180\sa240\keep When using startKeys, endTests, and validTests be sure to think carefully about your queries. Because the underlying search is handled by the soup implementation, it's much easier to inadvertently do things inefficiently than if you were writing the search loops yourself.\par \pard \qj\fi180\sa240\keep As was pointed out in the previous section, a common error is omitting the startKey and consequently ignoring the benefit of an index. A rel ated error is omitting the endTest. Forgetting an endTest can force a query to do unnecessary work \endash continuing to search after all entries of interest have been returned. For example, consider the above query without the endTest. If there was no entry mat ching the criteria, the query would examine every entry in the soup whose id was greater than or equal to {\f22 entryId}. By using the endTest, it examines only entries whose id equals {\f22 entryId} \endash at most, two.\par \pard\plain \s255\qj\sb240\sa240\keep \f20\fs36 Simulating Joins by Using Multiple Indexes\par \pard\plain \qj\fi180\sa240\keep \f20 Currently, you can only query soups using one index at a time. Sometimes you want queries involving two or more indexes at time. In relational database jargon, this is termed performing a join.\par \pard \qj\fi180\sa240\keep I will present a technique that is applicable only to two indexes and, even then, is practical only if one of the indexes is over a small set of values. We will call the two "conceptual" indexes of interest a director-index and a date-filmed-index. The dir ector-index only allows the values moe, larry, and curly, indicating whic h stooge directed the episode. The date-filmed-index uses integers to represent the date the episode was filmed. The kinds of queries we will be interested in will be ones like: "retrieve all the episodes directed by moe in the order they were filmed." \par \pard \qj\fi180\sa240\keep The basic idea is to build an actual soup index for each of the distinct values of director-index. Each entry in the soup will have either a moe-slot, a larry-slot, or a curly-slot. The content of this slot will be the date the episode was filmed. Our inde x specification will look something like:\par \pard\plain \s3\sa240\keep \f22 [\{structure: 'slot, path: 'moe, type: 'int\},\line \{structure: 'slot, path: 'larry, type: 'int\},\line \{structure: 'slot, path: 'curly, type: 'int\},\line ]\par \pard\plain \qj\fi180\sa240\keep \f20 To "retrieve all the episodes directed by moe in the order they were filmed," we simply create a cursor using the code shown below. Only entries with moe-slots will be returned by this cursor. They will be returned ordered by the contents of their moe-slot \endash the date on which they were filmed.\par \pard\plain \s3\sa240\keep \f22 moeCurs := Query(stoogeSoup,'\{type: index, indexPath: moe\});\par \pard\plain \qj\fi180\sa240\keep \f20 You will note that we never actually built a unified date-filmed-index. Instead, we distributed this information over three separate indexes. This means we can't do a query to retrieve all episodes in date-filmed-index order without first giving each entry another slot, call it dateFilmed, and then building an index on the dateFilmed slot. There is nothing preventing us from doing this, it will just require more space \endash an extra slot in each entry.\par \pard \qj\fi180\sa240\keep This is not the most general purpose technique. We might be able to use it for "Snow White and the Seven Dwarves," but we'll run into problems with "101 Dalmations." It's something to keep in mind for use in suitable situations. \par To give credit where credit is due, this section was inspired by a posting on the Internet by Kent Borg. The posting specifically addressed implementing filing (which categorizes user data in one of up to thirteen categories). As you can see, the technique has some wider applicability.\par \pard\plain \s255\qj\sb240\sa240\keep \f20\fs36 Using Identical Frame Maps\par \pard\plain \qj\fi180\sa240\keep \f20 St oring entries in a soup requires storing the contents of their slots as well as information about the structure of the slots themselves. The information describing the structure of the slots in a frame is called the map. Soups save space by reusing maps am ong entries with the same structure. The issue of shared maps has implications for general NewtonScript programming. This section will only discuss it with respect to minimizing soup size.\par Obviously, in order for two entries to share the same map they must have the same set of slots. However, the current of soup implementation also requires entries that share a map to have their slots in the same internal order. This means that even if your entries all use the same set of slots, it's possible to have a diff erent map stored for each permutation of slot ordering.\par \pard \qj\fi180\sa240\keep The internal ordering of slots in a frame is something that programmers normally don't think about. NewtonScript purposely leaves the details of slot ordering unspecified \endash open to changes in the futu re. Programmers are not supposed to write code that relies on slot ordering and there are no functions provided to manipulate slot ordering. So it may seem unreasonable to suggest optimizations the rely on slot ordering. However, the way in which you creat e a frame gives you some control in the area of slot ordering. If you create your entry frames in an identical way, you can assume they use the same slot ordering.\par \pard \qj\fi180\sa240\keep For example, if you use the following line of code to create all your entries you can be sure that each entry will use the same slot ordering.\par \pard\plain \s3\sa240\keep \f22 newEntry := \{slot1: x, slot2: y, slot3: z\};\par \pard\plain \qj\fi180\sa240\keep \f20 You can achieve the same effect by cloning a dummy frame and then filling in the slots individually as show in the following code:\par \pard\plain \s3\sa240\keep \f22 newEntry := Clone('\{slot1: nil, slot2: nil, slot3: nil\});\line newEntry.slot1 := x;\line newEntry.slot2 := y;\line newEntry.slot3 := z;\par \pard\plain \qj\fi180\sa240\keep \f20 If the set of slots needed varies from entry to entry, you might consider giving all entries the same set of slots and filling in unused slots with nil. Indexed slot s containing nil will be ignored by indexes. (Note that this last point is only true for the last system update (1.05). Prior to that, using nil in an indexed slot would cause errors.) You have to weigh the benefits of map sharing against the cost of the e xtra slots.\par \pard \qj\fi180\sa240\keep For instance, if you know that over time, all the entries in your soup will end up with the same set of slots, it's probably better to give each entry this set of slots when it's initially created. On the other hand, if the entries in your soup tend to fall into a few different categories, with a different set of slots for each category, then it's wasteful to force every entry in the soup to use the same set of slots. You can still achieve space savings by ensuring that entries in the same categ ory share a map.\par \pard \qj\fi180\sa240\keep Generally speaking, most programs create their soup entries using the same few lines of code \endash much like the ones shown above. You only need to worry about the issues in this section if you build up entries in a piecemeal fashion, using a large number of different of slot sets or slot orderings.\par \pard\plain \s255\qj\sb240\sa240\keep \f20\fs36 Alternatives to Soups\par \pard\plain \qj\fi180\sa240\keep \f20 Remember that soup data is stored as frames. Consequently, there is a certain amount of overhead associated with each soup entry. For example, using a soup to store a list of pairs (e.g. < 95051, CA >) would be a poor choice. You may be b etter off storing large, read-only data sets in a slot in your base view as a single array, frame, or binary object. Information stored this way will be compressed along with y our package and will not be brought into the NewtonScript heap when it is accessed. The primary disadvantages of such a scheme are that the data will be read-only and that you won't have any of the nice conveniences that soup queries provide.\par \pard \qj\fi180\sa240\keep If your application uses a large initial data set and allows additions by the user, you might consider a hybrid approach: Keep the initial data set in your base view and use a soup only for the user's additions.\par \pard \qj\fi180\sa240\keep If you decide not to store your data in a soup, here are some points to think about:\par \pard\plain \s2\li360\sa240\keep \f20 \bullet If you keep an array sorted, you can use binary search to find elements quickly. Once again, this is the difference between an O(N) and O(logN) search.\line \line \bullet Don't be too quick to discount frames as your data structure. Slot lookup is fast. A binary search is used for large frames (as soon as there are enough slots to make it faster than a linear search). Also, remember that slot names (symbols) can consist of any characters if you use vertical bars to escape them. For example, {\f22 \{|1|: "Reg"\}} is a legal syntax for a frame.\line \line \bullet If you're storing a lot of repeated strings, consider using symbols instead. This way, only one copy of the actual string will be stored in your package \endash all the symbols will reference it. Again, remember you can use vertical bars to allow arbitrary characters in your symbols.\line \line \bullet Storing your data as a binary object can avoid some of the overhead of the array and frame data structures. You'll have to use the various ExtractXXX functions to retrieve your data. If strings are part of the data in your binary object, extracting them with ExtractCString or ExtractPString will create a string object in the NewtonScript heap. In general, binary objects may let you store your data more compactly, but you'll pay more to a ccess it.\par \pard\plain \s255\qj\sb240\sa240\keep \f20\fs36 Uniquely Identifying Entries\par \pard\plain \qj\fi180\sa240\keep \f20 Sometimes soup entries need to refer to each other. Consider a soup whose entries represent people. One person may need to refer to another person in the soup \endash for example, their father. You may recall that when a fr ame is added to a soup, a deep copy is made. This means if the father slot contains a reference to the father's frame, you would end up copying the father's entire frame into the son's entry. This is undesirable. You want each person to be represented by a single entry in the soup and for other entries only to reference that single entry.\par \pard \qj\fi180\sa240\keep If you knew, for example, that every person in the soup had a unique social security number you could store the father's social security number and then be able to retriev e the father's entry. Some soups you design will naturally have a slot that uniquely identifies an entry. If the soup you're working with doesn't have such a slot, there may not be a need to waste space by artificially creating one. \par \pard \qj\fi180\sa240\keep Every soup entry has a unique id that uniquely identifies it within its soup. This id is stored in a {\f22 _uniqueId} slot, but you should access it with the EntryUniqueId entry function. Every soup has an index on the {\f22 _uniqueId} slot. It's important to know that these id's are not unique across union-soups, they are unique only within a single soup.\par \pard \qj\fi180\sa240\keep In the "Soup's On" article I said that "since applications normally use union-soups, the {\f22 _uniqueId} slot isn't of much practical use." I should have said "the {\f22 _uniqueId} slot, by itself, isn't of much practical use." You can uniquely identify an entry in a union-soup using its id plus the signature of the store on which it resides. Given an entry, you can obtain this information using the EntryUniqueId and EntryStore entr y-functions and the GetSignature store method. The following code illustrates getting this information.\par \pard\plain \s3\sa240\keep \f22 fatherStoreSig := EntryStore(fatherEntry):GetSignature();\line fatherId := EntryUniqueId(fatherEntry);\par \pard\plain \qj\fi180\sa240\keep \f20 Given an id and store signature, you can retrieve the entry either by getting it directly from its store (using GetSoup, bypassing union-soups) or by examining the store signatures of all the entries in the union-soup with that id.\par \pard \qj\fi180\sa240\keep The following function retrieves an entry by getting it directly from its store.\par \pard\plain \s3\sa240\keep \f22 {\fs20 func(storeSig,soupName,entryId)\line begin\line pos := ArrayPos(GetStores(),storeSig,0,func(id,store) id = store:GetSignature());\line if pos then\line begin\line soup := GetStores()[pos]:GetSoup(soupName); \line if soup then\line Query(soup,\{type: 'index,\line indexPath: '_uniqueId,\line startKey: entryId,\line endTest: func(e) EntryUniqueId(e) <> entryId,\line validTest: func(e) EntryUniqueId(e) = entryId,\line \}):Entry();\line end;\line end;\par }\pard\plain \qj\fi180\sa240\keep \f20 The following function retrieves an entry by examining all the entries a union soup with the specified id.\par \pard\plain \s3\sa240\keep \f22 {\fs20 func(storeSig,soupName,entryId)\line begin\line local soup := GetUnionSoup(soupName);\line if soup then\line Query(soup,\{type: 'index,\line indexPath: '_uniqueId,\line startKey: entryId,\line endTest: func(e) EntryUniqueId(e) <> entryId,\line validTest: func(e) EntryStore(e):GetSignature() = storeSig,\line \}):Entry();\line end;\par }\pard\plain \qj\fi180\sa240\keep \f20 Notice both of the above functions use endTests and validTests instead of explicit search loops. Therefore, a single Entry message is sufficient to return the entry of interest or {\f22 nil}.\par \pard \qj\fi180\sa240\keep These functions are only meant to illustrate the technique I've been describing, not as code you should blindly paste into your application. They are not efficient for retrieving large numbers of entries \endash they creates a cursor each time they're called. It would easy to generalize these functions to accept arrays as arguments and return arrays of entries. Depending on your particular application, you can probably make other optimizations.\par \pard \qj\fi180\sa240\keep Before you start adapting this technique to your application, you should consider its major drawback. Most applications let users move their data between stores using the "Move to card" and "Move from card" items in the routing menu. If you plan to support this feature in your application, and you probably should, then any saved store signatures and entry id values will be incorrect if the user moves data. (Note that when an entry is moved to a new store, its unique id must be cha nged if it's already in use by an entry in the new store.) You could try to fix up all the references to an entry when it is moved, but this may prove impractical unless you have a quick way to find all the references.\par \pard\plain \s255\qj\sb240\sa240\keep \f20\fs36 Indexing on Modification Time\par \pard\plain \qj\fi180\sa240\keep \f20 To support date finds, you need to time stamp your soup entries. There is no need to create a special slot for this purpose. Soups already maintain this information.\par \pard \qj\fi180\sa240\keep In addition to the {\f22 _uniqueId} slots, entries that have been modified have a {\f22 _modTime} slot. This sl ot contains the time the entry was last modified as the number of minutes since January 1, 1904. Normally, you should access this value with the EntryModTime entry-function.\par \pard \qj\fi180\sa240\keep To query your entries in the order of their modification dates you should build an index on the {\f22 _modTime} slot. Remember, entries don't automatically have a {\f22 _modTime} slot; only entries that have been modified have one. Entries without {\f22 _modTime} slots will be skipped by a query on the {\f22 _modTime} index so you have to ensure that every entry i n your soup has one. You can do this by creating the slot yourself in the frames you pass to AddToDefaultStore or you can immediately call EntryChange on entries after you add them.\par \pard\plain \s255\qj\sb240\sa240\keep \f20\fs36 Summary\par \pard\plain \qj\fi180\sa240\keep \f20 Newton's soups provide a lot of functionality for free. The down side of this is that people often fail to think about the nature of the underlying soup implementation when they are designing their data structures and applications. Understanding these issu es and designing your code appropriately can make a huge difference in the performance you realize from soups.\par }