A regular expression (a.k.a. regex or RE) is a pattern to be searched for in some body of text. These are not specific to Python, but by combining simple regular expressions with basic Python statements, we can quickly achieve powerful results.

Commonly used regex syntax

.

Matches any one character exactly once

[abcdef]

Matches any one of the characters a,b,c,d,e,f exactly once

[^abcdef]

Matches any one character **other than** a,b,c,d,e,f

?

After a character/group, makes that character/group optional (i.e. match zero or 1 times)

After a character/group, makes that character/group match zero or more times

+

After a character/group, makes that character/group match one or more times

{2,5}

After a character/group, makes that character/group match 2,3,4, or 5 times

{2,}

After a character/group, makes that character/group match 2 or more times

\3

Matches whatever was matched into group number 3 (first group from left is numbered 1)

To use regexes in Python, we use another module called “re” (this is a very common module and should already be installed).

In [53]:

importrelaozi="上德不德，是以有德；下德不失德，是以無德。上德無為而無以為；下德為之而有以為。上仁為之而無以為；上義為之而有以為。上禮為之而莫之應，則攘臂而扔之。故失道而後德，失德而後仁，失仁而後義，失義而後禮。"formatchinre.finditer(r".德",laozi):# re.finditer returns "match objects", each of which describes one matchmatched_text=match.group(0)# In Python, group(0) matches the full text that was foundprint("Found a match: "+matched_text)

Found a match: 上德
Found a match: 不德
Found a match: 有德
Found a match: 下德
Found a match: 失德
Found a match: 無德
Found a match: 上德
Found a match: 下德
Found a match: 後德
Found a match: 失德

[Aside: in Python, regexes are often written in strings with a "r" in front of them, e.g. r"德" rather than just "德". All this does is tells Python not to try to interpret the contents of the string (e.g. backslashes) as meaning something else. The result of r"德" is still an ordinary string variable with 德 in it.]

Exercise 1 (very easy): Change the above code to verify the results of some of the simple example regexes from the slides. Try these ones:

而無以為

是以.德

失.而後.

上[仁義]為之

後(.)，失\1

For the last of these (“後(.)，失\1″), see what happens to the output when you change group(0) to group(1). (Change it back to group(0) afterwards though, as we will reuse this code using group(0).)

Exercise 2: Write regular expressions to match the following things (you can keep on modifying the example above to check that they work, but you may want to write down your answers somewhere – remember, you can edit this cell by double-clicking on it).

Match any three characters where the middle character is “之” – i.e. “為之而”, “莫之應”, etc. Modify your regex so that it does not match things with punctuation in them, like “扔之。”.

Match each “phrase” (i.e. punctuated section) of the text. In other words, the first match should be “上德不德”, the second should be “是以有德”, and so on. You only need to handle the three punctuation marks “。”, “，”, and “；”.

Match each phrase which contains the term “之” in it. (Double check that you get 5 matches.)

We can do the same kind of thing on an entire text in one go if we have the whole text in a single string, as in the next example. (If we wanted to know which paragraph or chapter each match appeared in, we would want to run the same regex on each paragraph or chapter in turn so that we know which paragraph or chapter each match occurs in.)

In [54]:

fromctextimport*setapikey("demo")# The gettextasstring function gives us a single string variable with the whole text in itlaozi=gettextasstring("ctp:dao-de-jing")formatchinre.finditer(r"足.",laozi):matched_text=match.group(0)print(matched_text)

足，
足。
足，
足，
足者
足見
足聞
足既
足以
足；
足不
足；
足之
足，
足矣
足以
足下
足者
足。
足以

Exercise 3

Often we don’t want to include matches that have punctuation in them. Modify the regex from the last example so that it excludes all the matches where the character after “足” is “，”, “。”, or “；”. (You should do this by modifying the regex; the rest of the code does not need to change.)

Find all the occurrences of X可X – i.e. “道可道” and “名可名” (there is one more item that should be matched too).

Modify your regex so you match all occurrences of XYX – i.e. not just “道可道” but also things like “學不學”. You may need to make some changes to avoid matching punctuation – we don’t want to match “三，三” or “、寡、”.

Exercise 4: (Optional) Using what was covered in the previous tutorial, write a program in the cell below to perform one of these searches again, but this time running it once on each paragraph in turn so that you can also print out the number of the passage in which each match occurs.

One of the advantages of using regexes from within a programming language like Python is that as well as simply finding results, we can easily do things to collate our data, such as count up how many times a regex gave various different results. Another type of variable that is useful here is the “dictionary” variable.

A dictionary variable works in a very similar way to a list, except that whereas in a list the items are numbered 0,1,2,… and accessed using these numbers, a dictionary uses other things – in the case we will look at, strings – to identify the items. This lets us “look up” values for different strings, just like looking up the translation of a word in a dictionary. The things we use instead of numbers to “look up” values in a dictionary are called “keys“.

Dictionaries can be defined in Python using the following notation:

In [55]:

my_titles={"論語":"Analects","孟子":"Mengzi","荀子":"Xunzi"}

The above example defines one dictionary variable called “my_titles”, and sets values for three keys: “論語”, “孟子”, and “荀子”. Each of these keys is set to have the corresponding value (“Analects”, “Mengzi”, and “Xunzi” respectively). In this simple example, our dictionary gives us a way of translating Chinese-language titles into English-language titles.

We can access the items in a dictionary in a very similar way to accessing items from a list:

In [56]:

print(my_titles["論語"])

Analects

In [57]:

print(my_titles["荀子"])

Xunzi

Unlike in a list, our items don’t have numbers, and don’t come in any particular order. So one thing we will sometimes need to do is to get a list of all the keys – i.e., a list telling us what things there are in our dictionary.

In [58]:

list_of_titles=list(my_titles.keys())print(list_of_titles)

['孟子', '論語', '荀子']

Often we will store numbers in our dictionary; the keys will be strings, but the value for each key will be a number. This lets us do things like count how many times we’ve seen some particular string – for all of the strings we happen to come across at the same time, using just one dictionary variable. In cases like this, we will often want to sort the dictionary by the values of the keys. One way of doing this is using the “sorted” function:

In [59]:

# In this example, we use a dictionary to record people's year of birth# Then we sort the keys (i.e. the names) by the values (i.e. year of birth)year_of_birth={"胡適":1891,"梁啟超":1873,"茅盾":1896,"王韜":1828,"魯迅":1881}list_of_people=sorted(year_of_birth,key=year_of_birth.get,reverse=False)fornameinlist_of_people:print(name+" was born in "+str(year_of_birth[name]))

王韜 was born in 1828
梁啟超 was born in 1873
魯迅 was born in 1881
胡適 was born in 1891
茅盾 was born in 1896

Don’t worry about the rather complex looking syntax for sorted() – you can just follow this model whenever you need to sort a dictionary (and change “reverse=False” to “reverse=True” if you want to reverse the list):

Using a dictionary, we can keep track of every regex result we found, and at the same time collate the data. Instead of having a long list with repeated items in it, we build a dictionary in which the keys are the unique regex matches, and the values are the number of times we have seen that particular string.

In [60]:

match_count={}# This tells Python that we're going to use match_count as a dictionary variableformatchinre.finditer(r"(.)為",laozi):matched_text=match.group(0)# e.g. "心為"ifnotmatched_textinmatch_count:match_count[matched_text]=0# If we don't do this, Python will give an error on the following linematch_count[matched_text]=match_count[matched_text]+1# Our dictionary now contains a frequency count of each different pair we foundprint("match_count contains: "+str(match_count))# The sorted() function gets us a list of the items we matched, starting with the most frequentunique_items=sorted(match_count,key=match_count.get,reverse=True)foriteminunique_items:print(item+" occurred "+str(match_count[item])+" times.")

We can use this idea and almost exactly the same code to start answering quite complex questions about patterns appearing in texts. This code can tell us which actual phrases matching a certain pattern occurred most frequently.

For example, in poetry we often find various kinds of repetition. We can use part of the 詩經 as an example, and using a regex quickly find out which repeated XYXY patterns are most common:

In [61]:

shijing=gettextasstring("ctp:book-of-poetry/lessons-from-the-states")

In [62]:

match_count={}# This tells Python that we're going to use match_count as a dictionary variableformatchinre.finditer(r"(.)(.)\1\2",shijing):matched_text=match.group(0)ifnotmatched_textinmatch_count:match_count[matched_text]=0# If we don't do this, Python will give an error on the following linematch_count[matched_text]=match_count[shijing[match.start():match.end()]]+1unique_items=sorted(match_count,key=match_count.get,reverse=True)foriteminunique_items:print(item+" occurred "+str(match_count[item])+" times.")

Exercise 5: Write a regex to match paired lines of four-character poetry that both begin with the same two characters (examples: “亦既見止、亦既覯止”, “且以喜樂、且以永日”, etc.). Re-run the program above to verify your answer.

Exercise 6: Create a regex to match book titles that appear in punctuated Chinese texts, e.g. “《呂氏春秋》”. Your regex should extract the title without the punctuation marks into a group – i.e. you must use “(” and “)” in your regex. You can test it using the short program below – your output should look like this:

爾雅
廣雅
尚賢
呂氏春秋·順民
呂氏春秋·不侵
左·襄十一年傳
韓詩外傳
廣雅

In [ ]:

test_input="昔者文公出走而正天下，畢云：「正，讀如征。」王念孫云「畢讀非也，《爾雅》曰：『正，長也。』晉文為諸侯盟主，故曰『正天下』，與下『霸諸侯』對文。又《廣雅》『正，君也』。《尚賢》篇曰：『堯、舜、禹、湯、文、武之所以王天下正諸侯者』。凡墨子書言正天下正諸侯者，非訓為長，即訓為君，皆非征伐之謂。」案：王說是也。《呂氏春秋·順民》篇云：「湯克夏而正天下」，高誘注云：「正，治也」，亦非。桓公去國而霸諸侯，越王句踐遇吳王之醜，蘇時學云：「醜，猶恥也。」詒讓案：《呂氏春秋·不侵》篇「欲醜之以辭」，高注云：「醜，或作恥。」而尚攝中國之賢君，畢云：「尚與上通。攝，合也，謂合諸侯。郭璞注爾雅云：『聶，合』，攝同聶。」案：畢說未允。攝當與懾通，《左·襄十一年傳》云：「武震以攝威之」，《韓詩外傳》云：「上攝萬乘，下不敢敖乎匹夫」，此義與彼同，謂越王之威足以懾中國賢君也。三子之能達名成功於天下也，皆於其國抑而大醜也。畢云：「猶曰安其大醜。《廣雅》云：『抑，安也』」。俞樾云：「抑之言屈抑也。抑而大醜，與達名成功相對，言於其國則抑而大醜，於天下則達名成功，正見其由屈抑而達，下文所謂敗而有以成也。畢注於文義未得。」案：俞說是也。太上無敗，畢云：「李善文選注云：『河上公注老子云：太上，謂太古無名之君也』。」案：太上，對其次為文，謂等之最居上者，不論時代今古也。畢引老子注義，與此不相當。其次敗而有以成，此之謂用民。言以親士，故能用其民也。"formatchinre.finditer(r"your regex goes here!",test_input):print(match.group(1))# group() extracts the text of a group from a matched regex: so your regex must have a group in it

Now modify your regex so that instead of getting book titles together with chapter titles, your regex only captures the title of the work – i.e., capture “呂氏春秋” instead of “呂氏春秋·順民”, and “左” instead of “左·襄十一年傳”.

Optional: Bonus points if you can also capture the chapter title on its own in a separate regex group at the same time. This is a bit fiddly though, and we don’t need to do it for this exercise.

Now modify the example code below (it’s almost identical to one of examples above) so that it lists how often every title was mentioned in the 墨子閒詁 (a commentary on the classic text “墨子” – in this example we only use the first chapter, though the code can also be run on the whole text by changing the URN).

Then modify your code so that it only lists the top 10 most frequently mentioned texts. Hint: “unique_items” is a list, and getting part of a list is very similar to getting part of a string.

In [ ]:

test_input=gettextasstring("ctp:mozi-jiangu/qin-shi")match_count={}# This tells Python that we're going to use match_count as a dictionary variableformatchinre.finditer(r"your regex goes here!",test_input):matched_text=match.group(1)ifnotmatched_textinmatch_count:match_count[matched_text]=0# If we don't do this, Python will give an error on the following linematch_count[matched_text]=match_count[matched_text]+1unique_items=sorted(match_count,key=match_count.get,reverse=True)foriteminunique_items:print(item+" occurred "+str(match_count[item])+" times.")

Dictionaries also allow us to produce graphs summarizing our data.

In [ ]:

importnumpyasnpimportmatplotlib.pyplotaspltimportmatplotlibasmplimportpandasaspd%matplotlib inline
# Unfortunately some software still has difficulty dealing with Chinese.# Here we may need to tell matplotlib to use a specific font containing Chinese characters.# If your system doesn't display the Chinese text in the graph below, you may need to specify a different font name.importplatformifplatform.system()=='Darwin':# I.e. if we're running on Mac OS Xmpl.rcParams['font.family']="Arial Unicode MS"else:mpl.rcParams['font.family']="SimHei"mpl.rcParams['font.size']=20# The interesting stuff happens here:s=pd.Series(match_count)s.sort_values(0,0,inplace=True)s=s[:10]s.plot(kind='barh')

Now modify your regex so that you only match texts that are cited as pairs of book title and chapter, i.e. you should only match cases like “《呂氏春秋·順民》” (and not 《呂氏春秋》), and capture into a group the full title (“呂氏春秋·順民” in this example). This may be harder than it looks! You will need to be careful that your regex does not sometimes match too much text.

Re-run the above programs to find out (and graph) which chapters of which texts are most frequently cited in this way by this commentary.

As well as finding things, regexes are ideal for other very useful tasks including replacing and splitting textual data.

For example, we saw in the last notebook cases where it would be easier to process a text without any punctuation in it. We can easily match all punctuation using a regex, and once we know how to search and replace, we can just replace each matched piece of punctuation with a blank string to get an unpunctuated text.

We can do a simple search-and-replace using a regex like this:

In [63]:

importreinput_text="道可道，非常道。"print(re.sub(r"道",r"名",input_text))

名可名，非常名。

For very simple regexes that don’t use any special regex characters, this gives exactly the same result as replace(). But because we can specify patterns, we can do much more powerful replacements.

In [64]:

input_text="道可道，非常道。"print(re.sub(r"[。，]",r"",input_text))

道可道非常道

Of course, as usual the power of this is that we can quickly do it for however much data we like:

Another useful aspect is that we can use data from regex groups that we matched within our replacement. This makes it easy to write replacements that do things like add some particular string before or after something we want to match. This example finds any punctuation character, puts it in regex group 1, and then replaces it with regex group 1 followed by a return character – in other words, it adds a line break after every punctuation character.

Regular expressions can be very useful when we want to transform text from one format to another, or when we want to read text from a file and it isn’t in the format we want.

In this section, instead of using the ctext.org API, we will experiment with a text from Project Gutenberg. Before starting, download the plain text UTF-8 file from the website and save it on your computer as a file called “mulan.txt”. You should save this in the same folder as this Jupyter notebook (.ipynb) file.

Note: you don’t have to save files in the same folder as your Jupyter notebook, but if you save them somewhere else, when opening the file you will need to tell Python the full path to your file instead of just the filename – e.g. “C:\Users\user\Documents\mulan.txt” instead of just “mulan.txt”.

One practical issue when dealing with a lot of data in a string is that printing it to the screen so we can see what’s happening in our program may take up a lot of space. One thing we can do is to just print a substring – i.e. only print the first few hundred or so characters:

In [68]:

print(data_from_file[0:700])

﻿The Project Gutenberg EBook of Mu Lan Ji Nu Zhuan, by Anonymous
This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever. You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.org
Title: Mu Lan Ji Nu Zhuan
Author: Anonymous
Editor: Anonymous
Release Date: December 20, 2007 [EBook #23938]
Language: Chinese
Character set encoding: UTF-8
*** START OF THIS PROJECT GUTENBERG EBOOK MU LAN JI NU ZHUAN ***
序
嘗思人道之大，莫大於倫常；學問之精，莫精於性命。自有書籍以來，所載傳人不少，
求其交盡乎倫常者鮮矣，求其交盡乎性命者益鮮矣。蓋倫常之地，或盡孝而不必兼忠，
或盡忠而不必兼孝，或盡忠孝而安常處順，不必兼勇烈。遭際未極其變，即倫常未盡其
難也。性命之理，有不悟性根者，有不知命蒂者，有修性

One thing that will be handy is if we can delete the English blurb at the top of this file automatically. There are several ways we could do this. One way is to use a negative character class – matching everything except some set of characters – to match all characters that are non-Chinese, and delete them.

The re.sub() function takes three parameters:

The regular expression to match

What we want to replace each match with

The string we want to do the matching in
It returns a new string containing the result after making the substitution.

[The example below also makes use of another kind of special syntax in a character class: we can match a range of characters by their Unicode codepoint. Here we match everything from U+25A1 through U+FFFF, all of which are Chinese characters. Don't worry too much about the contents of this regex - we won't need to write regexes like this most of the time.]

We’ve got rid of the English text, but we’ve now got too many empty lines. Depending on what data is in the text, we might want to remove all the line breaks… but in this case there are some things like chapter titles that are best kept on separate lines so we can tell where the chapters begin and end.

Remember: “\n” means “one line break”, and “{3,}” will match 3 or more of something one after the other (and as many times as possible).

In [70]:

without_spaces=re.sub(r'\n{3,}',"\n\n",new_data)# This regex matches three or more line breaks, and replaces them with twoprint(without_spaces[0:700])

Exercise 7: (Harder) Make another substitution using a regex which removes only the line breaks within a paragraph (and does not remove linebreaks before and after a chapter title). The output should look like this:

Exercise 8: The text contains comments in it which we might want to delete before doing further processing or calculating any statistics. Create a regex substitution which removes each of these comments.

Next, use your chapter-detecting regex to add immediately before each chapter the text “CHAPTER_STARTS_HERE”.

In [ ]:

# Your code goes here!

Lastly, we can use a regex to split a string variable into a Python list using the re.split() function. At any point in the string where the specified regex is matched, the data is split into pieces. For instance:

Use re.split() to split your full text into a Python list, in which each chapter is one list item. (For simplicity you can ignore things like the preface etc.)

In [ ]:

# Your code goes here!# Call your list variable "chapters"

Now we have this data in a Python list, we can do things to each chapter individually. We can also put each of the chapters into its own text file – this is something we will sometimes need to do when we want to use other tools that are not in Python.

Regular expressions (often shortened to “Regexes”) are a powerful extension of the type of simple string search widely available in computer software (e.g. word processors, web browsers, etc.): a regular expression is a specification of something to be matched in some body of text. At their simplest, regular expressions can be simply strings of characters to search for, like “君子” or “巧言令色”. At its most basic, you can use Text Tools to search for multiple terms within a text by entering your terms one per line in the “Regex” tab:

Text Tools will highlight each match in a different color, and show only the paragraphs with at least one match. Of course, you can specify as many search terms as you like, for example:

Clicking on any of the matched terms adds it as a “constraint”, meaning that only passages containing that term will be shown (though still highlighting any other matches present). For instance, clicking “君子” will show all the passages with the term “君子” in them, while still highlighting any other matches:

As with the similarity function of the same plugin, if your regular expression query results in relational data, this can be visualized as a network graph. This is done by setting “Group rows by” to either “Paragraph” or “Chapter”, which gives results in the “Summary” tab tabulated by paragraph (or chapter) – each row represents a paragraph which matched a term, and each column corresponds to one of the matched items:

This can be visualized as a network graph in which edges represent co-occurrence of terms within the same paragraph, and edge weights represent the number of times such co-occurrence is repeated in the texts selected:

This makes it clear where the most frequently repeated co-occurrences occur in the selected corpus – in this example, “君子” and “小人”, “君子” and “禮”, etc. Similarly to the way in which similarity graphs created with the Text Tools plugin work, double-clicking on any edge in the graph returns to the “Regex” tab with the two terms joined by that edge chosen as constraints, thus listing all the passages in which those terms co-occur, this being the data explaining the selected edge:

So far these examples have used fixed lists of search strings. But as the name suggests, the “Regex” tool also supports regular expressions, and so by making use of standard regular expression syntax, it’s possible to make far more sophisticated queries. [If you haven't come across regular expressions before, some examples are covered in the regex section of the Text Tools tutorial.] For example, we could write a regular expression that matches any one of a specified set of color terms, followed by any other character, and see how these are used in the Quan Tang Shi (my example regex is “[黑白紅]\w”: match any one of “黑”, “白”, or “紅”, followed by one non-punctuation character):

If we use “Group by: None”, we get total counts of each matched value – i.e. counts of how frequently “白雪”, “白水”, “紅葉”, and whatever other combinations there are occurred in our text. We can then use the “Chart” link to chart these results and get an overview of the most frequently used combinations:

If we go back to the Regex tab and set “Group by” to “Paragraph”, we can visualize the relationships just like in the Analects example — except that this time we don’t need to specify a list of terms, rather these terms can be extracted using the pattern we specified as a regular expression (in this graph I have set “Skip edges with weight less than” to “2″ to reduce clutter caused by pairs of terms that only ever occur once):

Although overall – as we can see from the bar chart above – combinations with “白” in them are the most common, the relational data shown in the graph above immediately highlights other features of the use of these color pairings: the three most frequent pairings in our data are actually pairings between “白” and “紅”, like “白雲” and “紅葉”, or “白髮” and “紅顏”. As before, our edges are linked to the data, so we can easily go back to the text to see how these are actually being used:

Regular expressions are a hugely powerful way of expressing patterns to search for in text — see the tutorial for more examples and a step-by-step walk-through.

The plugin system and API for ctext.org make it possible to import textual data from ctext.org directly into other online tools. One such tool is the new “Text Tools” plugin, which provides a set of textual analysis and visualization tools designed to work with texts from ctext.org. There is a step-by-step online tutorial describing how to actually use the tool (as well as the instructions on the tool’s own help page); I won’t repeat those here, but instead will give some examples of what the tool can be used to do.

One of the most interesting features of the tool is its function to identify text reuse within and between texts (via the “Similarity” tab). This takes as input one or more texts, and identifies and visualizes similarities between them. For example, with the text of the Analects:

This uses a heat map effect somewhat similar to the ctext.org parallel passage feature: here n-grams are matched (e.g. 3-grams, i.e. triples of identical characters used in identical sequence), and overlapping matched n-grams are shown in successively brighter shades of red. By default, all paragraphs having any shared n-grams with anything else in the selected text or texts are shown. The visualization is interactive, so clicking on any highlighted section switches the view to show all locations in the chosen corpus containing the selected n-gram (which is then highlighted in blue, like the 6-gram “如己者過則勿” in the following image):

Since the texts are read in from ctext.org via the API, the program also knows the structure of the text; clicking on “Chapter summary” shows instead a table of calculated total matches aggregated on a chapter-by-chapter basis:

This data is relational: each row expresses strength of similarity of a certain kind between two entities (two chapters of text). It can therefore be visualized as a weighted network graph – the Text Tools plugin can do this for you:

What’s nice about this type of graph is that every edge has a very concrete meaning: the edge weights are simply a representation of how much reuse there is between the two nodes (i.e. chapters) which it connects. Even better, this visualization is also interactive: double-clicking an edge (e.g. the edge connecting 先進 and 雍也) returns to the passage level visualization and lists all the similarities between those two specified chapters – in other words, it lists precisely the data forming the basis for the creation of that edge:

What this means is that the graph can be used as a map to see where similarities occur and with which to navigate the results. It also makes it possible to visualize broader trends in the data which might not be easily visible by looking directly at the raw data. For instance, in the following graph created using the tool from three early texts, several interesting patterns are observable at a glance (key: light green = Mozi; dark green = Zhuangzi; blue = Xunzi):

Some at-a-glance patterns suggested by this graph: chapters of the three texts tend to have stronger relationships within their own text, with a few exceptions. There are several disjoint clusters of chapters, which have text reuse relationships with other members of their own group, but not with the rest of the text they appear in – most striking is the group of eight “military chapters” of the Mozi at the top right of the graph, which have strong internal connections but none to anything else in the graph:

Double-clicking on some edges to view the full data indicates that some of these pairs have quite significant reuse relationships:

The only other entirely disjoint cluster is the group formed by the 大取 and 小取 pair of texts – in this case the edge is formed by one short but highly significant parallel:

Another interesting observation: of those Zhuangzi chapters having text reuse relationships with other chapters in the set considered, only the 天下 chapter lacks any significant reuse relationship with any other part of the Zhuangzi – though it does contain a significant parallel with the Xunzi:

Something similar is seen with the 賦 chapter of the Xunzi:

There is a lot of complex detail in this graph, and interpretation requires care and attention to the actual details of what is being “reused” (as well as the parameters of the comparison and visualization); the Text Tools program makes it possible to easily explore the larger trends while also being able to quickly jump into the detailed instance-level view to examine the underlying text. Text Tools works “out of the box” with texts from ctext.org read in via API (ideally you will need an institutional subscription or API key to do this efficiently), but it can also use texts from other sources.

There are a number of ways to add direct full-text search of a ctext.org text to an external website. One of the most straightforward is to use the API “getlink” function to link to a text using its CTP URN. For example, to make a text box which will search this Harvard-Yenching copy of the 茶香閣遺草, you can first locate the corresponding transcribed text on ctext.org, go to the bottom-right of its contents page to get its URN (you need the contents page for the transcription, not the associated scan), which in this case is “ctp:wb417980″ – this step can also be done programmatically by API if you want to repeat it for a large number of texts. Once you have the URN, you can create an HTML form which will send the URN and any user-specified search term to the ctext API, which will redirect the user’s browser to the search results. For example, the following HTML creates a search box for 茶香閣遺草:

This will display the following type of search box (try entering a search term in Chinese and clicking “Search”):

You can also supply the optional “if” and “remap” parameters if you want users of your form to be directed to the Chinese interface, or to use the simplified Chinese version of the site (the defaults are English and traditional Chinese). For Chinese interface, between the <form … /> and </form> tags, add the following line:

<input type="hidden" name="if" value="zh" />

For simplified Chinese, add this line:

<input type="hidden" name="remap" value="gb" />

If you want to make a link to the text itself using the URN, you can also directly link to the API endpoint:

While transiting at Schiphol and using the airport wifi, I noticed the sudden appearance of a bunch of adverts on normally advert-free websites. For example:

Some investigation indicated that this time the adverts were not injected via Google Analytics, but instead attached directly into the HTML content of the page. First at the top we have some injected CSS:

Then at the bottom we have the real payload, injected JavaScript code:

It appears this is the same type of advertising afflicting AT&T hotspots – information gleaned from Jonathan Meyer, whose website describing the issue is itself also affected by the Schipol adverts:

Again it seems that given the large scale involved, someone, somewhere – perhaps including a company called “RaGaPa” who seem to be responsible for the ads – is making quite a bit of money through unsavory and perhaps legally questionable means.

Just in case the adverts on their own are not spammy enough, the icon at the top right of the adverts link to the following explanation, casually noting that in addition to standard user tracking and user history ad serving, “You may also be redirected to sponsor’s websites or welcome pages at a set frequency”:

Perhaps the real take-home though is that HTTPS sites are, again, not affected by this: content injection of this type is not possible on sites served using HTTPS without defeating the certificate authority chain or sidestepping it with other kinds of trickery. Digital Sinology recently moved to HTTPS, so is not affected by this particular attack.