Entity Matching Benchmark (EMBench++)

EMBench++

Short Description
Publications
Usage Examples
Video Recordings
Download

Execute Online

Source Repository
Data Generation
Matching Evaluation

An important part of EMBench is the collection of data. We do not want the synthetic data that we are generating to be completely random strings but we want them to be real world values. For this reason we have introduce the so called shredders. A shredder is a software component that takes a database (relational or XML) and shreds it into a series of column tables. There are general purpose shredders for relational or XML databases, but there are also shredders specifically designed for many popular database that are freely available, such as Wikipedia, IMdb, DBLP, and OKKAM.

The user of the benchmark has the ability to select what databases to be shredded, or to add additional databases if desired, by supplying alongside the respective shredder, or using the general purpose that comes with the system. Due to security restrictions, this functionality is not available over the EMBench Web System. Thus, the data generation can be performed using the source repository available in the default EMBench implementation. A summary of the data in this repository is provided in the following table.

No.		Name	Record Number	Random Values
	1.	fullname	on-the-fly value creation
	2.	movietitle	458,143
	3.	lastname	323,021
	4.	firstname	128,958
	5.	song	110,361
	6.	company	107,364
	7.	movieoccupation	83,303
	8.	occupation	83,303
	9.	movieproducers	70,877
	10.	film	52,979
	11.	distributor	22,184
	12.	editor	13,792
	13.	album	12,007
	14.	university	11,804
	15.	mountain	9,811
	16.	software	7,768
	17.	booktitle	5,208
	18.	organization	4,414
	19.	studio	4,179
	20.	disease	4,003
	21.	newspaper	3,338
	22.	subsidiarycompany	3,126
	23.	band	2,888
	24.	museum	2,456
	25.	theatre	629
	26.	publisher	482
	27.	athleticconference	229
	28.	monastery	189
	29.	series	153
	30.	symptom	135
	31.	toy	81
	32.	school	47
	33.	movierelatedoccupation	15

Last modified: May 2018, Page maintained by: Ekaterini Ioannou, Yannis Velegrakis